Using CUDA within Spark / boosting linear algebra

classic Classic list List threaded Threaded
77 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

fommil
Btw, I wish people would stop cheating when comparing CPU and GPU timings
for things like matrix multiply :-P

Please always compare apples with apples and include the time it takes to
set up the matrices, send it to the processing unit, doing the calculation
AND copying it back to where you need to see the results.

Ignoring this method will make you believe that your GPU is thousands of
times faster than it really is. Again, jump to the end of my talk for
graphs and more discussion....  especially the bit about me being keen on
funding to investigate APU hardware further ;-) (I believe it will solve
the problem)
On 26 Feb 2015 21:16, "Xiangrui Meng" <[hidden email]> wrote:

> Hey Alexander,
>
> I don't quite understand the part where netlib-cublas is about 20x
> slower than netlib-openblas. What is the overhead of using a GPU BLAS
> with netlib-java?
>
> CC'ed Sam, the author of netlib-java.
>
> Best,
> Xiangrui
>
> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]>
> wrote:
> > Better documentation for linking would be very helpful!  Here's a JIRA:
> > https://issues.apache.org/jira/browse/SPARK-6019
> >
> >
> > On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks <[hidden email]>
> > wrote:
> >
> >> Thanks for compiling all the data and running these benchmarks, Alex.
> The
> >> big takeaways here can be seen with this chart:
> >>
> >>
> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
> >>
> >> 1) A properly configured GPU matrix multiply implementation (e.g.
> >> BIDMat+GPU) can provide substantial (but less than an order of
> magnitude)
> >> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
> >> netlib-java+openblas-compiled).
> >> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
> worse
> >> than a well-tuned CPU implementation, particularly for larger matrices.
> >> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
> >> basically agrees with the authors own benchmarks (
> >> https://github.com/fommil/netlib-java)
> >>
> >> I think that most of our users are in a situation where using GPUs may
> not
> >> be practical - although we could consider having a good GPU backend
> >> available as an option. However, *ALL* users of MLlib could benefit
> >> (potentially tremendously) from using a well-tuned CPU-based BLAS
> >> implementation. Perhaps we should consider updating the mllib guide
> with a
> >> more complete section for enabling high performance binaries on OSX and
> >> Linux? Or better, figure out a way for the system to fetch these
> >> automatically.
> >>
> >> - Evan
> >>
> >>
> >>
> >> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
> >> [hidden email]> wrote:
> >>
> >>> Just to summarize this thread, I was finally able to make all
> performance
> >>> comparisons that we discussed. It turns out that:
> >>> BIDMat-cublas>>BIDMat
> >>>
> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
> >>>
> >>> Below is the link to the spreadsheet with full results.
> >>>
> >>>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
> >>>
> >>> One thing still needs exploration: does BIDMat-cublas perform copying
> >>> to/from machine’s RAM?
> >>>
> >>> -----Original Message-----
> >>> From: Ulanov, Alexander
> >>> Sent: Tuesday, February 10, 2015 2:12 PM
> >>> To: Evan R. Sparks
> >>> Cc: Joseph Bradley; [hidden email]
> >>> Subject: RE: Using CUDA within Spark / boosting linear algebra
> >>>
> >>> Thanks, Evan! It seems that ticket was marked as duplicate though the
> >>> original one discusses slightly different topic. I was able to link
> netlib
> >>> with MKL from BIDMat binaries. Indeed, MKL is statically linked inside
> a
> >>> 60MB library.
> >>>
> >>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
> >>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
> >>>
> +-----------------------------------------------------------------------+
> >>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
> >>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
> >>> |1,638475459 |
> >>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
> >>> 1569,233228 |
> >>>
> >>> It turn out that pre-compiled MKL is faster than precompiled OpenBlas
> on
> >>> my machine. Probably, I’ll add two more columns with locally compiled
> >>> openblas and cuda.
> >>>
> >>> Alexander
> >>>
> >>> From: Evan R. Sparks [mailto:[hidden email]]
> >>> Sent: Monday, February 09, 2015 6:06 PM
> >>> To: Ulanov, Alexander
> >>> Cc: Joseph Bradley; [hidden email]
> >>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>
> >>> Great - perhaps we can move this discussion off-list and onto a JIRA
> >>> ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)
> >>>
> >>> It seems like this is going to be somewhat exploratory for a while (and
> >>> there's probably only a handful of us who really care about fast linear
> >>> algebra!)
> >>>
> >>> - Evan
> >>>
> >>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
> >>> [hidden email]<mailto:[hidden email]>> wrote:
> >>> Hi Evan,
> >>>
> >>> Thank you for explanation and useful link. I am going to build
> OpenBLAS,
> >>> link it with Netlib-java and perform benchmark again.
> >>>
> >>> Do I understand correctly that BIDMat binaries contain statically
> linked
> >>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
> >>> having MKL BLAS installed on my server. If it is true, I wonder if it
> is OK
> >>> because Intel sells this library. Nevertheless, it seems that in my
> case
> >>> precompiled MKL BLAS performs better than precompiled OpenBLAS given
> that
> >>> BIDMat and Netlib-java are supposed to be on par with JNI overheads.
> >>>
> >>> Though, it might be interesting to link Netlib-java with Intel MKL, as
> >>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
> >>> (Netlib-java) interested to compare their libraries.
> >>>
> >>> Best regards, Alexander
> >>>
> >>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> >>> [hidden email]>]
> >>> Sent: Friday, February 06, 2015 5:58 PM
> >>>
> >>> To: Ulanov, Alexander
> >>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]>
> >>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>
> >>> I would build OpenBLAS yourself, since good BLAS performance comes from
> >>> getting cache sizes, etc. set up correctly for your particular
> hardware -
> >>> this is often a very tricky process (see, e.g. ATLAS), but we found
> that on
> >>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
> >>> performance competitive with MKL.
> >>>
> >>> To make sure the right library is getting used, you have to make sure
> >>> it's first on the search path - export
> >>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
> >>>
> >>> For some examples of getting netlib-java setup on an ec2 node and some
> >>> example benchmarking code we ran a while back, see:
> >>> https://github.com/shivaram/matrix-bench
> >>>
> >>> In particular - build-openblas-ec2.sh shows you how to build the
> library
> >>> and set up symlinks correctly, and scala/run-netlib.sh shows you how
> to get
> >>> the path setup and get that library picked up by netlib-java.
> >>>
> >>> In this way - you could probably get cuBLAS set up to be used by
> >>> netlib-java as well.
> >>>
> >>> - Evan
> >>>
> >>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
> >>> [hidden email]<mailto:[hidden email]>> wrote:
> >>> Evan, could you elaborate on how to force BIDMat and netlib-java to
> force
> >>> loading the right blas? For netlib, I there are few JVM flags, such as
> >>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I
> can
> >>> force it to use Java implementation. Not sure I understand how to
> force use
> >>> a specific blas (not specific wrapper for blas).
> >>>
> >>> Btw. I have installed openblas (yum install openblas), so I suppose
> that
> >>> netlib is using it.
> >>>
> >>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> >>> [hidden email]>]
> >>> Sent: Friday, February 06, 2015 5:19 PM
> >>> To: Ulanov, Alexander
> >>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]>
> >>>
> >>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>
> >>> Getting breeze to pick up the right blas library is critical for
> >>> performance. I recommend using OpenBLAS (or MKL, if you already have
> it).
> >>> It might make sense to force BIDMat to use the same underlying BLAS
> library
> >>> as well.
> >>>
> >>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
> >>> [hidden email]<mailto:[hidden email]>> wrote:
> >>> Hi Evan, Joseph
> >>>
> >>> I did few matrix multiplication test and BIDMat seems to be ~10x faster
> >>> than netlib-java+breeze (sorry for weird table formatting):
> >>>
> >>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
> native_system_linux_x86-64|
> >>> Breeze+Netlib-java f2jblas |
> >>>
> +-----------------------------------------------------------------------+
> >>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> >>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> >>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 |
> >>>
> >>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
> >>> Linux, Scala 2.11.
> >>>
> >>> Later I will make tests with Cuda. I need to install new Cuda version
> for
> >>> this purpose.
> >>>
> >>> Do you have any ideas why breeze-netlib with native blas is so much
> >>> slower than BIDMat MKL?
> >>>
> >>> Best regards, Alexander
> >>>
> >>> From: Joseph Bradley [mailto:[hidden email]<mailto:
> >>> [hidden email]>]
> >>> Sent: Thursday, February 05, 2015 5:29 PM
> >>> To: Ulanov, Alexander
> >>> Cc: Evan R. Sparks; [hidden email]<mailto:[hidden email]>
> >>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>
> >>> Hi Alexander,
> >>>
> >>> Using GPUs with Spark would be very exciting.  Small comment:
> Concerning
> >>> your question earlier about keeping data stored on the GPU rather than
> >>> having to move it between main memory and GPU memory on each
> iteration, I
> >>> would guess this would be critical to getting good performance.  If you
> >>> could do multiple local iterations before aggregating results, then the
> >>> cost of data movement to the GPU could be amortized (and I believe
> that is
> >>> done in practice).  Having Spark be aware of the GPU and using it as
> >>> another part of memory sounds like a much bigger undertaking.
> >>>
> >>> Joseph
> >>>
> >>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
> >>> [hidden email]<mailto:[hidden email]>> wrote:
> >>> Thank you for explanation! I’ve watched the BIDMach presentation by
> John
> >>> Canny and I am really inspired by his talk and comparisons with Spark
> MLlib.
> >>>
> >>> I am very interested to find out what will be better within Spark:
> BIDMat
> >>> or netlib-java with CPU or GPU natives. Could you suggest a fair way to
> >>> benchmark them? Currently I do benchmarks on artificial neural
> networks in
> >>> batch mode. While it is not a “pure” test of linear algebra, it
> involves
> >>> some other things that are essential to machine learning.
> >>>
> >>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> >>> [hidden email]>]
> >>> Sent: Thursday, February 05, 2015 1:29 PM
> >>> To: Ulanov, Alexander
> >>> Cc: [hidden email]<mailto:[hidden email]>
> >>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>
> >>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> >>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
> data
> >>> layout and fewer levels of indirection - it's definitely a worthwhile
> >>> experiment to run. The main speedups I've seen from using it come from
> >>> highly optimized GPU code for linear algebra. I know that in the past
> Canny
> >>> has gone as far as to write custom GPU kernels for performance-critical
> >>> regions of code.[1]
> >>>
> >>> BIDMach is highly optimized for single node performance or performance
> on
> >>> small clusters.[2] Once data doesn't fit easily in GPU memory (or can
> be
> >>> batched in that way) the performance tends to fall off. Canny argues
> for
> >>> hardware/software codesign and as such prefers machine configurations
> that
> >>> are quite different than what we find in most commodity cluster nodes -
> >>> e.g. 10 disk cahnnels and 4 GPUs.
> >>>
> >>> In contrast, MLlib was designed for horizontal scalability on commodity
> >>> clusters and works best on very big datasets - order of terabytes.
> >>>
> >>> For the most part, these projects developed concurrently to address
> >>> slightly different use cases. That said, there may be bits of BIDMach
> we
> >>> could repurpose for MLlib - keep in mind we need to be careful about
> >>> maintaining cross-language compatibility for our Java and Python-users,
> >>> though.
> >>>
> >>> - Evan
> >>>
> >>> [1] - http://arxiv.org/abs/1409.5402
> >>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
> >>>
> >>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
> >>> [hidden email]<mailto:[hidden email]><mailto:
> >>> [hidden email]<mailto:[hidden email]>>> wrote:
> >>> Hi Evan,
> >>>
> >>> Thank you for suggestion! BIDMat seems to have terrific speed. Do you
> >>> know what makes them faster than netlib-java?
> >>>
> >>> The same group has BIDMach library that implements machine learning.
> For
> >>> some examples they use Caffe convolutional neural network library
> owned by
> >>> another group in Berkeley. Could you elaborate on how these all might
> be
> >>> connected with Spark Mllib? If you take BIDMat for linear algebra why
> don’t
> >>> you take BIDMach for optimization and learning?
> >>>
> >>> Best regards, Alexander
> >>>
> >>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> >>> [hidden email]><mailto:[hidden email]<mailto:
> >>> [hidden email]>>]
> >>> Sent: Thursday, February 05, 2015 12:09 PM
> >>> To: Ulanov, Alexander
> >>> Cc: [hidden email]<mailto:[hidden email]><mailto:
> >>> [hidden email]<mailto:[hidden email]>>
> >>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>
> >>> I'd expect that we can make GPU-accelerated BLAS faster than CPU blas
> in
> >>> many cases.
> >>>
> >>> You might consider taking a look at the codepaths that BIDMat (
> >>> https://github.com/BIDData/BIDMat) takes and comparing them to
> >>> netlib-java/breeze. John Canny et. al. have done a bunch of work
> optimizing
> >>> to make this work really fast from Scala. I've run it on my laptop and
> >>> compared to MKL and in certain cases it's 10x faster at matrix
> multiply.
> >>> There are a lot of layers of indirection here and you really want to
> avoid
> >>> data copying as much as possible.
> >>>
> >>> We could also consider swapping out BIDMat for Breeze, but that would
> be
> >>> a big project and if we can figure out how to get breeze+cublas to
> >>> comparable performance that would be a big win.
> >>>
> >>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> >>> [hidden email]<mailto:[hidden email]><mailto:
> >>> [hidden email]<mailto:[hidden email]>>> wrote:
> >>> Dear Spark developers,
> >>>
> >>> I am exploring how to make linear algebra operations faster within
> Spark.
> >>> One way of doing this is to use Scala Breeze library that is bundled
> with
> >>> Spark. For matrix operations, it employs Netlib-java that has a Java
> >>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
> >>> binaries if they are available on the worker node. It also has its own
> >>> optimized Java implementation of BLAS. It is worth mentioning, that
> native
> >>> binaries provide better performance only for BLAS level 3, i.e.
> >>> matrix-matrix operations or general matrix multiplication (GEMM). This
> is
> >>> confirmed by GEMM test on Netlib-java page
> >>> https://github.com/fommil/netlib-java. I also confirmed it with my
> >>> experiments with training of artificial neural network
> >>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
> >>> However, I would like to boost performance more.
> >>>
> >>> GPU is supposed to work fast with linear algebra and there is Nvidia
> CUDA
> >>> implementation of BLAS, called cublas. I have one Linux server with
> Nvidia
> >>> GPU and I was able to do the following. I linked cublas (instead of
> >>> cpu-based blas) with Netlib-java wrapper and put it into Spark, so
> >>> Breeze/Netlib is using it. Then I did some performance measurements
> with
> >>> regards to artificial neural network batch learning in Spark MLlib that
> >>> involves matrix-matrix multiplications. It turns out that for matrices
> of
> >>> size less than ~1000x780 GPU cublas has the same speed as CPU blas.
> Cublas
> >>> becomes slower for bigger matrices. It worth mentioning that it is was
> not
> >>> a test for ONLY multiplication since there are other operations
> involved.
> >>> One of the reasons for slowdown might be the overhead of copying the
> >>> matrices from computer memory to graphic card memory and back.
> >>>
> >>> So, few questions:
> >>> 1) Do these results with CUDA make sense?
> >>> 2) If the problem is with copy overhead, are there any libraries that
> >>> allow to force intermediate results to stay in graphic card memory thus
> >>> removing the overhead?
> >>> 3) Any other options to speed-up linear algebra in Spark?
> >>>
> >>> Thank you, Alexander
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [hidden email]<mailto:
> >>> [hidden email]><mailto:
> [hidden email]
> >>> <mailto:[hidden email]>>
> >>> For additional commands, e-mail: [hidden email]<mailto:
> >>> [hidden email]><mailto:[hidden email]<mailto:
> >>> [hidden email]>>
> >>>
> >>>
> >>>
> >>>
> >>
>
Reply | Threaded
Open this post in threaded view
|

RE: Using CUDA within Spark / boosting linear algebra

Ulanov, Alexander
Evan, thank you for the summary. I would like to add some more observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both are 3 years old. I've also did a small test with modern hardware, and the new GPU nVidia Titan was slightly more than 1 order of magnitude faster than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU ($1200). My takeaway is that GPU is making a better price/value progress.



Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda and the most reasonable explanation is that it holds the result in GPU memory, as Sam suggested. At the same time, it is OK because you can copy the result back from GPU only when needed. However, to be sure, I am going to ask the developer of BIDMat on his upcoming talk.



Best regards, Alexander


From: Sam Halliday [mailto:[hidden email]]
Sent: Thursday, February 26, 2015 1:56 PM
To: Xiangrui Meng
Cc: [hidden email]; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks
Subject: Re: Using CUDA within Spark / boosting linear algebra


Btw, I wish people would stop cheating when comparing CPU and GPU timings for things like matrix multiply :-P

Please always compare apples with apples and include the time it takes to set up the matrices, send it to the processing unit, doing the calculation AND copying it back to where you need to see the results.

Ignoring this method will make you believe that your GPU is thousands of times faster than it really is. Again, jump to the end of my talk for graphs and more discussion....  especially the bit about me being keen on funding to investigate APU hardware further ;-) (I believe it will solve the problem)
On 26 Feb 2015 21:16, "Xiangrui Meng" <[hidden email]<mailto:[hidden email]>> wrote:
Hey Alexander,

I don't quite understand the part where netlib-cublas is about 20x
slower than netlib-openblas. What is the overhead of using a GPU BLAS
with netlib-java?

CC'ed Sam, the author of netlib-java.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]<mailto:[hidden email]>> wrote:

> Better documentation for linking would be very helpful!  Here's a JIRA:
> https://issues.apache.org/jira/browse/SPARK-6019
>
>
> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks <[hidden email]<mailto:[hidden email]>>
> wrote:
>
>> Thanks for compiling all the data and running these benchmarks, Alex. The
>> big takeaways here can be seen with this chart:
>>
>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>
>> 1) A properly configured GPU matrix multiply implementation (e.g.
>> BIDMat+GPU) can provide substantial (but less than an order of magnitude)
>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>> netlib-java+openblas-compiled).
>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse
>> than a well-tuned CPU implementation, particularly for larger matrices.
>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>> basically agrees with the authors own benchmarks (
>> https://github.com/fommil/netlib-java)
>>
>> I think that most of our users are in a situation where using GPUs may not
>> be practical - although we could consider having a good GPU backend
>> available as an option. However, *ALL* users of MLlib could benefit
>> (potentially tremendously) from using a well-tuned CPU-based BLAS
>> implementation. Perhaps we should consider updating the mllib guide with a
>> more complete section for enabling high performance binaries on OSX and
>> Linux? Or better, figure out a way for the system to fetch these
>> automatically.
>>
>> - Evan
>>
>>
>>
>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>> [hidden email]<mailto:[hidden email]>> wrote:
>>
>>> Just to summarize this thread, I was finally able to make all performance
>>> comparisons that we discussed. It turns out that:
>>> BIDMat-cublas>>BIDMat
>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
>>>
>>> Below is the link to the spreadsheet with full results.
>>>
>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>
>>> One thing still needs exploration: does BIDMat-cublas perform copying
>>> to/from machine’s RAM?
>>>
>>> -----Original Message-----
>>> From: Ulanov, Alexander
>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>> To: Evan R. Sparks
>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]>
>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>
>>> Thanks, Evan! It seems that ticket was marked as duplicate though the
>>> original one discusses slightly different topic. I was able to link netlib
>>> with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
>>> 60MB library.
>>>
>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>> +-----------------------------------------------------------------------+
>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>> |1,638475459 |
>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>>> 1569,233228 |
>>>
>>> It turn out that pre-compiled MKL is faster than precompiled OpenBlas on
>>> my machine. Probably, I’ll add two more columns with locally compiled
>>> openblas and cuda.
>>>
>>> Alexander
>>>
>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]>]
>>> Sent: Monday, February 09, 2015 6:06 PM
>>> To: Ulanov, Alexander
>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> Great - perhaps we can move this discussion off-list and onto a JIRA
>>> ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)
>>>
>>> It seems like this is going to be somewhat exploratory for a while (and
>>> there's probably only a handful of us who really care about fast linear
>>> algebra!)
>>>
>>> - Evan
>>>
>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>> Hi Evan,
>>>
>>> Thank you for explanation and useful link. I am going to build OpenBLAS,
>>> link it with Netlib-java and perform benchmark again.
>>>
>>> Do I understand correctly that BIDMat binaries contain statically linked
>>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
>>> having MKL BLAS installed on my server. If it is true, I wonder if it is OK
>>> because Intel sells this library. Nevertheless, it seems that in my case
>>> precompiled MKL BLAS performs better than precompiled OpenBLAS given that
>>> BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>>>
>>> Though, it might be interesting to link Netlib-java with Intel MKL, as
>>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
>>> (Netlib-java) interested to compare their libraries.
>>>
>>> Best regards, Alexander
>>>
>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>]
>>> Sent: Friday, February 06, 2015 5:58 PM
>>>
>>> To: Ulanov, Alexander
>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> I would build OpenBLAS yourself, since good BLAS performance comes from
>>> getting cache sizes, etc. set up correctly for your particular hardware -
>>> this is often a very tricky process (see, e.g. ATLAS), but we found that on
>>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
>>> performance competitive with MKL.
>>>
>>> To make sure the right library is getting used, you have to make sure
>>> it's first on the search path - export
>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>>
>>> For some examples of getting netlib-java setup on an ec2 node and some
>>> example benchmarking code we ran a while back, see:
>>> https://github.com/shivaram/matrix-bench
>>>
>>> In particular - build-openblas-ec2.sh shows you how to build the library
>>> and set up symlinks correctly, and scala/run-netlib.sh shows you how to get
>>> the path setup and get that library picked up by netlib-java.
>>>
>>> In this way - you could probably get cuBLAS set up to be used by
>>> netlib-java as well.
>>>
>>> - Evan
>>>
>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>> Evan, could you elaborate on how to force BIDMat and netlib-java to force
>>> loading the right blas? For netlib, I there are few JVM flags, such as
>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can
>>> force it to use Java implementation. Not sure I understand how to force use
>>> a specific blas (not specific wrapper for blas).
>>>
>>> Btw. I have installed openblas (yum install openblas), so I suppose that
>>> netlib is using it.
>>>
>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>]
>>> Sent: Friday, February 06, 2015 5:19 PM
>>> To: Ulanov, Alexander
>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> Getting breeze to pick up the right blas library is critical for
>>> performance. I recommend using OpenBLAS (or MKL, if you already have it).
>>> It might make sense to force BIDMat to use the same underlying BLAS library
>>> as well.
>>>
>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>> Hi Evan, Joseph
>>>
>>> I did few matrix multiplication test and BIDMat seems to be ~10x faster
>>> than netlib-java+breeze (sorry for weird table formatting):
>>>
>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
>>> Breeze+Netlib-java f2jblas |
>>> +-----------------------------------------------------------------------+
>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 |
>>>
>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
>>> Linux, Scala 2.11.
>>>
>>> Later I will make tests with Cuda. I need to install new Cuda version for
>>> this purpose.
>>>
>>> Do you have any ideas why breeze-netlib with native blas is so much
>>> slower than BIDMat MKL?
>>>
>>> Best regards, Alexander
>>>
>>> From: Joseph Bradley [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>]
>>> Sent: Thursday, February 05, 2015 5:29 PM
>>> To: Ulanov, Alexander
>>> Cc: Evan R. Sparks; [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> Hi Alexander,
>>>
>>> Using GPUs with Spark would be very exciting.  Small comment: Concerning
>>> your question earlier about keeping data stored on the GPU rather than
>>> having to move it between main memory and GPU memory on each iteration, I
>>> would guess this would be critical to getting good performance.  If you
>>> could do multiple local iterations before aggregating results, then the
>>> cost of data movement to the GPU could be amortized (and I believe that is
>>> done in practice).  Having Spark be aware of the GPU and using it as
>>> another part of memory sounds like a much bigger undertaking.
>>>
>>> Joseph
>>>
>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>> Thank you for explanation! I’ve watched the BIDMach presentation by John
>>> Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>>>
>>> I am very interested to find out what will be better within Spark: BIDMat
>>> or netlib-java with CPU or GPU natives. Could you suggest a fair way to
>>> benchmark them? Currently I do benchmarks on artificial neural networks in
>>> batch mode. While it is not a “pure” test of linear algebra, it involves
>>> some other things that are essential to machine learning.
>>>
>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>]
>>> Sent: Thursday, February 05, 2015 1:29 PM
>>> To: Ulanov, Alexander
>>> Cc: [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to data
>>> layout and fewer levels of indirection - it's definitely a worthwhile
>>> experiment to run. The main speedups I've seen from using it come from
>>> highly optimized GPU code for linear algebra. I know that in the past Canny
>>> has gone as far as to write custom GPU kernels for performance-critical
>>> regions of code.[1]
>>>
>>> BIDMach is highly optimized for single node performance or performance on
>>> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
>>> batched in that way) the performance tends to fall off. Canny argues for
>>> hardware/software codesign and as such prefers machine configurations that
>>> are quite different than what we find in most commodity cluster nodes -
>>> e.g. 10 disk cahnnels and 4 GPUs.
>>>
>>> In contrast, MLlib was designed for horizontal scalability on commodity
>>> clusters and works best on very big datasets - order of terabytes.
>>>
>>> For the most part, these projects developed concurrently to address
>>> slightly different use cases. That said, there may be bits of BIDMach we
>>> could repurpose for MLlib - keep in mind we need to be careful about
>>> maintaining cross-language compatibility for our Java and Python-users,
>>> though.
>>>
>>> - Evan
>>>
>>> [1] - http://arxiv.org/abs/1409.5402
>>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>>
>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>> Hi Evan,
>>>
>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do you
>>> know what makes them faster than netlib-java?
>>>
>>> The same group has BIDMach library that implements machine learning. For
>>> some examples they use Caffe convolutional neural network library owned by
>>> another group in Berkeley. Could you elaborate on how these all might be
>>> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
>>> you take BIDMach for optimization and learning?
>>>
>>> Best regards, Alexander
>>>
>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>>]
>>> Sent: Thursday, February 05, 2015 12:09 PM
>>> To: Ulanov, Alexander
>>> Cc: [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in
>>> many cases.
>>>
>>> You might consider taking a look at the codepaths that BIDMat (
>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>> netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
>>> to make this work really fast from Scala. I've run it on my laptop and
>>> compared to MKL and in certain cases it's 10x faster at matrix multiply.
>>> There are a lot of layers of indirection here and you really want to avoid
>>> data copying as much as possible.
>>>
>>> We could also consider swapping out BIDMat for Breeze, but that would be
>>> a big project and if we can figure out how to get breeze+cublas to
>>> comparable performance that would be a big win.
>>>
>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>> Dear Spark developers,
>>>
>>> I am exploring how to make linear algebra operations faster within Spark.
>>> One way of doing this is to use Scala Breeze library that is bundled with
>>> Spark. For matrix operations, it employs Netlib-java that has a Java
>>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
>>> binaries if they are available on the worker node. It also has its own
>>> optimized Java implementation of BLAS. It is worth mentioning, that native
>>> binaries provide better performance only for BLAS level 3, i.e.
>>> matrix-matrix operations or general matrix multiplication (GEMM). This is
>>> confirmed by GEMM test on Netlib-java page
>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>>> experiments with training of artificial neural network
>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>> However, I would like to boost performance more.
>>>
>>> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
>>> implementation of BLAS, called cublas. I have one Linux server with Nvidia
>>> GPU and I was able to do the following. I linked cublas (instead of
>>> cpu-based blas) with Netlib-java wrapper and put it into Spark, so
>>> Breeze/Netlib is using it. Then I did some performance measurements with
>>> regards to artificial neural network batch learning in Spark MLlib that
>>> involves matrix-matrix multiplications. It turns out that for matrices of
>>> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
>>> becomes slower for bigger matrices. It worth mentioning that it is was not
>>> a test for ONLY multiplication since there are other operations involved.
>>> One of the reasons for slowdown might be the overhead of copying the
>>> matrices from computer memory to graphic card memory and back.
>>>
>>> So, few questions:
>>> 1) Do these results with CUDA make sense?
>>> 2) If the problem is with copy overhead, are there any libraries that
>>> allow to force intermediate results to stay in graphic card memory thus
>>> removing the overhead?
>>> 3) Any other options to speed-up linear algebra in Spark?
>>>
>>> Thank you, Alexander
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]>
>>> <mailto:[hidden email]<mailto:[hidden email]>>>
>>> For additional commands, e-mail: [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>>
>>>
>>>
>>>
>>>
>>
Reply | Threaded
Open this post in threaded view
|

RE: Using CUDA within Spark / boosting linear algebra

Ulanov, Alexander
Typo - CPU was 2.5 cheaper (not GPU!)

-----Original Message-----
From: Ulanov, Alexander
Sent: Thursday, February 26, 2015 2:01 PM
To: Sam Halliday; Xiangrui Meng
Cc: [hidden email]; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Evan, thank you for the summary. I would like to add some more observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both are 3 years old. I've also did a small test with modern hardware, and the new GPU nVidia Titan was slightly more than 1 order of magnitude faster than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU ($1200). My takeaway is that GPU is making a better price/value progress.



Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda and the most reasonable explanation is that it holds the result in GPU memory, as Sam suggested. At the same time, it is OK because you can copy the result back from GPU only when needed. However, to be sure, I am going to ask the developer of BIDMat on his upcoming talk.



Best regards, Alexander


From: Sam Halliday [mailto:[hidden email]]
Sent: Thursday, February 26, 2015 1:56 PM
To: Xiangrui Meng
Cc: [hidden email]; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks
Subject: Re: Using CUDA within Spark / boosting linear algebra


Btw, I wish people would stop cheating when comparing CPU and GPU timings for things like matrix multiply :-P

Please always compare apples with apples and include the time it takes to set up the matrices, send it to the processing unit, doing the calculation AND copying it back to where you need to see the results.

Ignoring this method will make you believe that your GPU is thousands of times faster than it really is. Again, jump to the end of my talk for graphs and more discussion....  especially the bit about me being keen on funding to investigate APU hardware further ;-) (I believe it will solve the problem) On 26 Feb 2015 21:16, "Xiangrui Meng" <[hidden email]<mailto:[hidden email]>> wrote:
Hey Alexander,

I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java?

CC'ed Sam, the author of netlib-java.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]<mailto:[hidden email]>> wrote:

> Better documentation for linking would be very helpful!  Here's a JIRA:
> https://issues.apache.org/jira/browse/SPARK-6019
>
>
> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
> <[hidden email]<mailto:[hidden email]>>
> wrote:
>
>> Thanks for compiling all the data and running these benchmarks, Alex.
>> The big takeaways here can be seen with this chart:
>>
>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZH
>> l6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>
>> 1) A properly configured GPU matrix multiply implementation (e.g.
>> BIDMat+GPU) can provide substantial (but less than an order of
>> BIDMat+magnitude)
>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>> netlib-java+openblas-compiled).
>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>> worse than a well-tuned CPU implementation, particularly for larger matrices.
>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>> basically agrees with the authors own benchmarks (
>> https://github.com/fommil/netlib-java)
>>
>> I think that most of our users are in a situation where using GPUs
>> may not be practical - although we could consider having a good GPU
>> backend available as an option. However, *ALL* users of MLlib could
>> benefit (potentially tremendously) from using a well-tuned CPU-based
>> BLAS implementation. Perhaps we should consider updating the mllib
>> guide with a more complete section for enabling high performance
>> binaries on OSX and Linux? Or better, figure out a way for the system
>> to fetch these automatically.
>>
>> - Evan
>>
>>
>>
>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>> [hidden email]<mailto:[hidden email]>> wrote:
>>
>>> Just to summarize this thread, I was finally able to make all
>>> performance comparisons that we discussed. It turns out that:
>>> BIDMat-cublas>>BIDMat
>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==
>>> netlib-cublas>netlib-blas>f2jblas
>>>
>>> Below is the link to the spreadsheet with full results.
>>>
>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx3
>>> 78T9J5r7kwKSPkY/edit?usp=sharing
>>>
>>> One thing still needs exploration: does BIDMat-cublas perform
>>> copying to/from machine’s RAM?
>>>
>>> -----Original Message-----
>>> From: Ulanov, Alexander
>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>> To: Evan R. Sparks
>>> Cc: Joseph Bradley;
>>> [hidden email]<mailto:[hidden email]>
>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>
>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>>> the original one discusses slightly different topic. I was able to
>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is statically
>>> linked inside a 60MB library.
>>>
>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>> +-----------------------------------------------------------------------+
>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>> |1,638475459 |
>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>>> 1569,233228 |
>>>
>>> It turn out that pre-compiled MKL is faster than precompiled
>>> OpenBlas on my machine. Probably, I’ll add two more columns with
>>> locally compiled openblas and cuda.
>>>
>>> Alexander
>>>
>>> From: Evan R. Sparks
>>> [mailto:[hidden email]<mailto:[hidden email]>]
>>> Sent: Monday, February 09, 2015 6:06 PM
>>> To: Ulanov, Alexander
>>> Cc: Joseph Bradley;
>>> [hidden email]<mailto:[hidden email]>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> Great - perhaps we can move this discussion off-list and onto a JIRA
>>> ticket? (Here's one:
>>> https://issues.apache.org/jira/browse/SPARK-5705)
>>>
>>> It seems like this is going to be somewhat exploratory for a while
>>> (and there's probably only a handful of us who really care about
>>> fast linear
>>> algebra!)
>>>
>>> - Evan
>>>
>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>> Hi Evan,
>>>
>>> Thank you for explanation and useful link. I am going to build
>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
>>>
>>> Do I understand correctly that BIDMat binaries contain statically
>>> linked Intel MKL BLAS? It might be the reason why I am able to run
>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
>>> wonder if it is OK because Intel sells this library. Nevertheless,
>>> it seems that in my case precompiled MKL BLAS performs better than
>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>>>
>>> Though, it might be interesting to link Netlib-java with Intel MKL,
>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
>>> (Netlib-java) interested to compare their libraries.
>>>
>>> Best regards, Alexander
>>>
>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>]
>>> Sent: Friday, February 06, 2015 5:58 PM
>>>
>>> To: Ulanov, Alexander
>>> Cc: Joseph Bradley;
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]
>>> pache.org<mailto:[hidden email]>>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> I would build OpenBLAS yourself, since good BLAS performance comes
>>> from getting cache sizes, etc. set up correctly for your particular
>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
>>> quickly and yields performance competitive with MKL.
>>>
>>> To make sure the right library is getting used, you have to make
>>> sure it's first on the search path - export
>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>>
>>> For some examples of getting netlib-java setup on an ec2 node and
>>> some example benchmarking code we ran a while back, see:
>>> https://github.com/shivaram/matrix-bench
>>>
>>> In particular - build-openblas-ec2.sh shows you how to build the
>>> library and set up symlinks correctly, and scala/run-netlib.sh shows
>>> you how to get the path setup and get that library picked up by netlib-java.
>>>
>>> In this way - you could probably get cuBLAS set up to be used by
>>> netlib-java as well.
>>>
>>> - Evan
>>>
>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
>>> force loading the right blas? For netlib, I there are few JVM flags,
>>> such as
>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so
>>> I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas).
>>>
>>> Btw. I have installed openblas (yum install openblas), so I suppose
>>> that netlib is using it.
>>>
>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>]
>>> Sent: Friday, February 06, 2015 5:19 PM
>>> To: Ulanov, Alexander
>>> Cc: Joseph Bradley;
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]
>>> pache.org<mailto:[hidden email]>>
>>>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> Getting breeze to pick up the right blas library is critical for
>>> performance. I recommend using OpenBLAS (or MKL, if you already have it).
>>> It might make sense to force BIDMat to use the same underlying BLAS
>>> library as well.
>>>
>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>> Hi Evan, Joseph
>>>
>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>>> faster than netlib-java+breeze (sorry for weird table formatting):
>>>
>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>>> |native_system_linux_x86-64|
>>> Breeze+Netlib-java f2jblas |
>>> +-----------------------------------------------------------------------+
>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 |
>>>
>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
>>> 19 Linux, Scala 2.11.
>>>
>>> Later I will make tests with Cuda. I need to install new Cuda
>>> version for this purpose.
>>>
>>> Do you have any ideas why breeze-netlib with native blas is so much
>>> slower than BIDMat MKL?
>>>
>>> Best regards, Alexander
>>>
>>> From: Joseph Bradley [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>]
>>> Sent: Thursday, February 05, 2015 5:29 PM
>>> To: Ulanov, Alexander
>>> Cc: Evan R. Sparks;
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]
>>> pache.org<mailto:[hidden email]>>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> Hi Alexander,
>>>
>>> Using GPUs with Spark would be very exciting.  Small comment:
>>> Concerning your question earlier about keeping data stored on the
>>> GPU rather than having to move it between main memory and GPU memory
>>> on each iteration, I would guess this would be critical to getting
>>> good performance.  If you could do multiple local iterations before
>>> aggregating results, then the cost of data movement to the GPU could
>>> be amortized (and I believe that is done in practice).  Having Spark
>>> be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking.
>>>
>>> Joseph
>>>
>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>> Thank you for explanation! I’ve watched the BIDMach presentation by
>>> John Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>>>
>>> I am very interested to find out what will be better within Spark:
>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
>>> fair way to benchmark them? Currently I do benchmarks on artificial
>>> neural networks in batch mode. While it is not a “pure” test of
>>> linear algebra, it involves some other things that are essential to machine learning.
>>>
>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>]
>>> Sent: Thursday, February 05, 2015 1:29 PM
>>> To: Ulanov, Alexander
>>> Cc:
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]
>>> pache.org<mailto:[hidden email]>>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
>>> netlib-java+data
>>> layout and fewer levels of indirection - it's definitely a
>>> worthwhile experiment to run. The main speedups I've seen from using
>>> it come from highly optimized GPU code for linear algebra. I know
>>> that in the past Canny has gone as far as to write custom GPU
>>> kernels for performance-critical regions of code.[1]
>>>
>>> BIDMach is highly optimized for single node performance or
>>> performance on small clusters.[2] Once data doesn't fit easily in
>>> GPU memory (or can be batched in that way) the performance tends to
>>> fall off. Canny argues for hardware/software codesign and as such
>>> prefers machine configurations that are quite different than what we
>>> find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs.
>>>
>>> In contrast, MLlib was designed for horizontal scalability on
>>> commodity clusters and works best on very big datasets - order of terabytes.
>>>
>>> For the most part, these projects developed concurrently to address
>>> slightly different use cases. That said, there may be bits of
>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
>>> careful about maintaining cross-language compatibility for our Java
>>> and Python-users, though.
>>>
>>> - Evan
>>>
>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>>
>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>> Hi Evan,
>>>
>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>>> you know what makes them faster than netlib-java?
>>>
>>> The same group has BIDMach library that implements machine learning.
>>> For some examples they use Caffe convolutional neural network
>>> library owned by another group in Berkeley. Could you elaborate on
>>> how these all might be connected with Spark Mllib? If you take
>>> BIDMat for linear algebra why don’t you take BIDMach for optimization and learning?
>>>
>>> Best regards, Alexander
>>>
>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>>]
>>> Sent: Thursday, February 05, 2015 12:09 PM
>>> To: Ulanov, Alexander
>>> Cc: [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]
>>> pache.org<mailto:[hidden email]>>>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>>> blas in many cases.
>>>
>>> You might consider taking a look at the codepaths that BIDMat (
>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>>> optimizing to make this work really fast from Scala. I've run it on
>>> my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply.
>>> There are a lot of layers of indirection here and you really want to
>>> avoid data copying as much as possible.
>>>
>>> We could also consider swapping out BIDMat for Breeze, but that
>>> would be a big project and if we can figure out how to get
>>> breeze+cublas to comparable performance that would be a big win.
>>>
>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>> Dear Spark developers,
>>>
>>> I am exploring how to make linear algebra operations faster within Spark.
>>> One way of doing this is to use Scala Breeze library that is bundled
>>> with Spark. For matrix operations, it employs Netlib-java that has a
>>> Java wrapper for BLAS (basic linear algebra subprograms) and LAPACK
>>> native binaries if they are available on the worker node. It also
>>> has its own optimized Java implementation of BLAS. It is worth
>>> mentioning, that native binaries provide better performance only for BLAS level 3, i.e.
>>> matrix-matrix operations or general matrix multiplication (GEMM).
>>> This is confirmed by GEMM test on Netlib-java page
>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>>> experiments with training of artificial neural network
>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>> However, I would like to boost performance more.
>>>
>>> GPU is supposed to work fast with linear algebra and there is Nvidia
>>> CUDA implementation of BLAS, called cublas. I have one Linux server
>>> with Nvidia GPU and I was able to do the following. I linked cublas
>>> (instead of cpu-based blas) with Netlib-java wrapper and put it into
>>> Spark, so Breeze/Netlib is using it. Then I did some performance
>>> measurements with regards to artificial neural network batch
>>> learning in Spark MLlib that involves matrix-matrix multiplications.
>>> It turns out that for matrices of size less than ~1000x780 GPU
>>> cublas has the same speed as CPU blas. Cublas becomes slower for
>>> bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved.
>>> One of the reasons for slowdown might be the overhead of copying the
>>> matrices from computer memory to graphic card memory and back.
>>>
>>> So, few questions:
>>> 1) Do these results with CUDA make sense?
>>> 2) If the problem is with copy overhead, are there any libraries
>>> that allow to force intermediate results to stay in graphic card
>>> memory thus removing the overhead?
>>> 3) Any other options to speed-up linear algebra in Spark?
>>>
>>> Thank you, Alexander
>>>
>>> --------------------------------------------------------------------
>>> - To unsubscribe, e-mail:
>>> [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]
>>> .org>><mailto:[hidden email]<mailto:dev-unsubscrib
>>> [hidden email]>
>>> <mailto:[hidden email]<mailto:dev-unsubscribe@spar
>>> k.apache.org>>> For additional commands, e-mail:
>>> [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>>
>>>
>>>
>>>
>>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

Evan R. Sparks
In reply to this post by fommil
I couldn't agree with you more, Sam. The GPU/Matrix guys typically don't
count their copy times, but claim that you should be doing *as much as
possible* on the GPU - so, maybe for some applications where you can
generate the data on the GPU this makes sense. But, in the context of Spark
we should be *very* careful about enumerating the applications we want GPU
support for and deciding whether it's appropriate to measure the overheads
of getting the data to the GPU.

On Thu, Feb 26, 2015 at 1:55 PM, Sam Halliday <[hidden email]>
wrote:

> Btw, I wish people would stop cheating when comparing CPU and GPU timings
> for things like matrix multiply :-P
>
> Please always compare apples with apples and include the time it takes to
> set up the matrices, send it to the processing unit, doing the calculation
> AND copying it back to where you need to see the results.
>
> Ignoring this method will make you believe that your GPU is thousands of
> times faster than it really is. Again, jump to the end of my talk for
> graphs and more discussion....  especially the bit about me being keen on
> funding to investigate APU hardware further ;-) (I believe it will solve
> the problem)
> On 26 Feb 2015 21:16, "Xiangrui Meng" <[hidden email]> wrote:
>
>> Hey Alexander,
>>
>> I don't quite understand the part where netlib-cublas is about 20x
>> slower than netlib-openblas. What is the overhead of using a GPU BLAS
>> with netlib-java?
>>
>> CC'ed Sam, the author of netlib-java.
>>
>> Best,
>> Xiangrui
>>
>> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]>
>> wrote:
>> > Better documentation for linking would be very helpful!  Here's a JIRA:
>> > https://issues.apache.org/jira/browse/SPARK-6019
>> >
>> >
>> > On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks <[hidden email]>
>> > wrote:
>> >
>> >> Thanks for compiling all the data and running these benchmarks, Alex.
>> The
>> >> big takeaways here can be seen with this chart:
>> >>
>> >>
>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>> >>
>> >> 1) A properly configured GPU matrix multiply implementation (e.g.
>> >> BIDMat+GPU) can provide substantial (but less than an order of
>> magnitude)
>> >> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>> >> netlib-java+openblas-compiled).
>> >> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>> worse
>> >> than a well-tuned CPU implementation, particularly for larger matrices.
>> >> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>> >> basically agrees with the authors own benchmarks (
>> >> https://github.com/fommil/netlib-java)
>> >>
>> >> I think that most of our users are in a situation where using GPUs may
>> not
>> >> be practical - although we could consider having a good GPU backend
>> >> available as an option. However, *ALL* users of MLlib could benefit
>> >> (potentially tremendously) from using a well-tuned CPU-based BLAS
>> >> implementation. Perhaps we should consider updating the mllib guide
>> with a
>> >> more complete section for enabling high performance binaries on OSX and
>> >> Linux? Or better, figure out a way for the system to fetch these
>> >> automatically.
>> >>
>> >> - Evan
>> >>
>> >>
>> >>
>> >> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>> >> [hidden email]> wrote:
>> >>
>> >>> Just to summarize this thread, I was finally able to make all
>> performance
>> >>> comparisons that we discussed. It turns out that:
>> >>> BIDMat-cublas>>BIDMat
>> >>>
>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
>> >>>
>> >>> Below is the link to the spreadsheet with full results.
>> >>>
>> >>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>> >>>
>> >>> One thing still needs exploration: does BIDMat-cublas perform copying
>> >>> to/from machine’s RAM?
>> >>>
>> >>> -----Original Message-----
>> >>> From: Ulanov, Alexander
>> >>> Sent: Tuesday, February 10, 2015 2:12 PM
>> >>> To: Evan R. Sparks
>> >>> Cc: Joseph Bradley; [hidden email]
>> >>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>> >>>
>> >>> Thanks, Evan! It seems that ticket was marked as duplicate though the
>> >>> original one discusses slightly different topic. I was able to link
>> netlib
>> >>> with MKL from BIDMat binaries. Indeed, MKL is statically linked
>> inside a
>> >>> 60MB library.
>> >>>
>> >>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>> >>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>> >>>
>> +-----------------------------------------------------------------------+
>> >>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>> >>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>> >>> |1,638475459 |
>> >>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>> >>> 1569,233228 |
>> >>>
>> >>> It turn out that pre-compiled MKL is faster than precompiled OpenBlas
>> on
>> >>> my machine. Probably, I’ll add two more columns with locally compiled
>> >>> openblas and cuda.
>> >>>
>> >>> Alexander
>> >>>
>> >>> From: Evan R. Sparks [mailto:[hidden email]]
>> >>> Sent: Monday, February 09, 2015 6:06 PM
>> >>> To: Ulanov, Alexander
>> >>> Cc: Joseph Bradley; [hidden email]
>> >>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>
>> >>> Great - perhaps we can move this discussion off-list and onto a JIRA
>> >>> ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705
>> )
>> >>>
>> >>> It seems like this is going to be somewhat exploratory for a while
>> (and
>> >>> there's probably only a handful of us who really care about fast
>> linear
>> >>> algebra!)
>> >>>
>> >>> - Evan
>> >>>
>> >>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>> >>> [hidden email]<mailto:[hidden email]>> wrote:
>> >>> Hi Evan,
>> >>>
>> >>> Thank you for explanation and useful link. I am going to build
>> OpenBLAS,
>> >>> link it with Netlib-java and perform benchmark again.
>> >>>
>> >>> Do I understand correctly that BIDMat binaries contain statically
>> linked
>> >>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
>> >>> having MKL BLAS installed on my server. If it is true, I wonder if it
>> is OK
>> >>> because Intel sells this library. Nevertheless, it seems that in my
>> case
>> >>> precompiled MKL BLAS performs better than precompiled OpenBLAS given
>> that
>> >>> BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>> >>>
>> >>> Though, it might be interesting to link Netlib-java with Intel MKL, as
>> >>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
>> >>> (Netlib-java) interested to compare their libraries.
>> >>>
>> >>> Best regards, Alexander
>> >>>
>> >>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> >>> [hidden email]>]
>> >>> Sent: Friday, February 06, 2015 5:58 PM
>> >>>
>> >>> To: Ulanov, Alexander
>> >>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]>
>> >>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>
>> >>> I would build OpenBLAS yourself, since good BLAS performance comes
>> from
>> >>> getting cache sizes, etc. set up correctly for your particular
>> hardware -
>> >>> this is often a very tricky process (see, e.g. ATLAS), but we found
>> that on
>> >>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
>> >>> performance competitive with MKL.
>> >>>
>> >>> To make sure the right library is getting used, you have to make sure
>> >>> it's first on the search path - export
>> >>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>> >>>
>> >>> For some examples of getting netlib-java setup on an ec2 node and some
>> >>> example benchmarking code we ran a while back, see:
>> >>> https://github.com/shivaram/matrix-bench
>> >>>
>> >>> In particular - build-openblas-ec2.sh shows you how to build the
>> library
>> >>> and set up symlinks correctly, and scala/run-netlib.sh shows you how
>> to get
>> >>> the path setup and get that library picked up by netlib-java.
>> >>>
>> >>> In this way - you could probably get cuBLAS set up to be used by
>> >>> netlib-java as well.
>> >>>
>> >>> - Evan
>> >>>
>> >>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>> >>> [hidden email]<mailto:[hidden email]>> wrote:
>> >>> Evan, could you elaborate on how to force BIDMat and netlib-java to
>> force
>> >>> loading the right blas? For netlib, I there are few JVM flags, such as
>> >>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so
>> I can
>> >>> force it to use Java implementation. Not sure I understand how to
>> force use
>> >>> a specific blas (not specific wrapper for blas).
>> >>>
>> >>> Btw. I have installed openblas (yum install openblas), so I suppose
>> that
>> >>> netlib is using it.
>> >>>
>> >>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> >>> [hidden email]>]
>> >>> Sent: Friday, February 06, 2015 5:19 PM
>> >>> To: Ulanov, Alexander
>> >>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]>
>> >>>
>> >>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>
>> >>> Getting breeze to pick up the right blas library is critical for
>> >>> performance. I recommend using OpenBLAS (or MKL, if you already have
>> it).
>> >>> It might make sense to force BIDMat to use the same underlying BLAS
>> library
>> >>> as well.
>> >>>
>> >>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>> >>> [hidden email]<mailto:[hidden email]>> wrote:
>> >>> Hi Evan, Joseph
>> >>>
>> >>> I did few matrix multiplication test and BIDMat seems to be ~10x
>> faster
>> >>> than netlib-java+breeze (sorry for weird table formatting):
>> >>>
>> >>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>> native_system_linux_x86-64|
>> >>> Breeze+Netlib-java f2jblas |
>> >>>
>> +-----------------------------------------------------------------------+
>> >>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>> >>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>> >>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 |
>> >>>
>> >>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
>> >>> Linux, Scala 2.11.
>> >>>
>> >>> Later I will make tests with Cuda. I need to install new Cuda version
>> for
>> >>> this purpose.
>> >>>
>> >>> Do you have any ideas why breeze-netlib with native blas is so much
>> >>> slower than BIDMat MKL?
>> >>>
>> >>> Best regards, Alexander
>> >>>
>> >>> From: Joseph Bradley [mailto:[hidden email]<mailto:
>> >>> [hidden email]>]
>> >>> Sent: Thursday, February 05, 2015 5:29 PM
>> >>> To: Ulanov, Alexander
>> >>> Cc: Evan R. Sparks; [hidden email]<mailto:[hidden email]>
>> >>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>
>> >>> Hi Alexander,
>> >>>
>> >>> Using GPUs with Spark would be very exciting.  Small comment:
>> Concerning
>> >>> your question earlier about keeping data stored on the GPU rather than
>> >>> having to move it between main memory and GPU memory on each
>> iteration, I
>> >>> would guess this would be critical to getting good performance.  If
>> you
>> >>> could do multiple local iterations before aggregating results, then
>> the
>> >>> cost of data movement to the GPU could be amortized (and I believe
>> that is
>> >>> done in practice).  Having Spark be aware of the GPU and using it as
>> >>> another part of memory sounds like a much bigger undertaking.
>> >>>
>> >>> Joseph
>> >>>
>> >>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>> >>> [hidden email]<mailto:[hidden email]>> wrote:
>> >>> Thank you for explanation! I’ve watched the BIDMach presentation by
>> John
>> >>> Canny and I am really inspired by his talk and comparisons with Spark
>> MLlib.
>> >>>
>> >>> I am very interested to find out what will be better within Spark:
>> BIDMat
>> >>> or netlib-java with CPU or GPU natives. Could you suggest a fair way
>> to
>> >>> benchmark them? Currently I do benchmarks on artificial neural
>> networks in
>> >>> batch mode. While it is not a “pure” test of linear algebra, it
>> involves
>> >>> some other things that are essential to machine learning.
>> >>>
>> >>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> >>> [hidden email]>]
>> >>> Sent: Thursday, February 05, 2015 1:29 PM
>> >>> To: Ulanov, Alexander
>> >>> Cc: [hidden email]<mailto:[hidden email]>
>> >>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>
>> >>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>> >>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
>> data
>> >>> layout and fewer levels of indirection - it's definitely a worthwhile
>> >>> experiment to run. The main speedups I've seen from using it come from
>> >>> highly optimized GPU code for linear algebra. I know that in the past
>> Canny
>> >>> has gone as far as to write custom GPU kernels for
>> performance-critical
>> >>> regions of code.[1]
>> >>>
>> >>> BIDMach is highly optimized for single node performance or
>> performance on
>> >>> small clusters.[2] Once data doesn't fit easily in GPU memory (or can
>> be
>> >>> batched in that way) the performance tends to fall off. Canny argues
>> for
>> >>> hardware/software codesign and as such prefers machine configurations
>> that
>> >>> are quite different than what we find in most commodity cluster nodes
>> -
>> >>> e.g. 10 disk cahnnels and 4 GPUs.
>> >>>
>> >>> In contrast, MLlib was designed for horizontal scalability on
>> commodity
>> >>> clusters and works best on very big datasets - order of terabytes.
>> >>>
>> >>> For the most part, these projects developed concurrently to address
>> >>> slightly different use cases. That said, there may be bits of BIDMach
>> we
>> >>> could repurpose for MLlib - keep in mind we need to be careful about
>> >>> maintaining cross-language compatibility for our Java and
>> Python-users,
>> >>> though.
>> >>>
>> >>> - Evan
>> >>>
>> >>> [1] - http://arxiv.org/abs/1409.5402
>> >>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>> >>>
>> >>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>> >>> [hidden email]<mailto:[hidden email]><mailto:
>> >>> [hidden email]<mailto:[hidden email]>>> wrote:
>> >>> Hi Evan,
>> >>>
>> >>> Thank you for suggestion! BIDMat seems to have terrific speed. Do you
>> >>> know what makes them faster than netlib-java?
>> >>>
>> >>> The same group has BIDMach library that implements machine learning.
>> For
>> >>> some examples they use Caffe convolutional neural network library
>> owned by
>> >>> another group in Berkeley. Could you elaborate on how these all might
>> be
>> >>> connected with Spark Mllib? If you take BIDMat for linear algebra why
>> don’t
>> >>> you take BIDMach for optimization and learning?
>> >>>
>> >>> Best regards, Alexander
>> >>>
>> >>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> >>> [hidden email]><mailto:[hidden email]<mailto:
>> >>> [hidden email]>>]
>> >>> Sent: Thursday, February 05, 2015 12:09 PM
>> >>> To: Ulanov, Alexander
>> >>> Cc: [hidden email]<mailto:[hidden email]><mailto:
>> >>> [hidden email]<mailto:[hidden email]>>
>> >>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>
>> >>> I'd expect that we can make GPU-accelerated BLAS faster than CPU blas
>> in
>> >>> many cases.
>> >>>
>> >>> You might consider taking a look at the codepaths that BIDMat (
>> >>> https://github.com/BIDData/BIDMat) takes and comparing them to
>> >>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>> optimizing
>> >>> to make this work really fast from Scala. I've run it on my laptop and
>> >>> compared to MKL and in certain cases it's 10x faster at matrix
>> multiply.
>> >>> There are a lot of layers of indirection here and you really want to
>> avoid
>> >>> data copying as much as possible.
>> >>>
>> >>> We could also consider swapping out BIDMat for Breeze, but that would
>> be
>> >>> a big project and if we can figure out how to get breeze+cublas to
>> >>> comparable performance that would be a big win.
>> >>>
>> >>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>> >>> [hidden email]<mailto:[hidden email]><mailto:
>> >>> [hidden email]<mailto:[hidden email]>>> wrote:
>> >>> Dear Spark developers,
>> >>>
>> >>> I am exploring how to make linear algebra operations faster within
>> Spark.
>> >>> One way of doing this is to use Scala Breeze library that is bundled
>> with
>> >>> Spark. For matrix operations, it employs Netlib-java that has a Java
>> >>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
>> >>> binaries if they are available on the worker node. It also has its own
>> >>> optimized Java implementation of BLAS. It is worth mentioning, that
>> native
>> >>> binaries provide better performance only for BLAS level 3, i.e.
>> >>> matrix-matrix operations or general matrix multiplication (GEMM).
>> This is
>> >>> confirmed by GEMM test on Netlib-java page
>> >>> https://github.com/fommil/netlib-java. I also confirmed it with my
>> >>> experiments with training of artificial neural network
>> >>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>> >>> However, I would like to boost performance more.
>> >>>
>> >>> GPU is supposed to work fast with linear algebra and there is Nvidia
>> CUDA
>> >>> implementation of BLAS, called cublas. I have one Linux server with
>> Nvidia
>> >>> GPU and I was able to do the following. I linked cublas (instead of
>> >>> cpu-based blas) with Netlib-java wrapper and put it into Spark, so
>> >>> Breeze/Netlib is using it. Then I did some performance measurements
>> with
>> >>> regards to artificial neural network batch learning in Spark MLlib
>> that
>> >>> involves matrix-matrix multiplications. It turns out that for
>> matrices of
>> >>> size less than ~1000x780 GPU cublas has the same speed as CPU blas.
>> Cublas
>> >>> becomes slower for bigger matrices. It worth mentioning that it is
>> was not
>> >>> a test for ONLY multiplication since there are other operations
>> involved.
>> >>> One of the reasons for slowdown might be the overhead of copying the
>> >>> matrices from computer memory to graphic card memory and back.
>> >>>
>> >>> So, few questions:
>> >>> 1) Do these results with CUDA make sense?
>> >>> 2) If the problem is with copy overhead, are there any libraries that
>> >>> allow to force intermediate results to stay in graphic card memory
>> thus
>> >>> removing the overhead?
>> >>> 3) Any other options to speed-up linear algebra in Spark?
>> >>>
>> >>> Thank you, Alexander
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: [hidden email]<mailto:
>> >>> [hidden email]><mailto:
>> [hidden email]
>> >>> <mailto:[hidden email]>>
>> >>> For additional commands, e-mail: [hidden email]<mailto:
>> >>> [hidden email]><mailto:[hidden email]<mailto:
>> >>> [hidden email]>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>>
>
Reply | Threaded
Open this post in threaded view
|

RE: Using CUDA within Spark / boosting linear algebra

fommil
In reply to this post by Ulanov, Alexander
I've had some email exchanges with the author of BIDMat: it does exactly
what you need to get the GPU benefit and writes higher level algorithms
entirely in the GPU kernels so that the memory stays there as long as
possible. The restriction with this approach is that it is only offering
high-level algorithms so is not a toolkit for applied mathematics
research and development --- but it works well as a toolkit for higher
level analysis (e.g. for analysts and practitioners).

I believe BIDMat's approach is the best way to get performance out of
GPU hardware at the moment but I also have strong evidence to suggest
that the hardware will catch up and the memory transfer costs between
CPU/GPU will disappear meaning that there will be no need for custom GPU
kernel implementations. i.e. please continue to use BLAS primitives when
writing new algorithms and only go to the GPU for an alternative
optimised implementation.

Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer
an API that looks like BLAS but takes pointers to special regions in the
GPU memory region. Somebody has written a wrapper around CUDA to create
a proper BLAS library but it only gives marginal performance over the
CPU because of the memory transfer overhead.

This slide from my talk

  http://fommil.github.io/scalax14/#/11/2

says it all. X axis is matrix size, Y axis is logarithmic time to do
DGEMM. Black line is the "cheating" time for the GPU and the green line
is after copying the memory to/from the GPU memory. APUs have the
potential to eliminate the green line.

Best regards,
Sam



"Ulanov, Alexander" <[hidden email]> writes:

> Evan, thank you for the summary. I would like to add some more observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both are 3 years old. I've also did a small test with modern hardware, and the new GPU nVidia Titan was slightly more than 1 order of magnitude faster than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU ($1200). My takeaway is that GPU is making a better price/value progress.
>
>
>
> Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda and the most reasonable explanation is that it holds the result in GPU memory, as Sam suggested. At the same time, it is OK because you can copy the result back from GPU only when needed. However, to be sure, I am going to ask the developer of BIDMat on his upcoming talk.
>
>
>
> Best regards, Alexander
>
>
> From: Sam Halliday [mailto:[hidden email]]
> Sent: Thursday, February 26, 2015 1:56 PM
> To: Xiangrui Meng
> Cc: [hidden email]; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
>
> Btw, I wish people would stop cheating when comparing CPU and GPU timings for things like matrix multiply :-P
>
> Please always compare apples with apples and include the time it takes to set up the matrices, send it to the processing unit, doing the calculation AND copying it back to where you need to see the results.
>
> Ignoring this method will make you believe that your GPU is thousands of times faster than it really is. Again, jump to the end of my talk for graphs and more discussion....  especially the bit about me being keen on funding to investigate APU hardware further ;-) (I believe it will solve the problem)
> On 26 Feb 2015 21:16, "Xiangrui Meng" <[hidden email]<mailto:[hidden email]>> wrote:
> Hey Alexander,
>
> I don't quite understand the part where netlib-cublas is about 20x
> slower than netlib-openblas. What is the overhead of using a GPU BLAS
> with netlib-java?
>
> CC'ed Sam, the author of netlib-java.
>
> Best,
> Xiangrui
>
> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]<mailto:[hidden email]>> wrote:
>> Better documentation for linking would be very helpful!  Here's a JIRA:
>> https://issues.apache.org/jira/browse/SPARK-6019
>>
>>
>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks <[hidden email]<mailto:[hidden email]>>
>> wrote:
>>
>>> Thanks for compiling all the data and running these benchmarks, Alex. The
>>> big takeaways here can be seen with this chart:
>>>
>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>>
>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>> BIDMat+GPU) can provide substantial (but less than an order of magnitude)
>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>>> netlib-java+openblas-compiled).
>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse
>>> than a well-tuned CPU implementation, particularly for larger matrices.
>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>>> basically agrees with the authors own benchmarks (
>>> https://github.com/fommil/netlib-java)
>>>
>>> I think that most of our users are in a situation where using GPUs may not
>>> be practical - although we could consider having a good GPU backend
>>> available as an option. However, *ALL* users of MLlib could benefit
>>> (potentially tremendously) from using a well-tuned CPU-based BLAS
>>> implementation. Perhaps we should consider updating the mllib guide with a
>>> more complete section for enabling high performance binaries on OSX and
>>> Linux? Or better, figure out a way for the system to fetch these
>>> automatically.
>>>
>>> - Evan
>>>
>>>
>>>
>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>> [hidden email]<mailto:[hidden email]>> wrote:
>>>
>>>> Just to summarize this thread, I was finally able to make all performance
>>>> comparisons that we discussed. It turns out that:
>>>> BIDMat-cublas>>BIDMat
>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
>>>>
>>>> Below is the link to the spreadsheet with full results.
>>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>>
>>>> One thing still needs exploration: does BIDMat-cublas perform copying
>>>> to/from machine’s RAM?
>>>>
>>>> -----Original Message-----
>>>> From: Ulanov, Alexander
>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>>> To: Evan R. Sparks
>>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]>
>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Thanks, Evan! It seems that ticket was marked as duplicate though the
>>>> original one discusses slightly different topic. I was able to link netlib
>>>> with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
>>>> 60MB library.
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>>> |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>>>> 1569,233228 |
>>>>
>>>> It turn out that pre-compiled MKL is faster than precompiled OpenBlas on
>>>> my machine. Probably, I’ll add two more columns with locally compiled
>>>> openblas and cuda.
>>>>
>>>> Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]>]
>>>> Sent: Monday, February 09, 2015 6:06 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Great - perhaps we can move this discussion off-list and onto a JIRA
>>>> ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)
>>>>
>>>> It seems like this is going to be somewhat exploratory for a while (and
>>>> there's probably only a handful of us who really care about fast linear
>>>> algebra!)
>>>>
>>>> - Evan
>>>>
>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for explanation and useful link. I am going to build OpenBLAS,
>>>> link it with Netlib-java and perform benchmark again.
>>>>
>>>> Do I understand correctly that BIDMat binaries contain statically linked
>>>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
>>>> having MKL BLAS installed on my server. If it is true, I wonder if it is OK
>>>> because Intel sells this library. Nevertheless, it seems that in my case
>>>> precompiled MKL BLAS performs better than precompiled OpenBLAS given that
>>>> BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>>>>
>>>> Though, it might be interesting to link Netlib-java with Intel MKL, as
>>>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
>>>> (Netlib-java) interested to compare their libraries.
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Friday, February 06, 2015 5:58 PM
>>>>
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I would build OpenBLAS yourself, since good BLAS performance comes from
>>>> getting cache sizes, etc. set up correctly for your particular hardware -
>>>> this is often a very tricky process (see, e.g. ATLAS), but we found that on
>>>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
>>>> performance competitive with MKL.
>>>>
>>>> To make sure the right library is getting used, you have to make sure
>>>> it's first on the search path - export
>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>>>
>>>> For some examples of getting netlib-java setup on an ec2 node and some
>>>> example benchmarking code we ran a while back, see:
>>>> https://github.com/shivaram/matrix-bench
>>>>
>>>> In particular - build-openblas-ec2.sh shows you how to build the library
>>>> and set up symlinks correctly, and scala/run-netlib.sh shows you how to get
>>>> the path setup and get that library picked up by netlib-java.
>>>>
>>>> In this way - you could probably get cuBLAS set up to be used by
>>>> netlib-java as well.
>>>>
>>>> - Evan
>>>>
>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Evan, could you elaborate on how to force BIDMat and netlib-java to force
>>>> loading the right blas? For netlib, I there are few JVM flags, such as
>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can
>>>> force it to use Java implementation. Not sure I understand how to force use
>>>> a specific blas (not specific wrapper for blas).
>>>>
>>>> Btw. I have installed openblas (yum install openblas), so I suppose that
>>>> netlib is using it.
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Friday, February 06, 2015 5:19 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Getting breeze to pick up the right blas library is critical for
>>>> performance. I recommend using OpenBLAS (or MKL, if you already have it).
>>>> It might make sense to force BIDMat to use the same underlying BLAS library
>>>> as well.
>>>>
>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Hi Evan, Joseph
>>>>
>>>> I did few matrix multiplication test and BIDMat seems to be ~10x faster
>>>> than netlib-java+breeze (sorry for weird table formatting):
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
>>>> Breeze+Netlib-java f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 |
>>>>
>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
>>>> Linux, Scala 2.11.
>>>>
>>>> Later I will make tests with Cuda. I need to install new Cuda version for
>>>> this purpose.
>>>>
>>>> Do you have any ideas why breeze-netlib with native blas is so much
>>>> slower than BIDMat MKL?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Joseph Bradley [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Thursday, February 05, 2015 5:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Evan R. Sparks; [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment: Concerning
>>>> your question earlier about keeping data stored on the GPU rather than
>>>> having to move it between main memory and GPU memory on each iteration, I
>>>> would guess this would be critical to getting good performance.  If you
>>>> could do multiple local iterations before aggregating results, then the
>>>> cost of data movement to the GPU could be amortized (and I believe that is
>>>> done in practice).  Having Spark be aware of the GPU and using it as
>>>> another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph
>>>>
>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Thank you for explanation! I’ve watched the BIDMach presentation by John
>>>> Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>>>>
>>>> I am very interested to find out what will be better within Spark: BIDMat
>>>> or netlib-java with CPU or GPU natives. Could you suggest a fair way to
>>>> benchmark them? Currently I do benchmarks on artificial neural networks in
>>>> batch mode. While it is not a “pure” test of linear algebra, it involves
>>>> some other things that are essential to machine learning.
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Thursday, February 05, 2015 1:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc: [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to data
>>>> layout and fewer levels of indirection - it's definitely a worthwhile
>>>> experiment to run. The main speedups I've seen from using it come from
>>>> highly optimized GPU code for linear algebra. I know that in the past Canny
>>>> has gone as far as to write custom GPU kernels for performance-critical
>>>> regions of code.[1]
>>>>
>>>> BIDMach is highly optimized for single node performance or performance on
>>>> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
>>>> batched in that way) the performance tends to fall off. Canny argues for
>>>> hardware/software codesign and as such prefers machine configurations that
>>>> are quite different than what we find in most commodity cluster nodes -
>>>> e.g. 10 disk cahnnels and 4 GPUs.
>>>>
>>>> In contrast, MLlib was designed for horizontal scalability on commodity
>>>> clusters and works best on very big datasets - order of terabytes.
>>>>
>>>> For the most part, these projects developed concurrently to address
>>>> slightly different use cases. That said, there may be bits of BIDMach we
>>>> could repurpose for MLlib - keep in mind we need to be careful about
>>>> maintaining cross-language compatibility for our Java and Python-users,
>>>> though.
>>>>
>>>> - Evan
>>>>
>>>> [1] - http://arxiv.org/abs/1409.5402
>>>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>>>
>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do you
>>>> know what makes them faster than netlib-java?
>>>>
>>>> The same group has BIDMach library that implements machine learning. For
>>>> some examples they use Caffe convolutional neural network library owned by
>>>> another group in Berkeley. Could you elaborate on how these all might be
>>>> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
>>>> you take BIDMach for optimization and learning?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>>]
>>>> Sent: Thursday, February 05, 2015 12:09 PM
>>>> To: Ulanov, Alexander
>>>> Cc: [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in
>>>> many cases.
>>>>
>>>> You might consider taking a look at the codepaths that BIDMat (
>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
>>>> to make this work really fast from Scala. I've run it on my laptop and
>>>> compared to MKL and in certain cases it's 10x faster at matrix multiply.
>>>> There are a lot of layers of indirection here and you really want to avoid
>>>> data copying as much as possible.
>>>>
>>>> We could also consider swapping out BIDMat for Breeze, but that would be
>>>> a big project and if we can figure out how to get breeze+cublas to
>>>> comparable performance that would be a big win.
>>>>
>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>>> Dear Spark developers,
>>>>
>>>> I am exploring how to make linear algebra operations faster within Spark.
>>>> One way of doing this is to use Scala Breeze library that is bundled with
>>>> Spark. For matrix operations, it employs Netlib-java that has a Java
>>>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
>>>> binaries if they are available on the worker node. It also has its own
>>>> optimized Java implementation of BLAS. It is worth mentioning, that native
>>>> binaries provide better performance only for BLAS level 3, i.e.
>>>> matrix-matrix operations or general matrix multiplication (GEMM). This is
>>>> confirmed by GEMM test on Netlib-java page
>>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>>>> experiments with training of artificial neural network
>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>>> However, I would like to boost performance more.
>>>>
>>>> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
>>>> implementation of BLAS, called cublas. I have one Linux server with Nvidia
>>>> GPU and I was able to do the following. I linked cublas (instead of
>>>> cpu-based blas) with Netlib-java wrapper and put it into Spark, so
>>>> Breeze/Netlib is using it. Then I did some performance measurements with
>>>> regards to artificial neural network batch learning in Spark MLlib that
>>>> involves matrix-matrix multiplications. It turns out that for matrices of
>>>> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
>>>> becomes slower for bigger matrices. It worth mentioning that it is was not
>>>> a test for ONLY multiplication since there are other operations involved.
>>>> One of the reasons for slowdown might be the overhead of copying the
>>>> matrices from computer memory to graphic card memory and back.
>>>>
>>>> So, few questions:
>>>> 1) Do these results with CUDA make sense?
>>>> 2) If the problem is with copy overhead, are there any libraries that
>>>> allow to force intermediate results to stay in graphic card memory thus
>>>> removing the overhead?
>>>> 3) Any other options to speed-up linear algebra in Spark?
>>>>
>>>> Thank you, Alexander
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]>
>>>> <mailto:[hidden email]<mailto:[hidden email]>>>
>>>> For additional commands, e-mail: [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>>
>>>>
>>>>
>>>>
>>>>
>>>
--
Best regards,
Sam


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

signature.asc (186 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

Xiangrui Meng
The copying overhead should be quadratic on n, while the computation
cost is cubic on n. I can understand that netlib-cublas is slower than
netlib-openblas on small problems. But I'm surprised to see that it is
still 20x slower on 10000x10000. I did the following on a g2.2xlarge
instance with BIDMat:

val n = 10000

val f = rand(n, n)
flip; f*f; val rf = flop

flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop

flip; g*g; val rgg = flop

The CPU version finished in 12 seconds.
The CPU->GPU->CPU version finished in 2.2 seconds.
The GPU version finished in 1.7 seconds.

I'm not sure whether my CPU->GPU->CPU code simulates the netlib-cublas
path. But based on the result, the data copying overhead is definitely
not as big as 20x at n = 10000.

Best,
Xiangrui


On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday <[hidden email]> wrote:

> I've had some email exchanges with the author of BIDMat: it does exactly
> what you need to get the GPU benefit and writes higher level algorithms
> entirely in the GPU kernels so that the memory stays there as long as
> possible. The restriction with this approach is that it is only offering
> high-level algorithms so is not a toolkit for applied mathematics
> research and development --- but it works well as a toolkit for higher
> level analysis (e.g. for analysts and practitioners).
>
> I believe BIDMat's approach is the best way to get performance out of
> GPU hardware at the moment but I also have strong evidence to suggest
> that the hardware will catch up and the memory transfer costs between
> CPU/GPU will disappear meaning that there will be no need for custom GPU
> kernel implementations. i.e. please continue to use BLAS primitives when
> writing new algorithms and only go to the GPU for an alternative
> optimised implementation.
>
> Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer
> an API that looks like BLAS but takes pointers to special regions in the
> GPU memory region. Somebody has written a wrapper around CUDA to create
> a proper BLAS library but it only gives marginal performance over the
> CPU because of the memory transfer overhead.
>
> This slide from my talk
>
>   http://fommil.github.io/scalax14/#/11/2
>
> says it all. X axis is matrix size, Y axis is logarithmic time to do
> DGEMM. Black line is the "cheating" time for the GPU and the green line
> is after copying the memory to/from the GPU memory. APUs have the
> potential to eliminate the green line.
>
> Best regards,
> Sam
>
>
>
> "Ulanov, Alexander" <[hidden email]> writes:
>
>> Evan, thank you for the summary. I would like to add some more observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both are 3 years old. I've also did a small test with modern hardware, and the new GPU nVidia Titan was slightly more than 1 order of magnitude faster than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU ($1200). My takeaway is that GPU is making a better price/value progress.
>>
>>
>>
>> Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda and the most reasonable explanation is that it holds the result in GPU memory, as Sam suggested. At the same time, it is OK because you can copy the result back from GPU only when needed. However, to be sure, I am going to ask the developer of BIDMat on his upcoming talk.
>>
>>
>>
>> Best regards, Alexander
>>
>>
>> From: Sam Halliday [mailto:[hidden email]]
>> Sent: Thursday, February 26, 2015 1:56 PM
>> To: Xiangrui Meng
>> Cc: [hidden email]; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>>
>> Btw, I wish people would stop cheating when comparing CPU and GPU timings for things like matrix multiply :-P
>>
>> Please always compare apples with apples and include the time it takes to set up the matrices, send it to the processing unit, doing the calculation AND copying it back to where you need to see the results.
>>
>> Ignoring this method will make you believe that your GPU is thousands of times faster than it really is. Again, jump to the end of my talk for graphs and more discussion....  especially the bit about me being keen on funding to investigate APU hardware further ;-) (I believe it will solve the problem)
>> On 26 Feb 2015 21:16, "Xiangrui Meng" <[hidden email]<mailto:[hidden email]>> wrote:
>> Hey Alexander,
>>
>> I don't quite understand the part where netlib-cublas is about 20x
>> slower than netlib-openblas. What is the overhead of using a GPU BLAS
>> with netlib-java?
>>
>> CC'ed Sam, the author of netlib-java.
>>
>> Best,
>> Xiangrui
>>
>> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]<mailto:[hidden email]>> wrote:
>>> Better documentation for linking would be very helpful!  Here's a JIRA:
>>> https://issues.apache.org/jira/browse/SPARK-6019
>>>
>>>
>>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks <[hidden email]<mailto:[hidden email]>>
>>> wrote:
>>>
>>>> Thanks for compiling all the data and running these benchmarks, Alex. The
>>>> big takeaways here can be seen with this chart:
>>>>
>>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>>>
>>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>>> BIDMat+GPU) can provide substantial (but less than an order of magnitude)
>>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>>>> netlib-java+openblas-compiled).
>>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse
>>>> than a well-tuned CPU implementation, particularly for larger matrices.
>>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>>>> basically agrees with the authors own benchmarks (
>>>> https://github.com/fommil/netlib-java)
>>>>
>>>> I think that most of our users are in a situation where using GPUs may not
>>>> be practical - although we could consider having a good GPU backend
>>>> available as an option. However, *ALL* users of MLlib could benefit
>>>> (potentially tremendously) from using a well-tuned CPU-based BLAS
>>>> implementation. Perhaps we should consider updating the mllib guide with a
>>>> more complete section for enabling high performance binaries on OSX and
>>>> Linux? Or better, figure out a way for the system to fetch these
>>>> automatically.
>>>>
>>>> - Evan
>>>>
>>>>
>>>>
>>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]>> wrote:
>>>>
>>>>> Just to summarize this thread, I was finally able to make all performance
>>>>> comparisons that we discussed. It turns out that:
>>>>> BIDMat-cublas>>BIDMat
>>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
>>>>>
>>>>> Below is the link to the spreadsheet with full results.
>>>>>
>>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>>>
>>>>> One thing still needs exploration: does BIDMat-cublas perform copying
>>>>> to/from machine’s RAM?
>>>>>
>>>>> -----Original Message-----
>>>>> From: Ulanov, Alexander
>>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>>>> To: Evan R. Sparks
>>>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]>
>>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>>
>>>>> Thanks, Evan! It seems that ticket was marked as duplicate though the
>>>>> original one discusses slightly different topic. I was able to link netlib
>>>>> with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
>>>>> 60MB library.
>>>>>
>>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>>>> +-----------------------------------------------------------------------+
>>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>>>> |1,638475459 |
>>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>>>>> 1569,233228 |
>>>>>
>>>>> It turn out that pre-compiled MKL is faster than precompiled OpenBlas on
>>>>> my machine. Probably, I’ll add two more columns with locally compiled
>>>>> openblas and cuda.
>>>>>
>>>>> Alexander
>>>>>
>>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]>]
>>>>> Sent: Monday, February 09, 2015 6:06 PM
>>>>> To: Ulanov, Alexander
>>>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]>
>>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>>
>>>>> Great - perhaps we can move this discussion off-list and onto a JIRA
>>>>> ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)
>>>>>
>>>>> It seems like this is going to be somewhat exploratory for a while (and
>>>>> there's probably only a handful of us who really care about fast linear
>>>>> algebra!)
>>>>>
>>>>> - Evan
>>>>>
>>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>>> Hi Evan,
>>>>>
>>>>> Thank you for explanation and useful link. I am going to build OpenBLAS,
>>>>> link it with Netlib-java and perform benchmark again.
>>>>>
>>>>> Do I understand correctly that BIDMat binaries contain statically linked
>>>>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
>>>>> having MKL BLAS installed on my server. If it is true, I wonder if it is OK
>>>>> because Intel sells this library. Nevertheless, it seems that in my case
>>>>> precompiled MKL BLAS performs better than precompiled OpenBLAS given that
>>>>> BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>>>>>
>>>>> Though, it might be interesting to link Netlib-java with Intel MKL, as
>>>>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
>>>>> (Netlib-java) interested to compare their libraries.
>>>>>
>>>>> Best regards, Alexander
>>>>>
>>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>>> [hidden email]<mailto:[hidden email]>>]
>>>>> Sent: Friday, February 06, 2015 5:58 PM
>>>>>
>>>>> To: Ulanov, Alexander
>>>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>>
>>>>> I would build OpenBLAS yourself, since good BLAS performance comes from
>>>>> getting cache sizes, etc. set up correctly for your particular hardware -
>>>>> this is often a very tricky process (see, e.g. ATLAS), but we found that on
>>>>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
>>>>> performance competitive with MKL.
>>>>>
>>>>> To make sure the right library is getting used, you have to make sure
>>>>> it's first on the search path - export
>>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>>>>
>>>>> For some examples of getting netlib-java setup on an ec2 node and some
>>>>> example benchmarking code we ran a while back, see:
>>>>> https://github.com/shivaram/matrix-bench
>>>>>
>>>>> In particular - build-openblas-ec2.sh shows you how to build the library
>>>>> and set up symlinks correctly, and scala/run-netlib.sh shows you how to get
>>>>> the path setup and get that library picked up by netlib-java.
>>>>>
>>>>> In this way - you could probably get cuBLAS set up to be used by
>>>>> netlib-java as well.
>>>>>
>>>>> - Evan
>>>>>
>>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>>> Evan, could you elaborate on how to force BIDMat and netlib-java to force
>>>>> loading the right blas? For netlib, I there are few JVM flags, such as
>>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can
>>>>> force it to use Java implementation. Not sure I understand how to force use
>>>>> a specific blas (not specific wrapper for blas).
>>>>>
>>>>> Btw. I have installed openblas (yum install openblas), so I suppose that
>>>>> netlib is using it.
>>>>>
>>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>>> [hidden email]<mailto:[hidden email]>>]
>>>>> Sent: Friday, February 06, 2015 5:19 PM
>>>>> To: Ulanov, Alexander
>>>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>>>>
>>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>>
>>>>> Getting breeze to pick up the right blas library is critical for
>>>>> performance. I recommend using OpenBLAS (or MKL, if you already have it).
>>>>> It might make sense to force BIDMat to use the same underlying BLAS library
>>>>> as well.
>>>>>
>>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>>> Hi Evan, Joseph
>>>>>
>>>>> I did few matrix multiplication test and BIDMat seems to be ~10x faster
>>>>> than netlib-java+breeze (sorry for weird table formatting):
>>>>>
>>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
>>>>> Breeze+Netlib-java f2jblas |
>>>>> +-----------------------------------------------------------------------+
>>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 |
>>>>>
>>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
>>>>> Linux, Scala 2.11.
>>>>>
>>>>> Later I will make tests with Cuda. I need to install new Cuda version for
>>>>> this purpose.
>>>>>
>>>>> Do you have any ideas why breeze-netlib with native blas is so much
>>>>> slower than BIDMat MKL?
>>>>>
>>>>> Best regards, Alexander
>>>>>
>>>>> From: Joseph Bradley [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>>> [hidden email]<mailto:[hidden email]>>]
>>>>> Sent: Thursday, February 05, 2015 5:29 PM
>>>>> To: Ulanov, Alexander
>>>>> Cc: Evan R. Sparks; [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>>
>>>>> Hi Alexander,
>>>>>
>>>>> Using GPUs with Spark would be very exciting.  Small comment: Concerning
>>>>> your question earlier about keeping data stored on the GPU rather than
>>>>> having to move it between main memory and GPU memory on each iteration, I
>>>>> would guess this would be critical to getting good performance.  If you
>>>>> could do multiple local iterations before aggregating results, then the
>>>>> cost of data movement to the GPU could be amortized (and I believe that is
>>>>> done in practice).  Having Spark be aware of the GPU and using it as
>>>>> another part of memory sounds like a much bigger undertaking.
>>>>>
>>>>> Joseph
>>>>>
>>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>>> Thank you for explanation! I’ve watched the BIDMach presentation by John
>>>>> Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>>>>>
>>>>> I am very interested to find out what will be better within Spark: BIDMat
>>>>> or netlib-java with CPU or GPU natives. Could you suggest a fair way to
>>>>> benchmark them? Currently I do benchmarks on artificial neural networks in
>>>>> batch mode. While it is not a “pure” test of linear algebra, it involves
>>>>> some other things that are essential to machine learning.
>>>>>
>>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>>> [hidden email]<mailto:[hidden email]>>]
>>>>> Sent: Thursday, February 05, 2015 1:29 PM
>>>>> To: Ulanov, Alexander
>>>>> Cc: [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>>
>>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to data
>>>>> layout and fewer levels of indirection - it's definitely a worthwhile
>>>>> experiment to run. The main speedups I've seen from using it come from
>>>>> highly optimized GPU code for linear algebra. I know that in the past Canny
>>>>> has gone as far as to write custom GPU kernels for performance-critical
>>>>> regions of code.[1]
>>>>>
>>>>> BIDMach is highly optimized for single node performance or performance on
>>>>> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
>>>>> batched in that way) the performance tends to fall off. Canny argues for
>>>>> hardware/software codesign and as such prefers machine configurations that
>>>>> are quite different than what we find in most commodity cluster nodes -
>>>>> e.g. 10 disk cahnnels and 4 GPUs.
>>>>>
>>>>> In contrast, MLlib was designed for horizontal scalability on commodity
>>>>> clusters and works best on very big datasets - order of terabytes.
>>>>>
>>>>> For the most part, these projects developed concurrently to address
>>>>> slightly different use cases. That said, there may be bits of BIDMach we
>>>>> could repurpose for MLlib - keep in mind we need to be careful about
>>>>> maintaining cross-language compatibility for our Java and Python-users,
>>>>> though.
>>>>>
>>>>> - Evan
>>>>>
>>>>> [1] - http://arxiv.org/abs/1409.5402
>>>>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>>>>
>>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>>>> Hi Evan,
>>>>>
>>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do you
>>>>> know what makes them faster than netlib-java?
>>>>>
>>>>> The same group has BIDMach library that implements machine learning. For
>>>>> some examples they use Caffe convolutional neural network library owned by
>>>>> another group in Berkeley. Could you elaborate on how these all might be
>>>>> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
>>>>> you take BIDMach for optimization and learning?
>>>>>
>>>>> Best regards, Alexander
>>>>>
>>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>>> [hidden email]<mailto:[hidden email]>>>]
>>>>> Sent: Thursday, February 05, 2015 12:09 PM
>>>>> To: Ulanov, Alexander
>>>>> Cc: [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>>
>>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in
>>>>> many cases.
>>>>>
>>>>> You might consider taking a look at the codepaths that BIDMat (
>>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
>>>>> to make this work really fast from Scala. I've run it on my laptop and
>>>>> compared to MKL and in certain cases it's 10x faster at matrix multiply.
>>>>> There are a lot of layers of indirection here and you really want to avoid
>>>>> data copying as much as possible.
>>>>>
>>>>> We could also consider swapping out BIDMat for Breeze, but that would be
>>>>> a big project and if we can figure out how to get breeze+cublas to
>>>>> comparable performance that would be a big win.
>>>>>
>>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>>>> Dear Spark developers,
>>>>>
>>>>> I am exploring how to make linear algebra operations faster within Spark.
>>>>> One way of doing this is to use Scala Breeze library that is bundled with
>>>>> Spark. For matrix operations, it employs Netlib-java that has a Java
>>>>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
>>>>> binaries if they are available on the worker node. It also has its own
>>>>> optimized Java implementation of BLAS. It is worth mentioning, that native
>>>>> binaries provide better performance only for BLAS level 3, i.e.
>>>>> matrix-matrix operations or general matrix multiplication (GEMM). This is
>>>>> confirmed by GEMM test on Netlib-java page
>>>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>>>>> experiments with training of artificial neural network
>>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>>>> However, I would like to boost performance more.
>>>>>
>>>>> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
>>>>> implementation of BLAS, called cublas. I have one Linux server with Nvidia
>>>>> GPU and I was able to do the following. I linked cublas (instead of
>>>>> cpu-based blas) with Netlib-java wrapper and put it into Spark, so
>>>>> Breeze/Netlib is using it. Then I did some performance measurements with
>>>>> regards to artificial neural network batch learning in Spark MLlib that
>>>>> involves matrix-matrix multiplications. It turns out that for matrices of
>>>>> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
>>>>> becomes slower for bigger matrices. It worth mentioning that it is was not
>>>>> a test for ONLY multiplication since there are other operations involved.
>>>>> One of the reasons for slowdown might be the overhead of copying the
>>>>> matrices from computer memory to graphic card memory and back.
>>>>>
>>>>> So, few questions:
>>>>> 1) Do these results with CUDA make sense?
>>>>> 2) If the problem is with copy overhead, are there any libraries that
>>>>> allow to force intermediate results to stay in graphic card memory thus
>>>>> removing the overhead?
>>>>> 3) Any other options to speed-up linear algebra in Spark?
>>>>>
>>>>> Thank you, Alexander
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [hidden email]<mailto:[hidden email]><mailto:
>>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]>
>>>>> <mailto:[hidden email]<mailto:[hidden email]>>>
>>>>> For additional commands, e-mail: [hidden email]<mailto:[hidden email]><mailto:
>>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>>> [hidden email]<mailto:[hidden email]>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>
> --
> Best regards,
> Sam
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

fommil
Don't use "big O" estimates, always measure. It used to work back in the
days when double multiplication was a bottleneck. The computation cost is
effectively free on both the CPU and GPU and you're seeing pure copying
costs. Also, I'm dubious that cublas is doing what you think it is. Can you
link me to the source code for DGEMM?

I show all of this in my talk, with explanations, I can't stress enough how
much I recommend that you watch it if you want to understand high
performance hardware acceleration for linear algebra :-)
On 27 Feb 2015 01:42, "Xiangrui Meng" <[hidden email]> wrote:

> The copying overhead should be quadratic on n, while the computation
> cost is cubic on n. I can understand that netlib-cublas is slower than
> netlib-openblas on small problems. But I'm surprised to see that it is
> still 20x slower on 10000x10000. I did the following on a g2.2xlarge
> instance with BIDMat:
>
> val n = 10000
>
> val f = rand(n, n)
> flip; f*f; val rf = flop
>
> flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop
>
> flip; g*g; val rgg = flop
>
> The CPU version finished in 12 seconds.
> The CPU->GPU->CPU version finished in 2.2 seconds.
> The GPU version finished in 1.7 seconds.
>
> I'm not sure whether my CPU->GPU->CPU code simulates the netlib-cublas
> path. But based on the result, the data copying overhead is definitely
> not as big as 20x at n = 10000.
>
> Best,
> Xiangrui
>
>
> On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday <[hidden email]>
> wrote:
> > I've had some email exchanges with the author of BIDMat: it does exactly
> > what you need to get the GPU benefit and writes higher level algorithms
> > entirely in the GPU kernels so that the memory stays there as long as
> > possible. The restriction with this approach is that it is only offering
> > high-level algorithms so is not a toolkit for applied mathematics
> > research and development --- but it works well as a toolkit for higher
> > level analysis (e.g. for analysts and practitioners).
> >
> > I believe BIDMat's approach is the best way to get performance out of
> > GPU hardware at the moment but I also have strong evidence to suggest
> > that the hardware will catch up and the memory transfer costs between
> > CPU/GPU will disappear meaning that there will be no need for custom GPU
> > kernel implementations. i.e. please continue to use BLAS primitives when
> > writing new algorithms and only go to the GPU for an alternative
> > optimised implementation.
> >
> > Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer
> > an API that looks like BLAS but takes pointers to special regions in the
> > GPU memory region. Somebody has written a wrapper around CUDA to create
> > a proper BLAS library but it only gives marginal performance over the
> > CPU because of the memory transfer overhead.
> >
> > This slide from my talk
> >
> >   http://fommil.github.io/scalax14/#/11/2
> >
> > says it all. X axis is matrix size, Y axis is logarithmic time to do
> > DGEMM. Black line is the "cheating" time for the GPU and the green line
> > is after copying the memory to/from the GPU memory. APUs have the
> > potential to eliminate the green line.
> >
> > Best regards,
> > Sam
> >
> >
> >
> > "Ulanov, Alexander" <[hidden email]> writes:
> >
> >> Evan, thank you for the summary. I would like to add some more
> observations. The GPU that I used is 2.5 times cheaper than the CPU ($250
> vs $100). They both are 3 years old. I've also did a small test with modern
> hardware, and the new GPU nVidia Titan was slightly more than 1 order of
> magnitude faster than Intel E5-2650 v2 for the same tests. However, it
> costs as much as CPU ($1200). My takeaway is that GPU is making a better
> price/value progress.
> >>
> >>
> >>
> >> Xiangrui, I was also surprised that BIDMat-cuda was faster than
> netlib-cuda and the most reasonable explanation is that it holds the result
> in GPU memory, as Sam suggested. At the same time, it is OK because you can
> copy the result back from GPU only when needed. However, to be sure, I am
> going to ask the developer of BIDMat on his upcoming talk.
> >>
> >>
> >>
> >> Best regards, Alexander
> >>
> >>
> >> From: Sam Halliday [mailto:[hidden email]]
> >> Sent: Thursday, February 26, 2015 1:56 PM
> >> To: Xiangrui Meng
> >> Cc: [hidden email]; Joseph Bradley; Ulanov, Alexander; Evan R.
> Sparks
> >> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>
> >>
> >> Btw, I wish people would stop cheating when comparing CPU and GPU
> timings for things like matrix multiply :-P
> >>
> >> Please always compare apples with apples and include the time it takes
> to set up the matrices, send it to the processing unit, doing the
> calculation AND copying it back to where you need to see the results.
> >>
> >> Ignoring this method will make you believe that your GPU is thousands
> of times faster than it really is. Again, jump to the end of my talk for
> graphs and more discussion....  especially the bit about me being keen on
> funding to investigate APU hardware further ;-) (I believe it will solve
> the problem)
> >> On 26 Feb 2015 21:16, "Xiangrui Meng" <[hidden email]<mailto:
> [hidden email]>> wrote:
> >> Hey Alexander,
> >>
> >> I don't quite understand the part where netlib-cublas is about 20x
> >> slower than netlib-openblas. What is the overhead of using a GPU BLAS
> >> with netlib-java?
> >>
> >> CC'ed Sam, the author of netlib-java.
> >>
> >> Best,
> >> Xiangrui
> >>
> >> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]
> <mailto:[hidden email]>> wrote:
> >>> Better documentation for linking would be very helpful!  Here's a JIRA:
> >>> https://issues.apache.org/jira/browse/SPARK-6019
> >>>
> >>>
> >>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks <[hidden email]
> <mailto:[hidden email]>>
> >>> wrote:
> >>>
> >>>> Thanks for compiling all the data and running these benchmarks, Alex.
> The
> >>>> big takeaways here can be seen with this chart:
> >>>>
> >>>>
> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
> >>>>
> >>>> 1) A properly configured GPU matrix multiply implementation (e.g.
> >>>> BIDMat+GPU) can provide substantial (but less than an order of
> magnitude)
> >>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
> >>>> netlib-java+openblas-compiled).
> >>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
> worse
> >>>> than a well-tuned CPU implementation, particularly for larger
> matrices.
> >>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
> >>>> basically agrees with the authors own benchmarks (
> >>>> https://github.com/fommil/netlib-java)
> >>>>
> >>>> I think that most of our users are in a situation where using GPUs
> may not
> >>>> be practical - although we could consider having a good GPU backend
> >>>> available as an option. However, *ALL* users of MLlib could benefit
> >>>> (potentially tremendously) from using a well-tuned CPU-based BLAS
> >>>> implementation. Perhaps we should consider updating the mllib guide
> with a
> >>>> more complete section for enabling high performance binaries on OSX
> and
> >>>> Linux? Or better, figure out a way for the system to fetch these
> >>>> automatically.
> >>>>
> >>>> - Evan
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]>> wrote:
> >>>>
> >>>>> Just to summarize this thread, I was finally able to make all
> performance
> >>>>> comparisons that we discussed. It turns out that:
> >>>>> BIDMat-cublas>>BIDMat
> >>>>>
> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
> >>>>>
> >>>>> Below is the link to the spreadsheet with full results.
> >>>>>
> >>>>>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
> >>>>>
> >>>>> One thing still needs exploration: does BIDMat-cublas perform copying
> >>>>> to/from machine’s RAM?
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Ulanov, Alexander
> >>>>> Sent: Tuesday, February 10, 2015 2:12 PM
> >>>>> To: Evan R. Sparks
> >>>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]
> >
> >>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
> >>>>>
> >>>>> Thanks, Evan! It seems that ticket was marked as duplicate though the
> >>>>> original one discusses slightly different topic. I was able to link
> netlib
> >>>>> with MKL from BIDMat binaries. Indeed, MKL is statically linked
> inside a
> >>>>> 60MB library.
> >>>>>
> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
> >>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
> >>>>>
> +-----------------------------------------------------------------------+
> >>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
> >>>>> |1,638475459 |
> >>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
> >>>>> 1569,233228 |
> >>>>>
> >>>>> It turn out that pre-compiled MKL is faster than precompiled
> OpenBlas on
> >>>>> my machine. Probably, I’ll add two more columns with locally compiled
> >>>>> openblas and cuda.
> >>>>>
> >>>>> Alexander
> >>>>>
> >>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]>]
> >>>>> Sent: Monday, February 09, 2015 6:06 PM
> >>>>> To: Ulanov, Alexander
> >>>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]
> >
> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>>
> >>>>> Great - perhaps we can move this discussion off-list and onto a JIRA
> >>>>> ticket? (Here's one:
> https://issues.apache.org/jira/browse/SPARK-5705)
> >>>>>
> >>>>> It seems like this is going to be somewhat exploratory for a while
> (and
> >>>>> there's probably only a handful of us who really care about fast
> linear
> >>>>> algebra!)
> >>>>>
> >>>>> - Evan
> >>>>>
> >>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>>> Hi Evan,
> >>>>>
> >>>>> Thank you for explanation and useful link. I am going to build
> OpenBLAS,
> >>>>> link it with Netlib-java and perform benchmark again.
> >>>>>
> >>>>> Do I understand correctly that BIDMat binaries contain statically
> linked
> >>>>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat
> not
> >>>>> having MKL BLAS installed on my server. If it is true, I wonder if
> it is OK
> >>>>> because Intel sells this library. Nevertheless, it seems that in my
> case
> >>>>> precompiled MKL BLAS performs better than precompiled OpenBLAS given
> that
> >>>>> BIDMat and Netlib-java are supposed to be on par with JNI overheads.
> >>>>>
> >>>>> Though, it might be interesting to link Netlib-java with Intel MKL,
> as
> >>>>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
> >>>>> (Netlib-java) interested to compare their libraries.
> >>>>>
> >>>>> Best regards, Alexander
> >>>>>
> >>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>>> [hidden email]<mailto:[hidden email]>>]
> >>>>> Sent: Friday, February 06, 2015 5:58 PM
> >>>>>
> >>>>> To: Ulanov, Alexander
> >>>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]
> ><mailto:[hidden email]<mailto:[hidden email]>>
> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>>
> >>>>> I would build OpenBLAS yourself, since good BLAS performance comes
> from
> >>>>> getting cache sizes, etc. set up correctly for your particular
> hardware -
> >>>>> this is often a very tricky process (see, e.g. ATLAS), but we found
> that on
> >>>>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
> >>>>> performance competitive with MKL.
> >>>>>
> >>>>> To make sure the right library is getting used, you have to make sure
> >>>>> it's first on the search path - export
> >>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
> >>>>>
> >>>>> For some examples of getting netlib-java setup on an ec2 node and
> some
> >>>>> example benchmarking code we ran a while back, see:
> >>>>> https://github.com/shivaram/matrix-bench
> >>>>>
> >>>>> In particular - build-openblas-ec2.sh shows you how to build the
> library
> >>>>> and set up symlinks correctly, and scala/run-netlib.sh shows you how
> to get
> >>>>> the path setup and get that library picked up by netlib-java.
> >>>>>
> >>>>> In this way - you could probably get cuBLAS set up to be used by
> >>>>> netlib-java as well.
> >>>>>
> >>>>> - Evan
> >>>>>
> >>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
> force
> >>>>> loading the right blas? For netlib, I there are few JVM flags, such
> as
> >>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so
> I can
> >>>>> force it to use Java implementation. Not sure I understand how to
> force use
> >>>>> a specific blas (not specific wrapper for blas).
> >>>>>
> >>>>> Btw. I have installed openblas (yum install openblas), so I suppose
> that
> >>>>> netlib is using it.
> >>>>>
> >>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>>> [hidden email]<mailto:[hidden email]>>]
> >>>>> Sent: Friday, February 06, 2015 5:19 PM
> >>>>> To: Ulanov, Alexander
> >>>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]
> ><mailto:[hidden email]<mailto:[hidden email]>>
> >>>>>
> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>>
> >>>>> Getting breeze to pick up the right blas library is critical for
> >>>>> performance. I recommend using OpenBLAS (or MKL, if you already have
> it).
> >>>>> It might make sense to force BIDMat to use the same underlying BLAS
> library
> >>>>> as well.
> >>>>>
> >>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>>> Hi Evan, Joseph
> >>>>>
> >>>>> I did few matrix multiplication test and BIDMat seems to be ~10x
> faster
> >>>>> than netlib-java+breeze (sorry for weird table formatting):
> >>>>>
> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
> native_system_linux_x86-64|
> >>>>> Breeze+Netlib-java f2jblas |
> >>>>>
> +-----------------------------------------------------------------------+
> >>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> >>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 |
> >>>>>
> >>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
> 19
> >>>>> Linux, Scala 2.11.
> >>>>>
> >>>>> Later I will make tests with Cuda. I need to install new Cuda
> version for
> >>>>> this purpose.
> >>>>>
> >>>>> Do you have any ideas why breeze-netlib with native blas is so much
> >>>>> slower than BIDMat MKL?
> >>>>>
> >>>>> Best regards, Alexander
> >>>>>
> >>>>> From: Joseph Bradley [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>>> [hidden email]<mailto:[hidden email]>>]
> >>>>> Sent: Thursday, February 05, 2015 5:29 PM
> >>>>> To: Ulanov, Alexander
> >>>>> Cc: Evan R. Sparks; [hidden email]<mailto:[hidden email]
> ><mailto:[hidden email]<mailto:[hidden email]>>
> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>>
> >>>>> Hi Alexander,
> >>>>>
> >>>>> Using GPUs with Spark would be very exciting.  Small comment:
> Concerning
> >>>>> your question earlier about keeping data stored on the GPU rather
> than
> >>>>> having to move it between main memory and GPU memory on each
> iteration, I
> >>>>> would guess this would be critical to getting good performance.  If
> you
> >>>>> could do multiple local iterations before aggregating results, then
> the
> >>>>> cost of data movement to the GPU could be amortized (and I believe
> that is
> >>>>> done in practice).  Having Spark be aware of the GPU and using it as
> >>>>> another part of memory sounds like a much bigger undertaking.
> >>>>>
> >>>>> Joseph
> >>>>>
> >>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>>> Thank you for explanation! I’ve watched the BIDMach presentation by
> John
> >>>>> Canny and I am really inspired by his talk and comparisons with
> Spark MLlib.
> >>>>>
> >>>>> I am very interested to find out what will be better within Spark:
> BIDMat
> >>>>> or netlib-java with CPU or GPU natives. Could you suggest a fair way
> to
> >>>>> benchmark them? Currently I do benchmarks on artificial neural
> networks in
> >>>>> batch mode. While it is not a “pure” test of linear algebra, it
> involves
> >>>>> some other things that are essential to machine learning.
> >>>>>
> >>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>>> [hidden email]<mailto:[hidden email]>>]
> >>>>> Sent: Thursday, February 05, 2015 1:29 PM
> >>>>> To: Ulanov, Alexander
> >>>>> Cc: [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>
> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>>
> >>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> >>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
> data
> >>>>> layout and fewer levels of indirection - it's definitely a worthwhile
> >>>>> experiment to run. The main speedups I've seen from using it come
> from
> >>>>> highly optimized GPU code for linear algebra. I know that in the
> past Canny
> >>>>> has gone as far as to write custom GPU kernels for
> performance-critical
> >>>>> regions of code.[1]
> >>>>>
> >>>>> BIDMach is highly optimized for single node performance or
> performance on
> >>>>> small clusters.[2] Once data doesn't fit easily in GPU memory (or
> can be
> >>>>> batched in that way) the performance tends to fall off. Canny argues
> for
> >>>>> hardware/software codesign and as such prefers machine
> configurations that
> >>>>> are quite different than what we find in most commodity cluster
> nodes -
> >>>>> e.g. 10 disk cahnnels and 4 GPUs.
> >>>>>
> >>>>> In contrast, MLlib was designed for horizontal scalability on
> commodity
> >>>>> clusters and works best on very big datasets - order of terabytes.
> >>>>>
> >>>>> For the most part, these projects developed concurrently to address
> >>>>> slightly different use cases. That said, there may be bits of
> BIDMach we
> >>>>> could repurpose for MLlib - keep in mind we need to be careful about
> >>>>> maintaining cross-language compatibility for our Java and
> Python-users,
> >>>>> though.
> >>>>>
> >>>>> - Evan
> >>>>>
> >>>>> [1] - http://arxiv.org/abs/1409.5402
> >>>>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
> >>>>>
> >>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>> wrote:
> >>>>> Hi Evan,
> >>>>>
> >>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do you
> >>>>> know what makes them faster than netlib-java?
> >>>>>
> >>>>> The same group has BIDMach library that implements machine learning.
> For
> >>>>> some examples they use Caffe convolutional neural network library
> owned by
> >>>>> another group in Berkeley. Could you elaborate on how these all
> might be
> >>>>> connected with Spark Mllib? If you take BIDMat for linear algebra
> why don’t
> >>>>> you take BIDMach for optimization and learning?
> >>>>>
> >>>>> Best regards, Alexander
> >>>>>
> >>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>>> [hidden email]<mailto:[hidden email]>><mailto:
> [hidden email]<mailto:[hidden email]><mailto:
> >>>>> [hidden email]<mailto:[hidden email]>>>]
> >>>>> Sent: Thursday, February 05, 2015 12:09 PM
> >>>>> To: Ulanov, Alexander
> >>>>> Cc: [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>
> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>>
> >>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
> blas in
> >>>>> many cases.
> >>>>>
> >>>>> You might consider taking a look at the codepaths that BIDMat (
> >>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
> >>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
> optimizing
> >>>>> to make this work really fast from Scala. I've run it on my laptop
> and
> >>>>> compared to MKL and in certain cases it's 10x faster at matrix
> multiply.
> >>>>> There are a lot of layers of indirection here and you really want to
> avoid
> >>>>> data copying as much as possible.
> >>>>>
> >>>>> We could also consider swapping out BIDMat for Breeze, but that
> would be
> >>>>> a big project and if we can figure out how to get breeze+cublas to
> >>>>> comparable performance that would be a big win.
> >>>>>
> >>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>> wrote:
> >>>>> Dear Spark developers,
> >>>>>
> >>>>> I am exploring how to make linear algebra operations faster within
> Spark.
> >>>>> One way of doing this is to use Scala Breeze library that is bundled
> with
> >>>>> Spark. For matrix operations, it employs Netlib-java that has a Java
> >>>>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
> >>>>> binaries if they are available on the worker node. It also has its
> own
> >>>>> optimized Java implementation of BLAS. It is worth mentioning, that
> native
> >>>>> binaries provide better performance only for BLAS level 3, i.e.
> >>>>> matrix-matrix operations or general matrix multiplication (GEMM).
> This is
> >>>>> confirmed by GEMM test on Netlib-java page
> >>>>> https://github.com/fommil/netlib-java. I also confirmed it with my
> >>>>> experiments with training of artificial neural network
> >>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
> >>>>> However, I would like to boost performance more.
> >>>>>
> >>>>> GPU is supposed to work fast with linear algebra and there is Nvidia
> CUDA
> >>>>> implementation of BLAS, called cublas. I have one Linux server with
> Nvidia
> >>>>> GPU and I was able to do the following. I linked cublas (instead of
> >>>>> cpu-based blas) with Netlib-java wrapper and put it into Spark, so
> >>>>> Breeze/Netlib is using it. Then I did some performance measurements
> with
> >>>>> regards to artificial neural network batch learning in Spark MLlib
> that
> >>>>> involves matrix-matrix multiplications. It turns out that for
> matrices of
> >>>>> size less than ~1000x780 GPU cublas has the same speed as CPU blas.
> Cublas
> >>>>> becomes slower for bigger matrices. It worth mentioning that it is
> was not
> >>>>> a test for ONLY multiplication since there are other operations
> involved.
> >>>>> One of the reasons for slowdown might be the overhead of copying the
> >>>>> matrices from computer memory to graphic card memory and back.
> >>>>>
> >>>>> So, few questions:
> >>>>> 1) Do these results with CUDA make sense?
> >>>>> 2) If the problem is with copy overhead, are there any libraries that
> >>>>> allow to force intermediate results to stay in graphic card memory
> thus
> >>>>> removing the overhead?
> >>>>> 3) Any other options to speed-up linear algebra in Spark?
> >>>>>
> >>>>> Thank you, Alexander
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: [hidden email]<mailto:
> [hidden email]><mailto:
> >>>>> [hidden email]<mailto:
> [hidden email]>><mailto:[hidden email]
> <mailto:[hidden email]>
> >>>>> <mailto:[hidden email]<mailto:
> [hidden email]>>>
> >>>>> For additional commands, e-mail: [hidden email]<mailto:
> [hidden email]><mailto:
> >>>>> [hidden email]<mailto:[hidden email]>><mailto:
> [hidden email]<mailto:[hidden email]><mailto:
> >>>>> [hidden email]<mailto:[hidden email]>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >
> > --
> > Best regards,
> > Sam
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

Xiangrui Meng
Hey Sam,

The running times are not "big O" estimates:

> The CPU version finished in 12 seconds.
> The CPU->GPU->CPU version finished in 2.2 seconds.
> The GPU version finished in 1.7 seconds.

I think there is something wrong with the netlib/cublas combination.
Sam already mentioned that cuBLAS doesn't implement the CPU BLAS
interfaces. I checked the CUDA doc and it seems that to use GPU BLAS
through the CPU BLAS interface we need to use NVBLAS, which intercepts
some Level 3 CPU BLAS calls (including GEMM). So we need to load
nvblas.so first and then some CPU BLAS library in JNI. I wonder
whether the setup was correct.

Alexander, could you check whether GPU is used in the netlib-cublas
experiments? You can tell it by watching CPU/GPU usage.

Best,
Xiangrui

On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday <[hidden email]> wrote:

> Don't use "big O" estimates, always measure. It used to work back in the
> days when double multiplication was a bottleneck. The computation cost is
> effectively free on both the CPU and GPU and you're seeing pure copying
> costs. Also, I'm dubious that cublas is doing what you think it is. Can you
> link me to the source code for DGEMM?
>
> I show all of this in my talk, with explanations, I can't stress enough how
> much I recommend that you watch it if you want to understand high
> performance hardware acceleration for linear algebra :-)
>
> On 27 Feb 2015 01:42, "Xiangrui Meng" <[hidden email]> wrote:
>>
>> The copying overhead should be quadratic on n, while the computation
>> cost is cubic on n. I can understand that netlib-cublas is slower than
>> netlib-openblas on small problems. But I'm surprised to see that it is
>> still 20x slower on 10000x10000. I did the following on a g2.2xlarge
>> instance with BIDMat:
>>
>> val n = 10000
>>
>> val f = rand(n, n)
>> flip; f*f; val rf = flop
>>
>> flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop
>>
>> flip; g*g; val rgg = flop
>>
>> The CPU version finished in 12 seconds.
>> The CPU->GPU->CPU version finished in 2.2 seconds.
>> The GPU version finished in 1.7 seconds.
>>
>> I'm not sure whether my CPU->GPU->CPU code simulates the netlib-cublas
>> path. But based on the result, the data copying overhead is definitely
>> not as big as 20x at n = 10000.
>>
>> Best,
>> Xiangrui
>>
>>
>> On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday <[hidden email]>
>> wrote:
>> > I've had some email exchanges with the author of BIDMat: it does exactly
>> > what you need to get the GPU benefit and writes higher level algorithms
>> > entirely in the GPU kernels so that the memory stays there as long as
>> > possible. The restriction with this approach is that it is only offering
>> > high-level algorithms so is not a toolkit for applied mathematics
>> > research and development --- but it works well as a toolkit for higher
>> > level analysis (e.g. for analysts and practitioners).
>> >
>> > I believe BIDMat's approach is the best way to get performance out of
>> > GPU hardware at the moment but I also have strong evidence to suggest
>> > that the hardware will catch up and the memory transfer costs between
>> > CPU/GPU will disappear meaning that there will be no need for custom GPU
>> > kernel implementations. i.e. please continue to use BLAS primitives when
>> > writing new algorithms and only go to the GPU for an alternative
>> > optimised implementation.
>> >
>> > Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer
>> > an API that looks like BLAS but takes pointers to special regions in the
>> > GPU memory region. Somebody has written a wrapper around CUDA to create
>> > a proper BLAS library but it only gives marginal performance over the
>> > CPU because of the memory transfer overhead.
>> >
>> > This slide from my talk
>> >
>> >   http://fommil.github.io/scalax14/#/11/2
>> >
>> > says it all. X axis is matrix size, Y axis is logarithmic time to do
>> > DGEMM. Black line is the "cheating" time for the GPU and the green line
>> > is after copying the memory to/from the GPU memory. APUs have the
>> > potential to eliminate the green line.
>> >
>> > Best regards,
>> > Sam
>> >
>> >
>> >
>> > "Ulanov, Alexander" <[hidden email]> writes:
>> >
>> >> Evan, thank you for the summary. I would like to add some more
>> >> observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 vs
>> >> $100). They both are 3 years old. I've also did a small test with modern
>> >> hardware, and the new GPU nVidia Titan was slightly more than 1 order of
>> >> magnitude faster than Intel E5-2650 v2 for the same tests. However, it costs
>> >> as much as CPU ($1200). My takeaway is that GPU is making a better
>> >> price/value progress.
>> >>
>> >>
>> >>
>> >> Xiangrui, I was also surprised that BIDMat-cuda was faster than
>> >> netlib-cuda and the most reasonable explanation is that it holds the result
>> >> in GPU memory, as Sam suggested. At the same time, it is OK because you can
>> >> copy the result back from GPU only when needed. However, to be sure, I am
>> >> going to ask the developer of BIDMat on his upcoming talk.
>> >>
>> >>
>> >>
>> >> Best regards, Alexander
>> >>
>> >>
>> >> From: Sam Halliday [mailto:[hidden email]]
>> >> Sent: Thursday, February 26, 2015 1:56 PM
>> >> To: Xiangrui Meng
>> >> Cc: [hidden email]; Joseph Bradley; Ulanov, Alexander; Evan R.
>> >> Sparks
>> >> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>
>> >>
>> >> Btw, I wish people would stop cheating when comparing CPU and GPU
>> >> timings for things like matrix multiply :-P
>> >>
>> >> Please always compare apples with apples and include the time it takes
>> >> to set up the matrices, send it to the processing unit, doing the
>> >> calculation AND copying it back to where you need to see the results.
>> >>
>> >> Ignoring this method will make you believe that your GPU is thousands
>> >> of times faster than it really is. Again, jump to the end of my talk for
>> >> graphs and more discussion....  especially the bit about me being keen on
>> >> funding to investigate APU hardware further ;-) (I believe it will solve the
>> >> problem)
>> >> On 26 Feb 2015 21:16, "Xiangrui Meng"
>> >> <[hidden email]<mailto:[hidden email]>> wrote:
>> >> Hey Alexander,
>> >>
>> >> I don't quite understand the part where netlib-cublas is about 20x
>> >> slower than netlib-openblas. What is the overhead of using a GPU BLAS
>> >> with netlib-java?
>> >>
>> >> CC'ed Sam, the author of netlib-java.
>> >>
>> >> Best,
>> >> Xiangrui
>> >>
>> >> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley
>> >> <[hidden email]<mailto:[hidden email]>> wrote:
>> >>> Better documentation for linking would be very helpful!  Here's a
>> >>> JIRA:
>> >>> https://issues.apache.org/jira/browse/SPARK-6019
>> >>>
>> >>>
>> >>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>> >>> <[hidden email]<mailto:[hidden email]>>
>> >>> wrote:
>> >>>
>> >>>> Thanks for compiling all the data and running these benchmarks, Alex.
>> >>>> The
>> >>>> big takeaways here can be seen with this chart:
>> >>>>
>> >>>>
>> >>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>> >>>>
>> >>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>> >>>> BIDMat+GPU) can provide substantial (but less than an order of
>> >>>> magnitude)
>> >>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>> >>>> netlib-java+openblas-compiled).
>> >>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>> >>>> worse
>> >>>> than a well-tuned CPU implementation, particularly for larger
>> >>>> matrices.
>> >>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>> >>>> basically agrees with the authors own benchmarks (
>> >>>> https://github.com/fommil/netlib-java)
>> >>>>
>> >>>> I think that most of our users are in a situation where using GPUs
>> >>>> may not
>> >>>> be practical - although we could consider having a good GPU backend
>> >>>> available as an option. However, *ALL* users of MLlib could benefit
>> >>>> (potentially tremendously) from using a well-tuned CPU-based BLAS
>> >>>> implementation. Perhaps we should consider updating the mllib guide
>> >>>> with a
>> >>>> more complete section for enabling high performance binaries on OSX
>> >>>> and
>> >>>> Linux? Or better, figure out a way for the system to fetch these
>> >>>> automatically.
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]>> wrote:
>> >>>>
>> >>>>> Just to summarize this thread, I was finally able to make all
>> >>>>> performance
>> >>>>> comparisons that we discussed. It turns out that:
>> >>>>> BIDMat-cublas>>BIDMat
>> >>>>>
>> >>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
>> >>>>>
>> >>>>> Below is the link to the spreadsheet with full results.
>> >>>>>
>> >>>>>
>> >>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>> >>>>>
>> >>>>> One thing still needs exploration: does BIDMat-cublas perform
>> >>>>> copying
>> >>>>> to/from machine’s RAM?
>> >>>>>
>> >>>>> -----Original Message-----
>> >>>>> From: Ulanov, Alexander
>> >>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>> >>>>> To: Evan R. Sparks
>> >>>>> Cc: Joseph Bradley;
>> >>>>> [hidden email]<mailto:[hidden email]>
>> >>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>> >>>>>
>> >>>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>> >>>>> the
>> >>>>> original one discusses slightly different topic. I was able to link
>> >>>>> netlib
>> >>>>> with MKL from BIDMat binaries. Indeed, MKL is statically linked
>> >>>>> inside a
>> >>>>> 60MB library.
>> >>>>>
>> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>> >>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>> >>>>>
>> >>>>> +-----------------------------------------------------------------------+
>> >>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>> >>>>> |1,638475459 |
>> >>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>> >>>>> 1569,233228 |
>> >>>>>
>> >>>>> It turn out that pre-compiled MKL is faster than precompiled
>> >>>>> OpenBlas on
>> >>>>> my machine. Probably, I’ll add two more columns with locally
>> >>>>> compiled
>> >>>>> openblas and cuda.
>> >>>>>
>> >>>>> Alexander
>> >>>>>
>> >>>>> From: Evan R. Sparks
>> >>>>> [mailto:[hidden email]<mailto:[hidden email]>]
>> >>>>> Sent: Monday, February 09, 2015 6:06 PM
>> >>>>> To: Ulanov, Alexander
>> >>>>> Cc: Joseph Bradley;
>> >>>>> [hidden email]<mailto:[hidden email]>
>> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>>
>> >>>>> Great - perhaps we can move this discussion off-list and onto a JIRA
>> >>>>> ticket? (Here's one:
>> >>>>> https://issues.apache.org/jira/browse/SPARK-5705)
>> >>>>>
>> >>>>> It seems like this is going to be somewhat exploratory for a while
>> >>>>> (and
>> >>>>> there's probably only a handful of us who really care about fast
>> >>>>> linear
>> >>>>> algebra!)
>> >>>>>
>> >>>>> - Evan
>> >>>>>
>> >>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>> >>>>>
>> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>> >>>>> wrote:
>> >>>>> Hi Evan,
>> >>>>>
>> >>>>> Thank you for explanation and useful link. I am going to build
>> >>>>> OpenBLAS,
>> >>>>> link it with Netlib-java and perform benchmark again.
>> >>>>>
>> >>>>> Do I understand correctly that BIDMat binaries contain statically
>> >>>>> linked
>> >>>>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat
>> >>>>> not
>> >>>>> having MKL BLAS installed on my server. If it is true, I wonder if
>> >>>>> it is OK
>> >>>>> because Intel sells this library. Nevertheless, it seems that in my
>> >>>>> case
>> >>>>> precompiled MKL BLAS performs better than precompiled OpenBLAS given
>> >>>>> that
>> >>>>> BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>> >>>>>
>> >>>>> Though, it might be interesting to link Netlib-java with Intel MKL,
>> >>>>> as
>> >>>>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
>> >>>>> (Netlib-java) interested to compare their libraries.
>> >>>>>
>> >>>>> Best regards, Alexander
>> >>>>>
>> >>>>> From: Evan R. Sparks
>> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >>>>> [hidden email]<mailto:[hidden email]>>]
>> >>>>> Sent: Friday, February 06, 2015 5:58 PM
>> >>>>>
>> >>>>> To: Ulanov, Alexander
>> >>>>> Cc: Joseph Bradley;
>> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>>
>> >>>>> I would build OpenBLAS yourself, since good BLAS performance comes
>> >>>>> from
>> >>>>> getting cache sizes, etc. set up correctly for your particular
>> >>>>> hardware -
>> >>>>> this is often a very tricky process (see, e.g. ATLAS), but we found
>> >>>>> that on
>> >>>>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
>> >>>>> performance competitive with MKL.
>> >>>>>
>> >>>>> To make sure the right library is getting used, you have to make
>> >>>>> sure
>> >>>>> it's first on the search path - export
>> >>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>> >>>>>
>> >>>>> For some examples of getting netlib-java setup on an ec2 node and
>> >>>>> some
>> >>>>> example benchmarking code we ran a while back, see:
>> >>>>> https://github.com/shivaram/matrix-bench
>> >>>>>
>> >>>>> In particular - build-openblas-ec2.sh shows you how to build the
>> >>>>> library
>> >>>>> and set up symlinks correctly, and scala/run-netlib.sh shows you how
>> >>>>> to get
>> >>>>> the path setup and get that library picked up by netlib-java.
>> >>>>>
>> >>>>> In this way - you could probably get cuBLAS set up to be used by
>> >>>>> netlib-java as well.
>> >>>>>
>> >>>>> - Evan
>> >>>>>
>> >>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>> >>>>>
>> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>> >>>>> wrote:
>> >>>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
>> >>>>> force
>> >>>>> loading the right blas? For netlib, I there are few JVM flags, such
>> >>>>> as
>> >>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so
>> >>>>> I can
>> >>>>> force it to use Java implementation. Not sure I understand how to
>> >>>>> force use
>> >>>>> a specific blas (not specific wrapper for blas).
>> >>>>>
>> >>>>> Btw. I have installed openblas (yum install openblas), so I suppose
>> >>>>> that
>> >>>>> netlib is using it.
>> >>>>>
>> >>>>> From: Evan R. Sparks
>> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >>>>> [hidden email]<mailto:[hidden email]>>]
>> >>>>> Sent: Friday, February 06, 2015 5:19 PM
>> >>>>> To: Ulanov, Alexander
>> >>>>> Cc: Joseph Bradley;
>> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>> >>>>>
>> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>>
>> >>>>> Getting breeze to pick up the right blas library is critical for
>> >>>>> performance. I recommend using OpenBLAS (or MKL, if you already have
>> >>>>> it).
>> >>>>> It might make sense to force BIDMat to use the same underlying BLAS
>> >>>>> library
>> >>>>> as well.
>> >>>>>
>> >>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>> >>>>>
>> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>> >>>>> wrote:
>> >>>>> Hi Evan, Joseph
>> >>>>>
>> >>>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>> >>>>> faster
>> >>>>> than netlib-java+breeze (sorry for weird table formatting):
>> >>>>>
>> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>> >>>>> native_system_linux_x86-64|
>> >>>>> Breeze+Netlib-java f2jblas |
>> >>>>>
>> >>>>> +-----------------------------------------------------------------------+
>> >>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>> >>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 |
>> >>>>>
>> >>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
>> >>>>> 19
>> >>>>> Linux, Scala 2.11.
>> >>>>>
>> >>>>> Later I will make tests with Cuda. I need to install new Cuda
>> >>>>> version for
>> >>>>> this purpose.
>> >>>>>
>> >>>>> Do you have any ideas why breeze-netlib with native blas is so much
>> >>>>> slower than BIDMat MKL?
>> >>>>>
>> >>>>> Best regards, Alexander
>> >>>>>
>> >>>>> From: Joseph Bradley
>> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >>>>> [hidden email]<mailto:[hidden email]>>]
>> >>>>> Sent: Thursday, February 05, 2015 5:29 PM
>> >>>>> To: Ulanov, Alexander
>> >>>>> Cc: Evan R. Sparks;
>> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>>
>> >>>>> Hi Alexander,
>> >>>>>
>> >>>>> Using GPUs with Spark would be very exciting.  Small comment:
>> >>>>> Concerning
>> >>>>> your question earlier about keeping data stored on the GPU rather
>> >>>>> than
>> >>>>> having to move it between main memory and GPU memory on each
>> >>>>> iteration, I
>> >>>>> would guess this would be critical to getting good performance.  If
>> >>>>> you
>> >>>>> could do multiple local iterations before aggregating results, then
>> >>>>> the
>> >>>>> cost of data movement to the GPU could be amortized (and I believe
>> >>>>> that is
>> >>>>> done in practice).  Having Spark be aware of the GPU and using it as
>> >>>>> another part of memory sounds like a much bigger undertaking.
>> >>>>>
>> >>>>> Joseph
>> >>>>>
>> >>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>> >>>>>
>> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>> >>>>> wrote:
>> >>>>> Thank you for explanation! I’ve watched the BIDMach presentation by
>> >>>>> John
>> >>>>> Canny and I am really inspired by his talk and comparisons with
>> >>>>> Spark MLlib.
>> >>>>>
>> >>>>> I am very interested to find out what will be better within Spark:
>> >>>>> BIDMat
>> >>>>> or netlib-java with CPU or GPU natives. Could you suggest a fair way
>> >>>>> to
>> >>>>> benchmark them? Currently I do benchmarks on artificial neural
>> >>>>> networks in
>> >>>>> batch mode. While it is not a “pure” test of linear algebra, it
>> >>>>> involves
>> >>>>> some other things that are essential to machine learning.
>> >>>>>
>> >>>>> From: Evan R. Sparks
>> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >>>>> [hidden email]<mailto:[hidden email]>>]
>> >>>>> Sent: Thursday, February 05, 2015 1:29 PM
>> >>>>> To: Ulanov, Alexander
>> >>>>> Cc:
>> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>>
>> >>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>> >>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
>> >>>>> data
>> >>>>> layout and fewer levels of indirection - it's definitely a
>> >>>>> worthwhile
>> >>>>> experiment to run. The main speedups I've seen from using it come
>> >>>>> from
>> >>>>> highly optimized GPU code for linear algebra. I know that in the
>> >>>>> past Canny
>> >>>>> has gone as far as to write custom GPU kernels for
>> >>>>> performance-critical
>> >>>>> regions of code.[1]
>> >>>>>
>> >>>>> BIDMach is highly optimized for single node performance or
>> >>>>> performance on
>> >>>>> small clusters.[2] Once data doesn't fit easily in GPU memory (or
>> >>>>> can be
>> >>>>> batched in that way) the performance tends to fall off. Canny argues
>> >>>>> for
>> >>>>> hardware/software codesign and as such prefers machine
>> >>>>> configurations that
>> >>>>> are quite different than what we find in most commodity cluster
>> >>>>> nodes -
>> >>>>> e.g. 10 disk cahnnels and 4 GPUs.
>> >>>>>
>> >>>>> In contrast, MLlib was designed for horizontal scalability on
>> >>>>> commodity
>> >>>>> clusters and works best on very big datasets - order of terabytes.
>> >>>>>
>> >>>>> For the most part, these projects developed concurrently to address
>> >>>>> slightly different use cases. That said, there may be bits of
>> >>>>> BIDMach we
>> >>>>> could repurpose for MLlib - keep in mind we need to be careful about
>> >>>>> maintaining cross-language compatibility for our Java and
>> >>>>> Python-users,
>> >>>>> though.
>> >>>>>
>> >>>>> - Evan
>> >>>>>
>> >>>>> [1] - http://arxiv.org/abs/1409.5402
>> >>>>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>> >>>>>
>> >>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>> >>>>>
>> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>> >>>>>
>> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>>
>> >>>>> wrote:
>> >>>>> Hi Evan,
>> >>>>>
>> >>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>> >>>>> you
>> >>>>> know what makes them faster than netlib-java?
>> >>>>>
>> >>>>> The same group has BIDMach library that implements machine learning.
>> >>>>> For
>> >>>>> some examples they use Caffe convolutional neural network library
>> >>>>> owned by
>> >>>>> another group in Berkeley. Could you elaborate on how these all
>> >>>>> might be
>> >>>>> connected with Spark Mllib? If you take BIDMat for linear algebra
>> >>>>> why don’t
>> >>>>> you take BIDMach for optimization and learning?
>> >>>>>
>> >>>>> Best regards, Alexander
>> >>>>>
>> >>>>> From: Evan R. Sparks
>> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >>>>>
>> >>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >>>>> [hidden email]<mailto:[hidden email]>>>]
>> >>>>> Sent: Thursday, February 05, 2015 12:09 PM
>> >>>>> To: Ulanov, Alexander
>> >>>>> Cc:
>> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>> >>>>>
>> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>>
>> >>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>> >>>>> blas in
>> >>>>> many cases.
>> >>>>>
>> >>>>> You might consider taking a look at the codepaths that BIDMat (
>> >>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>> >>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>> >>>>> optimizing
>> >>>>> to make this work really fast from Scala. I've run it on my laptop
>> >>>>> and
>> >>>>> compared to MKL and in certain cases it's 10x faster at matrix
>> >>>>> multiply.
>> >>>>> There are a lot of layers of indirection here and you really want to
>> >>>>> avoid
>> >>>>> data copying as much as possible.
>> >>>>>
>> >>>>> We could also consider swapping out BIDMat for Breeze, but that
>> >>>>> would be
>> >>>>> a big project and if we can figure out how to get breeze+cublas to
>> >>>>> comparable performance that would be a big win.
>> >>>>>
>> >>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>> >>>>>
>> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>> >>>>>
>> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>>
>> >>>>> wrote:
>> >>>>> Dear Spark developers,
>> >>>>>
>> >>>>> I am exploring how to make linear algebra operations faster within
>> >>>>> Spark.
>> >>>>> One way of doing this is to use Scala Breeze library that is bundled
>> >>>>> with
>> >>>>> Spark. For matrix operations, it employs Netlib-java that has a Java
>> >>>>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK
>> >>>>> native
>> >>>>> binaries if they are available on the worker node. It also has its
>> >>>>> own
>> >>>>> optimized Java implementation of BLAS. It is worth mentioning, that
>> >>>>> native
>> >>>>> binaries provide better performance only for BLAS level 3, i.e.
>> >>>>> matrix-matrix operations or general matrix multiplication (GEMM).
>> >>>>> This is
>> >>>>> confirmed by GEMM test on Netlib-java page
>> >>>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>> >>>>> experiments with training of artificial neural network
>> >>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>> >>>>> However, I would like to boost performance more.
>> >>>>>
>> >>>>> GPU is supposed to work fast with linear algebra and there is Nvidia
>> >>>>> CUDA
>> >>>>> implementation of BLAS, called cublas. I have one Linux server with
>> >>>>> Nvidia
>> >>>>> GPU and I was able to do the following. I linked cublas (instead of
>> >>>>> cpu-based blas) with Netlib-java wrapper and put it into Spark, so
>> >>>>> Breeze/Netlib is using it. Then I did some performance measurements
>> >>>>> with
>> >>>>> regards to artificial neural network batch learning in Spark MLlib
>> >>>>> that
>> >>>>> involves matrix-matrix multiplications. It turns out that for
>> >>>>> matrices of
>> >>>>> size less than ~1000x780 GPU cublas has the same speed as CPU blas.
>> >>>>> Cublas
>> >>>>> becomes slower for bigger matrices. It worth mentioning that it is
>> >>>>> was not
>> >>>>> a test for ONLY multiplication since there are other operations
>> >>>>> involved.
>> >>>>> One of the reasons for slowdown might be the overhead of copying the
>> >>>>> matrices from computer memory to graphic card memory and back.
>> >>>>>
>> >>>>> So, few questions:
>> >>>>> 1) Do these results with CUDA make sense?
>> >>>>> 2) If the problem is with copy overhead, are there any libraries
>> >>>>> that
>> >>>>> allow to force intermediate results to stay in graphic card memory
>> >>>>> thus
>> >>>>> removing the overhead?
>> >>>>> 3) Any other options to speed-up linear algebra in Spark?
>> >>>>>
>> >>>>> Thank you, Alexander
>> >>>>>
>> >>>>>
>> >>>>> ---------------------------------------------------------------------
>> >>>>> To unsubscribe, e-mail:
>> >>>>> [hidden email]<mailto:[hidden email]><mailto:
>> >>>>>
>> >>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]>
>> >>>>>
>> >>>>> <mailto:[hidden email]<mailto:[hidden email]>>>
>> >>>>> For additional commands, e-mail:
>> >>>>> [hidden email]<mailto:[hidden email]><mailto:
>> >>>>>
>> >>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >>>>> [hidden email]<mailto:[hidden email]>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>
>> >
>> > --
>> > Best regards,
>> > Sam
>> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

fommil
Also, check the JNILoader output.

Remember, for netlib-java to use your system libblas all you need to do is
setup libblas.so.3 like any native application would expect.

I haven't ever used the cublas "real BLAS"  implementation, so I'd be
interested to hear about this. Do an 'ldd /usr/lib/libblas.so.3' to check
that all the runtime links are in order.

Btw, I have some DGEMM wrappers in my netlib-java performance module... and
I also planned to write more in MultiBLAS (until I mothballed the project
for the hardware to catch up, which is probably has and now I just need a
reason to look at it)
 On 27 Feb 2015 20:26, "Xiangrui Meng" <[hidden email]> wrote:

> Hey Sam,
>
> The running times are not "big O" estimates:
>
> > The CPU version finished in 12 seconds.
> > The CPU->GPU->CPU version finished in 2.2 seconds.
> > The GPU version finished in 1.7 seconds.
>
> I think there is something wrong with the netlib/cublas combination.
> Sam already mentioned that cuBLAS doesn't implement the CPU BLAS
> interfaces. I checked the CUDA doc and it seems that to use GPU BLAS
> through the CPU BLAS interface we need to use NVBLAS, which intercepts
> some Level 3 CPU BLAS calls (including GEMM). So we need to load
> nvblas.so first and then some CPU BLAS library in JNI. I wonder
> whether the setup was correct.
>
> Alexander, could you check whether GPU is used in the netlib-cublas
> experiments? You can tell it by watching CPU/GPU usage.
>
> Best,
> Xiangrui
>
> On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday <[hidden email]>
> wrote:
> > Don't use "big O" estimates, always measure. It used to work back in the
> > days when double multiplication was a bottleneck. The computation cost is
> > effectively free on both the CPU and GPU and you're seeing pure copying
> > costs. Also, I'm dubious that cublas is doing what you think it is. Can
> you
> > link me to the source code for DGEMM?
> >
> > I show all of this in my talk, with explanations, I can't stress enough
> how
> > much I recommend that you watch it if you want to understand high
> > performance hardware acceleration for linear algebra :-)
> >
> > On 27 Feb 2015 01:42, "Xiangrui Meng" <[hidden email]> wrote:
> >>
> >> The copying overhead should be quadratic on n, while the computation
> >> cost is cubic on n. I can understand that netlib-cublas is slower than
> >> netlib-openblas on small problems. But I'm surprised to see that it is
> >> still 20x slower on 10000x10000. I did the following on a g2.2xlarge
> >> instance with BIDMat:
> >>
> >> val n = 10000
> >>
> >> val f = rand(n, n)
> >> flip; f*f; val rf = flop
> >>
> >> flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg =
> flop
> >>
> >> flip; g*g; val rgg = flop
> >>
> >> The CPU version finished in 12 seconds.
> >> The CPU->GPU->CPU version finished in 2.2 seconds.
> >> The GPU version finished in 1.7 seconds.
> >>
> >> I'm not sure whether my CPU->GPU->CPU code simulates the netlib-cublas
> >> path. But based on the result, the data copying overhead is definitely
> >> not as big as 20x at n = 10000.
> >>
> >> Best,
> >> Xiangrui
> >>
> >>
> >> On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday <[hidden email]>
> >> wrote:
> >> > I've had some email exchanges with the author of BIDMat: it does
> exactly
> >> > what you need to get the GPU benefit and writes higher level
> algorithms
> >> > entirely in the GPU kernels so that the memory stays there as long as
> >> > possible. The restriction with this approach is that it is only
> offering
> >> > high-level algorithms so is not a toolkit for applied mathematics
> >> > research and development --- but it works well as a toolkit for higher
> >> > level analysis (e.g. for analysts and practitioners).
> >> >
> >> > I believe BIDMat's approach is the best way to get performance out of
> >> > GPU hardware at the moment but I also have strong evidence to suggest
> >> > that the hardware will catch up and the memory transfer costs between
> >> > CPU/GPU will disappear meaning that there will be no need for custom
> GPU
> >> > kernel implementations. i.e. please continue to use BLAS primitives
> when
> >> > writing new algorithms and only go to the GPU for an alternative
> >> > optimised implementation.
> >> >
> >> > Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and
> offer
> >> > an API that looks like BLAS but takes pointers to special regions in
> the
> >> > GPU memory region. Somebody has written a wrapper around CUDA to
> create
> >> > a proper BLAS library but it only gives marginal performance over the
> >> > CPU because of the memory transfer overhead.
> >> >
> >> > This slide from my talk
> >> >
> >> >   http://fommil.github.io/scalax14/#/11/2
> >> >
> >> > says it all. X axis is matrix size, Y axis is logarithmic time to do
> >> > DGEMM. Black line is the "cheating" time for the GPU and the green
> line
> >> > is after copying the memory to/from the GPU memory. APUs have the
> >> > potential to eliminate the green line.
> >> >
> >> > Best regards,
> >> > Sam
> >> >
> >> >
> >> >
> >> > "Ulanov, Alexander" <[hidden email]> writes:
> >> >
> >> >> Evan, thank you for the summary. I would like to add some more
> >> >> observations. The GPU that I used is 2.5 times cheaper than the CPU
> ($250 vs
> >> >> $100). They both are 3 years old. I've also did a small test with
> modern
> >> >> hardware, and the new GPU nVidia Titan was slightly more than 1
> order of
> >> >> magnitude faster than Intel E5-2650 v2 for the same tests. However,
> it costs
> >> >> as much as CPU ($1200). My takeaway is that GPU is making a better
> >> >> price/value progress.
> >> >>
> >> >>
> >> >>
> >> >> Xiangrui, I was also surprised that BIDMat-cuda was faster than
> >> >> netlib-cuda and the most reasonable explanation is that it holds the
> result
> >> >> in GPU memory, as Sam suggested. At the same time, it is OK because
> you can
> >> >> copy the result back from GPU only when needed. However, to be sure,
> I am
> >> >> going to ask the developer of BIDMat on his upcoming talk.
> >> >>
> >> >>
> >> >>
> >> >> Best regards, Alexander
> >> >>
> >> >>
> >> >> From: Sam Halliday [mailto:[hidden email]]
> >> >> Sent: Thursday, February 26, 2015 1:56 PM
> >> >> To: Xiangrui Meng
> >> >> Cc: [hidden email]; Joseph Bradley; Ulanov, Alexander; Evan R.
> >> >> Sparks
> >> >> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >> >>
> >> >>
> >> >> Btw, I wish people would stop cheating when comparing CPU and GPU
> >> >> timings for things like matrix multiply :-P
> >> >>
> >> >> Please always compare apples with apples and include the time it
> takes
> >> >> to set up the matrices, send it to the processing unit, doing the
> >> >> calculation AND copying it back to where you need to see the results.
> >> >>
> >> >> Ignoring this method will make you believe that your GPU is thousands
> >> >> of times faster than it really is. Again, jump to the end of my talk
> for
> >> >> graphs and more discussion....  especially the bit about me being
> keen on
> >> >> funding to investigate APU hardware further ;-) (I believe it will
> solve the
> >> >> problem)
> >> >> On 26 Feb 2015 21:16, "Xiangrui Meng"
> >> >> <[hidden email]<mailto:[hidden email]>> wrote:
> >> >> Hey Alexander,
> >> >>
> >> >> I don't quite understand the part where netlib-cublas is about 20x
> >> >> slower than netlib-openblas. What is the overhead of using a GPU BLAS
> >> >> with netlib-java?
> >> >>
> >> >> CC'ed Sam, the author of netlib-java.
> >> >>
> >> >> Best,
> >> >> Xiangrui
> >> >>
> >> >> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley
> >> >> <[hidden email]<mailto:[hidden email]>> wrote:
> >> >>> Better documentation for linking would be very helpful!  Here's a
> >> >>> JIRA:
> >> >>> https://issues.apache.org/jira/browse/SPARK-6019
> >> >>>
> >> >>>
> >> >>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
> >> >>> <[hidden email]<mailto:[hidden email]>>
> >> >>> wrote:
> >> >>>
> >> >>>> Thanks for compiling all the data and running these benchmarks,
> Alex.
> >> >>>> The
> >> >>>> big takeaways here can be seen with this chart:
> >> >>>>
> >> >>>>
> >> >>>>
> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
> >> >>>>
> >> >>>> 1) A properly configured GPU matrix multiply implementation (e.g.
> >> >>>> BIDMat+GPU) can provide substantial (but less than an order of
> >> >>>> magnitude)
> >> >>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
> >> >>>> netlib-java+openblas-compiled).
> >> >>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
> >> >>>> worse
> >> >>>> than a well-tuned CPU implementation, particularly for larger
> >> >>>> matrices.
> >> >>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
> >> >>>> basically agrees with the authors own benchmarks (
> >> >>>> https://github.com/fommil/netlib-java)
> >> >>>>
> >> >>>> I think that most of our users are in a situation where using GPUs
> >> >>>> may not
> >> >>>> be practical - although we could consider having a good GPU backend
> >> >>>> available as an option. However, *ALL* users of MLlib could benefit
> >> >>>> (potentially tremendously) from using a well-tuned CPU-based BLAS
> >> >>>> implementation. Perhaps we should consider updating the mllib guide
> >> >>>> with a
> >> >>>> more complete section for enabling high performance binaries on OSX
> >> >>>> and
> >> >>>> Linux? Or better, figure out a way for the system to fetch these
> >> >>>> automatically.
> >> >>>>
> >> >>>> - Evan
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
> >> >>>> [hidden email]<mailto:[hidden email]>> wrote:
> >> >>>>
> >> >>>>> Just to summarize this thread, I was finally able to make all
> >> >>>>> performance
> >> >>>>> comparisons that we discussed. It turns out that:
> >> >>>>> BIDMat-cublas>>BIDMat
> >> >>>>>
> >> >>>>>
> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
> >> >>>>>
> >> >>>>> Below is the link to the spreadsheet with full results.
> >> >>>>>
> >> >>>>>
> >> >>>>>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
> >> >>>>>
> >> >>>>> One thing still needs exploration: does BIDMat-cublas perform
> >> >>>>> copying
> >> >>>>> to/from machine’s RAM?
> >> >>>>>
> >> >>>>> -----Original Message-----
> >> >>>>> From: Ulanov, Alexander
> >> >>>>> Sent: Tuesday, February 10, 2015 2:12 PM
> >> >>>>> To: Evan R. Sparks
> >> >>>>> Cc: Joseph Bradley;
> >> >>>>> [hidden email]<mailto:[hidden email]>
> >> >>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
> >> >>>>>
> >> >>>>> Thanks, Evan! It seems that ticket was marked as duplicate though
> >> >>>>> the
> >> >>>>> original one discusses slightly different topic. I was able to
> link
> >> >>>>> netlib
> >> >>>>> with MKL from BIDMat binaries. Indeed, MKL is statically linked
> >> >>>>> inside a
> >> >>>>> 60MB library.
> >> >>>>>
> >> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
> >> >>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
> >> >>>>>
> >> >>>>>
> +-----------------------------------------------------------------------+
> >> >>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
> >> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
> >> >>>>> |1,638475459 |
> >> >>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211
> |
> >> >>>>> 1569,233228 |
> >> >>>>>
> >> >>>>> It turn out that pre-compiled MKL is faster than precompiled
> >> >>>>> OpenBlas on
> >> >>>>> my machine. Probably, I’ll add two more columns with locally
> >> >>>>> compiled
> >> >>>>> openblas and cuda.
> >> >>>>>
> >> >>>>> Alexander
> >> >>>>>
> >> >>>>> From: Evan R. Sparks
> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]>]
> >> >>>>> Sent: Monday, February 09, 2015 6:06 PM
> >> >>>>> To: Ulanov, Alexander
> >> >>>>> Cc: Joseph Bradley;
> >> >>>>> [hidden email]<mailto:[hidden email]>
> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >> >>>>>
> >> >>>>> Great - perhaps we can move this discussion off-list and onto a
> JIRA
> >> >>>>> ticket? (Here's one:
> >> >>>>> https://issues.apache.org/jira/browse/SPARK-5705)
> >> >>>>>
> >> >>>>> It seems like this is going to be somewhat exploratory for a while
> >> >>>>> (and
> >> >>>>> there's probably only a handful of us who really care about fast
> >> >>>>> linear
> >> >>>>> algebra!)
> >> >>>>>
> >> >>>>> - Evan
> >> >>>>>
> >> >>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
> >> >>>>>
> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>
> >> >>>>> wrote:
> >> >>>>> Hi Evan,
> >> >>>>>
> >> >>>>> Thank you for explanation and useful link. I am going to build
> >> >>>>> OpenBLAS,
> >> >>>>> link it with Netlib-java and perform benchmark again.
> >> >>>>>
> >> >>>>> Do I understand correctly that BIDMat binaries contain statically
> >> >>>>> linked
> >> >>>>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat
> >> >>>>> not
> >> >>>>> having MKL BLAS installed on my server. If it is true, I wonder if
> >> >>>>> it is OK
> >> >>>>> because Intel sells this library. Nevertheless, it seems that in
> my
> >> >>>>> case
> >> >>>>> precompiled MKL BLAS performs better than precompiled OpenBLAS
> given
> >> >>>>> that
> >> >>>>> BIDMat and Netlib-java are supposed to be on par with JNI
> overheads.
> >> >>>>>
> >> >>>>> Though, it might be interesting to link Netlib-java with Intel
> MKL,
> >> >>>>> as
> >> >>>>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
> >> >>>>> (Netlib-java) interested to compare their libraries.
> >> >>>>>
> >> >>>>> Best regards, Alexander
> >> >>>>>
> >> >>>>> From: Evan R. Sparks
> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]
> ><mailto:
> >> >>>>> [hidden email]<mailto:[hidden email]>>]
> >> >>>>> Sent: Friday, February 06, 2015 5:58 PM
> >> >>>>>
> >> >>>>> To: Ulanov, Alexander
> >> >>>>> Cc: Joseph Bradley;
> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>
> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >> >>>>>
> >> >>>>> I would build OpenBLAS yourself, since good BLAS performance comes
> >> >>>>> from
> >> >>>>> getting cache sizes, etc. set up correctly for your particular
> >> >>>>> hardware -
> >> >>>>> this is often a very tricky process (see, e.g. ATLAS), but we
> found
> >> >>>>> that on
> >> >>>>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
> >> >>>>> performance competitive with MKL.
> >> >>>>>
> >> >>>>> To make sure the right library is getting used, you have to make
> >> >>>>> sure
> >> >>>>> it's first on the search path - export
> >> >>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
> >> >>>>>
> >> >>>>> For some examples of getting netlib-java setup on an ec2 node and
> >> >>>>> some
> >> >>>>> example benchmarking code we ran a while back, see:
> >> >>>>> https://github.com/shivaram/matrix-bench
> >> >>>>>
> >> >>>>> In particular - build-openblas-ec2.sh shows you how to build the
> >> >>>>> library
> >> >>>>> and set up symlinks correctly, and scala/run-netlib.sh shows you
> how
> >> >>>>> to get
> >> >>>>> the path setup and get that library picked up by netlib-java.
> >> >>>>>
> >> >>>>> In this way - you could probably get cuBLAS set up to be used by
> >> >>>>> netlib-java as well.
> >> >>>>>
> >> >>>>> - Evan
> >> >>>>>
> >> >>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
> >> >>>>>
> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>
> >> >>>>> wrote:
> >> >>>>> Evan, could you elaborate on how to force BIDMat and netlib-java
> to
> >> >>>>> force
> >> >>>>> loading the right blas? For netlib, I there are few JVM flags,
> such
> >> >>>>> as
> >> >>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
> so
> >> >>>>> I can
> >> >>>>> force it to use Java implementation. Not sure I understand how to
> >> >>>>> force use
> >> >>>>> a specific blas (not specific wrapper for blas).
> >> >>>>>
> >> >>>>> Btw. I have installed openblas (yum install openblas), so I
> suppose
> >> >>>>> that
> >> >>>>> netlib is using it.
> >> >>>>>
> >> >>>>> From: Evan R. Sparks
> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]
> ><mailto:
> >> >>>>> [hidden email]<mailto:[hidden email]>>]
> >> >>>>> Sent: Friday, February 06, 2015 5:19 PM
> >> >>>>> To: Ulanov, Alexander
> >> >>>>> Cc: Joseph Bradley;
> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>
> >> >>>>>
> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >> >>>>>
> >> >>>>> Getting breeze to pick up the right blas library is critical for
> >> >>>>> performance. I recommend using OpenBLAS (or MKL, if you already
> have
> >> >>>>> it).
> >> >>>>> It might make sense to force BIDMat to use the same underlying
> BLAS
> >> >>>>> library
> >> >>>>> as well.
> >> >>>>>
> >> >>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
> >> >>>>>
> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>
> >> >>>>> wrote:
> >> >>>>> Hi Evan, Joseph
> >> >>>>>
> >> >>>>> I did few matrix multiplication test and BIDMat seems to be ~10x
> >> >>>>> faster
> >> >>>>> than netlib-java+breeze (sorry for weird table formatting):
> >> >>>>>
> >> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
> >> >>>>> native_system_linux_x86-64|
> >> >>>>> Breeze+Netlib-java f2jblas |
> >> >>>>>
> >> >>>>>
> +-----------------------------------------------------------------------+
> >> >>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> >> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> >> >>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 |
> 1569,233228 |
> >> >>>>>
> >> >>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM,
> Fedora
> >> >>>>> 19
> >> >>>>> Linux, Scala 2.11.
> >> >>>>>
> >> >>>>> Later I will make tests with Cuda. I need to install new Cuda
> >> >>>>> version for
> >> >>>>> this purpose.
> >> >>>>>
> >> >>>>> Do you have any ideas why breeze-netlib with native blas is so
> much
> >> >>>>> slower than BIDMat MKL?
> >> >>>>>
> >> >>>>> Best regards, Alexander
> >> >>>>>
> >> >>>>> From: Joseph Bradley
> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]
> ><mailto:
> >> >>>>> [hidden email]<mailto:[hidden email]>>]
> >> >>>>> Sent: Thursday, February 05, 2015 5:29 PM
> >> >>>>> To: Ulanov, Alexander
> >> >>>>> Cc: Evan R. Sparks;
> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>
> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >> >>>>>
> >> >>>>> Hi Alexander,
> >> >>>>>
> >> >>>>> Using GPUs with Spark would be very exciting.  Small comment:
> >> >>>>> Concerning
> >> >>>>> your question earlier about keeping data stored on the GPU rather
> >> >>>>> than
> >> >>>>> having to move it between main memory and GPU memory on each
> >> >>>>> iteration, I
> >> >>>>> would guess this would be critical to getting good performance.
> If
> >> >>>>> you
> >> >>>>> could do multiple local iterations before aggregating results,
> then
> >> >>>>> the
> >> >>>>> cost of data movement to the GPU could be amortized (and I believe
> >> >>>>> that is
> >> >>>>> done in practice).  Having Spark be aware of the GPU and using it
> as
> >> >>>>> another part of memory sounds like a much bigger undertaking.
> >> >>>>>
> >> >>>>> Joseph
> >> >>>>>
> >> >>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
> >> >>>>>
> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>
> >> >>>>> wrote:
> >> >>>>> Thank you for explanation! I’ve watched the BIDMach presentation
> by
> >> >>>>> John
> >> >>>>> Canny and I am really inspired by his talk and comparisons with
> >> >>>>> Spark MLlib.
> >> >>>>>
> >> >>>>> I am very interested to find out what will be better within Spark:
> >> >>>>> BIDMat
> >> >>>>> or netlib-java with CPU or GPU natives. Could you suggest a fair
> way
> >> >>>>> to
> >> >>>>> benchmark them? Currently I do benchmarks on artificial neural
> >> >>>>> networks in
> >> >>>>> batch mode. While it is not a “pure” test of linear algebra, it
> >> >>>>> involves
> >> >>>>> some other things that are essential to machine learning.
> >> >>>>>
> >> >>>>> From: Evan R. Sparks
> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]
> ><mailto:
> >> >>>>> [hidden email]<mailto:[hidden email]>>]
> >> >>>>> Sent: Thursday, February 05, 2015 1:29 PM
> >> >>>>> To: Ulanov, Alexander
> >> >>>>> Cc:
> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>
> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >> >>>>>
> >> >>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> >> >>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due
> to
> >> >>>>> data
> >> >>>>> layout and fewer levels of indirection - it's definitely a
> >> >>>>> worthwhile
> >> >>>>> experiment to run. The main speedups I've seen from using it come
> >> >>>>> from
> >> >>>>> highly optimized GPU code for linear algebra. I know that in the
> >> >>>>> past Canny
> >> >>>>> has gone as far as to write custom GPU kernels for
> >> >>>>> performance-critical
> >> >>>>> regions of code.[1]
> >> >>>>>
> >> >>>>> BIDMach is highly optimized for single node performance or
> >> >>>>> performance on
> >> >>>>> small clusters.[2] Once data doesn't fit easily in GPU memory (or
> >> >>>>> can be
> >> >>>>> batched in that way) the performance tends to fall off. Canny
> argues
> >> >>>>> for
> >> >>>>> hardware/software codesign and as such prefers machine
> >> >>>>> configurations that
> >> >>>>> are quite different than what we find in most commodity cluster
> >> >>>>> nodes -
> >> >>>>> e.g. 10 disk cahnnels and 4 GPUs.
> >> >>>>>
> >> >>>>> In contrast, MLlib was designed for horizontal scalability on
> >> >>>>> commodity
> >> >>>>> clusters and works best on very big datasets - order of terabytes.
> >> >>>>>
> >> >>>>> For the most part, these projects developed concurrently to
> address
> >> >>>>> slightly different use cases. That said, there may be bits of
> >> >>>>> BIDMach we
> >> >>>>> could repurpose for MLlib - keep in mind we need to be careful
> about
> >> >>>>> maintaining cross-language compatibility for our Java and
> >> >>>>> Python-users,
> >> >>>>> though.
> >> >>>>>
> >> >>>>> - Evan
> >> >>>>>
> >> >>>>> [1] - http://arxiv.org/abs/1409.5402
> >> >>>>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
> >> >>>>>
> >> >>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
> >> >>>>>
> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >> >>>>>
> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>>
> >> >>>>> wrote:
> >> >>>>> Hi Evan,
> >> >>>>>
> >> >>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
> >> >>>>> you
> >> >>>>> know what makes them faster than netlib-java?
> >> >>>>>
> >> >>>>> The same group has BIDMach library that implements machine
> learning.
> >> >>>>> For
> >> >>>>> some examples they use Caffe convolutional neural network library
> >> >>>>> owned by
> >> >>>>> another group in Berkeley. Could you elaborate on how these all
> >> >>>>> might be
> >> >>>>> connected with Spark Mllib? If you take BIDMat for linear algebra
> >> >>>>> why don’t
> >> >>>>> you take BIDMach for optimization and learning?
> >> >>>>>
> >> >>>>> Best regards, Alexander
> >> >>>>>
> >> >>>>> From: Evan R. Sparks
> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]
> ><mailto:
> >> >>>>>
> >> >>>>> [hidden email]<mailto:[hidden email]>><mailto:
> [hidden email]<mailto:[hidden email]><mailto:
> >> >>>>> [hidden email]<mailto:[hidden email]>>>]
> >> >>>>> Sent: Thursday, February 05, 2015 12:09 PM
> >> >>>>> To: Ulanov, Alexander
> >> >>>>> Cc:
> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >> >>>>>
> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>
> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >> >>>>>
> >> >>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
> >> >>>>> blas in
> >> >>>>> many cases.
> >> >>>>>
> >> >>>>> You might consider taking a look at the codepaths that BIDMat (
> >> >>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
> >> >>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
> >> >>>>> optimizing
> >> >>>>> to make this work really fast from Scala. I've run it on my laptop
> >> >>>>> and
> >> >>>>> compared to MKL and in certain cases it's 10x faster at matrix
> >> >>>>> multiply.
> >> >>>>> There are a lot of layers of indirection here and you really want
> to
> >> >>>>> avoid
> >> >>>>> data copying as much as possible.
> >> >>>>>
> >> >>>>> We could also consider swapping out BIDMat for Breeze, but that
> >> >>>>> would be
> >> >>>>> a big project and if we can figure out how to get breeze+cublas to
> >> >>>>> comparable performance that would be a big win.
> >> >>>>>
> >> >>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> >> >>>>>
> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >> >>>>>
> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>>
> >> >>>>> wrote:
> >> >>>>> Dear Spark developers,
> >> >>>>>
> >> >>>>> I am exploring how to make linear algebra operations faster within
> >> >>>>> Spark.
> >> >>>>> One way of doing this is to use Scala Breeze library that is
> bundled
> >> >>>>> with
> >> >>>>> Spark. For matrix operations, it employs Netlib-java that has a
> Java
> >> >>>>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK
> >> >>>>> native
> >> >>>>> binaries if they are available on the worker node. It also has its
> >> >>>>> own
> >> >>>>> optimized Java implementation of BLAS. It is worth mentioning,
> that
> >> >>>>> native
> >> >>>>> binaries provide better performance only for BLAS level 3, i.e.
> >> >>>>> matrix-matrix operations or general matrix multiplication (GEMM).
> >> >>>>> This is
> >> >>>>> confirmed by GEMM test on Netlib-java page
> >> >>>>> https://github.com/fommil/netlib-java. I also confirmed it with
> my
> >> >>>>> experiments with training of artificial neural network
> >> >>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
> >> >>>>> However, I would like to boost performance more.
> >> >>>>>
> >> >>>>> GPU is supposed to work fast with linear algebra and there is
> Nvidia
> >> >>>>> CUDA
> >> >>>>> implementation of BLAS, called cublas. I have one Linux server
> with
> >> >>>>> Nvidia
> >> >>>>> GPU and I was able to do the following. I linked cublas (instead
> of
> >> >>>>> cpu-based blas) with Netlib-java wrapper and put it into Spark, so
> >> >>>>> Breeze/Netlib is using it. Then I did some performance
> measurements
> >> >>>>> with
> >> >>>>> regards to artificial neural network batch learning in Spark MLlib
> >> >>>>> that
> >> >>>>> involves matrix-matrix multiplications. It turns out that for
> >> >>>>> matrices of
> >> >>>>> size less than ~1000x780 GPU cublas has the same speed as CPU
> blas.
> >> >>>>> Cublas
> >> >>>>> becomes slower for bigger matrices. It worth mentioning that it is
> >> >>>>> was not
> >> >>>>> a test for ONLY multiplication since there are other operations
> >> >>>>> involved.
> >> >>>>> One of the reasons for slowdown might be the overhead of copying
> the
> >> >>>>> matrices from computer memory to graphic card memory and back.
> >> >>>>>
> >> >>>>> So, few questions:
> >> >>>>> 1) Do these results with CUDA make sense?
> >> >>>>> 2) If the problem is with copy overhead, are there any libraries
> >> >>>>> that
> >> >>>>> allow to force intermediate results to stay in graphic card memory
> >> >>>>> thus
> >> >>>>> removing the overhead?
> >> >>>>> 3) Any other options to speed-up linear algebra in Spark?
> >> >>>>>
> >> >>>>> Thank you, Alexander
> >> >>>>>
> >> >>>>>
> >> >>>>>
> ---------------------------------------------------------------------
> >> >>>>> To unsubscribe, e-mail:
> >> >>>>> [hidden email]<mailto:
> [hidden email]><mailto:
> >> >>>>>
> >> >>>>> [hidden email]<mailto:
> [hidden email]>><mailto:[hidden email]
> <mailto:[hidden email]>
> >> >>>>>
> >> >>>>> <mailto:[hidden email]<mailto:
> [hidden email]>>>
> >> >>>>> For additional commands, e-mail:
> >> >>>>> [hidden email]<mailto:[hidden email]
> ><mailto:
> >> >>>>>
> >> >>>>> [hidden email]<mailto:[hidden email]
> >><mailto:[hidden email]<mailto:[hidden email]
> ><mailto:
> >> >>>>> [hidden email]<mailto:[hidden email]>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>
> >> >
> >> > --
> >> > Best regards,
> >> > Sam
> >> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

Xiangrui Meng
On Fri, Feb 27, 2015 at 12:33 PM, Sam Halliday <[hidden email]> wrote:
> Also, check the JNILoader output.
>
> Remember, for netlib-java to use your system libblas all you need to do is
> setup libblas.so.3 like any native application would expect.
>
> I haven't ever used the cublas "real BLAS"  implementation, so I'd be
> interested to hear about this. Do an 'ldd /usr/lib/libblas.so.3' to check
> that all the runtime links are in order.
>

There are two shared libraries in this hybrid setup. nvblas.so must be
loaded before libblas.so to intercept level 3 routines using GPU. More
details are at: http://docs.nvidia.com/cuda/nvblas/index.html#Usage

> Btw, I have some DGEMM wrappers in my netlib-java performance module... and
> I also planned to write more in MultiBLAS (until I mothballed the project
> for the hardware to catch up, which is probably has and now I just need a
> reason to look at it)
>
> On 27 Feb 2015 20:26, "Xiangrui Meng" <[hidden email]> wrote:
>>
>> Hey Sam,
>>
>> The running times are not "big O" estimates:
>>
>> > The CPU version finished in 12 seconds.
>> > The CPU->GPU->CPU version finished in 2.2 seconds.
>> > The GPU version finished in 1.7 seconds.
>>
>> I think there is something wrong with the netlib/cublas combination.
>> Sam already mentioned that cuBLAS doesn't implement the CPU BLAS
>> interfaces. I checked the CUDA doc and it seems that to use GPU BLAS
>> through the CPU BLAS interface we need to use NVBLAS, which intercepts
>> some Level 3 CPU BLAS calls (including GEMM). So we need to load
>> nvblas.so first and then some CPU BLAS library in JNI. I wonder
>> whether the setup was correct.
>>
>> Alexander, could you check whether GPU is used in the netlib-cublas
>> experiments? You can tell it by watching CPU/GPU usage.
>>
>> Best,
>> Xiangrui
>>
>> On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday <[hidden email]>
>> wrote:
>> > Don't use "big O" estimates, always measure. It used to work back in the
>> > days when double multiplication was a bottleneck. The computation cost
>> > is
>> > effectively free on both the CPU and GPU and you're seeing pure copying
>> > costs. Also, I'm dubious that cublas is doing what you think it is. Can
>> > you
>> > link me to the source code for DGEMM?
>> >
>> > I show all of this in my talk, with explanations, I can't stress enough
>> > how
>> > much I recommend that you watch it if you want to understand high
>> > performance hardware acceleration for linear algebra :-)
>> >
>> > On 27 Feb 2015 01:42, "Xiangrui Meng" <[hidden email]> wrote:
>> >>
>> >> The copying overhead should be quadratic on n, while the computation
>> >> cost is cubic on n. I can understand that netlib-cublas is slower than
>> >> netlib-openblas on small problems. But I'm surprised to see that it is
>> >> still 20x slower on 10000x10000. I did the following on a g2.2xlarge
>> >> instance with BIDMat:
>> >>
>> >> val n = 10000
>> >>
>> >> val f = rand(n, n)
>> >> flip; f*f; val rf = flop
>> >>
>> >> flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg =
>> >> flop
>> >>
>> >> flip; g*g; val rgg = flop
>> >>
>> >> The CPU version finished in 12 seconds.
>> >> The CPU->GPU->CPU version finished in 2.2 seconds.
>> >> The GPU version finished in 1.7 seconds.
>> >>
>> >> I'm not sure whether my CPU->GPU->CPU code simulates the netlib-cublas
>> >> path. But based on the result, the data copying overhead is definitely
>> >> not as big as 20x at n = 10000.
>> >>
>> >> Best,
>> >> Xiangrui
>> >>
>> >>
>> >> On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday <[hidden email]>
>> >> wrote:
>> >> > I've had some email exchanges with the author of BIDMat: it does
>> >> > exactly
>> >> > what you need to get the GPU benefit and writes higher level
>> >> > algorithms
>> >> > entirely in the GPU kernels so that the memory stays there as long as
>> >> > possible. The restriction with this approach is that it is only
>> >> > offering
>> >> > high-level algorithms so is not a toolkit for applied mathematics
>> >> > research and development --- but it works well as a toolkit for
>> >> > higher
>> >> > level analysis (e.g. for analysts and practitioners).
>> >> >
>> >> > I believe BIDMat's approach is the best way to get performance out of
>> >> > GPU hardware at the moment but I also have strong evidence to suggest
>> >> > that the hardware will catch up and the memory transfer costs between
>> >> > CPU/GPU will disappear meaning that there will be no need for custom
>> >> > GPU
>> >> > kernel implementations. i.e. please continue to use BLAS primitives
>> >> > when
>> >> > writing new algorithms and only go to the GPU for an alternative
>> >> > optimised implementation.
>> >> >
>> >> > Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and
>> >> > offer
>> >> > an API that looks like BLAS but takes pointers to special regions in
>> >> > the
>> >> > GPU memory region. Somebody has written a wrapper around CUDA to
>> >> > create
>> >> > a proper BLAS library but it only gives marginal performance over the
>> >> > CPU because of the memory transfer overhead.
>> >> >
>> >> > This slide from my talk
>> >> >
>> >> >   http://fommil.github.io/scalax14/#/11/2
>> >> >
>> >> > says it all. X axis is matrix size, Y axis is logarithmic time to do
>> >> > DGEMM. Black line is the "cheating" time for the GPU and the green
>> >> > line
>> >> > is after copying the memory to/from the GPU memory. APUs have the
>> >> > potential to eliminate the green line.
>> >> >
>> >> > Best regards,
>> >> > Sam
>> >> >
>> >> >
>> >> >
>> >> > "Ulanov, Alexander" <[hidden email]> writes:
>> >> >
>> >> >> Evan, thank you for the summary. I would like to add some more
>> >> >> observations. The GPU that I used is 2.5 times cheaper than the CPU
>> >> >> ($250 vs
>> >> >> $100). They both are 3 years old. I've also did a small test with
>> >> >> modern
>> >> >> hardware, and the new GPU nVidia Titan was slightly more than 1
>> >> >> order of
>> >> >> magnitude faster than Intel E5-2650 v2 for the same tests. However,
>> >> >> it costs
>> >> >> as much as CPU ($1200). My takeaway is that GPU is making a better
>> >> >> price/value progress.
>> >> >>
>> >> >>
>> >> >>
>> >> >> Xiangrui, I was also surprised that BIDMat-cuda was faster than
>> >> >> netlib-cuda and the most reasonable explanation is that it holds the
>> >> >> result
>> >> >> in GPU memory, as Sam suggested. At the same time, it is OK because
>> >> >> you can
>> >> >> copy the result back from GPU only when needed. However, to be sure,
>> >> >> I am
>> >> >> going to ask the developer of BIDMat on his upcoming talk.
>> >> >>
>> >> >>
>> >> >>
>> >> >> Best regards, Alexander
>> >> >>
>> >> >>
>> >> >> From: Sam Halliday [mailto:[hidden email]]
>> >> >> Sent: Thursday, February 26, 2015 1:56 PM
>> >> >> To: Xiangrui Meng
>> >> >> Cc: [hidden email]; Joseph Bradley; Ulanov, Alexander; Evan R.
>> >> >> Sparks
>> >> >> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>
>> >> >>
>> >> >> Btw, I wish people would stop cheating when comparing CPU and GPU
>> >> >> timings for things like matrix multiply :-P
>> >> >>
>> >> >> Please always compare apples with apples and include the time it
>> >> >> takes
>> >> >> to set up the matrices, send it to the processing unit, doing the
>> >> >> calculation AND copying it back to where you need to see the
>> >> >> results.
>> >> >>
>> >> >> Ignoring this method will make you believe that your GPU is
>> >> >> thousands
>> >> >> of times faster than it really is. Again, jump to the end of my talk
>> >> >> for
>> >> >> graphs and more discussion....  especially the bit about me being
>> >> >> keen on
>> >> >> funding to investigate APU hardware further ;-) (I believe it will
>> >> >> solve the
>> >> >> problem)
>> >> >> On 26 Feb 2015 21:16, "Xiangrui Meng"
>> >> >> <[hidden email]<mailto:[hidden email]>> wrote:
>> >> >> Hey Alexander,
>> >> >>
>> >> >> I don't quite understand the part where netlib-cublas is about 20x
>> >> >> slower than netlib-openblas. What is the overhead of using a GPU
>> >> >> BLAS
>> >> >> with netlib-java?
>> >> >>
>> >> >> CC'ed Sam, the author of netlib-java.
>> >> >>
>> >> >> Best,
>> >> >> Xiangrui
>> >> >>
>> >> >> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley
>> >> >> <[hidden email]<mailto:[hidden email]>> wrote:
>> >> >>> Better documentation for linking would be very helpful!  Here's a
>> >> >>> JIRA:
>> >> >>> https://issues.apache.org/jira/browse/SPARK-6019
>> >> >>>
>> >> >>>
>> >> >>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>> >> >>> <[hidden email]<mailto:[hidden email]>>
>> >> >>> wrote:
>> >> >>>
>> >> >>>> Thanks for compiling all the data and running these benchmarks,
>> >> >>>> Alex.
>> >> >>>> The
>> >> >>>> big takeaways here can be seen with this chart:
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>> >> >>>>
>> >> >>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>> >> >>>> BIDMat+GPU) can provide substantial (but less than an order of
>> >> >>>> magnitude)
>> >> >>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>> >> >>>> netlib-java+openblas-compiled).
>> >> >>>> 2) A poorly tuned CPU implementation can be 1-2 orders of
>> >> >>>> magnitude
>> >> >>>> worse
>> >> >>>> than a well-tuned CPU implementation, particularly for larger
>> >> >>>> matrices.
>> >> >>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib -
>> >> >>>> this
>> >> >>>> basically agrees with the authors own benchmarks (
>> >> >>>> https://github.com/fommil/netlib-java)
>> >> >>>>
>> >> >>>> I think that most of our users are in a situation where using GPUs
>> >> >>>> may not
>> >> >>>> be practical - although we could consider having a good GPU
>> >> >>>> backend
>> >> >>>> available as an option. However, *ALL* users of MLlib could
>> >> >>>> benefit
>> >> >>>> (potentially tremendously) from using a well-tuned CPU-based BLAS
>> >> >>>> implementation. Perhaps we should consider updating the mllib
>> >> >>>> guide
>> >> >>>> with a
>> >> >>>> more complete section for enabling high performance binaries on
>> >> >>>> OSX
>> >> >>>> and
>> >> >>>> Linux? Or better, figure out a way for the system to fetch these
>> >> >>>> automatically.
>> >> >>>>
>> >> >>>> - Evan
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>> >> >>>> [hidden email]<mailto:[hidden email]>> wrote:
>> >> >>>>
>> >> >>>>> Just to summarize this thread, I was finally able to make all
>> >> >>>>> performance
>> >> >>>>> comparisons that we discussed. It turns out that:
>> >> >>>>> BIDMat-cublas>>BIDMat
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
>> >> >>>>>
>> >> >>>>> Below is the link to the spreadsheet with full results.
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>> >> >>>>>
>> >> >>>>> One thing still needs exploration: does BIDMat-cublas perform
>> >> >>>>> copying
>> >> >>>>> to/from machine’s RAM?
>> >> >>>>>
>> >> >>>>> -----Original Message-----
>> >> >>>>> From: Ulanov, Alexander
>> >> >>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>> >> >>>>> To: Evan R. Sparks
>> >> >>>>> Cc: Joseph Bradley;
>> >> >>>>> [hidden email]<mailto:[hidden email]>
>> >> >>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>> >> >>>>>
>> >> >>>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>> >> >>>>> the
>> >> >>>>> original one discusses slightly different topic. I was able to
>> >> >>>>> link
>> >> >>>>> netlib
>> >> >>>>> with MKL from BIDMat binaries. Indeed, MKL is statically linked
>> >> >>>>> inside a
>> >> >>>>> 60MB library.
>> >> >>>>>
>> >> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>> >> >>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> +-----------------------------------------------------------------------+
>> >> >>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556
>> >> >>>>> |
>> >> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>> >> >>>>> |1,638475459 |
>> >> >>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211
>> >> >>>>> |
>> >> >>>>> 1569,233228 |
>> >> >>>>>
>> >> >>>>> It turn out that pre-compiled MKL is faster than precompiled
>> >> >>>>> OpenBlas on
>> >> >>>>> my machine. Probably, I’ll add two more columns with locally
>> >> >>>>> compiled
>> >> >>>>> openblas and cuda.
>> >> >>>>>
>> >> >>>>> Alexander
>> >> >>>>>
>> >> >>>>> From: Evan R. Sparks
>> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]>]
>> >> >>>>> Sent: Monday, February 09, 2015 6:06 PM
>> >> >>>>> To: Ulanov, Alexander
>> >> >>>>> Cc: Joseph Bradley;
>> >> >>>>> [hidden email]<mailto:[hidden email]>
>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>>>>
>> >> >>>>> Great - perhaps we can move this discussion off-list and onto a
>> >> >>>>> JIRA
>> >> >>>>> ticket? (Here's one:
>> >> >>>>> https://issues.apache.org/jira/browse/SPARK-5705)
>> >> >>>>>
>> >> >>>>> It seems like this is going to be somewhat exploratory for a
>> >> >>>>> while
>> >> >>>>> (and
>> >> >>>>> there's probably only a handful of us who really care about fast
>> >> >>>>> linear
>> >> >>>>> algebra!)
>> >> >>>>>
>> >> >>>>> - Evan
>> >> >>>>>
>> >> >>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>> >> >>>>> wrote:
>> >> >>>>> Hi Evan,
>> >> >>>>>
>> >> >>>>> Thank you for explanation and useful link. I am going to build
>> >> >>>>> OpenBLAS,
>> >> >>>>> link it with Netlib-java and perform benchmark again.
>> >> >>>>>
>> >> >>>>> Do I understand correctly that BIDMat binaries contain statically
>> >> >>>>> linked
>> >> >>>>> Intel MKL BLAS? It might be the reason why I am able to run
>> >> >>>>> BIDMat
>> >> >>>>> not
>> >> >>>>> having MKL BLAS installed on my server. If it is true, I wonder
>> >> >>>>> if
>> >> >>>>> it is OK
>> >> >>>>> because Intel sells this library. Nevertheless, it seems that in
>> >> >>>>> my
>> >> >>>>> case
>> >> >>>>> precompiled MKL BLAS performs better than precompiled OpenBLAS
>> >> >>>>> given
>> >> >>>>> that
>> >> >>>>> BIDMat and Netlib-java are supposed to be on par with JNI
>> >> >>>>> overheads.
>> >> >>>>>
>> >> >>>>> Though, it might be interesting to link Netlib-java with Intel
>> >> >>>>> MKL,
>> >> >>>>> as
>> >> >>>>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
>> >> >>>>> (Netlib-java) interested to compare their libraries.
>> >> >>>>>
>> >> >>>>> Best regards, Alexander
>> >> >>>>>
>> >> >>>>> From: Evan R. Sparks
>> >> >>>>>
>> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >> >>>>> [hidden email]<mailto:[hidden email]>>]
>> >> >>>>> Sent: Friday, February 06, 2015 5:58 PM
>> >> >>>>>
>> >> >>>>> To: Ulanov, Alexander
>> >> >>>>> Cc: Joseph Bradley;
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>>>>
>> >> >>>>> I would build OpenBLAS yourself, since good BLAS performance
>> >> >>>>> comes
>> >> >>>>> from
>> >> >>>>> getting cache sizes, etc. set up correctly for your particular
>> >> >>>>> hardware -
>> >> >>>>> this is often a very tricky process (see, e.g. ATLAS), but we
>> >> >>>>> found
>> >> >>>>> that on
>> >> >>>>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
>> >> >>>>> performance competitive with MKL.
>> >> >>>>>
>> >> >>>>> To make sure the right library is getting used, you have to make
>> >> >>>>> sure
>> >> >>>>> it's first on the search path - export
>> >> >>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>> >> >>>>>
>> >> >>>>> For some examples of getting netlib-java setup on an ec2 node and
>> >> >>>>> some
>> >> >>>>> example benchmarking code we ran a while back, see:
>> >> >>>>> https://github.com/shivaram/matrix-bench
>> >> >>>>>
>> >> >>>>> In particular - build-openblas-ec2.sh shows you how to build the
>> >> >>>>> library
>> >> >>>>> and set up symlinks correctly, and scala/run-netlib.sh shows you
>> >> >>>>> how
>> >> >>>>> to get
>> >> >>>>> the path setup and get that library picked up by netlib-java.
>> >> >>>>>
>> >> >>>>> In this way - you could probably get cuBLAS set up to be used by
>> >> >>>>> netlib-java as well.
>> >> >>>>>
>> >> >>>>> - Evan
>> >> >>>>>
>> >> >>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>> >> >>>>> wrote:
>> >> >>>>> Evan, could you elaborate on how to force BIDMat and netlib-java
>> >> >>>>> to
>> >> >>>>> force
>> >> >>>>> loading the right blas? For netlib, I there are few JVM flags,
>> >> >>>>> such
>> >> >>>>> as
>> >> >>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>> >> >>>>> so
>> >> >>>>> I can
>> >> >>>>> force it to use Java implementation. Not sure I understand how to
>> >> >>>>> force use
>> >> >>>>> a specific blas (not specific wrapper for blas).
>> >> >>>>>
>> >> >>>>> Btw. I have installed openblas (yum install openblas), so I
>> >> >>>>> suppose
>> >> >>>>> that
>> >> >>>>> netlib is using it.
>> >> >>>>>
>> >> >>>>> From: Evan R. Sparks
>> >> >>>>>
>> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >> >>>>> [hidden email]<mailto:[hidden email]>>]
>> >> >>>>> Sent: Friday, February 06, 2015 5:19 PM
>> >> >>>>> To: Ulanov, Alexander
>> >> >>>>> Cc: Joseph Bradley;
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>> >> >>>>>
>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>>>>
>> >> >>>>> Getting breeze to pick up the right blas library is critical for
>> >> >>>>> performance. I recommend using OpenBLAS (or MKL, if you already
>> >> >>>>> have
>> >> >>>>> it).
>> >> >>>>> It might make sense to force BIDMat to use the same underlying
>> >> >>>>> BLAS
>> >> >>>>> library
>> >> >>>>> as well.
>> >> >>>>>
>> >> >>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>> >> >>>>> wrote:
>> >> >>>>> Hi Evan, Joseph
>> >> >>>>>
>> >> >>>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>> >> >>>>> faster
>> >> >>>>> than netlib-java+breeze (sorry for weird table formatting):
>> >> >>>>>
>> >> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>> >> >>>>> native_system_linux_x86-64|
>> >> >>>>> Breeze+Netlib-java f2jblas |
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> +-----------------------------------------------------------------------+
>> >> >>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>> >> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>> >> >>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 |
>> >> >>>>> 1569,233228 |
>> >> >>>>>
>> >> >>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM,
>> >> >>>>> Fedora
>> >> >>>>> 19
>> >> >>>>> Linux, Scala 2.11.
>> >> >>>>>
>> >> >>>>> Later I will make tests with Cuda. I need to install new Cuda
>> >> >>>>> version for
>> >> >>>>> this purpose.
>> >> >>>>>
>> >> >>>>> Do you have any ideas why breeze-netlib with native blas is so
>> >> >>>>> much
>> >> >>>>> slower than BIDMat MKL?
>> >> >>>>>
>> >> >>>>> Best regards, Alexander
>> >> >>>>>
>> >> >>>>> From: Joseph Bradley
>> >> >>>>>
>> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >> >>>>> [hidden email]<mailto:[hidden email]>>]
>> >> >>>>> Sent: Thursday, February 05, 2015 5:29 PM
>> >> >>>>> To: Ulanov, Alexander
>> >> >>>>> Cc: Evan R. Sparks;
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>>>>
>> >> >>>>> Hi Alexander,
>> >> >>>>>
>> >> >>>>> Using GPUs with Spark would be very exciting.  Small comment:
>> >> >>>>> Concerning
>> >> >>>>> your question earlier about keeping data stored on the GPU rather
>> >> >>>>> than
>> >> >>>>> having to move it between main memory and GPU memory on each
>> >> >>>>> iteration, I
>> >> >>>>> would guess this would be critical to getting good performance.
>> >> >>>>> If
>> >> >>>>> you
>> >> >>>>> could do multiple local iterations before aggregating results,
>> >> >>>>> then
>> >> >>>>> the
>> >> >>>>> cost of data movement to the GPU could be amortized (and I
>> >> >>>>> believe
>> >> >>>>> that is
>> >> >>>>> done in practice).  Having Spark be aware of the GPU and using it
>> >> >>>>> as
>> >> >>>>> another part of memory sounds like a much bigger undertaking.
>> >> >>>>>
>> >> >>>>> Joseph
>> >> >>>>>
>> >> >>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>> >> >>>>> wrote:
>> >> >>>>> Thank you for explanation! I’ve watched the BIDMach presentation
>> >> >>>>> by
>> >> >>>>> John
>> >> >>>>> Canny and I am really inspired by his talk and comparisons with
>> >> >>>>> Spark MLlib.
>> >> >>>>>
>> >> >>>>> I am very interested to find out what will be better within
>> >> >>>>> Spark:
>> >> >>>>> BIDMat
>> >> >>>>> or netlib-java with CPU or GPU natives. Could you suggest a fair
>> >> >>>>> way
>> >> >>>>> to
>> >> >>>>> benchmark them? Currently I do benchmarks on artificial neural
>> >> >>>>> networks in
>> >> >>>>> batch mode. While it is not a “pure” test of linear algebra, it
>> >> >>>>> involves
>> >> >>>>> some other things that are essential to machine learning.
>> >> >>>>>
>> >> >>>>> From: Evan R. Sparks
>> >> >>>>>
>> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >> >>>>> [hidden email]<mailto:[hidden email]>>]
>> >> >>>>> Sent: Thursday, February 05, 2015 1:29 PM
>> >> >>>>> To: Ulanov, Alexander
>> >> >>>>> Cc:
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>>>>
>> >> >>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>> >> >>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due
>> >> >>>>> to
>> >> >>>>> data
>> >> >>>>> layout and fewer levels of indirection - it's definitely a
>> >> >>>>> worthwhile
>> >> >>>>> experiment to run. The main speedups I've seen from using it come
>> >> >>>>> from
>> >> >>>>> highly optimized GPU code for linear algebra. I know that in the
>> >> >>>>> past Canny
>> >> >>>>> has gone as far as to write custom GPU kernels for
>> >> >>>>> performance-critical
>> >> >>>>> regions of code.[1]
>> >> >>>>>
>> >> >>>>> BIDMach is highly optimized for single node performance or
>> >> >>>>> performance on
>> >> >>>>> small clusters.[2] Once data doesn't fit easily in GPU memory (or
>> >> >>>>> can be
>> >> >>>>> batched in that way) the performance tends to fall off. Canny
>> >> >>>>> argues
>> >> >>>>> for
>> >> >>>>> hardware/software codesign and as such prefers machine
>> >> >>>>> configurations that
>> >> >>>>> are quite different than what we find in most commodity cluster
>> >> >>>>> nodes -
>> >> >>>>> e.g. 10 disk cahnnels and 4 GPUs.
>> >> >>>>>
>> >> >>>>> In contrast, MLlib was designed for horizontal scalability on
>> >> >>>>> commodity
>> >> >>>>> clusters and works best on very big datasets - order of
>> >> >>>>> terabytes.
>> >> >>>>>
>> >> >>>>> For the most part, these projects developed concurrently to
>> >> >>>>> address
>> >> >>>>> slightly different use cases. That said, there may be bits of
>> >> >>>>> BIDMach we
>> >> >>>>> could repurpose for MLlib - keep in mind we need to be careful
>> >> >>>>> about
>> >> >>>>> maintaining cross-language compatibility for our Java and
>> >> >>>>> Python-users,
>> >> >>>>> though.
>> >> >>>>>
>> >> >>>>> - Evan
>> >> >>>>>
>> >> >>>>> [1] - http://arxiv.org/abs/1409.5402
>> >> >>>>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>> >> >>>>>
>> >> >>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>>
>> >> >>>>> wrote:
>> >> >>>>> Hi Evan,
>> >> >>>>>
>> >> >>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>> >> >>>>> you
>> >> >>>>> know what makes them faster than netlib-java?
>> >> >>>>>
>> >> >>>>> The same group has BIDMach library that implements machine
>> >> >>>>> learning.
>> >> >>>>> For
>> >> >>>>> some examples they use Caffe convolutional neural network library
>> >> >>>>> owned by
>> >> >>>>> another group in Berkeley. Could you elaborate on how these all
>> >> >>>>> might be
>> >> >>>>> connected with Spark Mllib? If you take BIDMat for linear algebra
>> >> >>>>> why don’t
>> >> >>>>> you take BIDMach for optimization and learning?
>> >> >>>>>
>> >> >>>>> Best regards, Alexander
>> >> >>>>>
>> >> >>>>> From: Evan R. Sparks
>> >> >>>>>
>> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >> >>>>> [hidden email]<mailto:[hidden email]>>>]
>> >> >>>>> Sent: Thursday, February 05, 2015 12:09 PM
>> >> >>>>> To: Ulanov, Alexander
>> >> >>>>> Cc:
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>>>>
>> >> >>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>> >> >>>>> blas in
>> >> >>>>> many cases.
>> >> >>>>>
>> >> >>>>> You might consider taking a look at the codepaths that BIDMat (
>> >> >>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>> >> >>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>> >> >>>>> optimizing
>> >> >>>>> to make this work really fast from Scala. I've run it on my
>> >> >>>>> laptop
>> >> >>>>> and
>> >> >>>>> compared to MKL and in certain cases it's 10x faster at matrix
>> >> >>>>> multiply.
>> >> >>>>> There are a lot of layers of indirection here and you really want
>> >> >>>>> to
>> >> >>>>> avoid
>> >> >>>>> data copying as much as possible.
>> >> >>>>>
>> >> >>>>> We could also consider swapping out BIDMat for Breeze, but that
>> >> >>>>> would be
>> >> >>>>> a big project and if we can figure out how to get breeze+cublas
>> >> >>>>> to
>> >> >>>>> comparable performance that would be a big win.
>> >> >>>>>
>> >> >>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>>
>> >> >>>>> wrote:
>> >> >>>>> Dear Spark developers,
>> >> >>>>>
>> >> >>>>> I am exploring how to make linear algebra operations faster
>> >> >>>>> within
>> >> >>>>> Spark.
>> >> >>>>> One way of doing this is to use Scala Breeze library that is
>> >> >>>>> bundled
>> >> >>>>> with
>> >> >>>>> Spark. For matrix operations, it employs Netlib-java that has a
>> >> >>>>> Java
>> >> >>>>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK
>> >> >>>>> native
>> >> >>>>> binaries if they are available on the worker node. It also has
>> >> >>>>> its
>> >> >>>>> own
>> >> >>>>> optimized Java implementation of BLAS. It is worth mentioning,
>> >> >>>>> that
>> >> >>>>> native
>> >> >>>>> binaries provide better performance only for BLAS level 3, i.e.
>> >> >>>>> matrix-matrix operations or general matrix multiplication (GEMM).
>> >> >>>>> This is
>> >> >>>>> confirmed by GEMM test on Netlib-java page
>> >> >>>>> https://github.com/fommil/netlib-java. I also confirmed it with
>> >> >>>>> my
>> >> >>>>> experiments with training of artificial neural network
>> >> >>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>> >> >>>>> However, I would like to boost performance more.
>> >> >>>>>
>> >> >>>>> GPU is supposed to work fast with linear algebra and there is
>> >> >>>>> Nvidia
>> >> >>>>> CUDA
>> >> >>>>> implementation of BLAS, called cublas. I have one Linux server
>> >> >>>>> with
>> >> >>>>> Nvidia
>> >> >>>>> GPU and I was able to do the following. I linked cublas (instead
>> >> >>>>> of
>> >> >>>>> cpu-based blas) with Netlib-java wrapper and put it into Spark,
>> >> >>>>> so
>> >> >>>>> Breeze/Netlib is using it. Then I did some performance
>> >> >>>>> measurements
>> >> >>>>> with
>> >> >>>>> regards to artificial neural network batch learning in Spark
>> >> >>>>> MLlib
>> >> >>>>> that
>> >> >>>>> involves matrix-matrix multiplications. It turns out that for
>> >> >>>>> matrices of
>> >> >>>>> size less than ~1000x780 GPU cublas has the same speed as CPU
>> >> >>>>> blas.
>> >> >>>>> Cublas
>> >> >>>>> becomes slower for bigger matrices. It worth mentioning that it
>> >> >>>>> is
>> >> >>>>> was not
>> >> >>>>> a test for ONLY multiplication since there are other operations
>> >> >>>>> involved.
>> >> >>>>> One of the reasons for slowdown might be the overhead of copying
>> >> >>>>> the
>> >> >>>>> matrices from computer memory to graphic card memory and back.
>> >> >>>>>
>> >> >>>>> So, few questions:
>> >> >>>>> 1) Do these results with CUDA make sense?
>> >> >>>>> 2) If the problem is with copy overhead, are there any libraries
>> >> >>>>> that
>> >> >>>>> allow to force intermediate results to stay in graphic card
>> >> >>>>> memory
>> >> >>>>> thus
>> >> >>>>> removing the overhead?
>> >> >>>>> 3) Any other options to speed-up linear algebra in Spark?
>> >> >>>>>
>> >> >>>>> Thank you, Alexander
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> ---------------------------------------------------------------------
>> >> >>>>> To unsubscribe, e-mail:
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> <mailto:[hidden email]<mailto:[hidden email]>>>
>> >> >>>>> For additional commands, e-mail:
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >> >>>>> [hidden email]<mailto:[hidden email]>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>
>> >> >
>> >> > --
>> >> > Best regards,
>> >> > Sam
>> >> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Using CUDA within Spark / boosting linear algebra

Ulanov, Alexander
Hi Xiangrui,

Thanks for the link, I am currently trying to use nvblas. It seems that netlib wrappers are implemented with C-BLAS interface and nvblas does not have c-blas. I wonder how it is going to work. I'll keep you updated.

Alexander

-----Original Message-----
From: Xiangrui Meng [mailto:[hidden email]]
Sent: Monday, March 02, 2015 11:42 AM
To: Sam Halliday
Cc: Joseph Bradley; Ulanov, Alexander; dev; Evan R. Sparks
Subject: Re: Using CUDA within Spark / boosting linear algebra

On Fri, Feb 27, 2015 at 12:33 PM, Sam Halliday <[hidden email]> wrote:
> Also, check the JNILoader output.
>
> Remember, for netlib-java to use your system libblas all you need to
> do is setup libblas.so.3 like any native application would expect.
>
> I haven't ever used the cublas "real BLAS"  implementation, so I'd be
> interested to hear about this. Do an 'ldd /usr/lib/libblas.so.3' to
> check that all the runtime links are in order.
>

There are two shared libraries in this hybrid setup. nvblas.so must be loaded before libblas.so to intercept level 3 routines using GPU. More details are at: http://docs.nvidia.com/cuda/nvblas/index.html#Usage

> Btw, I have some DGEMM wrappers in my netlib-java performance
> module... and I also planned to write more in MultiBLAS (until I
> mothballed the project for the hardware to catch up, which is probably
> has and now I just need a reason to look at it)
>
> On 27 Feb 2015 20:26, "Xiangrui Meng" <[hidden email]> wrote:
>>
>> Hey Sam,
>>
>> The running times are not "big O" estimates:
>>
>> > The CPU version finished in 12 seconds.
>> > The CPU->GPU->CPU version finished in 2.2 seconds.
>> > The GPU version finished in 1.7 seconds.
>>
>> I think there is something wrong with the netlib/cublas combination.
>> Sam already mentioned that cuBLAS doesn't implement the CPU BLAS
>> interfaces. I checked the CUDA doc and it seems that to use GPU BLAS
>> through the CPU BLAS interface we need to use NVBLAS, which
>> intercepts some Level 3 CPU BLAS calls (including GEMM). So we need
>> to load nvblas.so first and then some CPU BLAS library in JNI. I
>> wonder whether the setup was correct.
>>
>> Alexander, could you check whether GPU is used in the netlib-cublas
>> experiments? You can tell it by watching CPU/GPU usage.
>>
>> Best,
>> Xiangrui
>>
>> On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday
>> <[hidden email]>
>> wrote:
>> > Don't use "big O" estimates, always measure. It used to work back
>> > in the days when double multiplication was a bottleneck. The
>> > computation cost is effectively free on both the CPU and GPU and
>> > you're seeing pure copying costs. Also, I'm dubious that cublas is
>> > doing what you think it is. Can you link me to the source code for
>> > DGEMM?
>> >
>> > I show all of this in my talk, with explanations, I can't stress
>> > enough how much I recommend that you watch it if you want to
>> > understand high performance hardware acceleration for linear
>> > algebra :-)
>> >
>> > On 27 Feb 2015 01:42, "Xiangrui Meng" <[hidden email]> wrote:
>> >>
>> >> The copying overhead should be quadratic on n, while the
>> >> computation cost is cubic on n. I can understand that
>> >> netlib-cublas is slower than netlib-openblas on small problems.
>> >> But I'm surprised to see that it is still 20x slower on
>> >> 10000x10000. I did the following on a g2.2xlarge instance with BIDMat:
>> >>
>> >> val n = 10000
>> >>
>> >> val f = rand(n, n)
>> >> flip; f*f; val rf = flop
>> >>
>> >> flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val
>> >> rg = flop
>> >>
>> >> flip; g*g; val rgg = flop
>> >>
>> >> The CPU version finished in 12 seconds.
>> >> The CPU->GPU->CPU version finished in 2.2 seconds.
>> >> The GPU version finished in 1.7 seconds.
>> >>
>> >> I'm not sure whether my CPU->GPU->CPU code simulates the
>> >> netlib-cublas path. But based on the result, the data copying
>> >> overhead is definitely not as big as 20x at n = 10000.
>> >>
>> >> Best,
>> >> Xiangrui
>> >>
>> >>
>> >> On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday
>> >> <[hidden email]>
>> >> wrote:
>> >> > I've had some email exchanges with the author of BIDMat: it does
>> >> > exactly what you need to get the GPU benefit and writes higher
>> >> > level algorithms entirely in the GPU kernels so that the memory
>> >> > stays there as long as possible. The restriction with this
>> >> > approach is that it is only offering high-level algorithms so is
>> >> > not a toolkit for applied mathematics research and development
>> >> > --- but it works well as a toolkit for higher level analysis
>> >> > (e.g. for analysts and practitioners).
>> >> >
>> >> > I believe BIDMat's approach is the best way to get performance
>> >> > out of GPU hardware at the moment but I also have strong
>> >> > evidence to suggest that the hardware will catch up and the
>> >> > memory transfer costs between CPU/GPU will disappear meaning
>> >> > that there will be no need for custom GPU kernel
>> >> > implementations. i.e. please continue to use BLAS primitives
>> >> > when writing new algorithms and only go to the GPU for an
>> >> > alternative optimised implementation.
>> >> >
>> >> > Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like,
>> >> > and offer an API that looks like BLAS but takes pointers to
>> >> > special regions in the GPU memory region. Somebody has written a
>> >> > wrapper around CUDA to create a proper BLAS library but it only
>> >> > gives marginal performance over the CPU because of the memory
>> >> > transfer overhead.
>> >> >
>> >> > This slide from my talk
>> >> >
>> >> >   http://fommil.github.io/scalax14/#/11/2
>> >> >
>> >> > says it all. X axis is matrix size, Y axis is logarithmic time
>> >> > to do DGEMM. Black line is the "cheating" time for the GPU and
>> >> > the green line is after copying the memory to/from the GPU
>> >> > memory. APUs have the potential to eliminate the green line.
>> >> >
>> >> > Best regards,
>> >> > Sam
>> >> >
>> >> >
>> >> >
>> >> > "Ulanov, Alexander" <[hidden email]> writes:
>> >> >
>> >> >> Evan, thank you for the summary. I would like to add some more
>> >> >> observations. The GPU that I used is 2.5 times cheaper than the
>> >> >> CPU
>> >> >> ($250 vs
>> >> >> $100). They both are 3 years old. I've also did a small test
>> >> >> with modern hardware, and the new GPU nVidia Titan was slightly
>> >> >> more than 1 order of magnitude faster than Intel E5-2650 v2 for
>> >> >> the same tests. However, it costs as much as CPU ($1200). My
>> >> >> takeaway is that GPU is making a better price/value progress.
>> >> >>
>> >> >>
>> >> >>
>> >> >> Xiangrui, I was also surprised that BIDMat-cuda was faster than
>> >> >> netlib-cuda and the most reasonable explanation is that it
>> >> >> holds the result in GPU memory, as Sam suggested. At the same
>> >> >> time, it is OK because you can copy the result back from GPU
>> >> >> only when needed. However, to be sure, I am going to ask the
>> >> >> developer of BIDMat on his upcoming talk.
>> >> >>
>> >> >>
>> >> >>
>> >> >> Best regards, Alexander
>> >> >>
>> >> >>
>> >> >> From: Sam Halliday [mailto:[hidden email]]
>> >> >> Sent: Thursday, February 26, 2015 1:56 PM
>> >> >> To: Xiangrui Meng
>> >> >> Cc: [hidden email]; Joseph Bradley; Ulanov, Alexander; Evan R.
>> >> >> Sparks
>> >> >> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>
>> >> >>
>> >> >> Btw, I wish people would stop cheating when comparing CPU and
>> >> >> GPU timings for things like matrix multiply :-P
>> >> >>
>> >> >> Please always compare apples with apples and include the time
>> >> >> it takes to set up the matrices, send it to the processing
>> >> >> unit, doing the calculation AND copying it back to where you
>> >> >> need to see the results.
>> >> >>
>> >> >> Ignoring this method will make you believe that your GPU is
>> >> >> thousands of times faster than it really is. Again, jump to the
>> >> >> end of my talk for graphs and more discussion....  especially
>> >> >> the bit about me being keen on funding to investigate APU
>> >> >> hardware further ;-) (I believe it will solve the
>> >> >> problem)
>> >> >> On 26 Feb 2015 21:16, "Xiangrui Meng"
>> >> >> <[hidden email]<mailto:[hidden email]>> wrote:
>> >> >> Hey Alexander,
>> >> >>
>> >> >> I don't quite understand the part where netlib-cublas is about
>> >> >> 20x slower than netlib-openblas. What is the overhead of using
>> >> >> a GPU BLAS with netlib-java?
>> >> >>
>> >> >> CC'ed Sam, the author of netlib-java.
>> >> >>
>> >> >> Best,
>> >> >> Xiangrui
>> >> >>
>> >> >> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley
>> >> >> <[hidden email]<mailto:[hidden email]>> wrote:
>> >> >>> Better documentation for linking would be very helpful!  
>> >> >>> Here's a
>> >> >>> JIRA:
>> >> >>> https://issues.apache.org/jira/browse/SPARK-6019
>> >> >>>
>> >> >>>
>> >> >>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>> >> >>> <[hidden email]<mailto:[hidden email]>>
>> >> >>> wrote:
>> >> >>>
>> >> >>>> Thanks for compiling all the data and running these
>> >> >>>> benchmarks, Alex.
>> >> >>>> The
>> >> >>>> big takeaways here can be seen with this chart:
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4
>> >> >>>> StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interac
>> >> >>>> tive
>> >> >>>>
>> >> >>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>> >> >>>> BIDMat+GPU) can provide substantial (but less than an order
>> >> >>>> BIDMat+of
>> >> >>>> magnitude)
>> >> >>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL
>> >> >>>> or
>> >> >>>> netlib-java+openblas-compiled).
>> >> >>>> 2) A poorly tuned CPU implementation can be 1-2 orders of
>> >> >>>> magnitude worse than a well-tuned CPU implementation,
>> >> >>>> particularly for larger matrices.
>> >> >>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib
>> >> >>>> - this basically agrees with the authors own benchmarks (
>> >> >>>> https://github.com/fommil/netlib-java)
>> >> >>>>
>> >> >>>> I think that most of our users are in a situation where using
>> >> >>>> GPUs may not be practical - although we could consider having
>> >> >>>> a good GPU backend available as an option. However, *ALL*
>> >> >>>> users of MLlib could benefit (potentially tremendously) from
>> >> >>>> using a well-tuned CPU-based BLAS implementation. Perhaps we
>> >> >>>> should consider updating the mllib guide with a more complete
>> >> >>>> section for enabling high performance binaries on OSX and
>> >> >>>> Linux? Or better, figure out a way for the system to fetch
>> >> >>>> these automatically.
>> >> >>>>
>> >> >>>> - Evan
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>> >> >>>> [hidden email]<mailto:[hidden email]>> wrote:
>> >> >>>>
>> >> >>>>> Just to summarize this thread, I was finally able to make
>> >> >>>>> all performance comparisons that we discussed. It turns out
>> >> >>>>> that:
>> >> >>>>> BIDMat-cublas>>BIDMat
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yu
>> >> >>>>> m-repo==netlib-cublas>netlib-blas>f2jblas
>> >> >>>>>
>> >> >>>>> Below is the link to the spreadsheet with full results.
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeo
>> >> >>>>> uQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>> >> >>>>>
>> >> >>>>> One thing still needs exploration: does BIDMat-cublas
>> >> >>>>> perform copying to/from machine’s RAM?
>> >> >>>>>
>> >> >>>>> -----Original Message-----
>> >> >>>>> From: Ulanov, Alexander
>> >> >>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>> >> >>>>> To: Evan R. Sparks
>> >> >>>>> Cc: Joseph Bradley;
>> >> >>>>> [hidden email]<mailto:[hidden email]>
>> >> >>>>> Subject: RE: Using CUDA within Spark / boosting linear
>> >> >>>>> algebra
>> >> >>>>>
>> >> >>>>> Thanks, Evan! It seems that ticket was marked as duplicate
>> >> >>>>> though the original one discusses slightly different topic.
>> >> >>>>> I was able to link netlib with MKL from BIDMat binaries.
>> >> >>>>> Indeed, MKL is statically linked inside a 60MB library.
>> >> >>>>>
>> >> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>> >> >>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas
>> >> >>>>> Breeze+|
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> +-----------------------------------------------------------------------+
>> >> >>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 |
>> >> >>>>> |0,002556
>> >> >>>>> |
>> >> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 |
>> >> >>>>> |0,51803557
>> >> >>>>> |1,638475459 |
>> >> >>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697
>> >> >>>>> ||445,0935211
>> >> >>>>> |
>> >> >>>>> 1569,233228 |
>> >> >>>>>
>> >> >>>>> It turn out that pre-compiled MKL is faster than precompiled
>> >> >>>>> OpenBlas on my machine. Probably, I’ll add two more columns
>> >> >>>>> with locally compiled openblas and cuda.
>> >> >>>>>
>> >> >>>>> Alexander
>> >> >>>>>
>> >> >>>>> From: Evan R. Sparks
>> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]>]
>> >> >>>>> Sent: Monday, February 09, 2015 6:06 PM
>> >> >>>>> To: Ulanov, Alexander
>> >> >>>>> Cc: Joseph Bradley;
>> >> >>>>> [hidden email]<mailto:[hidden email]>
>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>>>>
>> >> >>>>> Great - perhaps we can move this discussion off-list and onto a
>> >> >>>>> JIRA
>> >> >>>>> ticket? (Here's one:
>> >> >>>>> https://issues.apache.org/jira/browse/SPARK-5705)
>> >> >>>>>
>> >> >>>>> It seems like this is going to be somewhat exploratory for a
>> >> >>>>> while
>> >> >>>>> (and
>> >> >>>>> there's probably only a handful of us who really care about fast
>> >> >>>>> linear
>> >> >>>>> algebra!)
>> >> >>>>>
>> >> >>>>> - Evan
>> >> >>>>>
>> >> >>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>> >> >>>>> wrote:
>> >> >>>>> Hi Evan,
>> >> >>>>>
>> >> >>>>> Thank you for explanation and useful link. I am going to build
>> >> >>>>> OpenBLAS,
>> >> >>>>> link it with Netlib-java and perform benchmark again.
>> >> >>>>>
>> >> >>>>> Do I understand correctly that BIDMat binaries contain statically
>> >> >>>>> linked
>> >> >>>>> Intel MKL BLAS? It might be the reason why I am able to run
>> >> >>>>> BIDMat
>> >> >>>>> not
>> >> >>>>> having MKL BLAS installed on my server. If it is true, I wonder
>> >> >>>>> if
>> >> >>>>> it is OK
>> >> >>>>> because Intel sells this library. Nevertheless, it seems that in
>> >> >>>>> my
>> >> >>>>> case
>> >> >>>>> precompiled MKL BLAS performs better than precompiled OpenBLAS
>> >> >>>>> given
>> >> >>>>> that
>> >> >>>>> BIDMat and Netlib-java are supposed to be on par with JNI
>> >> >>>>> overheads.
>> >> >>>>>
>> >> >>>>> Though, it might be interesting to link Netlib-java with Intel
>> >> >>>>> MKL,
>> >> >>>>> as
>> >> >>>>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
>> >> >>>>> (Netlib-java) interested to compare their libraries.
>> >> >>>>>
>> >> >>>>> Best regards, Alexander
>> >> >>>>>
>> >> >>>>> From: Evan R. Sparks
>> >> >>>>>
>> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >> >>>>> [hidden email]<mailto:[hidden email]>>]
>> >> >>>>> Sent: Friday, February 06, 2015 5:58 PM
>> >> >>>>>
>> >> >>>>> To: Ulanov, Alexander
>> >> >>>>> Cc: Joseph Bradley;
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>>>>
>> >> >>>>> I would build OpenBLAS yourself, since good BLAS performance
>> >> >>>>> comes
>> >> >>>>> from
>> >> >>>>> getting cache sizes, etc. set up correctly for your particular
>> >> >>>>> hardware -
>> >> >>>>> this is often a very tricky process (see, e.g. ATLAS), but we
>> >> >>>>> found
>> >> >>>>> that on
>> >> >>>>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
>> >> >>>>> performance competitive with MKL.
>> >> >>>>>
>> >> >>>>> To make sure the right library is getting used, you have to make
>> >> >>>>> sure
>> >> >>>>> it's first on the search path - export
>> >> >>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>> >> >>>>>
>> >> >>>>> For some examples of getting netlib-java setup on an ec2 node and
>> >> >>>>> some
>> >> >>>>> example benchmarking code we ran a while back, see:
>> >> >>>>> https://github.com/shivaram/matrix-bench
>> >> >>>>>
>> >> >>>>> In particular - build-openblas-ec2.sh shows you how to build the
>> >> >>>>> library
>> >> >>>>> and set up symlinks correctly, and scala/run-netlib.sh shows you
>> >> >>>>> how
>> >> >>>>> to get
>> >> >>>>> the path setup and get that library picked up by netlib-java.
>> >> >>>>>
>> >> >>>>> In this way - you could probably get cuBLAS set up to be used by
>> >> >>>>> netlib-java as well.
>> >> >>>>>
>> >> >>>>> - Evan
>> >> >>>>>
>> >> >>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>> >> >>>>> wrote:
>> >> >>>>> Evan, could you elaborate on how to force BIDMat and netlib-java
>> >> >>>>> to
>> >> >>>>> force
>> >> >>>>> loading the right blas? For netlib, I there are few JVM flags,
>> >> >>>>> such
>> >> >>>>> as
>> >> >>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>> >> >>>>> so
>> >> >>>>> I can
>> >> >>>>> force it to use Java implementation. Not sure I understand how to
>> >> >>>>> force use
>> >> >>>>> a specific blas (not specific wrapper for blas).
>> >> >>>>>
>> >> >>>>> Btw. I have installed openblas (yum install openblas), so I
>> >> >>>>> suppose
>> >> >>>>> that
>> >> >>>>> netlib is using it.
>> >> >>>>>
>> >> >>>>> From: Evan R. Sparks
>> >> >>>>>
>> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >> >>>>> [hidden email]<mailto:[hidden email]>>]
>> >> >>>>> Sent: Friday, February 06, 2015 5:19 PM
>> >> >>>>> To: Ulanov, Alexander
>> >> >>>>> Cc: Joseph Bradley;
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>> >> >>>>>
>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>>>>
>> >> >>>>> Getting breeze to pick up the right blas library is critical for
>> >> >>>>> performance. I recommend using OpenBLAS (or MKL, if you already
>> >> >>>>> have
>> >> >>>>> it).
>> >> >>>>> It might make sense to force BIDMat to use the same underlying
>> >> >>>>> BLAS
>> >> >>>>> library
>> >> >>>>> as well.
>> >> >>>>>
>> >> >>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>> >> >>>>> wrote:
>> >> >>>>> Hi Evan, Joseph
>> >> >>>>>
>> >> >>>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>> >> >>>>> faster
>> >> >>>>> than netlib-java+breeze (sorry for weird table formatting):
>> >> >>>>>
>> >> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>> >> >>>>> native_system_linux_x86-64|
>> >> >>>>> Breeze+Netlib-java f2jblas |
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> +-----------------------------------------------------------------------+
>> >> >>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>> >> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>> >> >>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 |
>> >> >>>>> 1569,233228 |
>> >> >>>>>
>> >> >>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM,
>> >> >>>>> Fedora
>> >> >>>>> 19
>> >> >>>>> Linux, Scala 2.11.
>> >> >>>>>
>> >> >>>>> Later I will make tests with Cuda. I need to install new Cuda
>> >> >>>>> version for
>> >> >>>>> this purpose.
>> >> >>>>>
>> >> >>>>> Do you have any ideas why breeze-netlib with native blas is so
>> >> >>>>> much
>> >> >>>>> slower than BIDMat MKL?
>> >> >>>>>
>> >> >>>>> Best regards, Alexander
>> >> >>>>>
>> >> >>>>> From: Joseph Bradley
>> >> >>>>>
>> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >> >>>>> [hidden email]<mailto:[hidden email]>>]
>> >> >>>>> Sent: Thursday, February 05, 2015 5:29 PM
>> >> >>>>> To: Ulanov, Alexander
>> >> >>>>> Cc: Evan R. Sparks;
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>>>>
>> >> >>>>> Hi Alexander,
>> >> >>>>>
>> >> >>>>> Using GPUs with Spark would be very exciting.  Small comment:
>> >> >>>>> Concerning
>> >> >>>>> your question earlier about keeping data stored on the GPU rather
>> >> >>>>> than
>> >> >>>>> having to move it between main memory and GPU memory on each
>> >> >>>>> iteration, I
>> >> >>>>> would guess this would be critical to getting good performance.
>> >> >>>>> If
>> >> >>>>> you
>> >> >>>>> could do multiple local iterations before aggregating results,
>> >> >>>>> then
>> >> >>>>> the
>> >> >>>>> cost of data movement to the GPU could be amortized (and I
>> >> >>>>> believe
>> >> >>>>> that is
>> >> >>>>> done in practice).  Having Spark be aware of the GPU and using it
>> >> >>>>> as
>> >> >>>>> another part of memory sounds like a much bigger undertaking.
>> >> >>>>>
>> >> >>>>> Joseph
>> >> >>>>>
>> >> >>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>> >> >>>>> wrote:
>> >> >>>>> Thank you for explanation! I’ve watched the BIDMach presentation
>> >> >>>>> by
>> >> >>>>> John
>> >> >>>>> Canny and I am really inspired by his talk and comparisons with
>> >> >>>>> Spark MLlib.
>> >> >>>>>
>> >> >>>>> I am very interested to find out what will be better within
>> >> >>>>> Spark:
>> >> >>>>> BIDMat
>> >> >>>>> or netlib-java with CPU or GPU natives. Could you suggest a fair
>> >> >>>>> way
>> >> >>>>> to
>> >> >>>>> benchmark them? Currently I do benchmarks on artificial neural
>> >> >>>>> networks in
>> >> >>>>> batch mode. While it is not a “pure” test of linear algebra, it
>> >> >>>>> involves
>> >> >>>>> some other things that are essential to machine learning.
>> >> >>>>>
>> >> >>>>> From: Evan R. Sparks
>> >> >>>>>
>> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >> >>>>> [hidden email]<mailto:[hidden email]>>]
>> >> >>>>> Sent: Thursday, February 05, 2015 1:29 PM
>> >> >>>>> To: Ulanov, Alexander
>> >> >>>>> Cc:
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>>>>
>> >> >>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>> >> >>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due
>> >> >>>>> to
>> >> >>>>> data
>> >> >>>>> layout and fewer levels of indirection - it's definitely a
>> >> >>>>> worthwhile
>> >> >>>>> experiment to run. The main speedups I've seen from using it come
>> >> >>>>> from
>> >> >>>>> highly optimized GPU code for linear algebra. I know that in the
>> >> >>>>> past Canny
>> >> >>>>> has gone as far as to write custom GPU kernels for
>> >> >>>>> performance-critical
>> >> >>>>> regions of code.[1]
>> >> >>>>>
>> >> >>>>> BIDMach is highly optimized for single node performance or
>> >> >>>>> performance on
>> >> >>>>> small clusters.[2] Once data doesn't fit easily in GPU memory (or
>> >> >>>>> can be
>> >> >>>>> batched in that way) the performance tends to fall off. Canny
>> >> >>>>> argues
>> >> >>>>> for
>> >> >>>>> hardware/software codesign and as such prefers machine
>> >> >>>>> configurations that
>> >> >>>>> are quite different than what we find in most commodity cluster
>> >> >>>>> nodes -
>> >> >>>>> e.g. 10 disk cahnnels and 4 GPUs.
>> >> >>>>>
>> >> >>>>> In contrast, MLlib was designed for horizontal scalability on
>> >> >>>>> commodity
>> >> >>>>> clusters and works best on very big datasets - order of
>> >> >>>>> terabytes.
>> >> >>>>>
>> >> >>>>> For the most part, these projects developed concurrently to
>> >> >>>>> address
>> >> >>>>> slightly different use cases. That said, there may be bits of
>> >> >>>>> BIDMach we
>> >> >>>>> could repurpose for MLlib - keep in mind we need to be careful
>> >> >>>>> about
>> >> >>>>> maintaining cross-language compatibility for our Java and
>> >> >>>>> Python-users,
>> >> >>>>> though.
>> >> >>>>>
>> >> >>>>> - Evan
>> >> >>>>>
>> >> >>>>> [1] - http://arxiv.org/abs/1409.5402
>> >> >>>>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>> >> >>>>>
>> >> >>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>>
>> >> >>>>> wrote:
>> >> >>>>> Hi Evan,
>> >> >>>>>
>> >> >>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>> >> >>>>> you
>> >> >>>>> know what makes them faster than netlib-java?
>> >> >>>>>
>> >> >>>>> The same group has BIDMach library that implements machine
>> >> >>>>> learning.
>> >> >>>>> For
>> >> >>>>> some examples they use Caffe convolutional neural network library
>> >> >>>>> owned by
>> >> >>>>> another group in Berkeley. Could you elaborate on how these all
>> >> >>>>> might be
>> >> >>>>> connected with Spark Mllib? If you take BIDMat for linear algebra
>> >> >>>>> why don’t
>> >> >>>>> you take BIDMach for optimization and learning?
>> >> >>>>>
>> >> >>>>> Best regards, Alexander
>> >> >>>>>
>> >> >>>>> From: Evan R. Sparks
>> >> >>>>>
>> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >> >>>>> [hidden email]<mailto:[hidden email]>>>]
>> >> >>>>> Sent: Thursday, February 05, 2015 12:09 PM
>> >> >>>>> To: Ulanov, Alexander
>> >> >>>>> Cc:
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>>>>
>> >> >>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>> >> >>>>> blas in
>> >> >>>>> many cases.
>> >> >>>>>
>> >> >>>>> You might consider taking a look at the codepaths that BIDMat (
>> >> >>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>> >> >>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>> >> >>>>> optimizing
>> >> >>>>> to make this work really fast from Scala. I've run it on my
>> >> >>>>> laptop
>> >> >>>>> and
>> >> >>>>> compared to MKL and in certain cases it's 10x faster at matrix
>> >> >>>>> multiply.
>> >> >>>>> There are a lot of layers of indirection here and you really want
>> >> >>>>> to
>> >> >>>>> avoid
>> >> >>>>> data copying as much as possible.
>> >> >>>>>
>> >> >>>>> We could also consider swapping out BIDMat for Breeze, but that
>> >> >>>>> would be
>> >> >>>>> a big project and if we can figure out how to get breeze+cublas
>> >> >>>>> to
>> >> >>>>> comparable performance that would be a big win.
>> >> >>>>>
>> >> >>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>>
>> >> >>>>> wrote:
>> >> >>>>> Dear Spark developers,
>> >> >>>>>
>> >> >>>>> I am exploring how to make linear algebra operations faster
>> >> >>>>> within
>> >> >>>>> Spark.
>> >> >>>>> One way of doing this is to use Scala Breeze library that is
>> >> >>>>> bundled
>> >> >>>>> with
>> >> >>>>> Spark. For matrix operations, it employs Netlib-java that has a
>> >> >>>>> Java
>> >> >>>>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK
>> >> >>>>> native
>> >> >>>>> binaries if they are available on the worker node. It also has
>> >> >>>>> its
>> >> >>>>> own
>> >> >>>>> optimized Java implementation of BLAS. It is worth mentioning,
>> >> >>>>> that
>> >> >>>>> native
>> >> >>>>> binaries provide better performance only for BLAS level 3, i.e.
>> >> >>>>> matrix-matrix operations or general matrix multiplication (GEMM).
>> >> >>>>> This is
>> >> >>>>> confirmed by GEMM test on Netlib-java page
>> >> >>>>> https://github.com/fommil/netlib-java. I also confirmed it with
>> >> >>>>> my
>> >> >>>>> experiments with training of artificial neural network
>> >> >>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>> >> >>>>> However, I would like to boost performance more.
>> >> >>>>>
>> >> >>>>> GPU is supposed to work fast with linear algebra and there is
>> >> >>>>> Nvidia
>> >> >>>>> CUDA
>> >> >>>>> implementation of BLAS, called cublas. I have one Linux server
>> >> >>>>> with
>> >> >>>>> Nvidia
>> >> >>>>> GPU and I was able to do the following. I linked cublas (instead
>> >> >>>>> of
>> >> >>>>> cpu-based blas) with Netlib-java wrapper and put it into Spark,
>> >> >>>>> so
>> >> >>>>> Breeze/Netlib is using it. Then I did some performance
>> >> >>>>> measurements
>> >> >>>>> with
>> >> >>>>> regards to artificial neural network batch learning in Spark
>> >> >>>>> MLlib
>> >> >>>>> that
>> >> >>>>> involves matrix-matrix multiplications. It turns out that for
>> >> >>>>> matrices of
>> >> >>>>> size less than ~1000x780 GPU cublas has the same speed as CPU
>> >> >>>>> blas.
>> >> >>>>> Cublas
>> >> >>>>> becomes slower for bigger matrices. It worth mentioning that it
>> >> >>>>> is
>> >> >>>>> was not
>> >> >>>>> a test for ONLY multiplication since there are other operations
>> >> >>>>> involved.
>> >> >>>>> One of the reasons for slowdown might be the overhead of copying
>> >> >>>>> the
>> >> >>>>> matrices from computer memory to graphic card memory and back.
>> >> >>>>>
>> >> >>>>> So, few questions:
>> >> >>>>> 1) Do these results with CUDA make sense?
>> >> >>>>> 2) If the problem is with copy overhead, are there any libraries
>> >> >>>>> that
>> >> >>>>> allow to force intermediate results to stay in graphic card
>> >> >>>>> memory
>> >> >>>>> thus
>> >> >>>>> removing the overhead?
>> >> >>>>> 3) Any other options to speed-up linear algebra in Spark?
>> >> >>>>>
>> >> >>>>> Thank you, Alexander
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> ---------------------------------------------------------------------
>> >> >>>>> To unsubscribe, e-mail:
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> <mailto:[hidden email]<mailto:[hidden email]>>>
>> >> >>>>> For additional commands, e-mail:
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>> >> >>>>> [hidden email]<mailto:[hidden email]>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>
>> >> >
>> >> > --
>> >> > Best regards,
>> >> > Sam
>> >> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

RE: Using CUDA within Spark / boosting linear algebra

Ulanov, Alexander
Thanks Sam for suggestion! I should try doing this. Now I suppose that netlib-java linked with cuBlas during the execution time does fall back to cblas library in my system, which is atlas. If I remove atlas, netlib (linked with cublas) fails with the message "undefined symbol: cblas_dgemm".  

In the meantime, I have updated my spreadsheet with BIDMat-cuda results that does copy from main memory to GPU, multiplies and the copies it back to main memory (similar to what Xiangrui did). Surprisingly (for myself), the copying overhead seems quite small, especially for the bigger matrices.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

-----Original Message-----
From: Sam Halliday [mailto:[hidden email]]
Sent: Monday, March 02, 2015 1:24 PM
To: Ulanov, Alexander
Subject: Re: Using CUDA within Spark / boosting linear algebra

That's correct. It's highly unusual for a libblas.so to only provide the Fortran API. Oh well... CBLAS sources are available in the netlib-java repository so you could simply compile them and link against whatever libblas.so[fortran] you like.

On 2 March 2015 at 21:04, Ulanov, Alexander <[hidden email]> wrote:

> Hi Xiangrui,
>
> Thanks for the link, I am currently trying to use nvblas. It seems that netlib wrappers are implemented with C-BLAS interface and nvblas does not have c-blas. I wonder how it is going to work. I'll keep you updated.
>
> Alexander
>
> -----Original Message-----
> From: Xiangrui Meng [mailto:[hidden email]]
> Sent: Monday, March 02, 2015 11:42 AM
> To: Sam Halliday
> Cc: Joseph Bradley; Ulanov, Alexander; dev; Evan R. Sparks
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> On Fri, Feb 27, 2015 at 12:33 PM, Sam Halliday <[hidden email]> wrote:
>> Also, check the JNILoader output.
>>
>> Remember, for netlib-java to use your system libblas all you need to
>> do is setup libblas.so.3 like any native application would expect.
>>
>> I haven't ever used the cublas "real BLAS"  implementation, so I'd be
>> interested to hear about this. Do an 'ldd /usr/lib/libblas.so.3' to
>> check that all the runtime links are in order.
>>
>
> There are two shared libraries in this hybrid setup. nvblas.so must be
> loaded before libblas.so to intercept level 3 routines using GPU. More
> details are at: http://docs.nvidia.com/cuda/nvblas/index.html#Usage
>
>> Btw, I have some DGEMM wrappers in my netlib-java performance
>> module... and I also planned to write more in MultiBLAS (until I
>> mothballed the project for the hardware to catch up, which is
>> probably has and now I just need a reason to look at it)
>>
>> On 27 Feb 2015 20:26, "Xiangrui Meng" <[hidden email]> wrote:
>>>
>>> Hey Sam,
>>>
>>> The running times are not "big O" estimates:
>>>
>>> > The CPU version finished in 12 seconds.
>>> > The CPU->GPU->CPU version finished in 2.2 seconds.
>>> > The GPU version finished in 1.7 seconds.
>>>
>>> I think there is something wrong with the netlib/cublas combination.
>>> Sam already mentioned that cuBLAS doesn't implement the CPU BLAS
>>> interfaces. I checked the CUDA doc and it seems that to use GPU BLAS
>>> through the CPU BLAS interface we need to use NVBLAS, which
>>> intercepts some Level 3 CPU BLAS calls (including GEMM). So we need
>>> to load nvblas.so first and then some CPU BLAS library in JNI. I
>>> wonder whether the setup was correct.
>>>
>>> Alexander, could you check whether GPU is used in the netlib-cublas
>>> experiments? You can tell it by watching CPU/GPU usage.
>>>
>>> Best,
>>> Xiangrui
>>>
>>> On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday
>>> <[hidden email]>
>>> wrote:
>>> > Don't use "big O" estimates, always measure. It used to work back
>>> > in the days when double multiplication was a bottleneck. The
>>> > computation cost is effectively free on both the CPU and GPU and
>>> > you're seeing pure copying costs. Also, I'm dubious that cublas is
>>> > doing what you think it is. Can you link me to the source code for
>>> > DGEMM?
>>> >
>>> > I show all of this in my talk, with explanations, I can't stress
>>> > enough how much I recommend that you watch it if you want to
>>> > understand high performance hardware acceleration for linear
>>> > algebra :-)
>>> >
>>> > On 27 Feb 2015 01:42, "Xiangrui Meng" <[hidden email]> wrote:
>>> >>
>>> >> The copying overhead should be quadratic on n, while the
>>> >> computation cost is cubic on n. I can understand that
>>> >> netlib-cublas is slower than netlib-openblas on small problems.
>>> >> But I'm surprised to see that it is still 20x slower on
>>> >> 10000x10000. I did the following on a g2.2xlarge instance with BIDMat:
>>> >>
>>> >> val n = 10000
>>> >>
>>> >> val f = rand(n, n)
>>> >> flip; f*f; val rf = flop
>>> >>
>>> >> flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val
>>> >> rg = flop
>>> >>
>>> >> flip; g*g; val rgg = flop
>>> >>
>>> >> The CPU version finished in 12 seconds.
>>> >> The CPU->GPU->CPU version finished in 2.2 seconds.
>>> >> The GPU version finished in 1.7 seconds.
>>> >>
>>> >> I'm not sure whether my CPU->GPU->CPU code simulates the
>>> >> netlib-cublas path. But based on the result, the data copying
>>> >> overhead is definitely not as big as 20x at n = 10000.
>>> >>
>>> >> Best,
>>> >> Xiangrui
>>> >>
>>> >>
>>> >> On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday
>>> >> <[hidden email]>
>>> >> wrote:
>>> >> > I've had some email exchanges with the author of BIDMat: it
>>> >> > does exactly what you need to get the GPU benefit and writes
>>> >> > higher level algorithms entirely in the GPU kernels so that the
>>> >> > memory stays there as long as possible. The restriction with
>>> >> > this approach is that it is only offering high-level algorithms
>>> >> > so is not a toolkit for applied mathematics research and
>>> >> > development
>>> >> > --- but it works well as a toolkit for higher level analysis
>>> >> > (e.g. for analysts and practitioners).
>>> >> >
>>> >> > I believe BIDMat's approach is the best way to get performance
>>> >> > out of GPU hardware at the moment but I also have strong
>>> >> > evidence to suggest that the hardware will catch up and the
>>> >> > memory transfer costs between CPU/GPU will disappear meaning
>>> >> > that there will be no need for custom GPU kernel
>>> >> > implementations. i.e. please continue to use BLAS primitives
>>> >> > when writing new algorithms and only go to the GPU for an
>>> >> > alternative optimised implementation.
>>> >> >
>>> >> > Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like,
>>> >> > and offer an API that looks like BLAS but takes pointers to
>>> >> > special regions in the GPU memory region. Somebody has written
>>> >> > a wrapper around CUDA to create a proper BLAS library but it
>>> >> > only gives marginal performance over the CPU because of the
>>> >> > memory transfer overhead.
>>> >> >
>>> >> > This slide from my talk
>>> >> >
>>> >> >   http://fommil.github.io/scalax14/#/11/2
>>> >> >
>>> >> > says it all. X axis is matrix size, Y axis is logarithmic time
>>> >> > to do DGEMM. Black line is the "cheating" time for the GPU and
>>> >> > the green line is after copying the memory to/from the GPU
>>> >> > memory. APUs have the potential to eliminate the green line.
>>> >> >
>>> >> > Best regards,
>>> >> > Sam
>>> >> >
>>> >> >
>>> >> >
>>> >> > "Ulanov, Alexander" <[hidden email]> writes:
>>> >> >
>>> >> >> Evan, thank you for the summary. I would like to add some more
>>> >> >> observations. The GPU that I used is 2.5 times cheaper than
>>> >> >> the CPU
>>> >> >> ($250 vs
>>> >> >> $100). They both are 3 years old. I've also did a small test
>>> >> >> with modern hardware, and the new GPU nVidia Titan was
>>> >> >> slightly more than 1 order of magnitude faster than Intel
>>> >> >> E5-2650 v2 for the same tests. However, it costs as much as
>>> >> >> CPU ($1200). My takeaway is that GPU is making a better price/value progress.
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> Xiangrui, I was also surprised that BIDMat-cuda was faster
>>> >> >> than netlib-cuda and the most reasonable explanation is that
>>> >> >> it holds the result in GPU memory, as Sam suggested. At the
>>> >> >> same time, it is OK because you can copy the result back from
>>> >> >> GPU only when needed. However, to be sure, I am going to ask
>>> >> >> the developer of BIDMat on his upcoming talk.
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> Best regards, Alexander
>>> >> >>
>>> >> >>
>>> >> >> From: Sam Halliday [mailto:[hidden email]]
>>> >> >> Sent: Thursday, February 26, 2015 1:56 PM
>>> >> >> To: Xiangrui Meng
>>> >> >> Cc: [hidden email]; Joseph Bradley; Ulanov, Alexander; Evan R.
>>> >> >> Sparks
>>> >> >> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>> >> >>
>>> >> >>
>>> >> >> Btw, I wish people would stop cheating when comparing CPU and
>>> >> >> GPU timings for things like matrix multiply :-P
>>> >> >>
>>> >> >> Please always compare apples with apples and include the time
>>> >> >> it takes to set up the matrices, send it to the processing
>>> >> >> unit, doing the calculation AND copying it back to where you
>>> >> >> need to see the results.
>>> >> >>
>>> >> >> Ignoring this method will make you believe that your GPU is
>>> >> >> thousands of times faster than it really is. Again, jump to
>>> >> >> the end of my talk for graphs and more discussion....  
>>> >> >> especially the bit about me being keen on funding to
>>> >> >> investigate APU hardware further ;-) (I believe it will solve
>>> >> >> the
>>> >> >> problem)
>>> >> >> On 26 Feb 2015 21:16, "Xiangrui Meng"
>>> >> >> <[hidden email]<mailto:[hidden email]>> wrote:
>>> >> >> Hey Alexander,
>>> >> >>
>>> >> >> I don't quite understand the part where netlib-cublas is about
>>> >> >> 20x slower than netlib-openblas. What is the overhead of using
>>> >> >> a GPU BLAS with netlib-java?
>>> >> >>
>>> >> >> CC'ed Sam, the author of netlib-java.
>>> >> >>
>>> >> >> Best,
>>> >> >> Xiangrui
>>> >> >>
>>> >> >> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley
>>> >> >> <[hidden email]<mailto:[hidden email]>> wrote:
>>> >> >>> Better documentation for linking would be very helpful!
>>> >> >>> Here's a
>>> >> >>> JIRA:
>>> >> >>> https://issues.apache.org/jira/browse/SPARK-6019
>>> >> >>>
>>> >> >>>
>>> >> >>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>>> >> >>> <[hidden email]<mailto:[hidden email]>>
>>> >> >>> wrote:
>>> >> >>>
>>> >> >>>> Thanks for compiling all the data and running these
>>> >> >>>> benchmarks, Alex.
>>> >> >>>> The
>>> >> >>>> big takeaways here can be seen with this chart:
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh
>>> >> >>>> 4
>>> >> >>>> StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=intera
>>> >> >>>> c
>>> >> >>>> tive
>>> >> >>>>
>>> >> >>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>> >> >>>> BIDMat+GPU) can provide substantial (but less than an order
>>> >> >>>> BIDMat+of
>>> >> >>>> magnitude)
>>> >> >>>> benefit over a well-tuned CPU implementation (e.g.
>>> >> >>>> BIDMat+MKL or
>>> >> >>>> netlib-java+openblas-compiled).
>>> >> >>>> 2) A poorly tuned CPU implementation can be 1-2 orders of
>>> >> >>>> magnitude worse than a well-tuned CPU implementation,
>>> >> >>>> particularly for larger matrices.
>>> >> >>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib
>>> >> >>>> - this basically agrees with the authors own benchmarks (
>>> >> >>>> https://github.com/fommil/netlib-java)
>>> >> >>>>
>>> >> >>>> I think that most of our users are in a situation where
>>> >> >>>> using GPUs may not be practical - although we could consider
>>> >> >>>> having a good GPU backend available as an option. However,
>>> >> >>>> *ALL* users of MLlib could benefit (potentially
>>> >> >>>> tremendously) from using a well-tuned CPU-based BLAS
>>> >> >>>> implementation. Perhaps we should consider updating the
>>> >> >>>> mllib guide with a more complete section for enabling high
>>> >> >>>> performance binaries on OSX and Linux? Or better, figure out
>>> >> >>>> a way for the system to fetch these automatically.
>>> >> >>>>
>>> >> >>>> - Evan
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>> >> >>>> [hidden email]<mailto:[hidden email]>> wrote:
>>> >> >>>>
>>> >> >>>>> Just to summarize this thread, I was finally able to make
>>> >> >>>>> all performance comparisons that we discussed. It turns out
>>> >> >>>>> that:
>>> >> >>>>> BIDMat-cublas>>BIDMat
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-y
>>> >> >>>>> u m-repo==netlib-cublas>netlib-blas>f2jblas
>>> >> >>>>>
>>> >> >>>>> Below is the link to the spreadsheet with full results.
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oe
>>> >> >>>>> o uQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>> >> >>>>>
>>> >> >>>>> One thing still needs exploration: does BIDMat-cublas
>>> >> >>>>> perform copying to/from machine’s RAM?
>>> >> >>>>>
>>> >> >>>>> -----Original Message-----
>>> >> >>>>> From: Ulanov, Alexander
>>> >> >>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>> >> >>>>> To: Evan R. Sparks
>>> >> >>>>> Cc: Joseph Bradley;
>>> >> >>>>> [hidden email]<mailto:[hidden email]>
>>> >> >>>>> Subject: RE: Using CUDA within Spark / boosting linear
>>> >> >>>>> algebra
>>> >> >>>>>
>>> >> >>>>> Thanks, Evan! It seems that ticket was marked as duplicate
>>> >> >>>>> though the original one discusses slightly different topic.
>>> >> >>>>> I was able to link netlib with MKL from BIDMat binaries.
>>> >> >>>>> Indeed, MKL is statically linked inside a 60MB library.
>>> >> >>>>>
>>> >> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>> >> >>>>> Breeze+Netlib-OpenBlas(native system)|
>>> >> >>>>> Breeze+Breeze+Netlib-f2jblas
>>> >> >>>>> Breeze+|
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> +-----------------------------------------------------------------------+
>>> >> >>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 |
>>> >> >>>>> |0,002556
>>> >> >>>>> |
>>> >> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 |
>>> >> >>>>> |0,51803557
>>> >> >>>>> |1,638475459 |
>>> >> >>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697
>>> >> >>>>> ||445,0935211
>>> >> >>>>> |
>>> >> >>>>> 1569,233228 |
>>> >> >>>>>
>>> >> >>>>> It turn out that pre-compiled MKL is faster than
>>> >> >>>>> precompiled OpenBlas on my machine. Probably, I’ll add two
>>> >> >>>>> more columns with locally compiled openblas and cuda.
>>> >> >>>>>
>>> >> >>>>> Alexander
>>> >> >>>>>
>>> >> >>>>> From: Evan R. Sparks
>>> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]>
>>> >> >>>>> ]
>>> >> >>>>> Sent: Monday, February 09, 2015 6:06 PM
>>> >> >>>>> To: Ulanov, Alexander
>>> >> >>>>> Cc: Joseph Bradley;
>>> >> >>>>> [hidden email]<mailto:[hidden email]>
>>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear
>>> >> >>>>> algebra
>>> >> >>>>>
>>> >> >>>>> Great - perhaps we can move this discussion off-list and
>>> >> >>>>> onto a JIRA ticket? (Here's one:
>>> >> >>>>> https://issues.apache.org/jira/browse/SPARK-5705)
>>> >> >>>>>
>>> >> >>>>> It seems like this is going to be somewhat exploratory for
>>> >> >>>>> a while (and there's probably only a handful of us who
>>> >> >>>>> really care about fast linear
>>> >> >>>>> algebra!)
>>> >> >>>>>
>>> >> >>>>> - Evan
>>> >> >>>>>
>>> >> >>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> [hidden email]<mailto:[hidden email]><mai
>>> >> >>>>> lto:[hidden email]<mailto:[hidden email]>
>>> >> >>>>> >>
>>> >> >>>>> wrote:
>>> >> >>>>> Hi Evan,
>>> >> >>>>>
>>> >> >>>>> Thank you for explanation and useful link. I am going to
>>> >> >>>>> build OpenBLAS, link it with Netlib-java and perform
>>> >> >>>>> benchmark again.
>>> >> >>>>>
>>> >> >>>>> Do I understand correctly that BIDMat binaries contain
>>> >> >>>>> statically linked Intel MKL BLAS? It might be the reason
>>> >> >>>>> why I am able to run BIDMat not having MKL BLAS installed
>>> >> >>>>> on my server. If it is true, I wonder if it is OK because
>>> >> >>>>> Intel sells this library. Nevertheless, it seems that in my
>>> >> >>>>> case precompiled MKL BLAS performs better than precompiled
>>> >> >>>>> OpenBLAS given that BIDMat and Netlib-java are supposed to
>>> >> >>>>> be on par with JNI overheads.
>>> >> >>>>>
>>> >> >>>>> Though, it might be interesting to link Netlib-java with
>>> >> >>>>> Intel MKL, as you suggested. I wonder, are John Canny
>>> >> >>>>> (BIDMat) and Sam Halliday
>>> >> >>>>> (Netlib-java) interested to compare their libraries.
>>> >> >>>>>
>>> >> >>>>> Best regards, Alexander
>>> >> >>>>>
>>> >> >>>>> From: Evan R. Sparks
>>> >> >>>>>
>>> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> >> >>>>> [hidden email]<mailto:[hidden email]>>]
>>> >> >>>>> Sent: Friday, February 06, 2015 5:58 PM
>>> >> >>>>>
>>> >> >>>>> To: Ulanov, Alexander
>>> >> >>>>> Cc: Joseph Bradley;
>>> >> >>>>>
>>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:de
>>> >> >>>>> [hidden email]<mailto:[hidden email]>>
>>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear
>>> >> >>>>> algebra
>>> >> >>>>>
>>> >> >>>>> I would build OpenBLAS yourself, since good BLAS
>>> >> >>>>> performance comes from getting cache sizes, etc. set up
>>> >> >>>>> correctly for your particular hardware - this is often a
>>> >> >>>>> very tricky process (see, e.g. ATLAS), but we found that on
>>> >> >>>>> relatively modern Xeon chips, OpenBLAS builds quickly and
>>> >> >>>>> yields performance competitive with MKL.
>>> >> >>>>>
>>> >> >>>>> To make sure the right library is getting used, you have to
>>> >> >>>>> make sure it's first on the search path - export
>>> >> >>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>> >> >>>>>
>>> >> >>>>> For some examples of getting netlib-java setup on an ec2
>>> >> >>>>> node and some example benchmarking code we ran a while
>>> >> >>>>> back, see:
>>> >> >>>>> https://github.com/shivaram/matrix-bench
>>> >> >>>>>
>>> >> >>>>> In particular - build-openblas-ec2.sh shows you how to
>>> >> >>>>> build the library and set up symlinks correctly, and
>>> >> >>>>> scala/run-netlib.sh shows you how to get the path setup and
>>> >> >>>>> get that library picked up by netlib-java.
>>> >> >>>>>
>>> >> >>>>> In this way - you could probably get cuBLAS set up to be
>>> >> >>>>> used by netlib-java as well.
>>> >> >>>>>
>>> >> >>>>> - Evan
>>> >> >>>>>
>>> >> >>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> [hidden email]<mailto:[hidden email]><mai
>>> >> >>>>> lto:[hidden email]<mailto:[hidden email]>
>>> >> >>>>> >>
>>> >> >>>>> wrote:
>>> >> >>>>> Evan, could you elaborate on how to force BIDMat and
>>> >> >>>>> netlib-java to force loading the right blas? For netlib, I
>>> >> >>>>> there are few JVM flags, such as
>>> >> >>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2
>>> >> >>>>> jBLAS,
>>> >> >>>>> so
>>> >> >>>>> I can
>>> >> >>>>> force it to use Java implementation. Not sure I understand
>>> >> >>>>> how to force use a specific blas (not specific wrapper for
>>> >> >>>>> blas).
>>> >> >>>>>
>>> >> >>>>> Btw. I have installed openblas (yum install openblas), so I
>>> >> >>>>> suppose that netlib is using it.
>>> >> >>>>>
>>> >> >>>>> From: Evan R. Sparks
>>> >> >>>>>
>>> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> >> >>>>> [hidden email]<mailto:[hidden email]>>]
>>> >> >>>>> Sent: Friday, February 06, 2015 5:19 PM
>>> >> >>>>> To: Ulanov, Alexander
>>> >> >>>>> Cc: Joseph Bradley;
>>> >> >>>>>
>>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:de
>>> >> >>>>> [hidden email]<mailto:[hidden email]>>
>>> >> >>>>>
>>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear
>>> >> >>>>> algebra
>>> >> >>>>>
>>> >> >>>>> Getting breeze to pick up the right blas library is
>>> >> >>>>> critical for performance. I recommend using OpenBLAS (or
>>> >> >>>>> MKL, if you already have it).
>>> >> >>>>> It might make sense to force BIDMat to use the same
>>> >> >>>>> underlying BLAS library as well.
>>> >> >>>>>
>>> >> >>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> [hidden email]<mailto:[hidden email]><mai
>>> >> >>>>> lto:[hidden email]<mailto:[hidden email]>
>>> >> >>>>> >>
>>> >> >>>>> wrote:
>>> >> >>>>> Hi Evan, Joseph
>>> >> >>>>>
>>> >> >>>>> I did few matrix multiplication test and BIDMat seems to be
>>> >> >>>>> ~10x faster than netlib-java+breeze (sorry for weird table
>>> >> >>>>> formatting):
>>> >> >>>>>
>>> >> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>>> >> >>>>> native_system_linux_x86-64|
>>> >> >>>>> Breeze+Netlib-java f2jblas |
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> +-----------------------------------------------------------------------+
>>> >> >>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>> >> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557
>>> >> >>>>> ||1,638475459 |
>>> >> >>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 |
>>> >> >>>>> 1569,233228 |
>>> >> >>>>>
>>> >> >>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB
>>> >> >>>>> RAM, Fedora
>>> >> >>>>> 19
>>> >> >>>>> Linux, Scala 2.11.
>>> >> >>>>>
>>> >> >>>>> Later I will make tests with Cuda. I need to install new
>>> >> >>>>> Cuda version for this purpose.
>>> >> >>>>>
>>> >> >>>>> Do you have any ideas why breeze-netlib with native blas is
>>> >> >>>>> so much slower than BIDMat MKL?
>>> >> >>>>>
>>> >> >>>>> Best regards, Alexander
>>> >> >>>>>
>>> >> >>>>> From: Joseph Bradley
>>> >> >>>>>
>>> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> >> >>>>> [hidden email]<mailto:[hidden email]>>]
>>> >> >>>>> Sent: Thursday, February 05, 2015 5:29 PM
>>> >> >>>>> To: Ulanov, Alexander
>>> >> >>>>> Cc: Evan R. Sparks;
>>> >> >>>>>
>>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:de
>>> >> >>>>> [hidden email]<mailto:[hidden email]>>
>>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear
>>> >> >>>>> algebra
>>> >> >>>>>
>>> >> >>>>> Hi Alexander,
>>> >> >>>>>
>>> >> >>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>> >> >>>>> Concerning
>>> >> >>>>> your question earlier about keeping data stored on the GPU
>>> >> >>>>> rather than having to move it between main memory and GPU
>>> >> >>>>> memory on each iteration, I would guess this would be
>>> >> >>>>> critical to getting good performance.
>>> >> >>>>> If
>>> >> >>>>> you
>>> >> >>>>> could do multiple local iterations before aggregating
>>> >> >>>>> results, then the cost of data movement to the GPU could be
>>> >> >>>>> amortized (and I believe that is done in practice).  Having
>>> >> >>>>> Spark be aware of the GPU and using it as another part of
>>> >> >>>>> memory sounds like a much bigger undertaking.
>>> >> >>>>>
>>> >> >>>>> Joseph
>>> >> >>>>>
>>> >> >>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> [hidden email]<mailto:[hidden email]><mai
>>> >> >>>>> lto:[hidden email]<mailto:[hidden email]>
>>> >> >>>>> >>
>>> >> >>>>> wrote:
>>> >> >>>>> Thank you for explanation! I’ve watched the BIDMach
>>> >> >>>>> presentation by John Canny and I am really inspired by his
>>> >> >>>>> talk and comparisons with Spark MLlib.
>>> >> >>>>>
>>> >> >>>>> I am very interested to find out what will be better within
>>> >> >>>>> Spark:
>>> >> >>>>> BIDMat
>>> >> >>>>> or netlib-java with CPU or GPU natives. Could you suggest a
>>> >> >>>>> fair way to benchmark them? Currently I do benchmarks on
>>> >> >>>>> artificial neural networks in batch mode. While it is not a
>>> >> >>>>> “pure” test of linear algebra, it involves some other
>>> >> >>>>> things that are essential to machine learning.
>>> >> >>>>>
>>> >> >>>>> From: Evan R. Sparks
>>> >> >>>>>
>>> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> >> >>>>> [hidden email]<mailto:[hidden email]>>]
>>> >> >>>>> Sent: Thursday, February 05, 2015 1:29 PM
>>> >> >>>>> To: Ulanov, Alexander
>>> >> >>>>> Cc:
>>> >> >>>>>
>>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:de
>>> >> >>>>> [hidden email]<mailto:[hidden email]>>
>>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear
>>> >> >>>>> algebra
>>> >> >>>>>
>>> >> >>>>> I'd be surprised of BIDMat+OpenBLAS was significantly
>>> >> >>>>> faster than
>>> >> >>>>> netlib-java+OpenBLAS, but if it is much faster it's
>>> >> >>>>> netlib-java+probably due
>>> >> >>>>> to
>>> >> >>>>> data
>>> >> >>>>> layout and fewer levels of indirection - it's definitely a
>>> >> >>>>> worthwhile experiment to run. The main speedups I've seen
>>> >> >>>>> from using it come from highly optimized GPU code for
>>> >> >>>>> linear algebra. I know that in the past Canny has gone as
>>> >> >>>>> far as to write custom GPU kernels for performance-critical
>>> >> >>>>> regions of code.[1]
>>> >> >>>>>
>>> >> >>>>> BIDMach is highly optimized for single node performance or
>>> >> >>>>> performance on small clusters.[2] Once data doesn't fit
>>> >> >>>>> easily in GPU memory (or can be batched in that way) the
>>> >> >>>>> performance tends to fall off. Canny argues for
>>> >> >>>>> hardware/software codesign and as such prefers machine
>>> >> >>>>> configurations that are quite different than what we find
>>> >> >>>>> in most commodity cluster nodes - e.g. 10 disk cahnnels and
>>> >> >>>>> 4 GPUs.
>>> >> >>>>>
>>> >> >>>>> In contrast, MLlib was designed for horizontal scalability
>>> >> >>>>> on commodity clusters and works best on very big datasets -
>>> >> >>>>> order of terabytes.
>>> >> >>>>>
>>> >> >>>>> For the most part, these projects developed concurrently to
>>> >> >>>>> address slightly different use cases. That said, there may
>>> >> >>>>> be bits of BIDMach we could repurpose for MLlib - keep in
>>> >> >>>>> mind we need to be careful about maintaining cross-language
>>> >> >>>>> compatibility for our Java and Python-users, though.
>>> >> >>>>>
>>> >> >>>>> - Evan
>>> >> >>>>>
>>> >> >>>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>>> >> >>>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>> >> >>>>>
>>> >> >>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> [hidden email]<mailto:[hidden email]><mai
>>> >> >>>>> lto:[hidden email]<mailto:[hidden email]>
>>> >> >>>>> >>>
>>> >> >>>>> wrote:
>>> >> >>>>> Hi Evan,
>>> >> >>>>>
>>> >> >>>>> Thank you for suggestion! BIDMat seems to have terrific
>>> >> >>>>> speed. Do you know what makes them faster than netlib-java?
>>> >> >>>>>
>>> >> >>>>> The same group has BIDMach library that implements machine
>>> >> >>>>> learning.
>>> >> >>>>> For
>>> >> >>>>> some examples they use Caffe convolutional neural network
>>> >> >>>>> library owned by another group in Berkeley. Could you
>>> >> >>>>> elaborate on how these all might be connected with Spark
>>> >> >>>>> Mllib? If you take BIDMat for linear algebra why don’t you
>>> >> >>>>> take BIDMach for optimization and learning?
>>> >> >>>>>
>>> >> >>>>> Best regards, Alexander
>>> >> >>>>>
>>> >> >>>>> From: Evan R. Sparks
>>> >> >>>>>
>>> >> >>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> >> >>>>> [hidden email]<mailto:[hidden email]>>>]
>>> >> >>>>> Sent: Thursday, February 05, 2015 12:09 PM
>>> >> >>>>> To: Ulanov, Alexander
>>> >> >>>>> Cc:
>>> >> >>>>>
>>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:de
>>> >> >>>>> [hidden email]<mailto:[hidden email]>>>
>>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear
>>> >> >>>>> algebra
>>> >> >>>>>
>>> >> >>>>> I'd expect that we can make GPU-accelerated BLAS faster
>>> >> >>>>> than CPU blas in many cases.
>>> >> >>>>>
>>> >> >>>>> You might consider taking a look at the codepaths that
>>> >> >>>>> BIDMat (
>>> >> >>>>> https://github.com/BIDData/BIDMat) takes and comparing them
>>> >> >>>>> to netlib-java/breeze. John Canny et. al. have done a bunch
>>> >> >>>>> of work optimizing to make this work really fast from
>>> >> >>>>> Scala. I've run it on my laptop and compared to MKL and in
>>> >> >>>>> certain cases it's 10x faster at matrix multiply.
>>> >> >>>>> There are a lot of layers of indirection here and you
>>> >> >>>>> really want to avoid data copying as much as possible.
>>> >> >>>>>
>>> >> >>>>> We could also consider swapping out BIDMat for Breeze, but
>>> >> >>>>> that would be a big project and if we can figure out how to
>>> >> >>>>> get breeze+cublas to comparable performance that would be a
>>> >> >>>>> big win.
>>> >> >>>>>
>>> >> >>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> [hidden email]<mailto:[hidden email]><mai
>>> >> >>>>> lto:[hidden email]<mailto:[hidden email]>
>>> >> >>>>> >>>
>>> >> >>>>> wrote:
>>> >> >>>>> Dear Spark developers,
>>> >> >>>>>
>>> >> >>>>> I am exploring how to make linear algebra operations faster
>>> >> >>>>> within Spark.
>>> >> >>>>> One way of doing this is to use Scala Breeze library that
>>> >> >>>>> is bundled with Spark. For matrix operations, it employs
>>> >> >>>>> Netlib-java that has a Java wrapper for BLAS (basic linear
>>> >> >>>>> algebra subprograms) and LAPACK native binaries if they are
>>> >> >>>>> available on the worker node. It also has its own optimized
>>> >> >>>>> Java implementation of BLAS. It is worth mentioning, that
>>> >> >>>>> native binaries provide better performance only for BLAS
>>> >> >>>>> level 3, i.e.
>>> >> >>>>> matrix-matrix operations or general matrix multiplication (GEMM).
>>> >> >>>>> This is
>>> >> >>>>> confirmed by GEMM test on Netlib-java page
>>> >> >>>>> https://github.com/fommil/netlib-java. I also confirmed it
>>> >> >>>>> with my experiments with training of artificial neural
>>> >> >>>>> network
>>> >> >>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>> >> >>>>> However, I would like to boost performance more.
>>> >> >>>>>
>>> >> >>>>> GPU is supposed to work fast with linear algebra and there
>>> >> >>>>> is Nvidia CUDA implementation of BLAS, called cublas. I
>>> >> >>>>> have one Linux server with Nvidia GPU and I was able to do
>>> >> >>>>> the following. I linked cublas (instead of cpu-based blas)
>>> >> >>>>> with Netlib-java wrapper and put it into Spark, so
>>> >> >>>>> Breeze/Netlib is using it. Then I did some performance
>>> >> >>>>> measurements with regards to artificial neural network
>>> >> >>>>> batch learning in Spark MLlib that involves matrix-matrix
>>> >> >>>>> multiplications. It turns out that for matrices of size
>>> >> >>>>> less than ~1000x780 GPU cublas has the same speed as CPU
>>> >> >>>>> blas.
>>> >> >>>>> Cublas
>>> >> >>>>> becomes slower for bigger matrices. It worth mentioning
>>> >> >>>>> that it is was not a test for ONLY multiplication since
>>> >> >>>>> there are other operations involved.
>>> >> >>>>> One of the reasons for slowdown might be the overhead of
>>> >> >>>>> copying the matrices from computer memory to graphic card
>>> >> >>>>> memory and back.
>>> >> >>>>>
>>> >> >>>>> So, few questions:
>>> >> >>>>> 1) Do these results with CUDA make sense?
>>> >> >>>>> 2) If the problem is with copy overhead, are there any
>>> >> >>>>> libraries that allow to force intermediate results to stay
>>> >> >>>>> in graphic card memory thus removing the overhead?
>>> >> >>>>> 3) Any other options to speed-up linear algebra in Spark?
>>> >> >>>>>
>>> >> >>>>> Thank you, Alexander
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> -----------------------------------------------------------
>>> >> >>>>> ----------
>>> >> >>>>> To unsubscribe, e-mail:
>>> >> >>>>>
>>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> [hidden email]<mailto:dev-unsubscribe@spa
>>> >> >>>>> rk.apache.org>><mailto:[hidden email]<mai
>>> >> >>>>> lto:[hidden email]>
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> <mailto:[hidden email]<mailto:dev-unsubsc
>>> >> >>>>> [hidden email]>>> For additional commands, e-mail:
>>> >> >>>>>
>>> >> >>>>> [hidden email]<mailto:[hidden email]><mailto:
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>> >> >>>>> [hidden email]<mailto:[hidden email]>
>>> >> >>>>> >>
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>
>>> >> >
>>> >> > --
>>> >> > Best regards,
>>> >> > Sam
>>> >> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

fommil
In reply to this post by Xiangrui Meng
BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)


Xiangrui Meng <[hidden email]> writes:

> Hey Alexander,
>
> I don't quite understand the part where netlib-cublas is about 20x
> slower than netlib-openblas. What is the overhead of using a GPU BLAS
> with netlib-java?
>
> CC'ed Sam, the author of netlib-java.
>
> Best,
> Xiangrui
>
> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]> wrote:
>> Better documentation for linking would be very helpful!  Here's a JIRA:
>> https://issues.apache.org/jira/browse/SPARK-6019
>>
>>
>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks <[hidden email]>
>> wrote:
>>
>>> Thanks for compiling all the data and running these benchmarks, Alex. The
>>> big takeaways here can be seen with this chart:
>>>
>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>>
>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>> BIDMat+GPU) can provide substantial (but less than an order of magnitude)
>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>>> netlib-java+openblas-compiled).
>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse
>>> than a well-tuned CPU implementation, particularly for larger matrices.
>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>>> basically agrees with the authors own benchmarks (
>>> https://github.com/fommil/netlib-java)
>>>
>>> I think that most of our users are in a situation where using GPUs may not
>>> be practical - although we could consider having a good GPU backend
>>> available as an option. However, *ALL* users of MLlib could benefit
>>> (potentially tremendously) from using a well-tuned CPU-based BLAS
>>> implementation. Perhaps we should consider updating the mllib guide with a
>>> more complete section for enabling high performance binaries on OSX and
>>> Linux? Or better, figure out a way for the system to fetch these
>>> automatically.
>>>
>>> - Evan
>>>
>>>
>>>
>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>> [hidden email]> wrote:
>>>
>>>> Just to summarize this thread, I was finally able to make all performance
>>>> comparisons that we discussed. It turns out that:
>>>> BIDMat-cublas>>BIDMat
>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
>>>>
>>>> Below is the link to the spreadsheet with full results.
>>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>>
>>>> One thing still needs exploration: does BIDMat-cublas perform copying
>>>> to/from machine’s RAM?
>>>>
>>>> -----Original Message-----
>>>> From: Ulanov, Alexander
>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>>> To: Evan R. Sparks
>>>> Cc: Joseph Bradley; [hidden email]
>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Thanks, Evan! It seems that ticket was marked as duplicate though the
>>>> original one discusses slightly different topic. I was able to link netlib
>>>> with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
>>>> 60MB library.
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>>> |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>>>> 1569,233228 |
>>>>
>>>> It turn out that pre-compiled MKL is faster than precompiled OpenBlas on
>>>> my machine. Probably, I’ll add two more columns with locally compiled
>>>> openblas and cuda.
>>>>
>>>> Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]]
>>>> Sent: Monday, February 09, 2015 6:06 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley; [hidden email]
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Great - perhaps we can move this discussion off-list and onto a JIRA
>>>> ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)
>>>>
>>>> It seems like this is going to be somewhat exploratory for a while (and
>>>> there's probably only a handful of us who really care about fast linear
>>>> algebra!)
>>>>
>>>> - Evan
>>>>
>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for explanation and useful link. I am going to build OpenBLAS,
>>>> link it with Netlib-java and perform benchmark again.
>>>>
>>>> Do I understand correctly that BIDMat binaries contain statically linked
>>>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
>>>> having MKL BLAS installed on my server. If it is true, I wonder if it is OK
>>>> because Intel sells this library. Nevertheless, it seems that in my case
>>>> precompiled MKL BLAS performs better than precompiled OpenBLAS given that
>>>> BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>>>>
>>>> Though, it might be interesting to link Netlib-java with Intel MKL, as
>>>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
>>>> (Netlib-java) interested to compare their libraries.
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>>> [hidden email]>]
>>>> Sent: Friday, February 06, 2015 5:58 PM
>>>>
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I would build OpenBLAS yourself, since good BLAS performance comes from
>>>> getting cache sizes, etc. set up correctly for your particular hardware -
>>>> this is often a very tricky process (see, e.g. ATLAS), but we found that on
>>>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
>>>> performance competitive with MKL.
>>>>
>>>> To make sure the right library is getting used, you have to make sure
>>>> it's first on the search path - export
>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>>>
>>>> For some examples of getting netlib-java setup on an ec2 node and some
>>>> example benchmarking code we ran a while back, see:
>>>> https://github.com/shivaram/matrix-bench
>>>>
>>>> In particular - build-openblas-ec2.sh shows you how to build the library
>>>> and set up symlinks correctly, and scala/run-netlib.sh shows you how to get
>>>> the path setup and get that library picked up by netlib-java.
>>>>
>>>> In this way - you could probably get cuBLAS set up to be used by
>>>> netlib-java as well.
>>>>
>>>> - Evan
>>>>
>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]>> wrote:
>>>> Evan, could you elaborate on how to force BIDMat and netlib-java to force
>>>> loading the right blas? For netlib, I there are few JVM flags, such as
>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can
>>>> force it to use Java implementation. Not sure I understand how to force use
>>>> a specific blas (not specific wrapper for blas).
>>>>
>>>> Btw. I have installed openblas (yum install openblas), so I suppose that
>>>> netlib is using it.
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>>> [hidden email]>]
>>>> Sent: Friday, February 06, 2015 5:19 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]>
>>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Getting breeze to pick up the right blas library is critical for
>>>> performance. I recommend using OpenBLAS (or MKL, if you already have it).
>>>> It might make sense to force BIDMat to use the same underlying BLAS library
>>>> as well.
>>>>
>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]>> wrote:
>>>> Hi Evan, Joseph
>>>>
>>>> I did few matrix multiplication test and BIDMat seems to be ~10x faster
>>>> than netlib-java+breeze (sorry for weird table formatting):
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
>>>> Breeze+Netlib-java f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 |
>>>>
>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
>>>> Linux, Scala 2.11.
>>>>
>>>> Later I will make tests with Cuda. I need to install new Cuda version for
>>>> this purpose.
>>>>
>>>> Do you have any ideas why breeze-netlib with native blas is so much
>>>> slower than BIDMat MKL?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Joseph Bradley [mailto:[hidden email]<mailto:
>>>> [hidden email]>]
>>>> Sent: Thursday, February 05, 2015 5:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Evan R. Sparks; [hidden email]<mailto:[hidden email]>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment: Concerning
>>>> your question earlier about keeping data stored on the GPU rather than
>>>> having to move it between main memory and GPU memory on each iteration, I
>>>> would guess this would be critical to getting good performance.  If you
>>>> could do multiple local iterations before aggregating results, then the
>>>> cost of data movement to the GPU could be amortized (and I believe that is
>>>> done in practice).  Having Spark be aware of the GPU and using it as
>>>> another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph
>>>>
>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]>> wrote:
>>>> Thank you for explanation! I’ve watched the BIDMach presentation by John
>>>> Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>>>>
>>>> I am very interested to find out what will be better within Spark: BIDMat
>>>> or netlib-java with CPU or GPU natives. Could you suggest a fair way to
>>>> benchmark them? Currently I do benchmarks on artificial neural networks in
>>>> batch mode. While it is not a “pure” test of linear algebra, it involves
>>>> some other things that are essential to machine learning.
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>>> [hidden email]>]
>>>> Sent: Thursday, February 05, 2015 1:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc: [hidden email]<mailto:[hidden email]>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to data
>>>> layout and fewer levels of indirection - it's definitely a worthwhile
>>>> experiment to run. The main speedups I've seen from using it come from
>>>> highly optimized GPU code for linear algebra. I know that in the past Canny
>>>> has gone as far as to write custom GPU kernels for performance-critical
>>>> regions of code.[1]
>>>>
>>>> BIDMach is highly optimized for single node performance or performance on
>>>> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
>>>> batched in that way) the performance tends to fall off. Canny argues for
>>>> hardware/software codesign and as such prefers machine configurations that
>>>> are quite different than what we find in most commodity cluster nodes -
>>>> e.g. 10 disk cahnnels and 4 GPUs.
>>>>
>>>> In contrast, MLlib was designed for horizontal scalability on commodity
>>>> clusters and works best on very big datasets - order of terabytes.
>>>>
>>>> For the most part, these projects developed concurrently to address
>>>> slightly different use cases. That said, there may be bits of BIDMach we
>>>> could repurpose for MLlib - keep in mind we need to be careful about
>>>> maintaining cross-language compatibility for our Java and Python-users,
>>>> though.
>>>>
>>>> - Evan
>>>>
>>>> [1] - http://arxiv.org/abs/1409.5402
>>>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>>>
>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do you
>>>> know what makes them faster than netlib-java?
>>>>
>>>> The same group has BIDMach library that implements machine learning. For
>>>> some examples they use Caffe convolutional neural network library owned by
>>>> another group in Berkeley. Could you elaborate on how these all might be
>>>> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
>>>> you take BIDMach for optimization and learning?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>>> [hidden email]><mailto:[hidden email]<mailto:
>>>> [hidden email]>>]
>>>> Sent: Thursday, February 05, 2015 12:09 PM
>>>> To: Ulanov, Alexander
>>>> Cc: [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in
>>>> many cases.
>>>>
>>>> You might consider taking a look at the codepaths that BIDMat (
>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
>>>> to make this work really fast from Scala. I've run it on my laptop and
>>>> compared to MKL and in certain cases it's 10x faster at matrix multiply.
>>>> There are a lot of layers of indirection here and you really want to avoid
>>>> data copying as much as possible.
>>>>
>>>> We could also consider swapping out BIDMat for Breeze, but that would be
>>>> a big project and if we can figure out how to get breeze+cublas to
>>>> comparable performance that would be a big win.
>>>>
>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>> wrote:
>>>> Dear Spark developers,
>>>>
>>>> I am exploring how to make linear algebra operations faster within Spark.
>>>> One way of doing this is to use Scala Breeze library that is bundled with
>>>> Spark. For matrix operations, it employs Netlib-java that has a Java
>>>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
>>>> binaries if they are available on the worker node. It also has its own
>>>> optimized Java implementation of BLAS. It is worth mentioning, that native
>>>> binaries provide better performance only for BLAS level 3, i.e.
>>>> matrix-matrix operations or general matrix multiplication (GEMM). This is
>>>> confirmed by GEMM test on Netlib-java page
>>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>>>> experiments with training of artificial neural network
>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>>> However, I would like to boost performance more.
>>>>
>>>> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
>>>> implementation of BLAS, called cublas. I have one Linux server with Nvidia
>>>> GPU and I was able to do the following. I linked cublas (instead of
>>>> cpu-based blas) with Netlib-java wrapper and put it into Spark, so
>>>> Breeze/Netlib is using it. Then I did some performance measurements with
>>>> regards to artificial neural network batch learning in Spark MLlib that
>>>> involves matrix-matrix multiplications. It turns out that for matrices of
>>>> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
>>>> becomes slower for bigger matrices. It worth mentioning that it is was not
>>>> a test for ONLY multiplication since there are other operations involved.
>>>> One of the reasons for slowdown might be the overhead of copying the
>>>> matrices from computer memory to graphic card memory and back.
>>>>
>>>> So, few questions:
>>>> 1) Do these results with CUDA make sense?
>>>> 2) If the problem is with copy overhead, are there any libraries that
>>>> allow to force intermediate results to stay in graphic card memory thus
>>>> removing the overhead?
>>>> 3) Any other options to speed-up linear algebra in Spark?
>>>>
>>>> Thank you, Alexander
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]<mailto:
>>>> [hidden email]><mailto:[hidden email]
>>>> <mailto:[hidden email]>>
>>>> For additional commands, e-mail: [hidden email]<mailto:
>>>> [hidden email]><mailto:[hidden email]<mailto:
>>>> [hidden email]>>
>>>>
>>>>
>>>>
>>>>
>>>

--
Best regards,
Sam

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Using CUDA within Spark / boosting linear algebra

Ulanov, Alexander
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-----Original Message-----
From: Sam Halliday [mailto:[hidden email]]
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]
Subject: Re: Using CUDA within Spark / boosting linear algebra

BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)


Xiangrui Meng <[hidden email]> writes:

> Hey Alexander,
>
> I don't quite understand the part where netlib-cublas is about 20x
> slower than netlib-openblas. What is the overhead of using a GPU BLAS
> with netlib-java?
>
> CC'ed Sam, the author of netlib-java.
>
> Best,
> Xiangrui
>
> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]> wrote:
>> Better documentation for linking would be very helpful!  Here's a JIRA:
>> https://issues.apache.org/jira/browse/SPARK-6019
>>
>>
>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>> <[hidden email]>
>> wrote:
>>
>>> Thanks for compiling all the data and running these benchmarks,
>>> Alex. The big takeaways here can be seen with this chart:
>>>
>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
>>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>>
>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>> BIDMat+GPU) can provide substantial (but less than an order of
>>> BIDMat+magnitude)
>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>>> netlib-java+openblas-compiled).
>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>>> worse than a well-tuned CPU implementation, particularly for larger matrices.
>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>>> basically agrees with the authors own benchmarks (
>>> https://github.com/fommil/netlib-java)
>>>
>>> I think that most of our users are in a situation where using GPUs
>>> may not be practical - although we could consider having a good GPU
>>> backend available as an option. However, *ALL* users of MLlib could
>>> benefit (potentially tremendously) from using a well-tuned CPU-based
>>> BLAS implementation. Perhaps we should consider updating the mllib
>>> guide with a more complete section for enabling high performance
>>> binaries on OSX and Linux? Or better, figure out a way for the
>>> system to fetch these automatically.
>>>
>>> - Evan
>>>
>>>
>>>
>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>> [hidden email]> wrote:
>>>
>>>> Just to summarize this thread, I was finally able to make all
>>>> performance comparisons that we discussed. It turns out that:
>>>> BIDMat-cublas>>BIDMat
>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
>>>> =netlib-cublas>netlib-blas>f2jblas
>>>>
>>>> Below is the link to the spreadsheet with full results.
>>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
>>>> 378T9J5r7kwKSPkY/edit?usp=sharing
>>>>
>>>> One thing still needs exploration: does BIDMat-cublas perform
>>>> copying to/from machine’s RAM?
>>>>
>>>> -----Original Message-----
>>>> From: Ulanov, Alexander
>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>>> To: Evan R. Sparks
>>>> Cc: Joseph Bradley; [hidden email]
>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>>>> the original one discusses slightly different topic. I was able to
>>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
>>>> statically linked inside a 60MB library.
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>>> |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>>>> 1569,233228 |
>>>>
>>>> It turn out that pre-compiled MKL is faster than precompiled
>>>> OpenBlas on my machine. Probably, I’ll add two more columns with
>>>> locally compiled openblas and cuda.
>>>>
>>>> Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]]
>>>> Sent: Monday, February 09, 2015 6:06 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley; [hidden email]
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Great - perhaps we can move this discussion off-list and onto a
>>>> JIRA ticket? (Here's one:
>>>> https://issues.apache.org/jira/browse/SPARK-5705)
>>>>
>>>> It seems like this is going to be somewhat exploratory for a while
>>>> (and there's probably only a handful of us who really care about
>>>> fast linear
>>>> algebra!)
>>>>
>>>> - Evan
>>>>
>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for explanation and useful link. I am going to build
>>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
>>>>
>>>> Do I understand correctly that BIDMat binaries contain statically
>>>> linked Intel MKL BLAS? It might be the reason why I am able to run
>>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
>>>> wonder if it is OK because Intel sells this library. Nevertheless,
>>>> it seems that in my case precompiled MKL BLAS performs better than
>>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>>>>
>>>> Though, it might be interesting to link Netlib-java with Intel MKL,
>>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
>>>> Halliday
>>>> (Netlib-java) interested to compare their libraries.
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>>> [hidden email]>]
>>>> Sent: Friday, February 06, 2015 5:58 PM
>>>>
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I would build OpenBLAS yourself, since good BLAS performance comes
>>>> from getting cache sizes, etc. set up correctly for your particular
>>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
>>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
>>>> quickly and yields performance competitive with MKL.
>>>>
>>>> To make sure the right library is getting used, you have to make
>>>> sure it's first on the search path - export
>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>>>
>>>> For some examples of getting netlib-java setup on an ec2 node and
>>>> some example benchmarking code we ran a while back, see:
>>>> https://github.com/shivaram/matrix-bench
>>>>
>>>> In particular - build-openblas-ec2.sh shows you how to build the
>>>> library and set up symlinks correctly, and scala/run-netlib.sh
>>>> shows you how to get the path setup and get that library picked up by netlib-java.
>>>>
>>>> In this way - you could probably get cuBLAS set up to be used by
>>>> netlib-java as well.
>>>>
>>>> - Evan
>>>>
>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]>> wrote:
>>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
>>>> force loading the right blas? For netlib, I there are few JVM
>>>> flags, such as
>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>>>> so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas).
>>>>
>>>> Btw. I have installed openblas (yum install openblas), so I suppose
>>>> that netlib is using it.
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>>> [hidden email]>]
>>>> Sent: Friday, February 06, 2015 5:19 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]>
>>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Getting breeze to pick up the right blas library is critical for
>>>> performance. I recommend using OpenBLAS (or MKL, if you already have it).
>>>> It might make sense to force BIDMat to use the same underlying BLAS
>>>> library as well.
>>>>
>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]>> wrote:
>>>> Hi Evan, Joseph
>>>>
>>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>>>> faster than netlib-java+breeze (sorry for weird table formatting):
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>>>> |native_system_linux_x86-64|
>>>> Breeze+Netlib-java f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
>>>> ||
>>>>
>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
>>>> 19 Linux, Scala 2.11.
>>>>
>>>> Later I will make tests with Cuda. I need to install new Cuda
>>>> version for this purpose.
>>>>
>>>> Do you have any ideas why breeze-netlib with native blas is so much
>>>> slower than BIDMat MKL?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Joseph Bradley [mailto:[hidden email]<mailto:
>>>> [hidden email]>]
>>>> Sent: Thursday, February 05, 2015 5:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Evan R. Sparks;
>>>> [hidden email]<mailto:[hidden email]>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph
>>>>
>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]>> wrote:
>>>> Thank you for explanation! I’ve watched the BIDMach presentation by
>>>> John Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>>>>
>>>> I am very interested to find out what will be better within Spark:
>>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
>>>> fair way to benchmark them? Currently I do benchmarks on artificial
>>>> neural networks in batch mode. While it is not a “pure” test of
>>>> linear algebra, it involves some other things that are essential to machine learning.
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>>> [hidden email]>]
>>>> Sent: Thursday, February 05, 2015 1:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc: [hidden email]<mailto:[hidden email]>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
>>>> netlib-java+data
>>>> layout and fewer levels of indirection - it's definitely a
>>>> worthwhile experiment to run. The main speedups I've seen from
>>>> using it come from highly optimized GPU code for linear algebra. I
>>>> know that in the past Canny has gone as far as to write custom GPU
>>>> kernels for performance-critical regions of code.[1]
>>>>
>>>> BIDMach is highly optimized for single node performance or
>>>> performance on small clusters.[2] Once data doesn't fit easily in
>>>> GPU memory (or can be batched in that way) the performance tends to
>>>> fall off. Canny argues for hardware/software codesign and as such
>>>> prefers machine configurations that are quite different than what
>>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs.
>>>>
>>>> In contrast, MLlib was designed for horizontal scalability on
>>>> commodity clusters and works best on very big datasets - order of terabytes.
>>>>
>>>> For the most part, these projects developed concurrently to address
>>>> slightly different use cases. That said, there may be bits of
>>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
>>>> careful about maintaining cross-language compatibility for our Java
>>>> and Python-users, though.
>>>>
>>>> - Evan
>>>>
>>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>>>
>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>>>> you know what makes them faster than netlib-java?
>>>>
>>>> The same group has BIDMach library that implements machine
>>>> learning. For some examples they use Caffe convolutional neural
>>>> network library owned by another group in Berkeley. Could you
>>>> elaborate on how these all might be connected with Spark Mllib? If
>>>> you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>>> [hidden email]><mailto:[hidden email]<mailto:
>>>> [hidden email]>>]
>>>> Sent: Thursday, February 05, 2015 12:09 PM
>>>> To: Ulanov, Alexander
>>>> Cc: [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>>>> blas in many cases.
>>>>
>>>> You might consider taking a look at the codepaths that BIDMat (
>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>>>> optimizing to make this work really fast from Scala. I've run it on
>>>> my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply.
>>>> There are a lot of layers of indirection here and you really want
>>>> to avoid data copying as much as possible.
>>>>
>>>> We could also consider swapping out BIDMat for Breeze, but that
>>>> would be a big project and if we can figure out how to get
>>>> breeze+cublas to comparable performance that would be a big win.
>>>>
>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>> wrote:
>>>> Dear Spark developers,
>>>>
>>>> I am exploring how to make linear algebra operations faster within Spark.
>>>> One way of doing this is to use Scala Breeze library that is
>>>> bundled with Spark. For matrix operations, it employs Netlib-java
>>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
>>>> and LAPACK native binaries if they are available on the worker
>>>> node. It also has its own optimized Java implementation of BLAS. It
>>>> is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e.
>>>> matrix-matrix operations or general matrix multiplication (GEMM).
>>>> This is confirmed by GEMM test on Netlib-java page
>>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>>>> experiments with training of artificial neural network
>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>>> However, I would like to boost performance more.
>>>>
>>>> GPU is supposed to work fast with linear algebra and there is
>>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
>>>> server with Nvidia GPU and I was able to do the following. I linked
>>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
>>>> it into Spark, so Breeze/Netlib is using it. Then I did some
>>>> performance measurements with regards to artificial neural network
>>>> batch learning in Spark MLlib that involves matrix-matrix
>>>> multiplications. It turns out that for matrices of size less than
>>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
>>>> slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved.
>>>> One of the reasons for slowdown might be the overhead of copying
>>>> the matrices from computer memory to graphic card memory and back.
>>>>
>>>> So, few questions:
>>>> 1) Do these results with CUDA make sense?
>>>> 2) If the problem is with copy overhead, are there any libraries
>>>> that allow to force intermediate results to stay in graphic card
>>>> memory thus removing the overhead?
>>>> 3) Any other options to speed-up linear algebra in Spark?
>>>>
>>>> Thank you, Alexander
>>>>
>>>> -------------------------------------------------------------------
>>>> -- To unsubscribe, e-mail: [hidden email]<mailto:
>>>> [hidden email]><mailto:[hidden email]
>>>> he.org <mailto:[hidden email]>>
>>>> For additional commands, e-mail: [hidden email]<mailto:
>>>> [hidden email]><mailto:[hidden email]<mailto:
>>>> [hidden email]>>
>>>>
>>>>
>>>>
>>>>
>>>

--
Best regards,
Sam

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

RE: Using CUDA within Spark / boosting linear algebra

fommil
Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on
various pieces of hardware...
On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]> wrote:

> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
> support of Double in the current source code), did the test with BIDMat and
> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Best regards, Alexander
>
> -----Original Message-----
> From: Sam Halliday [mailto:[hidden email]]
> Sent: Tuesday, March 03, 2015 1:54 PM
> To: Xiangrui Meng; Joseph Bradley
> Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> BTW, is anybody on this list going to the London Meetup in a few weeks?
>
>
> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
>
> Would be nice to meet other people working on the guts of Spark! :-)
>
>
> Xiangrui Meng <[hidden email]> writes:
>
> > Hey Alexander,
> >
> > I don't quite understand the part where netlib-cublas is about 20x
> > slower than netlib-openblas. What is the overhead of using a GPU BLAS
> > with netlib-java?
> >
> > CC'ed Sam, the author of netlib-java.
> >
> > Best,
> > Xiangrui
> >
> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]>
> wrote:
> >> Better documentation for linking would be very helpful!  Here's a JIRA:
> >> https://issues.apache.org/jira/browse/SPARK-6019
> >>
> >>
> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
> >> <[hidden email]>
> >> wrote:
> >>
> >>> Thanks for compiling all the data and running these benchmarks,
> >>> Alex. The big takeaways here can be seen with this chart:
> >>>
> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
> >>>
> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
> >>> BIDMat+GPU) can provide substantial (but less than an order of
> >>> BIDMat+magnitude)
> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
> >>> netlib-java+openblas-compiled).
> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
> >>> worse than a well-tuned CPU implementation, particularly for larger
> matrices.
> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
> >>> basically agrees with the authors own benchmarks (
> >>> https://github.com/fommil/netlib-java)
> >>>
> >>> I think that most of our users are in a situation where using GPUs
> >>> may not be practical - although we could consider having a good GPU
> >>> backend available as an option. However, *ALL* users of MLlib could
> >>> benefit (potentially tremendously) from using a well-tuned CPU-based
> >>> BLAS implementation. Perhaps we should consider updating the mllib
> >>> guide with a more complete section for enabling high performance
> >>> binaries on OSX and Linux? Or better, figure out a way for the
> >>> system to fetch these automatically.
> >>>
> >>> - Evan
> >>>
> >>>
> >>>
> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
> >>> [hidden email]> wrote:
> >>>
> >>>> Just to summarize this thread, I was finally able to make all
> >>>> performance comparisons that we discussed. It turns out that:
> >>>> BIDMat-cublas>>BIDMat
> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
> >>>> =netlib-cublas>netlib-blas>f2jblas
> >>>>
> >>>> Below is the link to the spreadsheet with full results.
> >>>>
> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing
> >>>>
> >>>> One thing still needs exploration: does BIDMat-cublas perform
> >>>> copying to/from machine’s RAM?
> >>>>
> >>>> -----Original Message-----
> >>>> From: Ulanov, Alexander
> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
> >>>> To: Evan R. Sparks
> >>>> Cc: Joseph Bradley; [hidden email]
> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though
> >>>> the original one discusses slightly different topic. I was able to
> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
> >>>> statically linked inside a 60MB library.
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
> >>>>
> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
> >>>> |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
> >>>> 1569,233228 |
> >>>>
> >>>> It turn out that pre-compiled MKL is faster than precompiled
> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
> >>>> locally compiled openblas and cuda.
> >>>>
> >>>> Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]]
> >>>> Sent: Monday, February 09, 2015 6:06 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley; [hidden email]
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Great - perhaps we can move this discussion off-list and onto a
> >>>> JIRA ticket? (Here's one:
> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
> >>>>
> >>>> It seems like this is going to be somewhat exploratory for a while
> >>>> (and there's probably only a handful of us who really care about
> >>>> fast linear
> >>>> algebra!)
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for explanation and useful link. I am going to build
> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
> >>>>
> >>>> Do I understand correctly that BIDMat binaries contain statically
> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run
> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
> >>>> wonder if it is OK because Intel sells this library. Nevertheless,
> >>>> it seems that in my case precompiled MKL BLAS performs better than
> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed
> to be on par with JNI overheads.
> >>>>
> >>>> Though, it might be interesting to link Netlib-java with Intel MKL,
> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
> >>>> Halliday
> >>>> (Netlib-java) interested to compare their libraries.
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> >>>> [hidden email]>]
> >>>> Sent: Friday, February 06, 2015 5:58 PM
> >>>>
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I would build OpenBLAS yourself, since good BLAS performance comes
> >>>> from getting cache sizes, etc. set up correctly for your particular
> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
> >>>> quickly and yields performance competitive with MKL.
> >>>>
> >>>> To make sure the right library is getting used, you have to make
> >>>> sure it's first on the search path - export
> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
> >>>>
> >>>> For some examples of getting netlib-java setup on an ec2 node and
> >>>> some example benchmarking code we ran a while back, see:
> >>>> https://github.com/shivaram/matrix-bench
> >>>>
> >>>> In particular - build-openblas-ec2.sh shows you how to build the
> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
> >>>> shows you how to get the path setup and get that library picked up by
> netlib-java.
> >>>>
> >>>> In this way - you could probably get cuBLAS set up to be used by
> >>>> netlib-java as well.
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]>> wrote:
> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
> >>>> force loading the right blas? For netlib, I there are few JVM
> >>>> flags, such as
> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
> >>>> so I can force it to use Java implementation. Not sure I understand
> how to force use a specific blas (not specific wrapper for blas).
> >>>>
> >>>> Btw. I have installed openblas (yum install openblas), so I suppose
> >>>> that netlib is using it.
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> >>>> [hidden email]>]
> >>>> Sent: Friday, February 06, 2015 5:19 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]>
> >>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Getting breeze to pick up the right blas library is critical for
> >>>> performance. I recommend using OpenBLAS (or MKL, if you already have
> it).
> >>>> It might make sense to force BIDMat to use the same underlying BLAS
> >>>> library as well.
> >>>>
> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]>> wrote:
> >>>> Hi Evan, Joseph
> >>>>
> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
> >>>> |native_system_linux_x86-64|
> >>>> Breeze+Netlib-java f2jblas |
> >>>>
> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
> >>>> ||
> >>>>
> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
> >>>> 19 Linux, Scala 2.11.
> >>>>
> >>>> Later I will make tests with Cuda. I need to install new Cuda
> >>>> version for this purpose.
> >>>>
> >>>> Do you have any ideas why breeze-netlib with native blas is so much
> >>>> slower than BIDMat MKL?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Joseph Bradley [mailto:[hidden email]<mailto:
> >>>> [hidden email]>]
> >>>> Sent: Thursday, February 05, 2015 5:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Evan R. Sparks;
> >>>> [hidden email]<mailto:[hidden email]>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Hi Alexander,
> >>>>
> >>>> Using GPUs with Spark would be very exciting.  Small comment:
> >>>> Concerning your question earlier about keeping data stored on the
> >>>> GPU rather than having to move it between main memory and GPU
> >>>> memory on each iteration, I would guess this would be critical to
> >>>> getting good performance.  If you could do multiple local
> >>>> iterations before aggregating results, then the cost of data
> >>>> movement to the GPU could be amortized (and I believe that is done
> >>>> in practice).  Having Spark be aware of the GPU and using it as
> another part of memory sounds like a much bigger undertaking.
> >>>>
> >>>> Joseph
> >>>>
> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]>> wrote:
> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by
> >>>> John Canny and I am really inspired by his talk and comparisons with
> Spark MLlib.
> >>>>
> >>>> I am very interested to find out what will be better within Spark:
> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
> >>>> fair way to benchmark them? Currently I do benchmarks on artificial
> >>>> neural networks in batch mode. While it is not a “pure” test of
> >>>> linear algebra, it involves some other things that are essential to
> machine learning.
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> >>>> [hidden email]>]
> >>>> Sent: Thursday, February 05, 2015 1:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: [hidden email]<mailto:[hidden email]>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
> >>>> netlib-java+data
> >>>> layout and fewer levels of indirection - it's definitely a
> >>>> worthwhile experiment to run. The main speedups I've seen from
> >>>> using it come from highly optimized GPU code for linear algebra. I
> >>>> know that in the past Canny has gone as far as to write custom GPU
> >>>> kernels for performance-critical regions of code.[1]
> >>>>
> >>>> BIDMach is highly optimized for single node performance or
> >>>> performance on small clusters.[2] Once data doesn't fit easily in
> >>>> GPU memory (or can be batched in that way) the performance tends to
> >>>> fall off. Canny argues for hardware/software codesign and as such
> >>>> prefers machine configurations that are quite different than what
> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4
> GPUs.
> >>>>
> >>>> In contrast, MLlib was designed for horizontal scalability on
> >>>> commodity clusters and works best on very big datasets - order of
> terabytes.
> >>>>
> >>>> For the most part, these projects developed concurrently to address
> >>>> slightly different use cases. That said, there may be bits of
> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
> >>>> careful about maintaining cross-language compatibility for our Java
> >>>> and Python-users, though.
> >>>>
> >>>> - Evan
> >>>>
> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
> >>>>
> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
> >>>> you know what makes them faster than netlib-java?
> >>>>
> >>>> The same group has BIDMach library that implements machine
> >>>> learning. For some examples they use Caffe convolutional neural
> >>>> network library owned by another group in Berkeley. Could you
> >>>> elaborate on how these all might be connected with Spark Mllib? If
> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for
> optimization and learning?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> >>>> [hidden email]><mailto:[hidden email]<mailto:
> >>>> [hidden email]>>]
> >>>> Sent: Thursday, February 05, 2015 12:09 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
> >>>> blas in many cases.
> >>>>
> >>>> You might consider taking a look at the codepaths that BIDMat (
> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to
> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
> >>>> optimizing to make this work really fast from Scala. I've run it on
> >>>> my laptop and compared to MKL and in certain cases it's 10x faster at
> matrix multiply.
> >>>> There are a lot of layers of indirection here and you really want
> >>>> to avoid data copying as much as possible.
> >>>>
> >>>> We could also consider swapping out BIDMat for Breeze, but that
> >>>> would be a big project and if we can figure out how to get
> >>>> breeze+cublas to comparable performance that would be a big win.
> >>>>
> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Dear Spark developers,
> >>>>
> >>>> I am exploring how to make linear algebra operations faster within
> Spark.
> >>>> One way of doing this is to use Scala Breeze library that is
> >>>> bundled with Spark. For matrix operations, it employs Netlib-java
> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
> >>>> and LAPACK native binaries if they are available on the worker
> >>>> node. It also has its own optimized Java implementation of BLAS. It
> >>>> is worth mentioning, that native binaries provide better performance
> only for BLAS level 3, i.e.
> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
> >>>> This is confirmed by GEMM test on Netlib-java page
> >>>> https://github.com/fommil/netlib-java. I also confirmed it with my
> >>>> experiments with training of artificial neural network
> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
> >>>> However, I would like to boost performance more.
> >>>>
> >>>> GPU is supposed to work fast with linear algebra and there is
> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
> >>>> server with Nvidia GPU and I was able to do the following. I linked
> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some
> >>>> performance measurements with regards to artificial neural network
> >>>> batch learning in Spark MLlib that involves matrix-matrix
> >>>> multiplications. It turns out that for matrices of size less than
> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
> >>>> slower for bigger matrices. It worth mentioning that it is was not a
> test for ONLY multiplication since there are other operations involved.
> >>>> One of the reasons for slowdown might be the overhead of copying
> >>>> the matrices from computer memory to graphic card memory and back.
> >>>>
> >>>> So, few questions:
> >>>> 1) Do these results with CUDA make sense?
> >>>> 2) If the problem is with copy overhead, are there any libraries
> >>>> that allow to force intermediate results to stay in graphic card
> >>>> memory thus removing the overhead?
> >>>> 3) Any other options to speed-up linear algebra in Spark?
> >>>>
> >>>> Thank you, Alexander
> >>>>
> >>>> -------------------------------------------------------------------
> >>>> -- To unsubscribe, e-mail: [hidden email]<mailto:
> >>>> [hidden email]><mailto:[hidden email]
> >>>> he.org <mailto:[hidden email]>>
> >>>> For additional commands, e-mail: [hidden email]<mailto:
> >>>> [hidden email]><mailto:[hidden email]<mailto:
> >>>> [hidden email]>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
>
> --
> Best regards,
> Sam
>
Reply | Threaded
Open this post in threaded view
|

RE: Using CUDA within Spark / boosting linear algebra

Ulanov, Alexander
I can run benchmark on another machine with GPU nVidia Titan and Intel Xeon E5-2650 v2, although it runs Windows and I have to run Linux tests in VirtualBox.

It would be also interesting to add results on netlib+nvblas, however I am not sure I understand in details how to build this and will appreciate any help from you ☺

From: Sam Halliday [mailto:[hidden email]]
Sent: Monday, March 09, 2015 6:01 PM
To: Ulanov, Alexander
Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra


Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware...
On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]<mailto:[hidden email]>> wrote:
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-----Original Message-----
From: Sam Halliday [mailto:[hidden email]<mailto:[hidden email]>]
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:[hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra

BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)


Xiangrui Meng <[hidden email]<mailto:[hidden email]>> writes:

> Hey Alexander,
>
> I don't quite understand the part where netlib-cublas is about 20x
> slower than netlib-openblas. What is the overhead of using a GPU BLAS
> with netlib-java?
>
> CC'ed Sam, the author of netlib-java.
>
> Best,
> Xiangrui
>
> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]<mailto:[hidden email]>> wrote:
>> Better documentation for linking would be very helpful!  Here's a JIRA:
>> https://issues.apache.org/jira/browse/SPARK-6019
>>
>>
>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>> <[hidden email]<mailto:[hidden email]>>
>> wrote:
>>
>>> Thanks for compiling all the data and running these benchmarks,
>>> Alex. The big takeaways here can be seen with this chart:
>>>
>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
>>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>>
>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>> BIDMat+GPU) can provide substantial (but less than an order of
>>> BIDMat+magnitude)
>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>>> netlib-java+openblas-compiled).
>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>>> worse than a well-tuned CPU implementation, particularly for larger matrices.
>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>>> basically agrees with the authors own benchmarks (
>>> https://github.com/fommil/netlib-java)
>>>
>>> I think that most of our users are in a situation where using GPUs
>>> may not be practical - although we could consider having a good GPU
>>> backend available as an option. However, *ALL* users of MLlib could
>>> benefit (potentially tremendously) from using a well-tuned CPU-based
>>> BLAS implementation. Perhaps we should consider updating the mllib
>>> guide with a more complete section for enabling high performance
>>> binaries on OSX and Linux? Or better, figure out a way for the
>>> system to fetch these automatically.
>>>
>>> - Evan
>>>
>>>
>>>
>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>> [hidden email]<mailto:[hidden email]>> wrote:
>>>
>>>> Just to summarize this thread, I was finally able to make all
>>>> performance comparisons that we discussed. It turns out that:
>>>> BIDMat-cublas>>BIDMat
>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
>>>> =netlib-cublas>netlib-blas>f2jblas
>>>>
>>>> Below is the link to the spreadsheet with full results.
>>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
>>>> 378T9J5r7kwKSPkY/edit?usp=sharing
>>>>
>>>> One thing still needs exploration: does BIDMat-cublas perform
>>>> copying to/from machine’s RAM?
>>>>
>>>> -----Original Message-----
>>>> From: Ulanov, Alexander
>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>>> To: Evan R. Sparks
>>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]>
>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>>>> the original one discusses slightly different topic. I was able to
>>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
>>>> statically linked inside a 60MB library.
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>>> |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>>>> 1569,233228 |
>>>>
>>>> It turn out that pre-compiled MKL is faster than precompiled
>>>> OpenBlas on my machine. Probably, I’ll add two more columns with
>>>> locally compiled openblas and cuda.
>>>>
>>>> Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]>]
>>>> Sent: Monday, February 09, 2015 6:06 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Great - perhaps we can move this discussion off-list and onto a
>>>> JIRA ticket? (Here's one:
>>>> https://issues.apache.org/jira/browse/SPARK-5705)
>>>>
>>>> It seems like this is going to be somewhat exploratory for a while
>>>> (and there's probably only a handful of us who really care about
>>>> fast linear
>>>> algebra!)
>>>>
>>>> - Evan
>>>>
>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for explanation and useful link. I am going to build
>>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
>>>>
>>>> Do I understand correctly that BIDMat binaries contain statically
>>>> linked Intel MKL BLAS? It might be the reason why I am able to run
>>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
>>>> wonder if it is OK because Intel sells this library. Nevertheless,
>>>> it seems that in my case precompiled MKL BLAS performs better than
>>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>>>>
>>>> Though, it might be interesting to link Netlib-java with Intel MKL,
>>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
>>>> Halliday
>>>> (Netlib-java) interested to compare their libraries.
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Friday, February 06, 2015 5:58 PM
>>>>
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I would build OpenBLAS yourself, since good BLAS performance comes
>>>> from getting cache sizes, etc. set up correctly for your particular
>>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
>>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
>>>> quickly and yields performance competitive with MKL.
>>>>
>>>> To make sure the right library is getting used, you have to make
>>>> sure it's first on the search path - export
>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>>>
>>>> For some examples of getting netlib-java setup on an ec2 node and
>>>> some example benchmarking code we ran a while back, see:
>>>> https://github.com/shivaram/matrix-bench
>>>>
>>>> In particular - build-openblas-ec2.sh shows you how to build the
>>>> library and set up symlinks correctly, and scala/run-netlib.sh
>>>> shows you how to get the path setup and get that library picked up by netlib-java.
>>>>
>>>> In this way - you could probably get cuBLAS set up to be used by
>>>> netlib-java as well.
>>>>
>>>> - Evan
>>>>
>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
>>>> force loading the right blas? For netlib, I there are few JVM
>>>> flags, such as
>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>>>> so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas).
>>>>
>>>> Btw. I have installed openblas (yum install openblas), so I suppose
>>>> that netlib is using it.
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Friday, February 06, 2015 5:19 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Getting breeze to pick up the right blas library is critical for
>>>> performance. I recommend using OpenBLAS (or MKL, if you already have it).
>>>> It might make sense to force BIDMat to use the same underlying BLAS
>>>> library as well.
>>>>
>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Hi Evan, Joseph
>>>>
>>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>>>> faster than netlib-java+breeze (sorry for weird table formatting):
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>>>> |native_system_linux_x86-64|
>>>> Breeze+Netlib-java f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
>>>> ||
>>>>
>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
>>>> 19 Linux, Scala 2.11.
>>>>
>>>> Later I will make tests with Cuda. I need to install new Cuda
>>>> version for this purpose.
>>>>
>>>> Do you have any ideas why breeze-netlib with native blas is so much
>>>> slower than BIDMat MKL?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Joseph Bradley [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Thursday, February 05, 2015 5:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Evan R. Sparks;
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph
>>>>
>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Thank you for explanation! I’ve watched the BIDMach presentation by
>>>> John Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>>>>
>>>> I am very interested to find out what will be better within Spark:
>>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
>>>> fair way to benchmark them? Currently I do benchmarks on artificial
>>>> neural networks in batch mode. While it is not a “pure” test of
>>>> linear algebra, it involves some other things that are essential to machine learning.
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Thursday, February 05, 2015 1:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc: [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
>>>> netlib-java+data
>>>> layout and fewer levels of indirection - it's definitely a
>>>> worthwhile experiment to run. The main speedups I've seen from
>>>> using it come from highly optimized GPU code for linear algebra. I
>>>> know that in the past Canny has gone as far as to write custom GPU
>>>> kernels for performance-critical regions of code.[1]
>>>>
>>>> BIDMach is highly optimized for single node performance or
>>>> performance on small clusters.[2] Once data doesn't fit easily in
>>>> GPU memory (or can be batched in that way) the performance tends to
>>>> fall off. Canny argues for hardware/software codesign and as such
>>>> prefers machine configurations that are quite different than what
>>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs.
>>>>
>>>> In contrast, MLlib was designed for horizontal scalability on
>>>> commodity clusters and works best on very big datasets - order of terabytes.
>>>>
>>>> For the most part, these projects developed concurrently to address
>>>> slightly different use cases. That said, there may be bits of
>>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
>>>> careful about maintaining cross-language compatibility for our Java
>>>> and Python-users, though.
>>>>
>>>> - Evan
>>>>
>>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>>>
>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>>>> you know what makes them faster than netlib-java?
>>>>
>>>> The same group has BIDMach library that implements machine
>>>> learning. For some examples they use Caffe convolutional neural
>>>> network library owned by another group in Berkeley. Could you
>>>> elaborate on how these all might be connected with Spark Mllib? If
>>>> you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>>]
>>>> Sent: Thursday, February 05, 2015 12:09 PM
>>>> To: Ulanov, Alexander
>>>> Cc: [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>>>> blas in many cases.
>>>>
>>>> You might consider taking a look at the codepaths that BIDMat (
>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>>>> optimizing to make this work really fast from Scala. I've run it on
>>>> my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply.
>>>> There are a lot of layers of indirection here and you really want
>>>> to avoid data copying as much as possible.
>>>>
>>>> We could also consider swapping out BIDMat for Breeze, but that
>>>> would be a big project and if we can figure out how to get
>>>> breeze+cublas to comparable performance that would be a big win.
>>>>
>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>>> Dear Spark developers,
>>>>
>>>> I am exploring how to make linear algebra operations faster within Spark.
>>>> One way of doing this is to use Scala Breeze library that is
>>>> bundled with Spark. For matrix operations, it employs Netlib-java
>>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
>>>> and LAPACK native binaries if they are available on the worker
>>>> node. It also has its own optimized Java implementation of BLAS. It
>>>> is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e.
>>>> matrix-matrix operations or general matrix multiplication (GEMM).
>>>> This is confirmed by GEMM test on Netlib-java page
>>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>>>> experiments with training of artificial neural network
>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>>> However, I would like to boost performance more.
>>>>
>>>> GPU is supposed to work fast with linear algebra and there is
>>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
>>>> server with Nvidia GPU and I was able to do the following. I linked
>>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
>>>> it into Spark, so Breeze/Netlib is using it. Then I did some
>>>> performance measurements with regards to artificial neural network
>>>> batch learning in Spark MLlib that involves matrix-matrix
>>>> multiplications. It turns out that for matrices of size less than
>>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
>>>> slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved.
>>>> One of the reasons for slowdown might be the overhead of copying
>>>> the matrices from computer memory to graphic card memory and back.
>>>>
>>>> So, few questions:
>>>> 1) Do these results with CUDA make sense?
>>>> 2) If the problem is with copy overhead, are there any libraries
>>>> that allow to force intermediate results to stay in graphic card
>>>> memory thus removing the overhead?
>>>> 3) Any other options to speed-up linear algebra in Spark?
>>>>
>>>> Thank you, Alexander
>>>>
>>>> -------------------------------------------------------------------
>>>> -- To unsubscribe, e-mail: [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]>
>>>> he.org<http://he.org> <mailto:[hidden email]<mailto:[hidden email]>>>
>>>> For additional commands, e-mail: [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>>
>>>>
>>>>
>>>>
>>>>
>>>

--
Best regards,
Sam
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

Shivaram Venkataraman
I have run some BLAS comparison benchmarks on different EC2 instance sizes
and also on NERSC super computers. I can put together a github-backed
website where we can host latest benchmark results and update them over
time.

Sam -- Does that sound like what you had in mind ?

Thanks
Shivaram

On Tue, Mar 10, 2015 at 9:25 AM, Ulanov, Alexander <[hidden email]>
wrote:

> I can run benchmark on another machine with GPU nVidia Titan and Intel
> Xeon E5-2650 v2, although it runs Windows and I have to run Linux tests in
> VirtualBox.
>
> It would be also interesting to add results on netlib+nvblas, however I am
> not sure I understand in details how to build this and will appreciate any
> help from you ☺
>
> From: Sam Halliday [mailto:[hidden email]]
> Sent: Monday, March 09, 2015 6:01 PM
> To: Ulanov, Alexander
> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
>
> Thanks so much for following up on this!
>
> Hmm, I wonder if we should have a concerted effort to chart performance on
> various pieces of hardware...
> On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]<mailto:
> [hidden email]>> wrote:
> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
> support of Double in the current source code), did the test with BIDMat and
> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Best regards, Alexander
>
> -----Original Message-----
> From: Sam Halliday [mailto:[hidden email]<mailto:
> [hidden email]>]
> Sent: Tuesday, March 03, 2015 1:54 PM
> To: Xiangrui Meng; Joseph Bradley
> Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:
> [hidden email]>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> BTW, is anybody on this list going to the London Meetup in a few weeks?
>
>
> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
>
> Would be nice to meet other people working on the guts of Spark! :-)
>
>
> Xiangrui Meng <[hidden email]<mailto:[hidden email]>> writes:
>
> > Hey Alexander,
> >
> > I don't quite understand the part where netlib-cublas is about 20x
> > slower than netlib-openblas. What is the overhead of using a GPU BLAS
> > with netlib-java?
> >
> > CC'ed Sam, the author of netlib-java.
> >
> > Best,
> > Xiangrui
> >
> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]
> <mailto:[hidden email]>> wrote:
> >> Better documentation for linking would be very helpful!  Here's a JIRA:
> >> https://issues.apache.org/jira/browse/SPARK-6019
> >>
> >>
> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
> >> <[hidden email]<mailto:[hidden email]>>
> >> wrote:
> >>
> >>> Thanks for compiling all the data and running these benchmarks,
> >>> Alex. The big takeaways here can be seen with this chart:
> >>>
> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
> >>>
> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
> >>> BIDMat+GPU) can provide substantial (but less than an order of
> >>> BIDMat+magnitude)
> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
> >>> netlib-java+openblas-compiled).
> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
> >>> worse than a well-tuned CPU implementation, particularly for larger
> matrices.
> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
> >>> basically agrees with the authors own benchmarks (
> >>> https://github.com/fommil/netlib-java)
> >>>
> >>> I think that most of our users are in a situation where using GPUs
> >>> may not be practical - although we could consider having a good GPU
> >>> backend available as an option. However, *ALL* users of MLlib could
> >>> benefit (potentially tremendously) from using a well-tuned CPU-based
> >>> BLAS implementation. Perhaps we should consider updating the mllib
> >>> guide with a more complete section for enabling high performance
> >>> binaries on OSX and Linux? Or better, figure out a way for the
> >>> system to fetch these automatically.
> >>>
> >>> - Evan
> >>>
> >>>
> >>>
> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
> >>> [hidden email]<mailto:[hidden email]>> wrote:
> >>>
> >>>> Just to summarize this thread, I was finally able to make all
> >>>> performance comparisons that we discussed. It turns out that:
> >>>> BIDMat-cublas>>BIDMat
> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
> >>>> =netlib-cublas>netlib-blas>f2jblas
> >>>>
> >>>> Below is the link to the spreadsheet with full results.
> >>>>
> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing
> >>>>
> >>>> One thing still needs exploration: does BIDMat-cublas perform
> >>>> copying to/from machine’s RAM?
> >>>>
> >>>> -----Original Message-----
> >>>> From: Ulanov, Alexander
> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
> >>>> To: Evan R. Sparks
> >>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]>
> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though
> >>>> the original one discusses slightly different topic. I was able to
> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
> >>>> statically linked inside a 60MB library.
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
> >>>>
> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
> >>>> |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
> >>>> 1569,233228 |
> >>>>
> >>>> It turn out that pre-compiled MKL is faster than precompiled
> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
> >>>> locally compiled openblas and cuda.
> >>>>
> >>>> Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]>]
> >>>> Sent: Monday, February 09, 2015 6:06 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Great - perhaps we can move this discussion off-list and onto a
> >>>> JIRA ticket? (Here's one:
> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
> >>>>
> >>>> It seems like this is going to be somewhat exploratory for a while
> >>>> (and there's probably only a handful of us who really care about
> >>>> fast linear
> >>>> algebra!)
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for explanation and useful link. I am going to build
> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
> >>>>
> >>>> Do I understand correctly that BIDMat binaries contain statically
> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run
> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
> >>>> wonder if it is OK because Intel sells this library. Nevertheless,
> >>>> it seems that in my case precompiled MKL BLAS performs better than
> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed
> to be on par with JNI overheads.
> >>>>
> >>>> Though, it might be interesting to link Netlib-java with Intel MKL,
> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
> >>>> Halliday
> >>>> (Netlib-java) interested to compare their libraries.
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Friday, February 06, 2015 5:58 PM
> >>>>
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I would build OpenBLAS yourself, since good BLAS performance comes
> >>>> from getting cache sizes, etc. set up correctly for your particular
> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
> >>>> quickly and yields performance competitive with MKL.
> >>>>
> >>>> To make sure the right library is getting used, you have to make
> >>>> sure it's first on the search path - export
> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
> >>>>
> >>>> For some examples of getting netlib-java setup on an ec2 node and
> >>>> some example benchmarking code we ran a while back, see:
> >>>> https://github.com/shivaram/matrix-bench
> >>>>
> >>>> In particular - build-openblas-ec2.sh shows you how to build the
> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
> >>>> shows you how to get the path setup and get that library picked up by
> netlib-java.
> >>>>
> >>>> In this way - you could probably get cuBLAS set up to be used by
> >>>> netlib-java as well.
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
> >>>> force loading the right blas? For netlib, I there are few JVM
> >>>> flags, such as
> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
> >>>> so I can force it to use Java implementation. Not sure I understand
> how to force use a specific blas (not specific wrapper for blas).
> >>>>
> >>>> Btw. I have installed openblas (yum install openblas), so I suppose
> >>>> that netlib is using it.
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Friday, February 06, 2015 5:19 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>
> >>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Getting breeze to pick up the right blas library is critical for
> >>>> performance. I recommend using OpenBLAS (or MKL, if you already have
> it).
> >>>> It might make sense to force BIDMat to use the same underlying BLAS
> >>>> library as well.
> >>>>
> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Hi Evan, Joseph
> >>>>
> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
> >>>> |native_system_linux_x86-64|
> >>>> Breeze+Netlib-java f2jblas |
> >>>>
> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
> >>>> ||
> >>>>
> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
> >>>> 19 Linux, Scala 2.11.
> >>>>
> >>>> Later I will make tests with Cuda. I need to install new Cuda
> >>>> version for this purpose.
> >>>>
> >>>> Do you have any ideas why breeze-netlib with native blas is so much
> >>>> slower than BIDMat MKL?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Joseph Bradley [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Thursday, February 05, 2015 5:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Evan R. Sparks;
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Hi Alexander,
> >>>>
> >>>> Using GPUs with Spark would be very exciting.  Small comment:
> >>>> Concerning your question earlier about keeping data stored on the
> >>>> GPU rather than having to move it between main memory and GPU
> >>>> memory on each iteration, I would guess this would be critical to
> >>>> getting good performance.  If you could do multiple local
> >>>> iterations before aggregating results, then the cost of data
> >>>> movement to the GPU could be amortized (and I believe that is done
> >>>> in practice).  Having Spark be aware of the GPU and using it as
> another part of memory sounds like a much bigger undertaking.
> >>>>
> >>>> Joseph
> >>>>
> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by
> >>>> John Canny and I am really inspired by his talk and comparisons with
> Spark MLlib.
> >>>>
> >>>> I am very interested to find out what will be better within Spark:
> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
> >>>> fair way to benchmark them? Currently I do benchmarks on artificial
> >>>> neural networks in batch mode. While it is not a “pure” test of
> >>>> linear algebra, it involves some other things that are essential to
> machine learning.
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Thursday, February 05, 2015 1:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
> >>>> netlib-java+data
> >>>> layout and fewer levels of indirection - it's definitely a
> >>>> worthwhile experiment to run. The main speedups I've seen from
> >>>> using it come from highly optimized GPU code for linear algebra. I
> >>>> know that in the past Canny has gone as far as to write custom GPU
> >>>> kernels for performance-critical regions of code.[1]
> >>>>
> >>>> BIDMach is highly optimized for single node performance or
> >>>> performance on small clusters.[2] Once data doesn't fit easily in
> >>>> GPU memory (or can be batched in that way) the performance tends to
> >>>> fall off. Canny argues for hardware/software codesign and as such
> >>>> prefers machine configurations that are quite different than what
> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4
> GPUs.
> >>>>
> >>>> In contrast, MLlib was designed for horizontal scalability on
> >>>> commodity clusters and works best on very big datasets - order of
> terabytes.
> >>>>
> >>>> For the most part, these projects developed concurrently to address
> >>>> slightly different use cases. That said, there may be bits of
> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
> >>>> careful about maintaining cross-language compatibility for our Java
> >>>> and Python-users, though.
> >>>>
> >>>> - Evan
> >>>>
> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
> >>>>
> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
> >>>> you know what makes them faster than netlib-java?
> >>>>
> >>>> The same group has BIDMach library that implements machine
> >>>> learning. For some examples they use Caffe convolutional neural
> >>>> network library owned by another group in Berkeley. Could you
> >>>> elaborate on how these all might be connected with Spark Mllib? If
> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for
> optimization and learning?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>><mailto:
> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>>]
> >>>> Sent: Thursday, February 05, 2015 12:09 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
> >>>> blas in many cases.
> >>>>
> >>>> You might consider taking a look at the codepaths that BIDMat (
> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to
> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
> >>>> optimizing to make this work really fast from Scala. I've run it on
> >>>> my laptop and compared to MKL and in certain cases it's 10x faster at
> matrix multiply.
> >>>> There are a lot of layers of indirection here and you really want
> >>>> to avoid data copying as much as possible.
> >>>>
> >>>> We could also consider swapping out BIDMat for Breeze, but that
> >>>> would be a big project and if we can figure out how to get
> >>>> breeze+cublas to comparable performance that would be a big win.
> >>>>
> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>> wrote:
> >>>> Dear Spark developers,
> >>>>
> >>>> I am exploring how to make linear algebra operations faster within
> Spark.
> >>>> One way of doing this is to use Scala Breeze library that is
> >>>> bundled with Spark. For matrix operations, it employs Netlib-java
> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
> >>>> and LAPACK native binaries if they are available on the worker
> >>>> node. It also has its own optimized Java implementation of BLAS. It
> >>>> is worth mentioning, that native binaries provide better performance
> only for BLAS level 3, i.e.
> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
> >>>> This is confirmed by GEMM test on Netlib-java page
> >>>> https://github.com/fommil/netlib-java. I also confirmed it with my
> >>>> experiments with training of artificial neural network
> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
> >>>> However, I would like to boost performance more.
> >>>>
> >>>> GPU is supposed to work fast with linear algebra and there is
> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
> >>>> server with Nvidia GPU and I was able to do the following. I linked
> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some
> >>>> performance measurements with regards to artificial neural network
> >>>> batch learning in Spark MLlib that involves matrix-matrix
> >>>> multiplications. It turns out that for matrices of size less than
> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
> >>>> slower for bigger matrices. It worth mentioning that it is was not a
> test for ONLY multiplication since there are other operations involved.
> >>>> One of the reasons for slowdown might be the overhead of copying
> >>>> the matrices from computer memory to graphic card memory and back.
> >>>>
> >>>> So, few questions:
> >>>> 1) Do these results with CUDA make sense?
> >>>> 2) If the problem is with copy overhead, are there any libraries
> >>>> that allow to force intermediate results to stay in graphic card
> >>>> memory thus removing the overhead?
> >>>> 3) Any other options to speed-up linear algebra in Spark?
> >>>>
> >>>> Thank you, Alexander
> >>>>
> >>>> -------------------------------------------------------------------
> >>>> -- To unsubscribe, e-mail: [hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:
> [hidden email]>><mailto:[hidden email]
> <mailto:[hidden email]>
> >>>> he.org<http://he.org> <mailto:[hidden email]
> <mailto:[hidden email]>>>
> >>>> For additional commands, e-mail: [hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>><mailto:
> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
>
> --
> Best regards,
> Sam
>
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

jfcanny
If you're contemplating GPU acceleration in Spark, its important to look beyond BLAS. Dense BLAS probably account for only 10% of the cycles in the datasets we've tested in BIDMach, and we've tried to make them representative of industry machine learning workloads. Unless you're crunching images or audio, the majority of data will be very sparse and power law distributed. You need a good sparse BLAS, and in practice it seems like you need a sparse BLAS tailored for power-law data. We had to write our own since the NVIDIA libraries didnt perform well on typical power-law data. Intel MKL sparse BLAS also have issues and we only use some of them.

You also need 2D reductions, scan operations, slicing, element-wise transcendental functions and operators, many kinds of sort, random number generators etc, and some kind of memory management strategy. Some of this was layered on top of Thrust in BIDMat, but most had to be written from scratch. Its all been rooflined, typically to memory throughput of current GPUs (around 200 GB/s).

When you have all this you can write Learning Algorithms in the same high-level primitives available in Breeze or Numpy/Scipy. Its literally the same in BIDMat, since the generic matrix operations are implemented on both CPU and GPU, so the same code runs on either platform.

A lesser known fact is that GPUs are around 10x faster for *all* those operations, not just dense BLAS. Its mostly due to faster streaming memory speeds, but some kernels (random number generation and transcendentals) are more than an order of magnitude thanks to some specialized hardware for power series on the GPU chip.

When you have all this there is no need to move data back and forth across the PCI bus. The CPU only has to pull chunks of data off disk, unpack them, and feed them to the available GPUs. Most models fit comfortably in GPU memory these days (4-12 GB). With minibatch algorithms you can push TBs of data through the GPU this way.
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

rxin
Thanks for chiming in, John. I missed your meetup last night - do you have
any writeups or slides about roofline design? In particular, I'm curious
about what optimizations are available for power-law dense * sparse? (I
don't have any background in optimizations)



On Thu, Mar 12, 2015 at 8:50 PM, jfcanny <[hidden email]> wrote:

> If you're contemplating GPU acceleration in Spark, its important to look
> beyond BLAS. Dense BLAS probably account for only 10% of the cycles in the
> datasets we've tested in BIDMach, and we've tried to make them
> representative of industry machine learning workloads. Unless you're
> crunching images or audio, the majority of data will be very sparse and
> power law distributed. You need a good sparse BLAS, and in practice it
> seems
> like you need a sparse BLAS tailored for power-law data. We had to write
> our
> own since the NVIDIA libraries didnt perform well on typical power-law
> data.
> Intel MKL sparse BLAS also have issues and we only use some of them.
>
> You also need 2D reductions, scan operations, slicing, element-wise
> transcendental functions and operators, many kinds of sort, random number
> generators etc, and some kind of memory management strategy. Some of this
> was layered on top of Thrust in BIDMat, but most had to be written from
> scratch. Its all been rooflined, typically to memory throughput of current
> GPUs (around 200 GB/s).
>
> When you have all this you can write Learning Algorithms in the same
> high-level primitives available in Breeze or Numpy/Scipy. Its literally the
> same in BIDMat, since the generic matrix operations are implemented on both
> CPU and GPU, so the same code runs on either platform.
>
> A lesser known fact is that GPUs are around 10x faster for *all* those
> operations, not just dense BLAS. Its mostly due to faster streaming memory
> speeds, but some kernels (random number generation and transcendentals) are
> more than an order of magnitude thanks to some specialized hardware for
> power series on the GPU chip.
>
> When you have all this there is no need to move data back and forth across
> the PCI bus. The CPU only has to pull chunks of data off disk, unpack them,
> and feed them to the available GPUs. Most models fit comfortably in GPU
> memory these days (4-12 GB). With minibatch algorithms you can push TBs of
> data through the GPU this way.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11021.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

Chester Chen-2
Reyonld,

    Prof Canny gives me the slides yesterday I will posted the link to the slides to both SF BIg Analytics and SF Machine Learning meetups.

Chester

Sent from my iPad

On Mar 12, 2015, at 22:53, Reynold Xin <[hidden email]> wrote:

> Thanks for chiming in, John. I missed your meetup last night - do you have
> any writeups or slides about roofline design? In particular, I'm curious
> about what optimizations are available for power-law dense * sparse? (I
> don't have any background in optimizations)
>
>
>
> On Thu, Mar 12, 2015 at 8:50 PM, jfcanny <[hidden email]> wrote:
>
>> If you're contemplating GPU acceleration in Spark, its important to look
>> beyond BLAS. Dense BLAS probably account for only 10% of the cycles in the
>> datasets we've tested in BIDMach, and we've tried to make them
>> representative of industry machine learning workloads. Unless you're
>> crunching images or audio, the majority of data will be very sparse and
>> power law distributed. You need a good sparse BLAS, and in practice it
>> seems
>> like you need a sparse BLAS tailored for power-law data. We had to write
>> our
>> own since the NVIDIA libraries didnt perform well on typical power-law
>> data.
>> Intel MKL sparse BLAS also have issues and we only use some of them.
>>
>> You also need 2D reductions, scan operations, slicing, element-wise
>> transcendental functions and operators, many kinds of sort, random number
>> generators etc, and some kind of memory management strategy. Some of this
>> was layered on top of Thrust in BIDMat, but most had to be written from
>> scratch. Its all been rooflined, typically to memory throughput of current
>> GPUs (around 200 GB/s).
>>
>> When you have all this you can write Learning Algorithms in the same
>> high-level primitives available in Breeze or Numpy/Scipy. Its literally the
>> same in BIDMat, since the generic matrix operations are implemented on both
>> CPU and GPU, so the same code runs on either platform.
>>
>> A lesser known fact is that GPUs are around 10x faster for *all* those
>> operations, not just dense BLAS. Its mostly due to faster streaming memory
>> speeds, but some kernels (random number generation and transcendentals) are
>> more than an order of magnitude thanks to some specialized hardware for
>> power series on the GPU chip.
>>
>> When you have all this there is no need to move data back and forth across
>> the PCI bus. The CPU only has to pull chunks of data off disk, unpack them,
>> and feed them to the available GPUs. Most models fit comfortably in GPU
>> memory these days (4-12 GB). With minibatch algorithms you can push TBs of
>> data through the GPU this way.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11021.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

1234