Large DataStructure to Broadcast

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Large DataStructure to Broadcast

purav aggarwal
Hi all,

I have a large file ( > 5 gigs) which I need to lookup. Since each slave
need to perform the search operation on the hashmap (built out of the file)
in parallel I need to broadcast the file. I was wondering if broadcasting
such a huge file is really a good idea. Do we have any benchmarks for the
broadcast variables. I am on a Standalone cluster and machine configuration
is not a problem at the moment.
Has anyone exploited broadcast to such an extent ?

Thanks,
Purav
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Large DataStructure to Broadcast

Mosharaf Chowdhury
You should try out TorrentBroadcast (NOT BitTorrentBroadcast) from the
0.8.1 branch.
In your config file, set spark.broadcast.factory=
org.apache.spark.broadcast.TorrentBroadcastFactory
It should perform significantly better than HttpBroadcast (some benchmarks
here: https://github.com/apache/incubator-spark/pull/68). I expect a 10X
improvement over the default.
Make sure you have enough memory in slaves.

--
Mosharaf Chowdhury
http://www.mosharaf.com/


On Wed, Dec 25, 2013 at 7:32 PM, purav aggarwal
<[hidden email]>wrote:

> Hi all,
>
> I have a large file ( > 5 gigs) which I need to lookup. Since each slave
> need to perform the search operation on the hashmap (built out of the file)
> in parallel I need to broadcast the file. I was wondering if broadcasting
> such a huge file is really a good idea. Do we have any benchmarks for the
> broadcast variables. I am on a Standalone cluster and machine configuration
> is not a problem at the moment.
> Has anyone exploited broadcast to such an extent ?
>
> Thanks,
> Purav
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Large DataStructure to Broadcast

Christopher Nguyen
In reply to this post by purav aggarwal
Purav, depending on the access pattern you should also consider the
trade-offs of setting up a lookup service (using, e.g., memcached, egad!)
which may end up being more efficient overall.

The general point is not to restrict yourself to only Spark APIs when
considering the overall architecture.
--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Wed, Dec 25, 2013 at 7:32 PM, purav aggarwal
<[hidden email]>wrote:

> Hi all,
>
> I have a large file ( > 5 gigs) which I need to lookup. Since each slave
> need to perform the search operation on the hashmap (built out of the file)
> in parallel I need to broadcast the file. I was wondering if broadcasting
> such a huge file is really a good idea. Do we have any benchmarks for the
> broadcast variables. I am on a Standalone cluster and machine configuration
> is not a problem at the moment.
> Has anyone exploited broadcast to such an extent ?
>
> Thanks,
> Purav
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Large DataStructure to Broadcast

purav aggarwal
Thanks.
Broadcasting such huge entities does not seem like a feasible solution.
Serialization-Deserialization and network seem to have a huge overhead for
large files.

Before I consider moving into an external lookup service (as Christopher
rightly suggested) I was wondering if I could make each slave load the
large file in memory and do lookup operations in parallel.

*I am struck at how to make each slave load the files just once and perform
the lookup service.*

I tried using a hack where I check if the object is not initialised, I
shall initialise it. The problem is now for multiple threads running on a
single slave, I need a global object (specific to the JVM on that slave) to
hold on the other threads using "synchronized" while one of them is loading
the large file for me.
Any suggestions what can that unique object specific to that particular JVM
be. Is SparkContext an option ?



On Thu, Dec 26, 2013 at 10:41 AM, Christopher Nguyen <[hidden email]> wrote:

> Purav, depending on the access pattern you should also consider the
> trade-offs of setting up a lookup service (using, e.g., memcached, egad!)
> which may end up being more efficient overall.
>
> The general point is not to restrict yourself to only Spark APIs when
> considering the overall architecture.
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao <http://adatao.com>
> linkedin.com/in/ctnguyen
>
>
>
> On Wed, Dec 25, 2013 at 7:32 PM, purav aggarwal
> <[hidden email]>wrote:
>
> > Hi all,
> >
> > I have a large file ( > 5 gigs) which I need to lookup. Since each slave
> > need to perform the search operation on the hashmap (built out of the
> file)
> > in parallel I need to broadcast the file. I was wondering if broadcasting
> > such a huge file is really a good idea. Do we have any benchmarks for the
> > broadcast variables. I am on a Standalone cluster and machine
> configuration
> > is not a problem at the moment.
> > Has anyone exploited broadcast to such an extent ?
> >
> > Thanks,
> > Purav
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Large DataStructure to Broadcast

Christopher Nguyen
Purav, look up the Singleton pattern which is what you seem to be
describing.

The strategy you describe does not sound like a good idea, however. It
couples the "lookup" service rather strongly (and serially) to its data
processing clients. This is usually, though not always, less robust and
efficient.

Sent while mobile. Pls excuse typos etc.
On Jan 7, 2014 10:30 AM, "purav aggarwal" <[hidden email]>
wrote:

> Thanks.
> Broadcasting such huge entities does not seem like a feasible solution.
> Serialization-Deserialization and network seem to have a huge overhead for
> large files.
>
> Before I consider moving into an external lookup service (as Christopher
> rightly suggested) I was wondering if I could make each slave load the
> large file in memory and do lookup operations in parallel.
>
> *I am struck at how to make each slave load the files just once and perform
> the lookup service.*
>
> I tried using a hack where I check if the object is not initialised, I
> shall initialise it. The problem is now for multiple threads running on a
> single slave, I need a global object (specific to the JVM on that slave) to
> hold on the other threads using "synchronized" while one of them is loading
> the large file for me.
> Any suggestions what can that unique object specific to that particular JVM
> be. Is SparkContext an option ?
>
>
>
> On Thu, Dec 26, 2013 at 10:41 AM, Christopher Nguyen <[hidden email]>
> wrote:
>
> > Purav, depending on the access pattern you should also consider the
> > trade-offs of setting up a lookup service (using, e.g., memcached, egad!)
> > which may end up being more efficient overall.
> >
> > The general point is not to restrict yourself to only Spark APIs when
> > considering the overall architecture.
> > --
> > Christopher T. Nguyen
> > Co-founder & CEO, Adatao <http://adatao.com>
> > linkedin.com/in/ctnguyen
> >
> >
> >
> > On Wed, Dec 25, 2013 at 7:32 PM, purav aggarwal
> > <[hidden email]>wrote:
> >
> > > Hi all,
> > >
> > > I have a large file ( > 5 gigs) which I need to lookup. Since each
> slave
> > > need to perform the search operation on the hashmap (built out of the
> > file)
> > > in parallel I need to broadcast the file. I was wondering if
> broadcasting
> > > such a huge file is really a good idea. Do we have any benchmarks for
> the
> > > broadcast variables. I am on a Standalone cluster and machine
> > configuration
> > > is not a problem at the moment.
> > > Has anyone exploited broadcast to such an extent ?
> > >
> > > Thanks,
> > > Purav
> > >
> >
>
Loading...