Joining 3 tables with 17 billions records

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Joining 3 tables with 17 billions records

Chetan Khatri
Hello Spark Developers,

I have 3 tables that i am reading from HBase and wants to do join transformation and save to Hive Parquet external table. Currently my join is failing with container failed error.

1. Read table A from Hbase with ~17 billion records.
2. repartition on primary key of table A
3. create temp view of table A Dataframe
4. Read table B from HBase with ~4 billion records
5. repartition on primary key of table B
6. create temp view of table B Dataframe
7. Join both view of A and B and create Dataframe C
8.  Join Dataframe C with table D
9. coleance(20) to reduce number of file creation on already repartitioned DF.
10. Finally store to external hive table with partition by skey.

Any Suggestion or resources you come across please do share suggestions on this to optimize this.

Thanks
Chetan
Reply | Threaded
Open this post in threaded view
|

Re: Joining 3 tables with 17 billions records

Jörn Franke
Hi,

Do you have a more detailed log/error message?
Also, can you please provide us details on the tables (no of rows, columns, size etc).
Is this just a one time thing or something regular?
If it is a one time thing then I would tend more towards putting each table in HDFS (parquet or ORC) and then join them.
What is the Hive and Spark version?

Best regards

> On 2. Nov 2017, at 20:57, Chetan Khatri <[hidden email]> wrote:
>
> Hello Spark Developers,
>
> I have 3 tables that i am reading from HBase and wants to do join transformation and save to Hive Parquet external table. Currently my join is failing with container failed error.
>
> 1. Read table A from Hbase with ~17 billion records.
> 2. repartition on primary key of table A
> 3. create temp view of table A Dataframe
> 4. Read table B from HBase with ~4 billion records
> 5. repartition on primary key of table B
> 6. create temp view of table B Dataframe
> 7. Join both view of A and B and create Dataframe C
> 8.  Join Dataframe C with table D
> 9. coleance(20) to reduce number of file creation on already repartitioned DF.
> 10. Finally store to external hive table with partition by skey.
>
> Any Suggestion or resources you come across please do share suggestions on this to optimize this.
>
> Thanks
> Chetan

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Joining 3 tables with 17 billions records

Chetan Khatri
Jorn,

This is kind of one time load from Historical Data to Analytical Hive engine. Hive version 1.2.1 and Spark version 2.0.1 with MapR distribution.

Writing every table to parquet and reading it could be very much time consuming, currently entire job could take ~8 hours on 8 node of 100 Gig  ram 20 core cluster, not only used utilized by me but by larger team.

Thanks


On Fri, Nov 3, 2017 at 1:31 AM, Jörn Franke <[hidden email]> wrote:
Hi,

Do you have a more detailed log/error message?
Also, can you please provide us details on the tables (no of rows, columns, size etc).
Is this just a one time thing or something regular?
If it is a one time thing then I would tend more towards putting each table in HDFS (parquet or ORC) and then join them.
What is the Hive and Spark version?

Best regards

> On 2. Nov 2017, at 20:57, Chetan Khatri <[hidden email]> wrote:
>
> Hello Spark Developers,
>
> I have 3 tables that i am reading from HBase and wants to do join transformation and save to Hive Parquet external table. Currently my join is failing with container failed error.
>
> 1. Read table A from Hbase with ~17 billion records.
> 2. repartition on primary key of table A
> 3. create temp view of table A Dataframe
> 4. Read table B from HBase with ~4 billion records
> 5. repartition on primary key of table B
> 6. create temp view of table B Dataframe
> 7. Join both view of A and B and create Dataframe C
> 8.  Join Dataframe C with table D
> 9. coleance(20) to reduce number of file creation on already repartitioned DF.
> 10. Finally store to external hive table with partition by skey.
>
> Any Suggestion or resources you come across please do share suggestions on this to optimize this.
>
> Thanks
> Chetan

Reply | Threaded
Open this post in threaded view
|

Re: Joining 3 tables with 17 billions records

Jörn Franke
Well this sounds a lot for “only” 17 billion. However you can limit the resources of the job so no need that it takes all of them (might be a little bit longer).
Alternatively did you try to use the hbase tables directly in Hive as external tables and do a simple ctas? Works better if Hive is on Tez but might be also worth a try with mr as an engine.

On 2. Nov 2017, at 21:08, Chetan Khatri <[hidden email]> wrote:

Jorn,

This is kind of one time load from Historical Data to Analytical Hive engine. Hive version 1.2.1 and Spark version 2.0.1 with MapR distribution.

Writing every table to parquet and reading it could be very much time consuming, currently entire job could take ~8 hours on 8 node of 100 Gig  ram 20 core cluster, not only used utilized by me but by larger team.

Thanks


On Fri, Nov 3, 2017 at 1:31 AM, Jörn Franke <[hidden email]> wrote:
Hi,

Do you have a more detailed log/error message?
Also, can you please provide us details on the tables (no of rows, columns, size etc).
Is this just a one time thing or something regular?
If it is a one time thing then I would tend more towards putting each table in HDFS (parquet or ORC) and then join them.
What is the Hive and Spark version?

Best regards

> On 2. Nov 2017, at 20:57, Chetan Khatri <[hidden email]> wrote:
>
> Hello Spark Developers,
>
> I have 3 tables that i am reading from HBase and wants to do join transformation and save to Hive Parquet external table. Currently my join is failing with container failed error.
>
> 1. Read table A from Hbase with ~17 billion records.
> 2. repartition on primary key of table A
> 3. create temp view of table A Dataframe
> 4. Read table B from HBase with ~4 billion records
> 5. repartition on primary key of table B
> 6. create temp view of table B Dataframe
> 7. Join both view of A and B and create Dataframe C
> 8.  Join Dataframe C with table D
> 9. coleance(20) to reduce number of file creation on already repartitioned DF.
> 10. Finally store to external hive table with partition by skey.
>
> Any Suggestion or resources you come across please do share suggestions on this to optimize this.
>
> Thanks
> Chetan