spark sql versus interactive hive versus hive

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

spark sql versus interactive hive versus hive

Saikat Kanjilal

Folks,

I'm embarking on a project to build a POC around spark sql, I was wondering if anyone has experience in comparing spark sql with hive or interactive hive and data points around the types of queries suited for both, I am naively assuming that spark sql will beat hive in all queries given that computations are mostly done in memory but want to hear some more data  points around queries that maybe problematic in spark-sql, also are there debugging tools people ordinarily use with spark-sql to troubleshoot perf related issues.


I look forward to hearing from the community.

Regards

Reply | Threaded
Open this post in threaded view
|

Re: spark sql versus interactive hive versus hive

Jörn Franke
I think this is a rather simplistic view. All the tools to computation in-memory in the end. For certain type of computation and usage patterns it makes sense to keep them in memory. For example, most of the machine learning approaches require to include the same data in several iterative calculations. This is what Spark has been designed for. Most aggregations/precalculations are just done by using the data in-memory once. Here is where Hive+Tez and to a limited extend Spark can help. The third pattern, where users interactively query the data i.e. Many concurrent users query the same or similar data very frequently, is addressed by Hive on Tez + Llap, Hive Tez+ Ignite or Spark + ignite ( and there are other tools).

So it is important to understand what your users want to do.

Then, you have a lot of benchmark data on the web to compare. However I always recommend to generate or use data yourself that fits to the data the company is using. Keep also in mind that time is needed to convert this data in a efficient format.

On 10 Feb 2017, at 20:36, Saikat Kanjilal <[hidden email]> wrote:

Folks,

I'm embarking on a project to build a POC around spark sql, I was wondering if anyone has experience in comparing spark sql with hive or interactive hive and data points around the types of queries suited for both, I am naively assuming that spark sql will beat hive in all queries given that computations are mostly done in memory but want to hear some more data  points around queries that maybe problematic in spark-sql, also are there debugging tools people ordinarily use with spark-sql to troubleshoot perf related issues.


I look forward to hearing from the community.

Regards

Reply | Threaded
Open this post in threaded view
|

Re: spark sql versus interactive hive versus hive

Saikat Kanjilal
Thanks Jorn for the input, our users want to run queries that perform large aggregations of data from different tables as well as simple ad hockey queries over 1 table.  The tables are all in orc format, they're currently using the hive plus tez architecture that you mention but experiencing perf issues, one of the things we're considering is to move them to spark sql where it makes sense which is why I wanted to know people's experience in using the various tools.

Sent from my iPhone

On Feb 11, 2017, at 12:22 AM, Jörn Franke <[hidden email]> wrote:

I think this is a rather simplistic view. All the tools to computation in-memory in the end. For certain type of computation and usage patterns it makes sense to keep them in memory. For example, most of the machine learning approaches require to include the same data in several iterative calculations. This is what Spark has been designed for. Most aggregations/precalculations are just done by using the data in-memory once. Here is where Hive+Tez and to a limited extend Spark can help. The third pattern, where users interactively query the data i.e. Many concurrent users query the same or similar data very frequently, is addressed by Hive on Tez + Llap, Hive Tez+ Ignite or Spark + ignite ( and there are other tools).

So it is important to understand what your users want to do.

Then, you have a lot of benchmark data on the web to compare. However I always recommend to generate or use data yourself that fits to the data the company is using. Keep also in mind that time is needed to convert this data in a efficient format.

On 10 Feb 2017, at 20:36, Saikat Kanjilal <[hidden email]> wrote:

Folks,

I'm embarking on a project to build a POC around spark sql, I was wondering if anyone has experience in comparing spark sql with hive or interactive hive and data points around the types of queries suited for both, I am naively assuming that spark sql will beat hive in all queries given that computations are mostly done in memory but want to hear some more data  points around queries that maybe problematic in spark-sql, also are there debugging tools people ordinarily use with spark-sql to troubleshoot perf related issues.


I look forward to hearing from the community.

Regards