A question about creating persistent table when in-memory catalog is used

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

A question about creating persistent table when in-memory catalog is used

Shuai Lin
Hi all,

Currently when the in-memory catalog is used, e.g. through `--conf spark.sql.catalogImplementation=in-memory`, we can create a persistent table, but inserting into this table would fail with error message "Hive support is required to insert into the following tables..". 

    sql("create table t1 (id int, name string, dept string)") // OK
    sql("insert into t1 values (1, 'name1', 'dept1')")  // ERROR


This doesn't make sense for me, because this table would always be empty if we can't insert into it, thus would be of no use. But I wonder if there are other good reasons for the current logic. If not, I would propose to raise an error when creating the table in the first place.

Thanks!

Regards,
Shuai Lin (@lins05)
Reply | Threaded
Open this post in threaded view
|

Re: A question about creating persistent table when in-memory catalog is used

rxin
I think this is something we are going to change to completely decouple the Hive support and catalog. 


On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin <[hidden email]> wrote:
Hi all,

Currently when the in-memory catalog is used, e.g. through `--conf spark.sql.catalogImplementation=in-memory`, we can create a persistent table, but inserting into this table would fail with error message "Hive support is required to insert into the following tables..". 

    sql("create table t1 (id int, name string, dept string)") // OK
    sql("insert into t1 values (1, 'name1', 'dept1')")  // ERROR


This doesn't make sense for me, because this table would always be empty if we can't insert into it, thus would be of no use. But I wonder if there are other good reasons for the current logic. If not, I would propose to raise an error when creating the table in the first place.

Thanks!

Regards,
Shuai Lin (@lins05)


Reply | Threaded
Open this post in threaded view
|

Re: A question about creating persistent table when in-memory catalog is used

Xiao Li
We have a pending PR to block users to create the Hive serde table when using InMemroyCatalog. See: https://github.com/apache/spark/pull/16587 I believe it answers your question.

BTW, we still can create the regular data source tables and insert the data into the tables. The major difference is whether the metadata is persistently stored or not. 

Thanks,

Xiao Li

2017-01-22 11:14 GMT-08:00 Reynold Xin <[hidden email]>:
I think this is something we are going to change to completely decouple the Hive support and catalog. 


On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin <[hidden email]> wrote:
Hi all,

Currently when the in-memory catalog is used, e.g. through `--conf spark.sql.catalogImplementation=in-memory`, we can create a persistent table, but inserting into this table would fail with error message "Hive support is required to insert into the following tables..". 

    sql("create table t1 (id int, name string, dept string)") // OK
    sql("insert into t1 values (1, 'name1', 'dept1')")  // ERROR


This doesn't make sense for me, because this table would always be empty if we can't insert into it, thus would be of no use. But I wonder if there are other good reasons for the current logic. If not, I would propose to raise an error when creating the table in the first place.

Thanks!

Regards,
Shuai Lin (@lins05)



Reply | Threaded
Open this post in threaded view
|

Re: A question about creating persistent table when in-memory catalog is used

rxin
To be clear there are two separate "hive" we are talking about here. One is the catalog, and the other is the Hive serde and UDF support. We want to get to a point that the choice of catalog does not impact the functionality in Spark other than where the catalog is stored. 


On Sun, Jan 22, 2017 at 11:18 AM Xiao Li <[hidden email]> wrote:
We have a pending PR to block users to create the Hive serde table when using InMemroyCatalog. See: https://github.com/apache/spark/pull/16587 I believe it answers your question.

BTW, we still can create the regular data source tables and insert the data into the tables. The major difference is whether the metadata is persistently stored or not. 

Thanks,

Xiao Li

2017-01-22 11:14 GMT-08:00 Reynold Xin <[hidden email]>:
I think this is something we are going to change to completely decouple the Hive support and catalog. 


On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin <[hidden email]> wrote:
Hi all,

Currently when the in-memory catalog is used, e.g. through `--conf spark.sql.catalogImplementation=in-memory`, we can create a persistent table, but inserting into this table would fail with error message "Hive support is required to insert into the following tables..". 

    sql("create table t1 (id int, name string, dept string)") // OK
    sql("insert into t1 values (1, 'name1', 'dept1')")  // ERROR


This doesn't make sense for me, because this table would always be empty if we can't insert into it, thus would be of no use. But I wonder if there are other good reasons for the current logic. If not, I would propose to raise an error when creating the table in the first place.

Thanks!

Regards,
Shuai Lin (@lins05)







Reply | Threaded
Open this post in threaded view
|

Re: A question about creating persistent table when in-memory catalog is used

Xiao Li
Agree. : )

2017-01-22 11:20 GMT-08:00 Reynold Xin <[hidden email]>:
To be clear there are two separate "hive" we are talking about here. One is the catalog, and the other is the Hive serde and UDF support. We want to get to a point that the choice of catalog does not impact the functionality in Spark other than where the catalog is stored. 


On Sun, Jan 22, 2017 at 11:18 AM Xiao Li <[hidden email]> wrote:
We have a pending PR to block users to create the Hive serde table when using InMemroyCatalog. See: https://github.com/apache/spark/pull/16587 I believe it answers your question.

BTW, we still can create the regular data source tables and insert the data into the tables. The major difference is whether the metadata is persistently stored or not. 

Thanks,

Xiao Li

2017-01-22 11:14 GMT-08:00 Reynold Xin <[hidden email]>:
I think this is something we are going to change to completely decouple the Hive support and catalog. 


On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin <[hidden email]> wrote:
Hi all,

Currently when the in-memory catalog is used, e.g. through `--conf spark.sql.catalogImplementation=in-memory`, we can create a persistent table, but inserting into this table would fail with error message "Hive support is required to insert into the following tables..". 

    sql("create table t1 (id int, name string, dept string)") // OK
    sql("insert into t1 values (1, 'name1', 'dept1')")  // ERROR


This doesn't make sense for me, because this table would always be empty if we can't insert into it, thus would be of no use. But I wonder if there are other good reasons for the current logic. If not, I would propose to raise an error when creating the table in the first place.

Thanks!

Regards,
Shuai Lin (@lins05)








Reply | Threaded
Open this post in threaded view
|

Re: A question about creating persistent table when in-memory catalog is used

Shuai Lin
Cool, thanks for the info.

I think this is something we are going to change to completely decouple the Hive support and catalog. 

Is there a ticket for this? I did a search in jira and only found "SPARK-16275: Implement all the Hive fallback functions", which seems to be related to it.


On Mon, Jan 23, 2017 at 3:21 AM, Xiao Li <[hidden email]> wrote:
Agree. : )

2017-01-22 11:20 GMT-08:00 Reynold Xin <[hidden email]>:
To be clear there are two separate "hive" we are talking about here. One is the catalog, and the other is the Hive serde and UDF support. We want to get to a point that the choice of catalog does not impact the functionality in Spark other than where the catalog is stored. 


On Sun, Jan 22, 2017 at 11:18 AM Xiao Li <[hidden email]> wrote:
We have a pending PR to block users to create the Hive serde table when using InMemroyCatalog. See: https://github.com/apache/spark/pull/16587 I believe it answers your question.

BTW, we still can create the regular data source tables and insert the data into the tables. The major difference is whether the metadata is persistently stored or not. 

Thanks,

Xiao Li

2017-01-22 11:14 GMT-08:00 Reynold Xin <[hidden email]>:
I think this is something we are going to change to completely decouple the Hive support and catalog. 


On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin <[hidden email]> wrote:
Hi all,

Currently when the in-memory catalog is used, e.g. through `--conf spark.sql.catalogImplementation=in-memory`, we can create a persistent table, but inserting into this table would fail with error message "Hive support is required to insert into the following tables..". 

    sql("create table t1 (id int, name string, dept string)") // OK
    sql("insert into t1 values (1, 'name1', 'dept1')")  // ERROR


This doesn't make sense for me, because this table would always be empty if we can't insert into it, thus would be of no use. But I wonder if there are other good reasons for the current logic. If not, I would propose to raise an error when creating the table in the first place.

Thanks!

Regards,
Shuai Lin (@lins05)









Reply | Threaded
Open this post in threaded view
|

Re: A question about creating persistent table when in-memory catalog is used

Xiao Li
Reynold mentioned the direction we are heading. You can see many PRs the community submitted are for this target. To achieve this, a lot of works we need to do.

For example, for some serde, Hive metastore will infer the schema when the schema is not provided, but our InMemoryCatalog does not have such a capability. Thus, we need to see how to resolve this.  

Hopefully, it answers your question. BTW, the issue you mentioned at the beginning has been resolved. Please fetch the latest master. You are unable to create such a hive serde table without Hive support. 

Thanks, 

Xiao Li


2017-01-23 0:01 GMT-08:00 Shuai Lin <[hidden email]>:
Cool, thanks for the info.

I think this is something we are going to change to completely decouple the Hive support and catalog. 

Is there a ticket for this? I did a search in jira and only found "SPARK-16275: Implement all the Hive fallback functions", which seems to be related to it.


On Mon, Jan 23, 2017 at 3:21 AM, Xiao Li <[hidden email]> wrote:
Agree. : )

2017-01-22 11:20 GMT-08:00 Reynold Xin <[hidden email]>:
To be clear there are two separate "hive" we are talking about here. One is the catalog, and the other is the Hive serde and UDF support. We want to get to a point that the choice of catalog does not impact the functionality in Spark other than where the catalog is stored. 


On Sun, Jan 22, 2017 at 11:18 AM Xiao Li <[hidden email]> wrote:
We have a pending PR to block users to create the Hive serde table when using InMemroyCatalog. See: https://github.com/apache/spark/pull/16587 I believe it answers your question.

BTW, we still can create the regular data source tables and insert the data into the tables. The major difference is whether the metadata is persistently stored or not. 

Thanks,

Xiao Li

2017-01-22 11:14 GMT-08:00 Reynold Xin <[hidden email]>:
I think this is something we are going to change to completely decouple the Hive support and catalog. 


On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin <[hidden email]> wrote:
Hi all,

Currently when the in-memory catalog is used, e.g. through `--conf spark.sql.catalogImplementation=in-memory`, we can create a persistent table, but inserting into this table would fail with error message "Hive support is required to insert into the following tables..". 

    sql("create table t1 (id int, name string, dept string)") // OK
    sql("insert into t1 values (1, 'name1', 'dept1')")  // ERROR


This doesn't make sense for me, because this table would always be empty if we can't insert into it, thus would be of no use. But I wonder if there are other good reasons for the current logic. If not, I would propose to raise an error when creating the table in the first place.

Thanks!

Regards,
Shuai Lin (@lins05)










Reply | Threaded
Open this post in threaded view
|

Re: A question about creating persistent table when in-memory catalog is used

Shuai Lin
I see, thanks for the info!

On Mon, Jan 23, 2017 at 4:12 PM, Xiao Li <[hidden email]> wrote:
Reynold mentioned the direction we are heading. You can see many PRs the community submitted are for this target. To achieve this, a lot of works we need to do.

For example, for some serde, Hive metastore will infer the schema when the schema is not provided, but our InMemoryCatalog does not have such a capability. Thus, we need to see how to resolve this.  

Hopefully, it answers your question. BTW, the issue you mentioned at the beginning has been resolved. Please fetch the latest master. You are unable to create such a hive serde table without Hive support. 

Thanks, 

Xiao Li


2017-01-23 0:01 GMT-08:00 Shuai Lin <[hidden email]>:
Cool, thanks for the info.

I think this is something we are going to change to completely decouple the Hive support and catalog. 

Is there a ticket for this? I did a search in jira and only found "SPARK-16275: Implement all the Hive fallback functions", which seems to be related to it.


On Mon, Jan 23, 2017 at 3:21 AM, Xiao Li <[hidden email]> wrote:
Agree. : )

2017-01-22 11:20 GMT-08:00 Reynold Xin <[hidden email]>:
To be clear there are two separate "hive" we are talking about here. One is the catalog, and the other is the Hive serde and UDF support. We want to get to a point that the choice of catalog does not impact the functionality in Spark other than where the catalog is stored. 


On Sun, Jan 22, 2017 at 11:18 AM Xiao Li <[hidden email]> wrote:
We have a pending PR to block users to create the Hive serde table when using InMemroyCatalog. See: https://github.com/apache/spark/pull/16587 I believe it answers your question.

BTW, we still can create the regular data source tables and insert the data into the tables. The major difference is whether the metadata is persistently stored or not. 

Thanks,

Xiao Li

2017-01-22 11:14 GMT-08:00 Reynold Xin <[hidden email]>:
I think this is something we are going to change to completely decouple the Hive support and catalog. 


On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin <[hidden email]> wrote:
Hi all,

Currently when the in-memory catalog is used, e.g. through `--conf spark.sql.catalogImplementation=in-memory`, we can create a persistent table, but inserting into this table would fail with error message "Hive support is required to insert into the following tables..". 

    sql("create table t1 (id int, name string, dept string)") // OK
    sql("insert into t1 values (1, 'name1', 'dept1')")  // ERROR


This doesn't make sense for me, because this table would always be empty if we can't insert into it, thus would be of no use. But I wonder if there are other good reasons for the current logic. If not, I would propose to raise an error when creating the table in the first place.

Thanks!

Regards,
Shuai Lin (@lins05)