[DISCUSS] Add RocksDB StateStore

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Add RocksDB StateStore

Liang-Chi Hsieh
Hi devs,

In Spark structured streaming, we need state store for state management for
stateful operators such streaming aggregates, joins, etc. We have one and
only one state store implementation now. It is in-memory hashmap which was
backed up in HDFS complaint file system at the end of every micro-batch.

As it basically uses in-memory map to store states, memory consumption is a
serious issue and state store size is limited by the size of the executor
memory. Moreover, state store using more memory means it may impact the
performance of task execution that requires memory too.

Internally we see more streaming applications that requires large state in
stateful operations. For such requirements, we need a StateStore not rely on
memory to store states.

This seems to be also true externally as several other major streaming
frameworks already use RocksDB for state management. RocksDB is an embedded
DB and streaming engines can use it to store state instead of memory
storage.

So seems to me, it is proven to be good choice for large state usage. But
Spark SS still lacks of a built-in state store for the requirement.

Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
Spark SS. IIUC, it was pushed back due to two concerns: extra code
maintenance cost and it introduces RocksDB dependency.

For the first concern, as more users require to use the feature, it should
be highly used code in SS and more developers will look at it. For second
one, we propose (SPARK-34198) to add it as an external module to relieve the
dependency concern.

Because it was pushed back previously, I'm going to raise this discussion to
know what people think about it now, in advance of submitting any code.

I think there might be some possible opinions:

1. okay to add RocksDB StateStore into sql core module
2. not okay for 1, but okay to add RocksDB StateStore as external module
3. either 1 or 2 is okay
4. not okay to add RocksDB StateStore, no matter into sql core or as
external module

Please let us know if you have some thoughts.

Thank you.

Liang-Chi Hsieh




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Add RocksDB StateStore

redsk
Hi,

FYI, I have been using the project at
https://github.com/chermenin/spark-states
for a few months and it has been working well for me.

-Nico



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Add RocksDB StateStore

Yikun Jiang
In reply to this post by Liang-Chi Hsieh
I worked on some work about rocksdb multi-arch support and version upgrade on
Kafka/Storm/Flink[1][2][3].To avoid these issues happened in spark again, I want to
give some inputs in here about rocksdb version selection from multi-arch support
view. Hope it helps.

The Rocksdb adds Arm64 support [4] since version 6.4.6, and also backports all Arm64
related commits to 5.18.4 and release a all platforms support version.

So, from multi-arch support view, the better rocksdb version is the version since
v6.4.6, or 5.X version is v5.18.4.


Regards,
Yikun

Liang-Chi Hsieh <[hidden email]> 于2021年2月2日周二 下午4:32写道:
Hi devs,

In Spark structured streaming, we need state store for state management for
stateful operators such streaming aggregates, joins, etc. We have one and
only one state store implementation now. It is in-memory hashmap which was
backed up in HDFS complaint file system at the end of every micro-batch.

As it basically uses in-memory map to store states, memory consumption is a
serious issue and state store size is limited by the size of the executor
memory. Moreover, state store using more memory means it may impact the
performance of task execution that requires memory too.

Internally we see more streaming applications that requires large state in
stateful operations. For such requirements, we need a StateStore not rely on
memory to store states.

This seems to be also true externally as several other major streaming
frameworks already use RocksDB for state management. RocksDB is an embedded
DB and streaming engines can use it to store state instead of memory
storage.

So seems to me, it is proven to be good choice for large state usage. But
Spark SS still lacks of a built-in state store for the requirement.

Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
Spark SS. IIUC, it was pushed back due to two concerns: extra code
maintenance cost and it introduces RocksDB dependency.

For the first concern, as more users require to use the feature, it should
be highly used code in SS and more developers will look at it. For second
one, we propose (SPARK-34198) to add it as an external module to relieve the
dependency concern.

Because it was pushed back previously, I'm going to raise this discussion to
know what people think about it now, in advance of submitting any code.

I think there might be some possible opinions:

1. okay to add RocksDB StateStore into sql core module
2. not okay for 1, but okay to add RocksDB StateStore as external module
3. either 1 or 2 is okay
4. not okay to add RocksDB StateStore, no matter into sql core or as
external module

Please let us know if you have some thoughts.

Thank you.

Liang-Chi Hsieh




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Add RocksDB StateStore

Liang-Chi Hsieh
Thank you for the inputs! Yikun. Let's take these inputs when we are ready to
have rocksdb state store in Spark SS.


Yikun Jiang wrote

> I worked on some work about rocksdb multi-arch support and version upgrade
> on
> Kafka/Storm/Flink[1][2][3].To avoid these issues happened in spark again,
> I
> want to
> give some inputs in here about rocksdb version selection from multi-arch
> support
> view. Hope it helps.
>
> The Rocksdb adds Arm64 support [4] since version 6.4.6, and also backports
> all Arm64
> related commits to 5.18.4 and release a all platforms support version.
>
> So, from multi-arch support view, the better rocksdb version is the
> version
> since
> v6.4.6, or 5.X version is v5.18.4.
>
> [1] https://issues.apache.org/jira/browse/STORM-3599
> [2] https://github.com/apache/kafka/pull/8284
> [3] https://issues.apache.org/jira/browse/FLINK-13598
> [4] https://github.com/facebook/rocksdb/pull/6250
>
> Regards,
> Yikun





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Add RocksDB StateStore

Jacek Laskowski
In reply to this post by Liang-Chi Hsieh
Hi,

I'm "okay to add RocksDB StateStore as external module". See no reason not to.

On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh <[hidden email]> wrote:
Hi devs,

In Spark structured streaming, we need state store for state management for
stateful operators such streaming aggregates, joins, etc. We have one and
only one state store implementation now. It is in-memory hashmap which was
backed up in HDFS complaint file system at the end of every micro-batch.

As it basically uses in-memory map to store states, memory consumption is a
serious issue and state store size is limited by the size of the executor
memory. Moreover, state store using more memory means it may impact the
performance of task execution that requires memory too.

Internally we see more streaming applications that requires large state in
stateful operations. For such requirements, we need a StateStore not rely on
memory to store states.

This seems to be also true externally as several other major streaming
frameworks already use RocksDB for state management. RocksDB is an embedded
DB and streaming engines can use it to store state instead of memory
storage.

So seems to me, it is proven to be good choice for large state usage. But
Spark SS still lacks of a built-in state store for the requirement.

Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
Spark SS. IIUC, it was pushed back due to two concerns: extra code
maintenance cost and it introduces RocksDB dependency.

For the first concern, as more users require to use the feature, it should
be highly used code in SS and more developers will look at it. For second
one, we propose (SPARK-34198) to add it as an external module to relieve the
dependency concern.

Because it was pushed back previously, I'm going to raise this discussion to
know what people think about it now, in advance of submitting any code.

I think there might be some possible opinions:

1. okay to add RocksDB StateStore into sql core module
2. not okay for 1, but okay to add RocksDB StateStore as external module
3. either 1 or 2 is okay
4. not okay to add RocksDB StateStore, no matter into sql core or as
external module

Please let us know if you have some thoughts.

Thank you.

Liang-Chi Hsieh




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Add RocksDB StateStore

Dongjoon Hyun-2
Thank you, Liang-chi and all.

+1 for (2) external module design because it can deliver the new feature in a safe way.

Bests,
Dongjoon

On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski <[hidden email]> wrote:
Hi,

I'm "okay to add RocksDB StateStore as external module". See no reason not to.

On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh <[hidden email]> wrote:
Hi devs,

In Spark structured streaming, we need state store for state management for
stateful operators such streaming aggregates, joins, etc. We have one and
only one state store implementation now. It is in-memory hashmap which was
backed up in HDFS complaint file system at the end of every micro-batch.

As it basically uses in-memory map to store states, memory consumption is a
serious issue and state store size is limited by the size of the executor
memory. Moreover, state store using more memory means it may impact the
performance of task execution that requires memory too.

Internally we see more streaming applications that requires large state in
stateful operations. For such requirements, we need a StateStore not rely on
memory to store states.

This seems to be also true externally as several other major streaming
frameworks already use RocksDB for state management. RocksDB is an embedded
DB and streaming engines can use it to store state instead of memory
storage.

So seems to me, it is proven to be good choice for large state usage. But
Spark SS still lacks of a built-in state store for the requirement.

Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
Spark SS. IIUC, it was pushed back due to two concerns: extra code
maintenance cost and it introduces RocksDB dependency.

For the first concern, as more users require to use the feature, it should
be highly used code in SS and more developers will look at it. For second
one, we propose (SPARK-34198) to add it as an external module to relieve the
dependency concern.

Because it was pushed back previously, I'm going to raise this discussion to
know what people think about it now, in advance of submitting any code.

I think there might be some possible opinions:

1. okay to add RocksDB StateStore into sql core module
2. not okay for 1, but okay to add RocksDB StateStore as external module
3. either 1 or 2 is okay
4. not okay to add RocksDB StateStore, no matter into sql core or as
external module

Please let us know if you have some thoughts.

Thank you.

Liang-Chi Hsieh




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Add RocksDB StateStore

Cheng Su-2

+1 for (2) adding to external module.

I think this feature is useful and popular in practice, and option 2 is not conflict with previous concern for dependency.

 

Thanks,

Cheng Su

 

From: Dongjoon Hyun <[hidden email]>
Date: Monday, February 8, 2021 at 10:39 AM
To: Jacek Laskowski <[hidden email]>
Cc: Liang-Chi Hsieh <[hidden email]>, dev <[hidden email]>
Subject: Re: [DISCUSS] Add RocksDB StateStore

 

Thank you, Liang-chi and all.

 

+1 for (2) external module design because it can deliver the new feature in a safe way.

 

Bests,

Dongjoon

 

On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski <[hidden email]> wrote:

Hi,

 

I'm "okay to add RocksDB StateStore as external module". See no reason not to.

 

On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh <[hidden email]> wrote:

Hi devs,

In Spark structured streaming, we need state store for state management for
stateful operators such streaming aggregates, joins, etc. We have one and
only one state store implementation now. It is in-memory hashmap which was
backed up in HDFS complaint file system at the end of every micro-batch.

As it basically uses in-memory map to store states, memory consumption is a
serious issue and state store size is limited by the size of the executor
memory. Moreover, state store using more memory means it may impact the
performance of task execution that requires memory too.

Internally we see more streaming applications that requires large state in
stateful operations. For such requirements, we need a StateStore not rely on
memory to store states.

This seems to be also true externally as several other major streaming
frameworks already use RocksDB for state management. RocksDB is an embedded
DB and streaming engines can use it to store state instead of memory
storage.

So seems to me, it is proven to be good choice for large state usage. But
Spark SS still lacks of a built-in state store for the requirement.

Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
Spark SS. IIUC, it was pushed back due to two concerns: extra code
maintenance cost and it introduces RocksDB dependency.

For the first concern, as more users require to use the feature, it should
be highly used code in SS and more developers will look at it. For second
one, we propose (SPARK-34198) to add it as an external module to relieve the
dependency concern.

Because it was pushed back previously, I'm going to raise this discussion to
know what people think about it now, in advance of submitting any code.

I think there might be some possible opinions:

1. okay to add RocksDB StateStore into sql core module
2. not okay for 1, but okay to add RocksDB StateStore as external module
3. either 1 or 2 is okay
4. not okay to add RocksDB StateStore, no matter into sql core or as
external module

Please let us know if you have some thoughts.

Thank you.

Liang-Chi Hsieh




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Add RocksDB StateStore

Holden Karau
+1 for an external module.

On Mon, Feb 8, 2021 at 11:51 AM Cheng Su <[hidden email]> wrote:

+1 for (2) adding to external module.

I think this feature is useful and popular in practice, and option 2 is not conflict with previous concern for dependency.

 

Thanks,

Cheng Su

 

From: Dongjoon Hyun <[hidden email]>
Date: Monday, February 8, 2021 at 10:39 AM
To: Jacek Laskowski <[hidden email]>
Cc: Liang-Chi Hsieh <[hidden email]>, dev <[hidden email]>
Subject: Re: [DISCUSS] Add RocksDB StateStore

 

Thank you, Liang-chi and all.

 

+1 for (2) external module design because it can deliver the new feature in a safe way.

 

Bests,

Dongjoon

 

On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski <[hidden email]> wrote:

Hi,

 

I'm "okay to add RocksDB StateStore as external module". See no reason not to.

 

On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh <[hidden email]> wrote:

Hi devs,

In Spark structured streaming, we need state store for state management for
stateful operators such streaming aggregates, joins, etc. We have one and
only one state store implementation now. It is in-memory hashmap which was
backed up in HDFS complaint file system at the end of every micro-batch.

As it basically uses in-memory map to store states, memory consumption is a
serious issue and state store size is limited by the size of the executor
memory. Moreover, state store using more memory means it may impact the
performance of task execution that requires memory too.

Internally we see more streaming applications that requires large state in
stateful operations. For such requirements, we need a StateStore not rely on
memory to store states.

This seems to be also true externally as several other major streaming
frameworks already use RocksDB for state management. RocksDB is an embedded
DB and streaming engines can use it to store state instead of memory
storage.

So seems to me, it is proven to be good choice for large state usage. But
Spark SS still lacks of a built-in state store for the requirement.

Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
Spark SS. IIUC, it was pushed back due to two concerns: extra code
maintenance cost and it introduces RocksDB dependency.

For the first concern, as more users require to use the feature, it should
be highly used code in SS and more developers will look at it. For second
one, we propose (SPARK-34198) to add it as an external module to relieve the
dependency concern.

Because it was pushed back previously, I'm going to raise this discussion to
know what people think about it now, in advance of submitting any code.

I think there might be some possible opinions:

1. okay to add RocksDB StateStore into sql core module
2. not okay for 1, but okay to add RocksDB StateStore as external module
3. either 1 or 2 is okay
4. not okay to add RocksDB StateStore, no matter into sql core or as
external module

Please let us know if you have some thoughts.

Thank you.

Liang-Chi Hsieh




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Add RocksDB StateStore

Gabor Somogyi
+1 adding it any way.

On Mon, 8 Feb 2021, 21:54 Holden Karau, <[hidden email]> wrote:
+1 for an external module.

On Mon, Feb 8, 2021 at 11:51 AM Cheng Su <[hidden email]> wrote:

+1 for (2) adding to external module.

I think this feature is useful and popular in practice, and option 2 is not conflict with previous concern for dependency.

 

Thanks,

Cheng Su

 

From: Dongjoon Hyun <[hidden email]>
Date: Monday, February 8, 2021 at 10:39 AM
To: Jacek Laskowski <[hidden email]>
Cc: Liang-Chi Hsieh <[hidden email]>, dev <[hidden email]>
Subject: Re: [DISCUSS] Add RocksDB StateStore

 

Thank you, Liang-chi and all.

 

+1 for (2) external module design because it can deliver the new feature in a safe way.

 

Bests,

Dongjoon

 

On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski <[hidden email]> wrote:

Hi,

 

I'm "okay to add RocksDB StateStore as external module". See no reason not to.

 

On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh <[hidden email]> wrote:

Hi devs,

In Spark structured streaming, we need state store for state management for
stateful operators such streaming aggregates, joins, etc. We have one and
only one state store implementation now. It is in-memory hashmap which was
backed up in HDFS complaint file system at the end of every micro-batch.

As it basically uses in-memory map to store states, memory consumption is a
serious issue and state store size is limited by the size of the executor
memory. Moreover, state store using more memory means it may impact the
performance of task execution that requires memory too.

Internally we see more streaming applications that requires large state in
stateful operations. For such requirements, we need a StateStore not rely on
memory to store states.

This seems to be also true externally as several other major streaming
frameworks already use RocksDB for state management. RocksDB is an embedded
DB and streaming engines can use it to store state instead of memory
storage.

So seems to me, it is proven to be good choice for large state usage. But
Spark SS still lacks of a built-in state store for the requirement.

Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
Spark SS. IIUC, it was pushed back due to two concerns: extra code
maintenance cost and it introduces RocksDB dependency.

For the first concern, as more users require to use the feature, it should
be highly used code in SS and more developers will look at it. For second
one, we propose (SPARK-34198) to add it as an external module to relieve the
dependency concern.

Because it was pushed back previously, I'm going to raise this discussion to
know what people think about it now, in advance of submitting any code.

I think there might be some possible opinions:

1. okay to add RocksDB StateStore into sql core module
2. not okay for 1, but okay to add RocksDB StateStore as external module
3. either 1 or 2 is okay
4. not okay to add RocksDB StateStore, no matter into sql core or as
external module

Please let us know if you have some thoughts.

Thank you.

Liang-Chi Hsieh




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Add RocksDB StateStore

DB Tsai-3
+1 to add it as an external module so people can test it out and give
feedback easier.

On Mon, Feb 8, 2021 at 10:22 PM Gabor Somogyi <[hidden email]> wrote:

>
> +1 adding it any way.
>
> On Mon, 8 Feb 2021, 21:54 Holden Karau, <[hidden email]> wrote:
>>
>> +1 for an external module.
>>
>> On Mon, Feb 8, 2021 at 11:51 AM Cheng Su <[hidden email]> wrote:
>>>
>>> +1 for (2) adding to external module.
>>>
>>> I think this feature is useful and popular in practice, and option 2 is not conflict with previous concern for dependency.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Cheng Su
>>>
>>>
>>>
>>> From: Dongjoon Hyun <[hidden email]>
>>> Date: Monday, February 8, 2021 at 10:39 AM
>>> To: Jacek Laskowski <[hidden email]>
>>> Cc: Liang-Chi Hsieh <[hidden email]>, dev <[hidden email]>
>>> Subject: Re: [DISCUSS] Add RocksDB StateStore
>>>
>>>
>>>
>>> Thank you, Liang-chi and all.
>>>
>>>
>>>
>>> +1 for (2) external module design because it can deliver the new feature in a safe way.
>>>
>>>
>>>
>>> Bests,
>>>
>>> Dongjoon
>>>
>>>
>>>
>>> On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski <[hidden email]> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> I'm "okay to add RocksDB StateStore as external module". See no reason not to.
>>>
>>>
>>> Pozdrawiam,
>>>
>>> Jacek Laskowski
>>>
>>> ----
>>>
>>> https://about.me/JacekLaskowski
>>>
>>> "The Internals Of" Online Books
>>>
>>> Follow me on https://twitter.com/jaceklaskowski
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh <[hidden email]> wrote:
>>>
>>> Hi devs,
>>>
>>> In Spark structured streaming, we need state store for state management for
>>> stateful operators such streaming aggregates, joins, etc. We have one and
>>> only one state store implementation now. It is in-memory hashmap which was
>>> backed up in HDFS complaint file system at the end of every micro-batch.
>>>
>>> As it basically uses in-memory map to store states, memory consumption is a
>>> serious issue and state store size is limited by the size of the executor
>>> memory. Moreover, state store using more memory means it may impact the
>>> performance of task execution that requires memory too.
>>>
>>> Internally we see more streaming applications that requires large state in
>>> stateful operations. For such requirements, we need a StateStore not rely on
>>> memory to store states.
>>>
>>> This seems to be also true externally as several other major streaming
>>> frameworks already use RocksDB for state management. RocksDB is an embedded
>>> DB and streaming engines can use it to store state instead of memory
>>> storage.
>>>
>>> So seems to me, it is proven to be good choice for large state usage. But
>>> Spark SS still lacks of a built-in state store for the requirement.
>>>
>>> Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
>>> Spark SS. IIUC, it was pushed back due to two concerns: extra code
>>> maintenance cost and it introduces RocksDB dependency.
>>>
>>> For the first concern, as more users require to use the feature, it should
>>> be highly used code in SS and more developers will look at it. For second
>>> one, we propose (SPARK-34198) to add it as an external module to relieve the
>>> dependency concern.
>>>
>>> Because it was pushed back previously, I'm going to raise this discussion to
>>> know what people think about it now, in advance of submitting any code.
>>>
>>> I think there might be some possible opinions:
>>>
>>> 1. okay to add RocksDB StateStore into sql core module
>>> 2. not okay for 1, but okay to add RocksDB StateStore as external module
>>> 3. either 1 or 2 is okay
>>> 4. not okay to add RocksDB StateStore, no matter into sql core or as
>>> external module
>>>
>>> Please let us know if you have some thoughts.
>>>
>>> Thank you.
>>>
>>> Liang-Chi Hsieh
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [hidden email]
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau



--
Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Add RocksDB StateStore

Jungtaek Lim-2
In reply to this post by Gabor Somogyi
+1 to add, no matter to add under sql-core vs external module.

Rationalization for myself:

* The discussion thread and voices here show strong demand for adding RocksDB state store out of the box.
* No workaround on huge state store problem out of the box. Direct competitors on streaming frameworks provide it for years.
* Maintenance cost is the major concern when evaluating to add something, but it can't be applied here, as contributors/committers from various companies are willing to contribute.
* Apache Bahir project is no longer something being maintained actively - the last release was in September 2019 based on Spark 2.4.0. We can no longer easily say "let's add to Bahir instead".



On Tue, Feb 9, 2021 at 3:22 PM Gabor Somogyi <[hidden email]> wrote:
+1 adding it any way.

On Mon, 8 Feb 2021, 21:54 Holden Karau, <[hidden email]> wrote:
+1 for an external module.

On Mon, Feb 8, 2021 at 11:51 AM Cheng Su <[hidden email]> wrote:

+1 for (2) adding to external module.

I think this feature is useful and popular in practice, and option 2 is not conflict with previous concern for dependency.

 

Thanks,

Cheng Su

 

From: Dongjoon Hyun <[hidden email]>
Date: Monday, February 8, 2021 at 10:39 AM
To: Jacek Laskowski <[hidden email]>
Cc: Liang-Chi Hsieh <[hidden email]>, dev <[hidden email]>
Subject: Re: [DISCUSS] Add RocksDB StateStore

 

Thank you, Liang-chi and all.

 

+1 for (2) external module design because it can deliver the new feature in a safe way.

 

Bests,

Dongjoon

 

On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski <[hidden email]> wrote:

Hi,

 

I'm "okay to add RocksDB StateStore as external module". See no reason not to.

 

On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh <[hidden email]> wrote:

Hi devs,

In Spark structured streaming, we need state store for state management for
stateful operators such streaming aggregates, joins, etc. We have one and
only one state store implementation now. It is in-memory hashmap which was
backed up in HDFS complaint file system at the end of every micro-batch.

As it basically uses in-memory map to store states, memory consumption is a
serious issue and state store size is limited by the size of the executor
memory. Moreover, state store using more memory means it may impact the
performance of task execution that requires memory too.

Internally we see more streaming applications that requires large state in
stateful operations. For such requirements, we need a StateStore not rely on
memory to store states.

This seems to be also true externally as several other major streaming
frameworks already use RocksDB for state management. RocksDB is an embedded
DB and streaming engines can use it to store state instead of memory
storage.

So seems to me, it is proven to be good choice for large state usage. But
Spark SS still lacks of a built-in state store for the requirement.

Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
Spark SS. IIUC, it was pushed back due to two concerns: extra code
maintenance cost and it introduces RocksDB dependency.

For the first concern, as more users require to use the feature, it should
be highly used code in SS and more developers will look at it. For second
one, we propose (SPARK-34198) to add it as an external module to relieve the
dependency concern.

Because it was pushed back previously, I'm going to raise this discussion to
know what people think about it now, in advance of submitting any code.

I think there might be some possible opinions:

1. okay to add RocksDB StateStore into sql core module
2. not okay for 1, but okay to add RocksDB StateStore as external module
3. either 1 or 2 is okay
4. not okay to add RocksDB StateStore, no matter into sql core or as
external module

Please let us know if you have some thoughts.

Thank you.

Liang-Chi Hsieh




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Add RocksDB StateStore

Hyukjin Kwon
In reply to this post by DB Tsai-3
I'm good with this too.

2021년 2월 9일 (화) 오후 4:16, DB Tsai <[hidden email]>님이 작성:
+1 to add it as an external module so people can test it out and give
feedback easier.

On Mon, Feb 8, 2021 at 10:22 PM Gabor Somogyi <[hidden email]> wrote:
>
> +1 adding it any way.
>
> On Mon, 8 Feb 2021, 21:54 Holden Karau, <[hidden email]> wrote:
>>
>> +1 for an external module.
>>
>> On Mon, Feb 8, 2021 at 11:51 AM Cheng Su <[hidden email]> wrote:
>>>
>>> +1 for (2) adding to external module.
>>>
>>> I think this feature is useful and popular in practice, and option 2 is not conflict with previous concern for dependency.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Cheng Su
>>>
>>>
>>>
>>> From: Dongjoon Hyun <[hidden email]>
>>> Date: Monday, February 8, 2021 at 10:39 AM
>>> To: Jacek Laskowski <[hidden email]>
>>> Cc: Liang-Chi Hsieh <[hidden email]>, dev <[hidden email]>
>>> Subject: Re: [DISCUSS] Add RocksDB StateStore
>>>
>>>
>>>
>>> Thank you, Liang-chi and all.
>>>
>>>
>>>
>>> +1 for (2) external module design because it can deliver the new feature in a safe way.
>>>
>>>
>>>
>>> Bests,
>>>
>>> Dongjoon
>>>
>>>
>>>
>>> On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski <[hidden email]> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> I'm "okay to add RocksDB StateStore as external module". See no reason not to.
>>>
>>>
>>> Pozdrawiam,
>>>
>>> Jacek Laskowski
>>>
>>> ----
>>>
>>> https://about.me/JacekLaskowski
>>>
>>> "The Internals Of" Online Books
>>>
>>> Follow me on https://twitter.com/jaceklaskowski
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh <[hidden email]> wrote:
>>>
>>> Hi devs,
>>>
>>> In Spark structured streaming, we need state store for state management for
>>> stateful operators such streaming aggregates, joins, etc. We have one and
>>> only one state store implementation now. It is in-memory hashmap which was
>>> backed up in HDFS complaint file system at the end of every micro-batch.
>>>
>>> As it basically uses in-memory map to store states, memory consumption is a
>>> serious issue and state store size is limited by the size of the executor
>>> memory. Moreover, state store using more memory means it may impact the
>>> performance of task execution that requires memory too.
>>>
>>> Internally we see more streaming applications that requires large state in
>>> stateful operations. For such requirements, we need a StateStore not rely on
>>> memory to store states.
>>>
>>> This seems to be also true externally as several other major streaming
>>> frameworks already use RocksDB for state management. RocksDB is an embedded
>>> DB and streaming engines can use it to store state instead of memory
>>> storage.
>>>
>>> So seems to me, it is proven to be good choice for large state usage. But
>>> Spark SS still lacks of a built-in state store for the requirement.
>>>
>>> Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
>>> Spark SS. IIUC, it was pushed back due to two concerns: extra code
>>> maintenance cost and it introduces RocksDB dependency.
>>>
>>> For the first concern, as more users require to use the feature, it should
>>> be highly used code in SS and more developers will look at it. For second
>>> one, we propose (SPARK-34198) to add it as an external module to relieve the
>>> dependency concern.
>>>
>>> Because it was pushed back previously, I'm going to raise this discussion to
>>> know what people think about it now, in advance of submitting any code.
>>>
>>> I think there might be some possible opinions:
>>>
>>> 1. okay to add RocksDB StateStore into sql core module
>>> 2. not okay for 1, but okay to add RocksDB StateStore as external module
>>> 3. either 1 or 2 is okay
>>> 4. not okay to add RocksDB StateStore, no matter into sql core or as
>>> external module
>>>
>>> Please let us know if you have some thoughts.
>>>
>>> Thank you.
>>>
>>> Liang-Chi Hsieh
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [hidden email]
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau



--
Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Add RocksDB StateStore

Hyukjin Kwon
I mean I am okay with adding it as an external module for the extra clarification :-)

2021년 2월 9일 (화) 오후 11:10, Hyukjin Kwon <[hidden email]>님이 작성:
I'm good with this too.

2021년 2월 9일 (화) 오후 4:16, DB Tsai <[hidden email]>님이 작성:
+1 to add it as an external module so people can test it out and give
feedback easier.

On Mon, Feb 8, 2021 at 10:22 PM Gabor Somogyi <[hidden email]> wrote:
>
> +1 adding it any way.
>
> On Mon, 8 Feb 2021, 21:54 Holden Karau, <[hidden email]> wrote:
>>
>> +1 for an external module.
>>
>> On Mon, Feb 8, 2021 at 11:51 AM Cheng Su <[hidden email]> wrote:
>>>
>>> +1 for (2) adding to external module.
>>>
>>> I think this feature is useful and popular in practice, and option 2 is not conflict with previous concern for dependency.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Cheng Su
>>>
>>>
>>>
>>> From: Dongjoon Hyun <[hidden email]>
>>> Date: Monday, February 8, 2021 at 10:39 AM
>>> To: Jacek Laskowski <[hidden email]>
>>> Cc: Liang-Chi Hsieh <[hidden email]>, dev <[hidden email]>
>>> Subject: Re: [DISCUSS] Add RocksDB StateStore
>>>
>>>
>>>
>>> Thank you, Liang-chi and all.
>>>
>>>
>>>
>>> +1 for (2) external module design because it can deliver the new feature in a safe way.
>>>
>>>
>>>
>>> Bests,
>>>
>>> Dongjoon
>>>
>>>
>>>
>>> On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski <[hidden email]> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> I'm "okay to add RocksDB StateStore as external module". See no reason not to.
>>>
>>>
>>> Pozdrawiam,
>>>
>>> Jacek Laskowski
>>>
>>> ----
>>>
>>> https://about.me/JacekLaskowski
>>>
>>> "The Internals Of" Online Books
>>>
>>> Follow me on https://twitter.com/jaceklaskowski
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh <[hidden email]> wrote:
>>>
>>> Hi devs,
>>>
>>> In Spark structured streaming, we need state store for state management for
>>> stateful operators such streaming aggregates, joins, etc. We have one and
>>> only one state store implementation now. It is in-memory hashmap which was
>>> backed up in HDFS complaint file system at the end of every micro-batch.
>>>
>>> As it basically uses in-memory map to store states, memory consumption is a
>>> serious issue and state store size is limited by the size of the executor
>>> memory. Moreover, state store using more memory means it may impact the
>>> performance of task execution that requires memory too.
>>>
>>> Internally we see more streaming applications that requires large state in
>>> stateful operations. For such requirements, we need a StateStore not rely on
>>> memory to store states.
>>>
>>> This seems to be also true externally as several other major streaming
>>> frameworks already use RocksDB for state management. RocksDB is an embedded
>>> DB and streaming engines can use it to store state instead of memory
>>> storage.
>>>
>>> So seems to me, it is proven to be good choice for large state usage. But
>>> Spark SS still lacks of a built-in state store for the requirement.
>>>
>>> Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
>>> Spark SS. IIUC, it was pushed back due to two concerns: extra code
>>> maintenance cost and it introduces RocksDB dependency.
>>>
>>> For the first concern, as more users require to use the feature, it should
>>> be highly used code in SS and more developers will look at it. For second
>>> one, we propose (SPARK-34198) to add it as an external module to relieve the
>>> dependency concern.
>>>
>>> Because it was pushed back previously, I'm going to raise this discussion to
>>> know what people think about it now, in advance of submitting any code.
>>>
>>> I think there might be some possible opinions:
>>>
>>> 1. okay to add RocksDB StateStore into sql core module
>>> 2. not okay for 1, but okay to add RocksDB StateStore as external module
>>> 3. either 1 or 2 is okay
>>> 4. not okay to add RocksDB StateStore, no matter into sql core or as
>>> external module
>>>
>>> Please let us know if you have some thoughts.
>>>
>>> Thank you.
>>>
>>> Liang-Chi Hsieh
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [hidden email]
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau



--
Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Add RocksDB StateStore

Liang-Chi Hsieh
Hi devs,

Thanks for all the inputs. I think overall there are positive inputs in
Spark community about having RocksDB state store as external module. Then
let's go forward with this direction and to improve structured streaming. I
will keep update to the JIRA SPARK-34198.

Thanks all again for the inputs and discussion.

Liang-Chi Hsieh





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Add RocksDB StateStore

rxin
Late +1


On Sat, Feb 13 2021 at 2:49 PM, Liang-Chi Hsieh <[hidden email]> wrote:

Hi devs,

Thanks for all the inputs. I think overall there are positive inputs in Spark community about having RocksDB state store as external module. Then let's go forward with this direction and to improve structured streaming. I will keep update to the JIRA SPARK-34198.

Thanks all again for the inputs and discussion.

Liang-Chi Hsieh

--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]


smime.p7s (6K) Download Attachment