KryoSerializer Implementation - Not using KryoPool

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

KryoSerializer Implementation - Not using KryoPool

Patrick Brown
Hi,

I am wondering about the implementation of KryoSerializer, specifically the lack of use of KryoPool, which is recommended by Kryo themselves.

Looking at the code, it seems that frequently KryoSerializer.newInstance is called, followed by a serialize and then this instance goes out of scope, this seems like it causes frequent creation of Kryo instances, something which the Kryo documentation says is expensive.

By doing flame graphs on our own running software (it processes a lot of small jobs) it seems like a good amount of time is spent on this.

I have a small patch we are using internally which implements a reused KryoPool inside KryoSerializer (not KryoSerializerInstance) in order to avoid the creation of many Kryo instances. I am wonder if I am missing something as to why this isn't done already. If not I am wondering if this might be a patch that Spark would be interested in merging in, and how I might go about that.

Thanks,

Patrick
Reply | Threaded
Open this post in threaded view
|

Re: KryoSerializer Implementation - Not using KryoPool

Sean Owen-2
I don't know; possibly just because it wasn't available whenever Kryo
was first used in the project.

Skimming the code, the KryoSerializerInstance looks like a wrapper
that provides a Kryo object to do work. It already maintains a 'pool'
of just 1 instance. Is the point that KryoSerializer can share a
KryoPool across KryoSerializerInstances that provides them with a Kryo
rather than allocate a new one? makes sense, though I believe the
concern is always whether that somehow shares state or config in a way
that breaks something. I see there's already a reset() call in here to
try to avoid that.

Well, seems worth a PR, especially if you can demonstrate some
performance gains.

On Wed, Oct 24, 2018 at 3:09 PM Patrick Brown
<[hidden email]> wrote:

>
> Hi,
>
> I am wondering about the implementation of KryoSerializer, specifically the lack of use of KryoPool, which is recommended by Kryo themselves.
>
> Looking at the code, it seems that frequently KryoSerializer.newInstance is called, followed by a serialize and then this instance goes out of scope, this seems like it causes frequent creation of Kryo instances, something which the Kryo documentation says is expensive.
>
> By doing flame graphs on our own running software (it processes a lot of small jobs) it seems like a good amount of time is spent on this.
>
> I have a small patch we are using internally which implements a reused KryoPool inside KryoSerializer (not KryoSerializerInstance) in order to avoid the creation of many Kryo instances. I am wonder if I am missing something as to why this isn't done already. If not I am wondering if this might be a patch that Spark would be interested in merging in, and how I might go about that.
>
> Thanks,
>
> Patrick

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: KryoSerializer Implementation - Not using KryoPool

Patrick Brown
Based on my, limited, read through of the code that uses this, it seems like often a new KryoSerializerInstance is created for whatever task and then it falls out of scope, instead of being reused. I did notice that comment about a pool size of 1, however if the use is generally how I just described, I don't think that really does anything.

I would be happy to open a PR but, pardon my ignorance, how would I go about doing that properly? Do I need to open a JIRA issue first? Also how would I demonstrate performance gains? Do you guys use something like ScalaMeter?

Thanks for your help!

On Wed, Oct 24, 2018 at 2:37 PM Sean Owen <[hidden email]> wrote:
I don't know; possibly just because it wasn't available whenever Kryo
was first used in the project.

Skimming the code, the KryoSerializerInstance looks like a wrapper
that provides a Kryo object to do work. It already maintains a 'pool'
of just 1 instance. Is the point that KryoSerializer can share a
KryoPool across KryoSerializerInstances that provides them with a Kryo
rather than allocate a new one? makes sense, though I believe the
concern is always whether that somehow shares state or config in a way
that breaks something. I see there's already a reset() call in here to
try to avoid that.

Well, seems worth a PR, especially if you can demonstrate some
performance gains.

On Wed, Oct 24, 2018 at 3:09 PM Patrick Brown
<[hidden email]> wrote:
>
> Hi,
>
> I am wondering about the implementation of KryoSerializer, specifically the lack of use of KryoPool, which is recommended by Kryo themselves.
>
> Looking at the code, it seems that frequently KryoSerializer.newInstance is called, followed by a serialize and then this instance goes out of scope, this seems like it causes frequent creation of Kryo instances, something which the Kryo documentation says is expensive.
>
> By doing flame graphs on our own running software (it processes a lot of small jobs) it seems like a good amount of time is spent on this.
>
> I have a small patch we are using internally which implements a reused KryoPool inside KryoSerializer (not KryoSerializerInstance) in order to avoid the creation of many Kryo instances. I am wonder if I am missing something as to why this isn't done already. If not I am wondering if this might be a patch that Spark would be interested in merging in, and how I might go about that.
>
> Thanks,
>
> Patrick
Reply | Threaded
Open this post in threaded view
|

Re: KryoSerializer Implementation - Not using KryoPool

Sean Owen-2
It's not so much the KryoSerializerInstance that's the problem, but
that it will always make a new Kryo (although at most 1). You mean to
supply it with a reference to a pool instead, shared across all
KryoSerializerInstance? plausible yeah.

See https://spark.apache.org/contributing.html for guidance but
basically you'd make a JIRA and then a pull request at apache/spark,
with the JIRA number in the title and some other conventions.
On Thu, Oct 25, 2018 at 11:51 AM Patrick Brown
<[hidden email]> wrote:

>
> Based on my, limited, read through of the code that uses this, it seems like often a new KryoSerializerInstance is created for whatever task and then it falls out of scope, instead of being reused. I did notice that comment about a pool size of 1, however if the use is generally how I just described, I don't think that really does anything.
>
> I would be happy to open a PR but, pardon my ignorance, how would I go about doing that properly? Do I need to open a JIRA issue first? Also how would I demonstrate performance gains? Do you guys use something like ScalaMeter?
>
> Thanks for your help!
>
> On Wed, Oct 24, 2018 at 2:37 PM Sean Owen <[hidden email]> wrote:
>>
>> I don't know; possibly just because it wasn't available whenever Kryo
>> was first used in the project.
>>
>> Skimming the code, the KryoSerializerInstance looks like a wrapper
>> that provides a Kryo object to do work. It already maintains a 'pool'
>> of just 1 instance. Is the point that KryoSerializer can share a
>> KryoPool across KryoSerializerInstances that provides them with a Kryo
>> rather than allocate a new one? makes sense, though I believe the
>> concern is always whether that somehow shares state or config in a way
>> that breaks something. I see there's already a reset() call in here to
>> try to avoid that.
>>
>> Well, seems worth a PR, especially if you can demonstrate some
>> performance gains.
>>
>> On Wed, Oct 24, 2018 at 3:09 PM Patrick Brown
>> <[hidden email]> wrote:
>> >
>> > Hi,
>> >
>> > I am wondering about the implementation of KryoSerializer, specifically the lack of use of KryoPool, which is recommended by Kryo themselves.
>> >
>> > Looking at the code, it seems that frequently KryoSerializer.newInstance is called, followed by a serialize and then this instance goes out of scope, this seems like it causes frequent creation of Kryo instances, something which the Kryo documentation says is expensive.
>> >
>> > By doing flame graphs on our own running software (it processes a lot of small jobs) it seems like a good amount of time is spent on this.
>> >
>> > I have a small patch we are using internally which implements a reused KryoPool inside KryoSerializer (not KryoSerializerInstance) in order to avoid the creation of many Kryo instances. I am wonder if I am missing something as to why this isn't done already. If not I am wondering if this might be a patch that Spark would be interested in merging in, and how I might go about that.
>> >
>> > Thanks,
>> >
>> > Patrick

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]