Performance of VectorizedRleValuesReader

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Performance of VectorizedRleValuesReader

Chang Chen
Hi export

it looks like there is a hot spot in VectorizedRleValuesReader#readNextGroup() 

case PACKED:
int numGroups = header >>> 1;
this.currentCount = numGroups * 8;

if (this.currentBuffer.length < this.currentCount) {
this.currentBuffer = new int[this.currentCount];
}
currentBufferIdx = 0;
int valueIndex = 0;
while (valueIndex < this.currentCount) {
// values are bit packed 8 at a time, so reading bitWidth will always work
ByteBuffer buffer = in.slice(bitWidth);
this.packer.unpack8Values(buffer, buffer.position(), this.currentBuffer, valueIndex);
valueIndex += 8;
}

Per my profile, the codes will spend 30% time of readNextGrou() on slice , why we can't call slice out of the loop?
Reply | Threaded
Open this post in threaded view
|

Re: Performance of VectorizedRleValuesReader

Sean Owen-2
It certainly can't be called once - it's reading different data each time.
There might be a faster way to do it, I don't know. Do you have ideas?

On Sun, Sep 13, 2020 at 9:25 PM Chang Chen <[hidden email]> wrote:

>
> Hi export
>
> it looks like there is a hot spot in VectorizedRleValuesReader#readNextGroup()
>
> case PACKED:
>   int numGroups = header >>> 1;
>   this.currentCount = numGroups * 8;
>
>   if (this.currentBuffer.length < this.currentCount) {
>     this.currentBuffer = new int[this.currentCount];
>   }
>   currentBufferIdx = 0;
>   int valueIndex = 0;
>   while (valueIndex < this.currentCount) {
>     // values are bit packed 8 at a time, so reading bitWidth will always work
>     ByteBuffer buffer = in.slice(bitWidth);
>     this.packer.unpack8Values(buffer, buffer.position(), this.currentBuffer, valueIndex);
>     valueIndex += 8;
>   }
>
>
> Per my profile, the codes will spend 30% time of readNextGrou() on slice , why we can't call slice out of the loop?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance of VectorizedRleValuesReader

Chang Chen
I think we can copy all encoded data into a ByteBuffer once, and unpack values in the loop

 while (valueIndex < this.currentCount) {
    // values are bit packed 8 at a time, so reading bitWidth will always work    
    this.packer.unpack8Values(buffer, buffer.position() + valueIndex, this.currentBuffer, valueIndex);
    valueIndex += 8;
  }

Sean Owen <[hidden email]> 于2020年9月14日周一 上午10:40写道:
It certainly can't be called once - it's reading different data each time.
There might be a faster way to do it, I don't know. Do you have ideas?

On Sun, Sep 13, 2020 at 9:25 PM Chang Chen <[hidden email]> wrote:
>
> Hi export
>
> it looks like there is a hot spot in VectorizedRleValuesReader#readNextGroup()
>
> case PACKED:
>   int numGroups = header >>> 1;
>   this.currentCount = numGroups * 8;
>
>   if (this.currentBuffer.length < this.currentCount) {
>     this.currentBuffer = new int[this.currentCount];
>   }
>   currentBufferIdx = 0;
>   int valueIndex = 0;
>   while (valueIndex < this.currentCount) {
>     // values are bit packed 8 at a time, so reading bitWidth will always work
>     ByteBuffer buffer = in.slice(bitWidth);
>     this.packer.unpack8Values(buffer, buffer.position(), this.currentBuffer, valueIndex);
>     valueIndex += 8;
>   }
>
>
> Per my profile, the codes will spend 30% time of readNextGrou() on slice , why we can't call slice out of the loop?
Reply | Threaded
Open this post in threaded view
|

Re: Performance of VectorizedRleValuesReader

Sean Owen-2
Ryan do you happen to have any opinion there? that particular section
was introduced in the Parquet 1.10 update:
https://github.com/apache/spark/commit/cac9b1dea1bb44fa42abf77829c05bf93f70cf20
It looks like it didn't use to make a ByteBuffer each time, but read from in.

On Sun, Sep 13, 2020 at 10:48 PM Chang Chen <[hidden email]> wrote:

>
> I think we can copy all encoded data into a ByteBuffer once, and unpack values in the loop
>
>  while (valueIndex < this.currentCount) {
>     // values are bit packed 8 at a time, so reading bitWidth will always work
>     this.packer.unpack8Values(buffer, buffer.position() + valueIndex, this.currentBuffer, valueIndex);
>     valueIndex += 8;
>   }
>
> Sean Owen <[hidden email]> 于2020年9月14日周一 上午10:40写道:
>>
>> It certainly can't be called once - it's reading different data each time.
>> There might be a faster way to do it, I don't know. Do you have ideas?
>>
>> On Sun, Sep 13, 2020 at 9:25 PM Chang Chen <[hidden email]> wrote:
>> >
>> > Hi export
>> >
>> > it looks like there is a hot spot in VectorizedRleValuesReader#readNextGroup()
>> >
>> > case PACKED:
>> >   int numGroups = header >>> 1;
>> >   this.currentCount = numGroups * 8;
>> >
>> >   if (this.currentBuffer.length < this.currentCount) {
>> >     this.currentBuffer = new int[this.currentCount];
>> >   }
>> >   currentBufferIdx = 0;
>> >   int valueIndex = 0;
>> >   while (valueIndex < this.currentCount) {
>> >     // values are bit packed 8 at a time, so reading bitWidth will always work
>> >     ByteBuffer buffer = in.slice(bitWidth);
>> >     this.packer.unpack8Values(buffer, buffer.position(), this.currentBuffer, valueIndex);
>> >     valueIndex += 8;
>> >   }
>> >
>> >
>> > Per my profile, the codes will spend 30% time of readNextGrou() on slice , why we can't call slice out of the loop?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance of VectorizedRleValuesReader

Ryan Blue-2
Before, the input was a byte array so we could read from it directly. Now, the input is a `ByteBufferInputStream` so that Parquet can choose how to allocate buffers. For example, we use vectored reads from S3 that pull back multiple buffers in parallel.

Now that the input is a stream based on possibly multiple byte buffers, it provides a method to get a buffer of a certain length. In most cases, that will create a ByteBuffer with the same backing byte array, but it may need to copy if the request spans multiple buffers in the stream. Most of the time, the call to `slice` only requires duplicating the buffer and setting its limit, but a read that spans multiple buffers is expensive. It would be helpful to know whether the time spent is copying data, which would indicate the backing buffers are too small, or whether it is spent duplicating the backing byte buffer.

On Mon, Sep 14, 2020 at 5:29 AM Sean Owen <[hidden email]> wrote:
Ryan do you happen to have any opinion there? that particular section
was introduced in the Parquet 1.10 update:
https://github.com/apache/spark/commit/cac9b1dea1bb44fa42abf77829c05bf93f70cf20
It looks like it didn't use to make a ByteBuffer each time, but read from in.

On Sun, Sep 13, 2020 at 10:48 PM Chang Chen <[hidden email]> wrote:
>
> I think we can copy all encoded data into a ByteBuffer once, and unpack values in the loop
>
>  while (valueIndex < this.currentCount) {
>     // values are bit packed 8 at a time, so reading bitWidth will always work
>     this.packer.unpack8Values(buffer, buffer.position() + valueIndex, this.currentBuffer, valueIndex);
>     valueIndex += 8;
>   }
>
> Sean Owen <[hidden email]> 于2020年9月14日周一 上午10:40写道:
>>
>> It certainly can't be called once - it's reading different data each time.
>> There might be a faster way to do it, I don't know. Do you have ideas?
>>
>> On Sun, Sep 13, 2020 at 9:25 PM Chang Chen <[hidden email]> wrote:
>> >
>> > Hi export
>> >
>> > it looks like there is a hot spot in VectorizedRleValuesReader#readNextGroup()
>> >
>> > case PACKED:
>> >   int numGroups = header >>> 1;
>> >   this.currentCount = numGroups * 8;
>> >
>> >   if (this.currentBuffer.length < this.currentCount) {
>> >     this.currentBuffer = new int[this.currentCount];
>> >   }
>> >   currentBufferIdx = 0;
>> >   int valueIndex = 0;
>> >   while (valueIndex < this.currentCount) {
>> >     // values are bit packed 8 at a time, so reading bitWidth will always work
>> >     ByteBuffer buffer = in.slice(bitWidth);
>> >     this.packer.unpack8Values(buffer, buffer.position(), this.currentBuffer, valueIndex);
>> >     valueIndex += 8;
>> >   }
>> >
>> >
>> > Per my profile, the codes will spend 30% time of readNextGrou() on slice , why we can't call slice out of the loop?


--
Ryan Blue
Reply | Threaded
Open this post in threaded view
|

Re: Performance of VectorizedRleValuesReader

Chang Chen
I See.

In our case,  we use SingleBufferInputStream, so time spent is duplicating the backing byte buffer.


Thanks
Chang
 

Ryan Blue <[hidden email]> 于2020年9月15日周二 上午2:04写道:
Before, the input was a byte array so we could read from it directly. Now, the input is a `ByteBufferInputStream` so that Parquet can choose how to allocate buffers. For example, we use vectored reads from S3 that pull back multiple buffers in parallel.

Now that the input is a stream based on possibly multiple byte buffers, it provides a method to get a buffer of a certain length. In most cases, that will create a ByteBuffer with the same backing byte array, but it may need to copy if the request spans multiple buffers in the stream. Most of the time, the call to `slice` only requires duplicating the buffer and setting its limit, but a read that spans multiple buffers is expensive. It would be helpful to know whether the time spent is copying data, which would indicate the backing buffers are too small, or whether it is spent duplicating the backing byte buffer.

On Mon, Sep 14, 2020 at 5:29 AM Sean Owen <[hidden email]> wrote:
Ryan do you happen to have any opinion there? that particular section
was introduced in the Parquet 1.10 update:
https://github.com/apache/spark/commit/cac9b1dea1bb44fa42abf77829c05bf93f70cf20
It looks like it didn't use to make a ByteBuffer each time, but read from in.

On Sun, Sep 13, 2020 at 10:48 PM Chang Chen <[hidden email]> wrote:
>
> I think we can copy all encoded data into a ByteBuffer once, and unpack values in the loop
>
>  while (valueIndex < this.currentCount) {
>     // values are bit packed 8 at a time, so reading bitWidth will always work
>     this.packer.unpack8Values(buffer, buffer.position() + valueIndex, this.currentBuffer, valueIndex);
>     valueIndex += 8;
>   }
>
> Sean Owen <[hidden email]> 于2020年9月14日周一 上午10:40写道:
>>
>> It certainly can't be called once - it's reading different data each time.
>> There might be a faster way to do it, I don't know. Do you have ideas?
>>
>> On Sun, Sep 13, 2020 at 9:25 PM Chang Chen <[hidden email]> wrote:
>> >
>> > Hi export
>> >
>> > it looks like there is a hot spot in VectorizedRleValuesReader#readNextGroup()
>> >
>> > case PACKED:
>> >   int numGroups = header >>> 1;
>> >   this.currentCount = numGroups * 8;
>> >
>> >   if (this.currentBuffer.length < this.currentCount) {
>> >     this.currentBuffer = new int[this.currentCount];
>> >   }
>> >   currentBufferIdx = 0;
>> >   int valueIndex = 0;
>> >   while (valueIndex < this.currentCount) {
>> >     // values are bit packed 8 at a time, so reading bitWidth will always work
>> >     ByteBuffer buffer = in.slice(bitWidth);
>> >     this.packer.unpack8Values(buffer, buffer.position(), this.currentBuffer, valueIndex);
>> >     valueIndex += 8;
>> >   }
>> >
>> >
>> > Per my profile, the codes will spend 30% time of readNextGrou() on slice , why we can't call slice out of the loop?


--
Ryan Blue
Reply | Threaded
Open this post in threaded view
|

unsubscribe

Mark Daniels
In reply to this post by Chang Chen