RE: Partitions at DataSource API V2

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

RE: Partitions at DataSource API V2

JOAQUIN GUANTER GONZALBEZ
I'd like to bump this. I agree with Carlos that there is very little information at the DataSoruceWrite/DataSourceReader level. To me, ideally, the DataSourceWriter/Reader should have as much information as possible. Not only the number of partitions, but also ideally the whole execution plan.

This would not only enable things like automatic creation of kafka topics with the correct number of partitions (like Carlos mentioned), but it would also allow advanced DataSources that, for example, analyze the execution plan to choose the correct parameters to implement differential privacy.

CC'ing in Ryan, since he is leading the DataSourceV2 workgroup (sorry I can't joint the sync meetings, but I'm in CET time and the time logictics of that meeting don't work for Europe).

Ryan, do you think it would be a good idea to provide extra information at the DataSourceWriter/Reader level to enable more advanced datasources? Would a PR contribution with these changed be a welcome addition?

Thanks,
Ximo

-----Mensaje original-----
De: CARLOS DEL PRADO MOTA <[hidden email]>
Enviado el: jueves, 7 de marzo de 2019 10:19
Para: [hidden email]
Asunto: Partitions at DataSource API V2

Hello, I’m Carlos del Prado, developer at Telefonica.

We are working with Spark's DataSource API V2 building a custom Kafka connector that creates the topic upon write. In order to do that, we need to know the number of partitions before writing data in each partition, at the DataSourceWriter level.

Is there any way for us do that?

King regards,
Carlos.

________________________________

Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted. el destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente por esta misma vía y proceda a su destrucción.

The information contained in this transmission is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it.

Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e proceda a sua destruição

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Partitions at DataSource API V2

Joseph Torres
The reader necessarily knows the number of partitions, since it's responsible for generating its output partitions in the first place. I won't speak for everyone, but it would make sense to me to pass in a Partitioning instance to the writer, since it's already part of the v2 interface through the reader's SupportsReportPartitioning.

I don't think we can expose execution plans to the data source v2 interface; the exact Java structure of execution plans isn't stable across even maintenance releases. Even if we could, I don't really see what the use case would be - what information does the writer need that can't be made available through either the input data or the input partitioning? (The built-in Kafka sink, for example, handles metadata such as topic switching by just accepting topic name as a column along with the data.)

On Wed, Mar 13, 2019 at 1:39 AM JOAQUIN GUANTER GONZALBEZ <[hidden email]> wrote:
I'd like to bump this. I agree with Carlos that there is very little information at the DataSoruceWrite/DataSourceReader level. To me, ideally, the DataSourceWriter/Reader should have as much information as possible. Not only the number of partitions, but also ideally the whole execution plan.

This would not only enable things like automatic creation of kafka topics with the correct number of partitions (like Carlos mentioned), but it would also allow advanced DataSources that, for example, analyze the execution plan to choose the correct parameters to implement differential privacy.

CC'ing in Ryan, since he is leading the DataSourceV2 workgroup (sorry I can't joint the sync meetings, but I'm in CET time and the time logictics of that meeting don't work for Europe).

Ryan, do you think it would be a good idea to provide extra information at the DataSourceWriter/Reader level to enable more advanced datasources? Would a PR contribution with these changed be a welcome addition?

Thanks,
Ximo

-----Mensaje original-----
De: CARLOS DEL PRADO MOTA <[hidden email]>
Enviado el: jueves, 7 de marzo de 2019 10:19
Para: [hidden email]
Asunto: Partitions at DataSource API V2

Hello, I’m Carlos del Prado, developer at Telefonica.

We are working with Spark's DataSource API V2 building a custom Kafka connector that creates the topic upon write. In order to do that, we need to know the number of partitions before writing data in each partition, at the DataSourceWriter level.

Is there any way for us do that?

King regards,
Carlos.

________________________________

Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted. el destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente por esta misma vía y proceda a su destrucción.

The information contained in this transmission is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it.

Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e proceda a sua destruição