Documenting the various DataFrame/SQL join types

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Documenting the various DataFrame/SQL join types

Nicholas Chammas

The documentation for DataFrame.join() lists all the join types we support:

  • inner
  • cross
  • outer
  • full
  • full_outer
  • left
  • left_outer
  • right
  • right_outer
  • left_semi
  • left_anti

Some of these join types are also listed on the SQL Programming Guide.

Is it obvious to everyone what all these different join types are? For example, I had never heard of a LEFT ANTI join until stumbling on it in the PySpark docs. It’s quite handy! But I had to experiment with it a bit just to understand what it does.

I think it would be a good service to our users if we either documented these join types ourselves clearly, or provided a link to an external resource that documented them sufficiently. I’m happy to file a JIRA about this and do the work itself. It would be great if the documentation could be expressed as a series of simple doc tests, but brief prose describing how each join works would still be valuable.

Does this seem worthwhile to folks here? And does anyone want to offer guidance on how best to provide this kind of documentation so that it’s easy to find by users, regardless of the language they’re using?

Nick

Reply | Threaded
Open this post in threaded view
|

Re: Documenting the various DataFrame/SQL join types

rxin
Would be great to document. Probably best with examples. 

On Tue, May 8, 2018 at 6:13 AM Nicholas Chammas <[hidden email]> wrote:

The documentation for DataFrame.join() lists all the join types we support:

  • inner
  • cross
  • outer
  • full
  • full_outer
  • left
  • left_outer
  • right
  • right_outer
  • left_semi
  • left_anti

Some of these join types are also listed on the SQL Programming Guide.

Is it obvious to everyone what all these different join types are? For example, I had never heard of a LEFT ANTI join until stumbling on it in the PySpark docs. It’s quite handy! But I had to experiment with it a bit just to understand what it does.

I think it would be a good service to our users if we either documented these join types ourselves clearly, or provided a link to an external resource that documented them sufficiently. I’m happy to file a JIRA about this and do the work itself. It would be great if the documentation could be expressed as a series of simple doc tests, but brief prose describing how each join works would still be valuable.

Does this seem worthwhile to folks here? And does anyone want to offer guidance on how best to provide this kind of documentation so that it’s easy to find by users, regardless of the language they’re using?

Nick

Reply | Threaded
Open this post in threaded view
|

Re: Documenting the various DataFrame/SQL join types

Nicholas Chammas

OK great, I’m happy to take this on.

Does it make sense to approach this by adding an example for each join type here (and perhaps also in the matching areas for Scala, Java, and R), and then referencing the examples from the SQL Programming Guide using include_example tags?

e.g.:

<div data-lang="python"  markdown="1">
{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
</div>

And would this let me implement simple tests for the examples? It’s not clear to me whether the comment blocks in that example file are used for testing somehow.

Just looking for some high level guidance.

Nick


On Tue, May 8, 2018 at 11:42 AM Reynold Xin <[hidden email]> wrote:
Would be great to document. Probably best with examples. 

On Tue, May 8, 2018 at 6:13 AM Nicholas Chammas <[hidden email]> wrote:

The documentation for DataFrame.join() lists all the join types we support:

  • inner
  • cross
  • outer
  • full
  • full_outer
  • left
  • left_outer
  • right
  • right_outer
  • left_semi
  • left_anti

Some of these join types are also listed on the SQL Programming Guide.

Is it obvious to everyone what all these different join types are? For example, I had never heard of a LEFT ANTI join until stumbling on it in the PySpark docs. It’s quite handy! But I had to experiment with it a bit just to understand what it does.

I think it would be a good service to our users if we either documented these join types ourselves clearly, or provided a link to an external resource that documented them sufficiently. I’m happy to file a JIRA about this and do the work itself. It would be great if the documentation could be expressed as a series of simple doc tests, but brief prose describing how each join works would still be valuable.

Does this seem worthwhile to folks here? And does anyone want to offer guidance on how best to provide this kind of documentation so that it’s easy to find by users, regardless of the language they’re using?

Nick