Spark SQL upgrade / migration guide: discoverability and content organization

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark SQL upgrade / migration guide: discoverability and content organization

Josh Rosen
I'd like to discuss the Spark SQL migration / upgrade guides in the Spark documentation: these are valuable resources and I think we could increase that value by making these docs easier to discover and by adding a bit more structure to the existing content.

For folks who aren't familiar with these docs: the Spark docs have a "SQL Migration Guide" which lists the deprecations and changes of behavior in each release:
A lot of community work went into crafting this doc and I really appreciate those efforts.

This doc is a little hard to find, though, because it's not consistently linked from release notes pages: the 2.4.0 page links it under "Changes of Behavior" (https://spark.apache.org/releases/spark-release-2-4-0.html#changes-of-behavior) but subsequent maintenance releases do not link to it (https://spark.apache.org/releases/spark-release-2-4-1.html). It's also not very cross-linked from the rest of the Spark docs (e.g. the Overview doc, doc drop-down menus, etc).

I'm also concerned that the doc may be overwhelming to end users (as opposed to Spark developers): 
  • Entries aren't grouped by component, so users need to read the entire document to spot changes relevant to their use of Spark (for example, PySpark changes are not grouped together).
  • Entries aren't ordered by size / risk of change, e.g. performance impact vs. loud behavior change (stopping with an explicit exception) vs. silent behavior changes (e.g. changing default rounding behavior). If we assume limited reader attention then it may be important to prioritize the order in which we list entries, putting the highest-expected-impact / lowest-organic-discoverability changes first.
  • We don't link JIRAs, forcing users to do their own archaeology to learn more about a specific change.
The existing ML migration guide addresses some of these issues, so maybe we can emulate it in the SQL guide: https://spark.apache.org/docs/latest/ml-guide.html#migration-guide

I think that documentation clarity is especially important with Spark 3.0 around the corner: many folks will seek out this information when they upgrade, so improving this guide can be a high-leverage, high-impact activity.

What do folks think? Does anyone have examples from other projects which do a notably good job of crafting release notes / migration guides? I'd be glad to help with pre-release editing after we decide on a structure and style.

Cheers,
Josh
Reply | Threaded
Open this post in threaded view
|

Re: Spark SQL upgrade / migration guide: discoverability and content organization

Xiao Li-2
Yeah, Josh! All these ideas sound good to me. All the top commercial database products have very detailed guide/document about the version upgrading. You can easily find them. 

Currently, only SQL and ML modules have the migration or upgrade guides. Since Spark 2.3 release, we strictly require the PR authors to document all the behavior changes in the SQL component. I would suggest to do the same things in the other modules. For example, Spark Core and Structured Streaming. Any objection?

Cheers,

Xiao



On Sun, Jul 14, 2019 at 2:05 PM Josh Rosen <[hidden email]> wrote:
I'd like to discuss the Spark SQL migration / upgrade guides in the Spark documentation: these are valuable resources and I think we could increase that value by making these docs easier to discover and by adding a bit more structure to the existing content.

For folks who aren't familiar with these docs: the Spark docs have a "SQL Migration Guide" which lists the deprecations and changes of behavior in each release:
A lot of community work went into crafting this doc and I really appreciate those efforts.

This doc is a little hard to find, though, because it's not consistently linked from release notes pages: the 2.4.0 page links it under "Changes of Behavior" (https://spark.apache.org/releases/spark-release-2-4-0.html#changes-of-behavior) but subsequent maintenance releases do not link to it (https://spark.apache.org/releases/spark-release-2-4-1.html). It's also not very cross-linked from the rest of the Spark docs (e.g. the Overview doc, doc drop-down menus, etc).

I'm also concerned that the doc may be overwhelming to end users (as opposed to Spark developers): 
  • Entries aren't grouped by component, so users need to read the entire document to spot changes relevant to their use of Spark (for example, PySpark changes are not grouped together).
  • Entries aren't ordered by size / risk of change, e.g. performance impact vs. loud behavior change (stopping with an explicit exception) vs. silent behavior changes (e.g. changing default rounding behavior). If we assume limited reader attention then it may be important to prioritize the order in which we list entries, putting the highest-expected-impact / lowest-organic-discoverability changes first.
  • We don't link JIRAs, forcing users to do their own archaeology to learn more about a specific change.
The existing ML migration guide addresses some of these issues, so maybe we can emulate it in the SQL guide: https://spark.apache.org/docs/latest/ml-guide.html#migration-guide

I think that documentation clarity is especially important with Spark 3.0 around the corner: many folks will seek out this information when they upgrade, so improving this guide can be a high-leverage, high-impact activity.

What do folks think? Does anyone have examples from other projects which do a notably good job of crafting release notes / migration guides? I'd be glad to help with pre-release editing after we decide on a structure and style.

Cheers,
Josh


--
Databricks Summit - Watch the talks 
Reply | Threaded
Open this post in threaded view
|

Re: Spark SQL upgrade / migration guide: discoverability and content organization

Dongjoon Hyun-2
Thank you, Josh and Xiao. That sounds great.

Do you think we can have some parts of that improvement in `2.4.4` document first since that is the very next release?

Bests,
Dongjoon.

On Sun, Jul 14, 2019 at 4:25 PM Xiao Li <[hidden email]> wrote:
Yeah, Josh! All these ideas sound good to me. All the top commercial database products have very detailed guide/document about the version upgrading. You can easily find them. 

Currently, only SQL and ML modules have the migration or upgrade guides. Since Spark 2.3 release, we strictly require the PR authors to document all the behavior changes in the SQL component. I would suggest to do the same things in the other modules. For example, Spark Core and Structured Streaming. Any objection?

Cheers,

Xiao



On Sun, Jul 14, 2019 at 2:05 PM Josh Rosen <[hidden email]> wrote:
I'd like to discuss the Spark SQL migration / upgrade guides in the Spark documentation: these are valuable resources and I think we could increase that value by making these docs easier to discover and by adding a bit more structure to the existing content.

For folks who aren't familiar with these docs: the Spark docs have a "SQL Migration Guide" which lists the deprecations and changes of behavior in each release:
A lot of community work went into crafting this doc and I really appreciate those efforts.

This doc is a little hard to find, though, because it's not consistently linked from release notes pages: the 2.4.0 page links it under "Changes of Behavior" (https://spark.apache.org/releases/spark-release-2-4-0.html#changes-of-behavior) but subsequent maintenance releases do not link to it (https://spark.apache.org/releases/spark-release-2-4-1.html). It's also not very cross-linked from the rest of the Spark docs (e.g. the Overview doc, doc drop-down menus, etc).

I'm also concerned that the doc may be overwhelming to end users (as opposed to Spark developers): 
  • Entries aren't grouped by component, so users need to read the entire document to spot changes relevant to their use of Spark (for example, PySpark changes are not grouped together).
  • Entries aren't ordered by size / risk of change, e.g. performance impact vs. loud behavior change (stopping with an explicit exception) vs. silent behavior changes (e.g. changing default rounding behavior). If we assume limited reader attention then it may be important to prioritize the order in which we list entries, putting the highest-expected-impact / lowest-organic-discoverability changes first.
  • We don't link JIRAs, forcing users to do their own archaeology to learn more about a specific change.
The existing ML migration guide addresses some of these issues, so maybe we can emulate it in the SQL guide: https://spark.apache.org/docs/latest/ml-guide.html#migration-guide

I think that documentation clarity is especially important with Spark 3.0 around the corner: many folks will seek out this information when they upgrade, so improving this guide can be a high-leverage, high-impact activity.

What do folks think? Does anyone have examples from other projects which do a notably good job of crafting release notes / migration guides? I'd be glad to help with pre-release editing after we decide on a structure and style.

Cheers,
Josh


--
Databricks Summit - Watch the talks 
Reply | Threaded
Open this post in threaded view
|

Re: Spark SQL upgrade / migration guide: discoverability and content organization

Jungtaek Lim
In reply to this post by Xiao Li-2
As one of contributors in Structured Streaming, I would vote on having migration guide doc for structured streaming as well, once we decide standard format of migration guide.

In Spark 3.0.0 there're some breaking change on even SS area - one example is SPARK-28199 which Sean took care of leaving release note for this, but migration guide would be better to help for some users from 2.4.x to 3.0.x since release note would be bound to only 3.0.0.

-Jungtaek Lim (HeartSaVioR)

On Mon, Jul 15, 2019 at 8:25 AM Xiao Li <[hidden email]> wrote:
Yeah, Josh! All these ideas sound good to me. All the top commercial database products have very detailed guide/document about the version upgrading. You can easily find them. 

Currently, only SQL and ML modules have the migration or upgrade guides. Since Spark 2.3 release, we strictly require the PR authors to document all the behavior changes in the SQL component. I would suggest to do the same things in the other modules. For example, Spark Core and Structured Streaming. Any objection?

Cheers,

Xiao



On Sun, Jul 14, 2019 at 2:05 PM Josh Rosen <[hidden email]> wrote:
I'd like to discuss the Spark SQL migration / upgrade guides in the Spark documentation: these are valuable resources and I think we could increase that value by making these docs easier to discover and by adding a bit more structure to the existing content.

For folks who aren't familiar with these docs: the Spark docs have a "SQL Migration Guide" which lists the deprecations and changes of behavior in each release:
A lot of community work went into crafting this doc and I really appreciate those efforts.

This doc is a little hard to find, though, because it's not consistently linked from release notes pages: the 2.4.0 page links it under "Changes of Behavior" (https://spark.apache.org/releases/spark-release-2-4-0.html#changes-of-behavior) but subsequent maintenance releases do not link to it (https://spark.apache.org/releases/spark-release-2-4-1.html). It's also not very cross-linked from the rest of the Spark docs (e.g. the Overview doc, doc drop-down menus, etc).

I'm also concerned that the doc may be overwhelming to end users (as opposed to Spark developers): 
  • Entries aren't grouped by component, so users need to read the entire document to spot changes relevant to their use of Spark (for example, PySpark changes are not grouped together).
  • Entries aren't ordered by size / risk of change, e.g. performance impact vs. loud behavior change (stopping with an explicit exception) vs. silent behavior changes (e.g. changing default rounding behavior). If we assume limited reader attention then it may be important to prioritize the order in which we list entries, putting the highest-expected-impact / lowest-organic-discoverability changes first.
  • We don't link JIRAs, forcing users to do their own archaeology to learn more about a specific change.
The existing ML migration guide addresses some of these issues, so maybe we can emulate it in the SQL guide: https://spark.apache.org/docs/latest/ml-guide.html#migration-guide

I think that documentation clarity is especially important with Spark 3.0 around the corner: many folks will seek out this information when they upgrade, so improving this guide can be a high-leverage, high-impact activity.

What do folks think? Does anyone have examples from other projects which do a notably good job of crafting release notes / migration guides? I'd be glad to help with pre-release editing after we decide on a structure and style.

Cheers,
Josh


--
Databricks Summit - Watch the talks 


--