[OSS DIGEST] The major changes of Apache Spark from Mar 25 to Apr 7

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[OSS DIGEST] The major changes of Apache Spark from Mar 25 to Apr 7

Xiao Li-2
Hi all,

This is the bi-weekly Apache Spark digest from the Databricks OSS team.
For each API/configuration/behavior change, an [API] tag is added in the title.

CORE

[3.0][SPARK-30623][CORE] Spark external shuffle allow disable of separate event loop group (+66, -33)>

PR#22173 introduced a perf regression in shuffle, even if we disable the feature flag spark.shuffle.server.chunkFetchHandlerThreadsPercent. To fix the perf regression, this PR refactors the related code to completely disable this feature by default.

[3.0][SPARK-31314][CORE] Revert SPARK-29285 to fix shuffle regression caused by creating temporary file eagerly (+10, -71)>

PR#25962 introduced a perf regression in shuffle, which may create empty files unnecessarily. This PR reverts it.

[API][3.1][SPARK-29154][CORE] Update Spark scheduler for stage level scheduling (+704, -218)>

This PR updates the DAG scheduler to schedule tasks to match the resource profile. It's for the stage level scheduling.

[API][3.1][SPARK-29153][CORE] Add ability to merge resource profiles within a stage with Stage Level Scheduling (+304, -15)>

Add the ability to optionally merged resource profiles if they are specified on multiple RDDs within a Stage. The feature is part of Stage Level Scheduling. There is a config spark.scheduler.resourceProfile.mergeConflicts to enable this feature, the config if off by default.

spark.scheduler.resource.profileMergeConflicts (Default: false)

  • If set to true, Spark will merge ResourceProfiles when different profiles are specified in RDDs that get combined into a single stage. When they are merged, Spark chooses the maximum of each resource and creates a new ResourceProfile. The default of false results in Spark throwing an exception if multiple different ResourceProfiles are found in RDDs going into the same stage.

[API][3.1][SPARK-31208][CORE] Add an experimental API: cleanShuffleDependencies (+158, -71)>

Add a new experimental developer API RDD.cleanShuffleDependencies(blocking: Boolean) to allow explicitly clean up shuffle files. This could help dynamic scaling of K8s backend since the backend only recycles executors without shuffle files.

  /**
   * :: Experimental ::
   * Removes an RDD's shuffles and it's non-persisted ancestors.
   * When running without a shuffle service, cleaning up shuffle files enables downscaling.
   * If you use the RDD after this call, you should checkpoint and materialize it first.
   * If you are uncertain of what you are doing, please do not use this feature.
   * Additional techniques for mitigating orphaned shuffle files:
   *   * Tuning the driver GC to be more aggressive, so the regular context cleaner is triggered
   *   * Setting an appropriate TTL for shuffle files to be auto cleaned
   */
  @Experimental
  @DeveloperApi
  @Since("3.1.0")
  def cleanShuffleDependencies(blocking: Boolean = false): Unit

[3.1][SPARK-31179] Fast fail the connection while last connection failed in fast fail time window (+68, -12)>

In TransportFactory, if a connection to the destination address fails, the new connection requests [that are created within a time window] fail fast for avoiding too many retries. This time window size is set to 95% of the IO retry wait time (spark.io.shuffle.retryWait whose default is 5 seconds).

SQL

[API][3.0][SPARK-25556][SPARK-17636][SPARK-31026][SPARK-31060][SQL] Nested Column Predicate Pushdown for Parquet (+852, -480)>

This PR extends the data source framework to support filter pushdown with nested fields, and implement it in the Parquet data source.

spark.sql.optimizer.nestedPredicatePushdown.enabled [Default: true]

  • When true, Spark tries to push down predicates for nested columns and or names containing dots to data sources. Currently, Parquet implements both optimizations while ORC only supports predicates for names containing dots. The other data sources don't support this feature yet.

[3.0][SPARK-30822][SQL] Remove semicolon at the end of a sql query (+14, -7)>

This PR updates the SQL parser to ignore the trailing semicolons in the SQL queries. Now sql("select 1;") works.

[API][3.0][SPARK-31086][SQL] Add Back the Deprecated SQLContext methods (+403, -0)>

This PR adds back the following APIs whose maintenance costs are relatively small.

  • SQLContext.applySchema
  • SQLContext.parquetFile
  • SQLContext.jsonFile
  • SQLContext.jsonRDD
  • SQLContext.load
  • SQLContext.jdbc

[API][3.0][SPARK-31087][SQL] Add Back Multiple Removed APIs (+458, -15)>

This PR adds back the following APIs whose maintenance costs are relatively small.

  • functions.toDegrees/toRadians
  • functions.approxCountDistinct
  • functions.monotonicallyIncreasingId
  • Column.!==
  • Dataset.explode
  • Dataset.registerTempTable
  • SQLContext.getOrCreate, setActive, clearActive, constructors

[API][3.0][SPARK-31088][SQL] Add back HiveContext and createExternalTable (+532, -11)>

This PR adds back the following APIs whose maintenance costs are relatively small.

  • HiveContext
  • createExternalTable APIs

[API][3.0][SPARK-31147][SQL] Forbid CHAR type in non-Hive-Serde tables (+134, -20)>

Spark introduced CHAR type for hive compatibility but it only works for hive tables. CHAR type is never documented and is treated as a STRING type for non-Hive tables. It violates the SQL standard and is very confusing. This PR forbids CHAR type in non-Hive tables as it's not supported correctly.

[3.0][SPARK-31170][SQL] Spark SQL Cli should respect hive-site.xml and spark.sql.warehouse.dir (+159, -66)>

This PR fixes Spark CLI to respect the warehouse config(spark.sql.warehousr.dir or hive.metastore.warehourse.dir) and the hive-site.xml.

[API][3.0][SPARK-31201][SQL] Add an individual config for skewed partition threshold (+21, -4)>

Skew join handling comes with an overhead: we need to read some data repeatedly. We should treat a partition as skewed if it's large enough so that it's beneficial to do so.

Currently the size threshold is the advisory partition size, which is 64 MB by default. This is not large enough for the skewed partition size threshold. Thus, a new conf is added.

spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes (Default: 256 MB)

  • A partition is considered as skewed if its size in bytes is larger than this threshold and also larger than 'spark.sql.adaptive.skewJoin.skewedPartitionFactor' multiplying the median partition size. Ideally this config should be set larger than 'spark.sql.adaptive.advisoryPartitionSizeInBytes'

[3.0][SPARK-31227][SQL] Non-nullable null type in complex types should not coerce to nullable type (+73, -29)>

This PR targets for non-nullable null type not to coerce to nullable type in complex types, to fix queries like concat(array(), array(1)).

[API][2.4][SPARK-31234][SQL] ResetCommand should not affect static SQL Configuration (+24, -4)>

Currently, ResetCommand clears all configurations, including runtime SQL configs, static SQL configs and spark context level configs. This PR fixes it to only clear the runtime SQL configs, as other configs are not supposed to change at runtime.

[3.0][SPARK-31238][SQL] Rebase dates to/from Julian calendar in write/read for ORC datasource (+153, -9)>

Spark 3.0 switches to the Proleptic Gregorian calendar, which is different from the ORC format. This PR rebases the datetime values when reading/writing ORC files, to adjust the calendar changes.

[3.0][SPARK-31297][SQL] Speed up dates rebasing (+174, -91)>

This PR replaces the current implementation of the rebaseGregorianToJulianDays() and rebaseJulianToGregorianDays() functions in DateTimeUtils by a new one which is based on the fact that difference between Proleptic Gregorian and the hybrid (Julian+Gregorian) calendars was changed only 14 times for entire supported range of valid dates [0001-01-01, 9999-12-31]. This brings about 3x speedup for fixing the performance regression caused by the calendar switch.

[3.0][SPARK-31312][SQL] Cache Class instance for the UDF instance in HiveFunctionWrapper (+203, -52)>

This PR caches the Class instance for the UDF instance in HiveFunctionWrapper to fix the case where Hive simple UDF is somehow transformed (expression is copied) and evaluated later with another classloader (for the case current thread context classloader is somehow changed). In this case, Spark throws CNFE as of now.

[3.1][SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project (+172, -18)>

This patch prunes unnecessary nested fields from Generate which has no Project on top of it.

[3.1][SPARK-31253][SQL] Add metrics to AQE shuffle reader (+229, -106)>

This PR adds SQL metrics to the AQE shuffle reader (CustomShuffleReaderExec), so that it's easier to debug AQE with SQL web UI.

[3.0][SPARK-30532] DataFrameStatFunctions to work with TABLE.COLUMN syntax (+31, -8)>

This PR makes approxQuantilefreqItemscov and corr in DataFrameStatFunctions to support fully qualified column names. For example, df2.as("t2").crossJoin(df1.as("t1")).stat.approxQuantile("t1.num", Array(0.1), 0.0) now works.

[API][3.0][SPARK-31113][SQL] Add SHOW VIEWS command (+605, -10)>

Add the SHOW VIEWS command to list the metadata for views. The SHOW VIEWS statement returns all the views for an optionally specified database. Additionally, the output of this statement may be filtered by an optional matching pattern. If no database is specified then the views are returned from the current database. If the specified database is global temporary view database, we will list global temporary views. Note that the command also lists local temporary views regardless of a given database.

For example:

-- List all views in default database
SHOW VIEWS;
  +-------------+------------+--------------+--+
  | namespace   | viewName   | isTemporary  |
  +-------------+------------+--------------+--+
  | default     | sam        | false        |
  | default     | sam1       | false        |
  | default     | suj        | false        |
  |             | temp2      | true         |
  +-------------+------------+--------------+--+

-- List all views from userdb database 
SHOW VIEWS FROM userdb;
  +-------------+------------+--------------+--+
  | namespace   | viewName   | isTemporary  |
  +-------------+------------+--------------+--+
  | userdb      | user1      | false        |
  | userdb      | user2      | false        |
  |             | temp2      | true         |
  +-------------+------------+--------------+--+

[3.0][SPARK-31224][SQL] Add view support to SHOW CREATE TABLE (+151, -133)>

This PR fixes the regression introduced in Spark 3.0. Now, both SHOW CREATE TABLE and SHOW CREATE TABLE AS SERDE can support view.

[API][3.0][SPARK-31318][SQL] Split Parquet/Avro configs for rebasing dates/timestamps in read and in write (+65, -33)>

The PR replaces the configs spark.sql.legacy.parquet.rebaseDateTime.enabled and spark.sql.legacy.avro.rebaseDateTime.enabled by the new configs spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabledspark.sql.legacy.parquet.rebaseDateTimeInRead.enabledspark.sql.legacy.avro.rebaseDateTimeInWrite.enabledspark.sql.legacy.avro.rebaseDateTimeInRead.enabled. Thus, users are able to load dates/timestamps saved by Spark 2.4/3.0, and save to Parquet/Avro files which are compatible with Spark 3.0/2.4 without rebasing.

spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled (Default: false)

  • When true, rebase dates/timestamps from Proleptic Gregorian calendar to the hybrid calendar (Julian + Gregorian) in writing. The rebasing is performed by converting micros/millis/days to a local date/timestamp in the source calendar, interpreting the resulted date/timestamp in the target calendar, and getting the number of micros/millis/days since the epoch 1970-01-01 00:00:00Z.

spark.sql.legacy.parquet.rebaseDateTimeInRead.enabled (Default: false)

  • When true, rebase dates/timestamps from the hybrid calendar to the Proleptic Gregorian calendar in read. The rebasing is performed by converting micros/millis/days to a local date/timestamp in the source calendar, interpreting the resulted date/timestamp in the target calendar, and getting the number of micros/millis/days since the epoch 1970-01-01 00:00:00Z.

spark.sql.legacy.avro.rebaseDateTimeInWrite.enabled (Default: false)

  • When true, rebase dates/timestamps from Proleptic Gregorian calendar to the hybrid calendar (Julian + Gregorian) in writing. The rebasing is performed by converting micros/millis/days to a local date/timestamp in the source calendar, interpreting the resulted date/timestamp in the target calendar, and getting the number of micros/millis/days since the epoch 1970-01-01 00:00:00Z.

spark.sql.legacy.avro.rebaseDateTimeInRead.enabled (Default: false)

  • When true, rebase dates/timestamps from the hybrid calendar to the Proleptic Gregorian calendar in read. The rebasing is performed by converting micros/millis/days to a local date/timestamp in the source calendar, interpreting the resulted date/timestamp in the target calendar, and getting the number of micros/millis/days since the epoch 1970-01-01 00:00:00Z.

[3.0][SPARK-31322][SQL] rename QueryPlan.collectInPlanAndSubqueries to collectWithSubqueries (+5, -5)>

Rename QueryPlan.collectInPlanAndSubqueries [which was introduced in the unreleased Spark 3.0] to collectWithSubqueriesQueryPlan's APIs are internal but they are the core of catalyst and we'd better make the API name clearer before we release it.

[3.0][SPARK-31327][SQL] Write Spark version into Avro file metadata (+116, -5)>

Write Spark version into Avro file metadata. The version info is very useful for backward compatibility.

[3.0][SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time (+120, -90)>

The PR fixes the bug of losing DST [Daily Saving Time] offset info in rebasing timestamps via local date-time. This bug could lead to DateTimeUtils.rebaseGregorianToJulianMicros() and DateTimeUtils.rebaseJulianToGregorianMicros() returning wrong results when handling the daylight saving offset.

ML

[3.1][SPARK-31222][ML] Make ANOVATest Sparsity-Aware (+173, -112)>

[3.1][SPARK-31223][ML] Set seed in np.random to regenerate test data (+282, -256)>

[3.1][SPARK-31283][ML] Simplify ChiSq by adding a common method (+68, -93)>

[API][3.1][SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR (+484, -36)>

Add SparkR wrapper for FMClassifier.

PYTHON

[2.4][SPARK-31186][SPARK-31441][PYSPARK][SQL] toPandas should not fail on duplicate column names (+56, -10)> >

This PR fixes the toPandas API to make it work on dataframe with duplicate column names.

[3.0][SPARK-30921][PYSPARK] Predicates on python udf should not be pushdown through Aggregate (+20, -2)>

This PR is to fix a new regression in Spark 3.0, in which the predicates using PythonUDFs will be pushed down through Aggregate. Since PythonUDFs can't be evaluated on Filter, the rule "predicate pushdown through Aggregate" should skip the predicates that contain PythonUDFs.

[3.1][SPARK-31308][PYSPARK] Merging pyFiles to files argument for Non-PySpark applications (+6, -4)>

Add python dependencies in SparkSubmit even if it is not a Python application. For some Spark applications(e.g. Livy remote SparkContext application, which is actually an embedded REPL for Scala/Python/R), they require not only jar dependencies, but also Python dependencies.

UI

[API][3.1][SPARK-31325][SQL][WEB UI] Control the plan explain mode in the events of SQL listeners via SQLConf (+100, -3)>

Add a new config spark.sql.ui.explainMode to control the query explain mode used in the Spark SQL UI. The default value of this config is formatted, which would display the query details content in a formatted way.

spark.sql.ui.explainMode (default: "formatted")

  • Configures the query explain mode used in the Spark SQL UI. The value can be 'simple', 'extended', 'codegen', 'cost', or 'formatted'.

R

[API][3.0][SPARK-31290][R] Add back the deprecated R APIs (+200, -4)>

This PR is to add back the following R APIs which were removed in the unreleased Spark 3.0:

  • sparkR.init
  • sparkRSQL.init
  • sparkRHive.init
  • registerTempTable
  • createExternalTable
  • dropTempTable


--
Reply | Threaded
Open this post in threaded view
|

Re: [OSS DIGEST] The major changes of Apache Spark from Mar 25 to Apr 7

Jiang Xingbo
Thank you so much for doing this, Xiao!

On Wed, Apr 29, 2020 at 11:09 AM Xiao Li <[hidden email]> wrote:
Hi all,

This is the bi-weekly Apache Spark digest from the Databricks OSS team.
For each API/configuration/behavior change, an [API] tag is added in the title.

CORE

[3.0][SPARK-30623][CORE] Spark external shuffle allow disable of separate event loop group (+66, -33)>

PR#22173 introduced a perf regression in shuffle, even if we disable the feature flag spark.shuffle.server.chunkFetchHandlerThreadsPercent. To fix the perf regression, this PR refactors the related code to completely disable this feature by default.

[3.0][SPARK-31314][CORE] Revert SPARK-29285 to fix shuffle regression caused by creating temporary file eagerly (+10, -71)>

PR#25962 introduced a perf regression in shuffle, which may create empty files unnecessarily. This PR reverts it.

[API][3.1][SPARK-29154][CORE] Update Spark scheduler for stage level scheduling (+704, -218)>

This PR updates the DAG scheduler to schedule tasks to match the resource profile. It's for the stage level scheduling.

[API][3.1][SPARK-29153][CORE] Add ability to merge resource profiles within a stage with Stage Level Scheduling (+304, -15)>

Add the ability to optionally merged resource profiles if they are specified on multiple RDDs within a Stage. The feature is part of Stage Level Scheduling. There is a config spark.scheduler.resourceProfile.mergeConflicts to enable this feature, the config if off by default.

spark.scheduler.resource.profileMergeConflicts (Default: false)

  • If set to true, Spark will merge ResourceProfiles when different profiles are specified in RDDs that get combined into a single stage. When they are merged, Spark chooses the maximum of each resource and creates a new ResourceProfile. The default of false results in Spark throwing an exception if multiple different ResourceProfiles are found in RDDs going into the same stage.

[API][3.1][SPARK-31208][CORE] Add an experimental API: cleanShuffleDependencies (+158, -71)>

Add a new experimental developer API RDD.cleanShuffleDependencies(blocking: Boolean) to allow explicitly clean up shuffle files. This could help dynamic scaling of K8s backend since the backend only recycles executors without shuffle files.

  /**
   * :: Experimental ::
   * Removes an RDD's shuffles and it's non-persisted ancestors.
   * When running without a shuffle service, cleaning up shuffle files enables downscaling.
   * If you use the RDD after this call, you should checkpoint and materialize it first.
   * If you are uncertain of what you are doing, please do not use this feature.
   * Additional techniques for mitigating orphaned shuffle files:
   *   * Tuning the driver GC to be more aggressive, so the regular context cleaner is triggered
   *   * Setting an appropriate TTL for shuffle files to be auto cleaned
   */
  @Experimental
  @DeveloperApi
  @Since("3.1.0")
  def cleanShuffleDependencies(blocking: Boolean = false): Unit

[3.1][SPARK-31179] Fast fail the connection while last connection failed in fast fail time window (+68, -12)>

In TransportFactory, if a connection to the destination address fails, the new connection requests [that are created within a time window] fail fast for avoiding too many retries. This time window size is set to 95% of the IO retry wait time (spark.io.shuffle.retryWait whose default is 5 seconds).

SQL

[API][3.0][SPARK-25556][SPARK-17636][SPARK-31026][SPARK-31060][SQL] Nested Column Predicate Pushdown for Parquet (+852, -480)>

This PR extends the data source framework to support filter pushdown with nested fields, and implement it in the Parquet data source.

spark.sql.optimizer.nestedPredicatePushdown.enabled [Default: true]

  • When true, Spark tries to push down predicates for nested columns and or names containing dots to data sources. Currently, Parquet implements both optimizations while ORC only supports predicates for names containing dots. The other data sources don't support this feature yet.

[3.0][SPARK-30822][SQL] Remove semicolon at the end of a sql query (+14, -7)>

This PR updates the SQL parser to ignore the trailing semicolons in the SQL queries. Now sql("select 1;") works.

[API][3.0][SPARK-31086][SQL] Add Back the Deprecated SQLContext methods (+403, -0)>

This PR adds back the following APIs whose maintenance costs are relatively small.

  • SQLContext.applySchema
  • SQLContext.parquetFile
  • SQLContext.jsonFile
  • SQLContext.jsonRDD
  • SQLContext.load
  • SQLContext.jdbc

[API][3.0][SPARK-31087][SQL] Add Back Multiple Removed APIs (+458, -15)>

This PR adds back the following APIs whose maintenance costs are relatively small.

  • functions.toDegrees/toRadians
  • functions.approxCountDistinct
  • functions.monotonicallyIncreasingId
  • Column.!==
  • Dataset.explode
  • Dataset.registerTempTable
  • SQLContext.getOrCreate, setActive, clearActive, constructors

[API][3.0][SPARK-31088][SQL] Add back HiveContext and createExternalTable (+532, -11)>

This PR adds back the following APIs whose maintenance costs are relatively small.

  • HiveContext
  • createExternalTable APIs

[API][3.0][SPARK-31147][SQL] Forbid CHAR type in non-Hive-Serde tables (+134, -20)>

Spark introduced CHAR type for hive compatibility but it only works for hive tables. CHAR type is never documented and is treated as a STRING type for non-Hive tables. It violates the SQL standard and is very confusing. This PR forbids CHAR type in non-Hive tables as it's not supported correctly.

[3.0][SPARK-31170][SQL] Spark SQL Cli should respect hive-site.xml and spark.sql.warehouse.dir (+159, -66)>

This PR fixes Spark CLI to respect the warehouse config(spark.sql.warehousr.dir or hive.metastore.warehourse.dir) and the hive-site.xml.

[API][3.0][SPARK-31201][SQL] Add an individual config for skewed partition threshold (+21, -4)>

Skew join handling comes with an overhead: we need to read some data repeatedly. We should treat a partition as skewed if it's large enough so that it's beneficial to do so.

Currently the size threshold is the advisory partition size, which is 64 MB by default. This is not large enough for the skewed partition size threshold. Thus, a new conf is added.

spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes (Default: 256 MB)

  • A partition is considered as skewed if its size in bytes is larger than this threshold and also larger than 'spark.sql.adaptive.skewJoin.skewedPartitionFactor' multiplying the median partition size. Ideally this config should be set larger than 'spark.sql.adaptive.advisoryPartitionSizeInBytes'

[3.0][SPARK-31227][SQL] Non-nullable null type in complex types should not coerce to nullable type (+73, -29)>

This PR targets for non-nullable null type not to coerce to nullable type in complex types, to fix queries like concat(array(), array(1)).

[API][2.4][SPARK-31234][SQL] ResetCommand should not affect static SQL Configuration (+24, -4)>

Currently, ResetCommand clears all configurations, including runtime SQL configs, static SQL configs and spark context level configs. This PR fixes it to only clear the runtime SQL configs, as other configs are not supposed to change at runtime.

[3.0][SPARK-31238][SQL] Rebase dates to/from Julian calendar in write/read for ORC datasource (+153, -9)>

Spark 3.0 switches to the Proleptic Gregorian calendar, which is different from the ORC format. This PR rebases the datetime values when reading/writing ORC files, to adjust the calendar changes.

[3.0][SPARK-31297][SQL] Speed up dates rebasing (+174, -91)>

This PR replaces the current implementation of the rebaseGregorianToJulianDays() and rebaseJulianToGregorianDays() functions in DateTimeUtils by a new one which is based on the fact that difference between Proleptic Gregorian and the hybrid (Julian+Gregorian) calendars was changed only 14 times for entire supported range of valid dates [0001-01-01, 9999-12-31]. This brings about 3x speedup for fixing the performance regression caused by the calendar switch.

[3.0][SPARK-31312][SQL] Cache Class instance for the UDF instance in HiveFunctionWrapper (+203, -52)>

This PR caches the Class instance for the UDF instance in HiveFunctionWrapper to fix the case where Hive simple UDF is somehow transformed (expression is copied) and evaluated later with another classloader (for the case current thread context classloader is somehow changed). In this case, Spark throws CNFE as of now.

[3.1][SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project (+172, -18)>

This patch prunes unnecessary nested fields from Generate which has no Project on top of it.

[3.1][SPARK-31253][SQL] Add metrics to AQE shuffle reader (+229, -106)>

This PR adds SQL metrics to the AQE shuffle reader (CustomShuffleReaderExec), so that it's easier to debug AQE with SQL web UI.

[3.0][SPARK-30532] DataFrameStatFunctions to work with TABLE.COLUMN syntax (+31, -8)>

This PR makes approxQuantilefreqItemscov and corr in DataFrameStatFunctions to support fully qualified column names. For example, df2.as("t2").crossJoin(df1.as("t1")).stat.approxQuantile("t1.num", Array(0.1), 0.0) now works.

[API][3.0][SPARK-31113][SQL] Add SHOW VIEWS command (+605, -10)>

Add the SHOW VIEWS command to list the metadata for views. The SHOW VIEWS statement returns all the views for an optionally specified database. Additionally, the output of this statement may be filtered by an optional matching pattern. If no database is specified then the views are returned from the current database. If the specified database is global temporary view database, we will list global temporary views. Note that the command also lists local temporary views regardless of a given database.

For example:

-- List all views in default database
SHOW VIEWS;
  +-------------+------------+--------------+--+
  | namespace   | viewName   | isTemporary  |
  +-------------+------------+--------------+--+
  | default     | sam        | false        |
  | default     | sam1       | false        |
  | default     | suj        | false        |
  |             | temp2      | true         |
  +-------------+------------+--------------+--+

-- List all views from userdb database 
SHOW VIEWS FROM userdb;
  +-------------+------------+--------------+--+
  | namespace   | viewName   | isTemporary  |
  +-------------+------------+--------------+--+
  | userdb      | user1      | false        |
  | userdb      | user2      | false        |
  |             | temp2      | true         |
  +-------------+------------+--------------+--+

[3.0][SPARK-31224][SQL] Add view support to SHOW CREATE TABLE (+151, -133)>

This PR fixes the regression introduced in Spark 3.0. Now, both SHOW CREATE TABLE and SHOW CREATE TABLE AS SERDE can support view.

[API][3.0][SPARK-31318][SQL] Split Parquet/Avro configs for rebasing dates/timestamps in read and in write (+65, -33)>

The PR replaces the configs spark.sql.legacy.parquet.rebaseDateTime.enabled and spark.sql.legacy.avro.rebaseDateTime.enabled by the new configs spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabledspark.sql.legacy.parquet.rebaseDateTimeInRead.enabledspark.sql.legacy.avro.rebaseDateTimeInWrite.enabledspark.sql.legacy.avro.rebaseDateTimeInRead.enabled. Thus, users are able to load dates/timestamps saved by Spark 2.4/3.0, and save to Parquet/Avro files which are compatible with Spark 3.0/2.4 without rebasing.

spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled (Default: false)

  • When true, rebase dates/timestamps from Proleptic Gregorian calendar to the hybrid calendar (Julian + Gregorian) in writing. The rebasing is performed by converting micros/millis/days to a local date/timestamp in the source calendar, interpreting the resulted date/timestamp in the target calendar, and getting the number of micros/millis/days since the epoch 1970-01-01 00:00:00Z.

spark.sql.legacy.parquet.rebaseDateTimeInRead.enabled (Default: false)

  • When true, rebase dates/timestamps from the hybrid calendar to the Proleptic Gregorian calendar in read. The rebasing is performed by converting micros/millis/days to a local date/timestamp in the source calendar, interpreting the resulted date/timestamp in the target calendar, and getting the number of micros/millis/days since the epoch 1970-01-01 00:00:00Z.

spark.sql.legacy.avro.rebaseDateTimeInWrite.enabled (Default: false)

  • When true, rebase dates/timestamps from Proleptic Gregorian calendar to the hybrid calendar (Julian + Gregorian) in writing. The rebasing is performed by converting micros/millis/days to a local date/timestamp in the source calendar, interpreting the resulted date/timestamp in the target calendar, and getting the number of micros/millis/days since the epoch 1970-01-01 00:00:00Z.

spark.sql.legacy.avro.rebaseDateTimeInRead.enabled (Default: false)

  • When true, rebase dates/timestamps from the hybrid calendar to the Proleptic Gregorian calendar in read. The rebasing is performed by converting micros/millis/days to a local date/timestamp in the source calendar, interpreting the resulted date/timestamp in the target calendar, and getting the number of micros/millis/days since the epoch 1970-01-01 00:00:00Z.

[3.0][SPARK-31322][SQL] rename QueryPlan.collectInPlanAndSubqueries to collectWithSubqueries (+5, -5)>

Rename QueryPlan.collectInPlanAndSubqueries [which was introduced in the unreleased Spark 3.0] to collectWithSubqueriesQueryPlan's APIs are internal but they are the core of catalyst and we'd better make the API name clearer before we release it.

[3.0][SPARK-31327][SQL] Write Spark version into Avro file metadata (+116, -5)>

Write Spark version into Avro file metadata. The version info is very useful for backward compatibility.

[3.0][SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time (+120, -90)>

The PR fixes the bug of losing DST [Daily Saving Time] offset info in rebasing timestamps via local date-time. This bug could lead to DateTimeUtils.rebaseGregorianToJulianMicros() and DateTimeUtils.rebaseJulianToGregorianMicros() returning wrong results when handling the daylight saving offset.

ML

[3.1][SPARK-31222][ML] Make ANOVATest Sparsity-Aware (+173, -112)>

[3.1][SPARK-31223][ML] Set seed in np.random to regenerate test data (+282, -256)>

[3.1][SPARK-31283][ML] Simplify ChiSq by adding a common method (+68, -93)>

[API][3.1][SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR (+484, -36)>

Add SparkR wrapper for FMClassifier.

PYTHON

[2.4][SPARK-31186][SPARK-31441][PYSPARK][SQL] toPandas should not fail on duplicate column names (+56, -10)> >

This PR fixes the toPandas API to make it work on dataframe with duplicate column names.

[3.0][SPARK-30921][PYSPARK] Predicates on python udf should not be pushdown through Aggregate (+20, -2)>

This PR is to fix a new regression in Spark 3.0, in which the predicates using PythonUDFs will be pushed down through Aggregate. Since PythonUDFs can't be evaluated on Filter, the rule "predicate pushdown through Aggregate" should skip the predicates that contain PythonUDFs.

[3.1][SPARK-31308][PYSPARK] Merging pyFiles to files argument for Non-PySpark applications (+6, -4)>

Add python dependencies in SparkSubmit even if it is not a Python application. For some Spark applications(e.g. Livy remote SparkContext application, which is actually an embedded REPL for Scala/Python/R), they require not only jar dependencies, but also Python dependencies.

UI

[API][3.1][SPARK-31325][SQL][WEB UI] Control the plan explain mode in the events of SQL listeners via SQLConf (+100, -3)>

Add a new config spark.sql.ui.explainMode to control the query explain mode used in the Spark SQL UI. The default value of this config is formatted, which would display the query details content in a formatted way.

spark.sql.ui.explainMode (default: "formatted")

  • Configures the query explain mode used in the Spark SQL UI. The value can be 'simple', 'extended', 'codegen', 'cost', or 'formatted'.

R

[API][3.0][SPARK-31290][R] Add back the deprecated R APIs (+200, -4)>

This PR is to add back the following R APIs which were removed in the unreleased Spark 3.0:

  • sparkR.init
  • sparkRSQL.init
  • sparkRHive.init
  • registerTempTable
  • createExternalTable
  • dropTempTable


--