Timestamp formatting in partitioned directory output: "YYYY-MM-dd HH%3Amm%3Ass" vs "YYYY-MM-ddTHH%3Amm%3Ass"

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Timestamp formatting in partitioned directory output: "YYYY-MM-dd HH%3Amm%3Ass" vs "YYYY-MM-ddTHH%3Amm%3Ass"

dataeng88

I have a feature requests or suggestion:

Spark 2.1 currently generates partitioned directory names like "timestamp=2015-06-20 08%3A00%3A00"

I request + recommend that it uses the "T" delimiter between date and time portions rather than a space character like, "timestamp=2015-06-20T08%3A00%3A00".

Two reasons:
1) The official ISO-8601 formatting standard specifies a "T" delimiter. RFC 3339 built on top of ISO-8601 says that a space character is also acceptable, but AFAIK, that is not part of the official ISO-8601 spec.

2) URIs can't have spaces in them. "s3://mybucket/data/timestamp=YYYY-MM-ddTHH%3A:mm:ss" is a valid URI, while the space character variant is not. Spark is already doing URI escaping of the "colon" characters with "%3A". Spark should use a URI compliant "T" character rather than a space.

This also applies to reading existing data. If I load a data frame with directory timestamp partitioning that uses the Spark standard space delimiter between date and time, Spark will automatically recognize the field as a timestamp. If the directory name uses the ISO-8601 standard "T" delimiter between date and time, Spark will not recognize the field as a timestamp but rather as a generic string.

Below is a short code snippet that can be pasted into spark-shell to reproduce this issue

```
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import scala.collection.JavaConverters._
import java.time.LocalDateTime

val simpleSchema = StructType(
    StructField("id", IntegerType) ::
    StructField("name", StringType) ::
    StructField("value", StringType) ::
    StructField("timestamp", TimestampType) :: Nil)

val data = List(
    Row(1, "Alice", "C101", java.sql.Timestamp.valueOf(LocalDateTime.of(2015, 6, 20, 8, 0))),
    Row(2, "Bob", "C101", java.sql.Timestamp.valueOf(LocalDateTime.of(2015, 6, 20, 8, 0))),
    Row(3, "Bob", "C102", java.sql.Timestamp.valueOf(LocalDateTime.of(2015, 6, 20, 9, 0))),
    Row(4, "Bob", "C101", java.sql.Timestamp.valueOf(LocalDateTime.of(2015, 6, 21, 9, 0)))
)

val df = spark.createDataFrame(data.asJava, simpleSchema)
df.printSchema()
df.show()
df.write.partitionBy("timestamp").save("test/")
```

~ find test -type d
test
test/timestamp=2015-06-20 08%3A00%3A00
test/timestamp=2015-06-20 09%3A00%3A00
test/timestamp=2015-06-21 09%3A00%3A00