Timestamp formatting in partitioned directory output: "YYYY-MM-dd HH%3Amm%3Ass" vs "YYYY-MM-ddTHH%3Amm%3Ass"
I have a feature requests or suggestion:
Spark 2.1 currently generates partitioned directory names like "timestamp=2015-06-20 08%3A00%3A00"
I request + recommend that it uses the "T" delimiter between date and time portions rather than a space character like, "timestamp=2015-06-20T08%3A00%3A00".
1) The official ISO-8601 formatting standard specifies a "T" delimiter. RFC 3339 built on top of ISO-8601 says that a space character is also acceptable, but AFAIK, that is not part of the official ISO-8601 spec.
2) URIs can't have spaces in them. "s3://mybucket/data/timestamp=YYYY-MM-ddTHH%3A:mm:ss" is a valid URI, while the space character variant is not. Spark is already doing URI escaping of the "colon" characters with "%3A". Spark should use a URI compliant "T" character rather than a space.
This also applies to reading existing data. If I load a data frame with directory timestamp partitioning that uses the Spark standard space delimiter between date and time, Spark will automatically recognize the field as a timestamp. If the directory name uses the ISO-8601 standard "T" delimiter between date and time, Spark will not recognize the field as a timestamp but rather as a generic string.
Below is a short code snippet that can be pasted into spark-shell to reproduce this issue