Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

rahul c
Hi Kshitij,

There are option to suppress the metadata files from get created.
Set the below properties and try.

1) To disable the transaction logs of spark "spark.sql.sources.commitProtocolClass = org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol". This will help to disable the "committed<TID>" and "started<TID>" files but still _SUCCESS, _common_metadata and _metadata files will generate.

2) We can disable the _common_metadata and _metadata files using "parquet.enable.summary-metadata=false".

3) We can also disable the _SUCCESS file using "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".

On Sat, 22 Feb, 2020, 10:51 AM Kshitij, <[hidden email]> wrote:

Hi,

There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.

Below both options will create directory with a random file name.

df.coalesce(1).write.csv(<path>)
 
df.write.csv(<path>) 

Instead of creating directory with standard files (_SUCCESS , _committed , _started). I want a single file with file_name specified.


Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

rahul c
Hi,

df.write.csv()
Will ideally give you a csv file which can be used in further processing.
I am not that much aware of raw_csv function of pandas. 

On Sat, 22 Feb, 2020, 4:09 PM Kshitij, <[hidden email]> wrote:
Is there any way to save it as raw_csv file as we do in pandas? I have a script that uses the CSV file for further processing. 

On Sat, 22 Feb 2020 at 14:31, rahul c <[hidden email]> wrote:
Hi Kshitij,

There are option to suppress the metadata files from get created.
Set the below properties and try.

1) To disable the transaction logs of spark "spark.sql.sources.commitProtocolClass = org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol". This will help to disable the "committed<TID>" and "started<TID>" files but still _SUCCESS, _common_metadata and _metadata files will generate.

2) We can disable the _common_metadata and _metadata files using "parquet.enable.summary-metadata=false".

3) We can also disable the _SUCCESS file using "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".

On Sat, 22 Feb, 2020, 10:51 AM Kshitij, <[hidden email]> wrote:

Hi,

There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.

Below both options will create directory with a random file name.

df.coalesce(1).write.csv(<path>)
 
df.write.csv(<path>) 

Instead of creating directory with standard files (_SUCCESS , _committed , _started). I want a single file with file_name specified.


Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

Kshitij
I am talking about spark here. 

On Sat, Feb 22, 2020, 4:19 PM rahul c <[hidden email]> wrote:
Hi,

df.write.csv()
Will ideally give you a csv file which can be used in further processing.
I am not that much aware of raw_csv function of pandas. 

On Sat, 22 Feb, 2020, 4:09 PM Kshitij, <[hidden email]> wrote:
Is there any way to save it as raw_csv file as we do in pandas? I have a script that uses the CSV file for further processing. 

On Sat, 22 Feb 2020 at 14:31, rahul c <[hidden email]> wrote:
Hi Kshitij,

There are option to suppress the metadata files from get created.
Set the below properties and try.

1) To disable the transaction logs of spark "spark.sql.sources.commitProtocolClass = org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol". This will help to disable the "committed<TID>" and "started<TID>" files but still _SUCCESS, _common_metadata and _metadata files will generate.

2) We can disable the _common_metadata and _metadata files using "parquet.enable.summary-metadata=false".

3) We can also disable the _SUCCESS file using "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".

On Sat, 22 Feb, 2020, 10:51 AM Kshitij, <[hidden email]> wrote:

Hi,

There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.

Below both options will create directory with a random file name.

df.coalesce(1).write.csv(<path>)
 
df.write.csv(<path>) 

Instead of creating directory with standard files (_SUCCESS , _committed , _started). I want a single file with file_name specified.


Thanks