from_csv

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

from_csv

Maxim Gekk
Hi All,

I would like to propose new function from_csv() for parsing columns containing strings in CSV format. Here is my PR: https://github.com/apache/spark/pull/22379

An use case is loading a dataset from an external storage, dbms or systems like Kafka to where CSV content was dumped as one of columns/fields. Other columns could contain related information like timestamps, ids, sources of data and etc. The column with CSV strings can be parsed by existing method csv() of DataFrameReader but in that case we have to "clean up" dataset and remove other columns since the csv() method requires Dataset[String]. Joining back result of parsing and original dataset by positions is expensive and not convenient. Instead users parse CSV columns by string functions. The approach is usually error prone especially for quoted values and other special cases.

The proposed in the PR methods should make a better user experience in parsing CSV-like columns. Please, share your thoughts.

--

Maxim Gekk

Technical Solutions Lead

Databricks Inc.

[hidden email]

databricks.com

 

Reply | Threaded
Open this post in threaded view
|

Re: from_csv

rxin
makes sense - i'd make this as consistent as to_json / from_json as possible. 

how would this work in sql? i.e. how would passing options in work?

--
excuse the brevity and lower case due to wrist injury


On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk <[hidden email]> wrote:
Hi All,

I would like to propose new function from_csv() for parsing columns containing strings in CSV format. Here is my PR: https://github.com/apache/spark/pull/22379

An use case is loading a dataset from an external storage, dbms or systems like Kafka to where CSV content was dumped as one of columns/fields. Other columns could contain related information like timestamps, ids, sources of data and etc. The column with CSV strings can be parsed by existing method csv() of DataFrameReader but in that case we have to "clean up" dataset and remove other columns since the csv() method requires Dataset[String]. Joining back result of parsing and original dataset by positions is expensive and not convenient. Instead users parse CSV columns by string functions. The approach is usually error prone especially for quoted values and other special cases.

The proposed in the PR methods should make a better user experience in parsing CSV-like columns. Please, share your thoughts.

--

Maxim Gekk

Technical Solutions Lead

Databricks Inc.

[hidden email]

databricks.com

 

Reply | Threaded
Open this post in threaded view
|

Re: from_csv

Maxim Gekk
Hi Reynold,

> i'd make this as consistent as to_json / from_json as possible

Sure, new function from_csv() has the same signature as from_json().

how would this work in sql? i.e. how would passing options in work?

The options are passed to the function via map, for example:
select from_csv('26/08/2015', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'))

On Sun, Sep 16, 2018 at 7:01 AM Reynold Xin <[hidden email]> wrote:
makes sense - i'd make this as consistent as to_json / from_json as possible. 

how would this work in sql? i.e. how would passing options in work?

--
excuse the brevity and lower case due to wrist injury


On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk <[hidden email]> wrote:
Hi All,

I would like to propose new function from_csv() for parsing columns containing strings in CSV format. Here is my PR: https://github.com/apache/spark/pull/22379

An use case is loading a dataset from an external storage, dbms or systems like Kafka to where CSV content was dumped as one of columns/fields. Other columns could contain related information like timestamps, ids, sources of data and etc. The column with CSV strings can be parsed by existing method csv() of DataFrameReader but in that case we have to "clean up" dataset and remove other columns since the csv() method requires Dataset[String]. Joining back result of parsing and original dataset by positions is expensive and not convenient. Instead users parse CSV columns by string functions. The approach is usually error prone especially for quoted values and other special cases.

The proposed in the PR methods should make a better user experience in parsing CSV-like columns. Please, share your thoughts.

--

Maxim Gekk

Technical Solutions Lead

Databricks Inc.

[hidden email]

databricks.com

 


Reply | Threaded
Open this post in threaded view
|

Re: from_csv

Hyukjin Kwon
+1 for this idea since text parsing in CSV/JSON is quite common.

One thing is about schema inference likewise with JSON functionality. In case of JSON, we added schema_of_json for it and same thing should be able to apply to CSV too.
If we see some more needs for it, we can consider a function like schema_of_csv as well.


2018년 9월 16일 (일) 오후 4:41, Maxim Gekk <[hidden email]>님이 작성:
Hi Reynold,

> i'd make this as consistent as to_json / from_json as possible

Sure, new function from_csv() has the same signature as from_json().

how would this work in sql? i.e. how would passing options in work?

The options are passed to the function via map, for example:
select from_csv('26/08/2015', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'))

On Sun, Sep 16, 2018 at 7:01 AM Reynold Xin <[hidden email]> wrote:
makes sense - i'd make this as consistent as to_json / from_json as possible. 

how would this work in sql? i.e. how would passing options in work?

--
excuse the brevity and lower case due to wrist injury


On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk <[hidden email]> wrote:
Hi All,

I would like to propose new function from_csv() for parsing columns containing strings in CSV format. Here is my PR: https://github.com/apache/spark/pull/22379

An use case is loading a dataset from an external storage, dbms or systems like Kafka to where CSV content was dumped as one of columns/fields. Other columns could contain related information like timestamps, ids, sources of data and etc. The column with CSV strings can be parsed by existing method csv() of DataFrameReader but in that case we have to "clean up" dataset and remove other columns since the csv() method requires Dataset[String]. Joining back result of parsing and original dataset by positions is expensive and not convenient. Instead users parse CSV columns by string functions. The approach is usually error prone especially for quoted values and other special cases.

The proposed in the PR methods should make a better user experience in parsing CSV-like columns. Please, share your thoughts.

--

Maxim Gekk

Technical Solutions Lead

Databricks Inc.

[hidden email]

databricks.com

 


Reply | Threaded
Open this post in threaded view
|

Re: from_csv

Dongjin Lee
Another +1.

I already experienced this case several times.

On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon <[hidden email]> wrote:
+1 for this idea since text parsing in CSV/JSON is quite common.

One thing is about schema inference likewise with JSON functionality. In case of JSON, we added schema_of_json for it and same thing should be able to apply to CSV too.
If we see some more needs for it, we can consider a function like schema_of_csv as well.


2018년 9월 16일 (일) 오후 4:41, Maxim Gekk <[hidden email]>님이 작성:
Hi Reynold,

> i'd make this as consistent as to_json / from_json as possible

Sure, new function from_csv() has the same signature as from_json().

how would this work in sql? i.e. how would passing options in work?

The options are passed to the function via map, for example:
select from_csv('26/08/2015', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'))

On Sun, Sep 16, 2018 at 7:01 AM Reynold Xin <[hidden email]> wrote:
makes sense - i'd make this as consistent as to_json / from_json as possible. 

how would this work in sql? i.e. how would passing options in work?

--
excuse the brevity and lower case due to wrist injury


On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk <[hidden email]> wrote:
Hi All,

I would like to propose new function from_csv() for parsing columns containing strings in CSV format. Here is my PR: https://github.com/apache/spark/pull/22379

An use case is loading a dataset from an external storage, dbms or systems like Kafka to where CSV content was dumped as one of columns/fields. Other columns could contain related information like timestamps, ids, sources of data and etc. The column with CSV strings can be parsed by existing method csv() of DataFrameReader but in that case we have to "clean up" dataset and remove other columns since the csv() method requires Dataset[String]. Joining back result of parsing and original dataset by positions is expensive and not convenient. Instead users parse CSV columns by string functions. The approach is usually error prone especially for quoted values and other special cases.

The proposed in the PR methods should make a better user experience in parsing CSV-like columns. Please, share your thoughts.

--

Maxim Gekk

Technical Solutions Lead

Databricks Inc.

[hidden email]

databricks.com

 




--
Dongjin Lee

A hitchhiker in the mathematical world.
Reply | Threaded
Open this post in threaded view
|

Re: from_csv

Ted Yu
+1

-------- Original message --------
From: Dongjin Lee <[hidden email]>
Date: 9/19/18 7:20 AM (GMT-08:00)
To: dev <[hidden email]>
Subject: Re: from_csv

Another +1.

I already experienced this case several times.

On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon <[hidden email]> wrote:
+1 for this idea since text parsing in CSV/JSON is quite common.

One thing is about schema inference likewise with JSON functionality. In case of JSON, we added schema_of_json for it and same thing should be able to apply to CSV too.
If we see some more needs for it, we can consider a function like schema_of_csv as well.


2018년 9월 16일 (일) 오후 4:41, Maxim Gekk <[hidden email]>님이 작성:
Hi Reynold,

> i'd make this as consistent as to_json / from_json as possible

Sure, new function from_csv() has the same signature as from_json().

how would this work in sql? i.e. how would passing options in work?

The options are passed to the function via map, for example:
select from_csv('26/08/2015', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'))

On Sun, Sep 16, 2018 at 7:01 AM Reynold Xin <[hidden email]> wrote:
makes sense - i'd make this as consistent as to_json / from_json as possible. 

how would this work in sql? i.e. how would passing options in work?

--
excuse the brevity and lower case due to wrist injury


On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk <[hidden email]> wrote:
Hi All,

I would like to propose new function from_csv() for parsing columns containing strings in CSV format. Here is my PR: https://github.com/apache/spark/pull/22379

An use case is loading a dataset from an external storage, dbms or systems like Kafka to where CSV content was dumped as one of columns/fields. Other columns could contain related information like timestamps, ids, sources of data and etc. The column with CSV strings can be parsed by existing method csv() of DataFrameReader but in that case we have to "clean up" dataset and remove other columns since the csv() method requires Dataset[String]. Joining back result of parsing and original dataset by positions is expensive and not convenient. Instead users parse CSV columns by string functions. The approach is usually error prone especially for quoted values and other special cases.

The proposed in the PR methods should make a better user experience in parsing CSV-like columns. Please, share your thoughts.



--
Dongjin Lee

A hitchhiker in the mathematical world.
Reply | Threaded
Open this post in threaded view
|

Re: from_csv

John Zhuge
+1

On Wed, Sep 19, 2018 at 8:07 AM Ted Yu <[hidden email]> wrote:
+1

-------- Original message --------
From: Dongjin Lee <[hidden email]>
Date: 9/19/18 7:20 AM (GMT-08:00)
To: dev <[hidden email]>
Subject: Re: from_csv

Another +1.

I already experienced this case several times.

On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon <[hidden email]> wrote:
+1 for this idea since text parsing in CSV/JSON is quite common.

One thing is about schema inference likewise with JSON functionality. In case of JSON, we added schema_of_json for it and same thing should be able to apply to CSV too.
If we see some more needs for it, we can consider a function like schema_of_csv as well.


2018년 9월 16일 (일) 오후 4:41, Maxim Gekk <[hidden email]>님이 작성:
Hi Reynold,

> i'd make this as consistent as to_json / from_json as possible

Sure, new function from_csv() has the same signature as from_json().

how would this work in sql? i.e. how would passing options in work?

The options are passed to the function via map, for example:
select from_csv('26/08/2015', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'))

On Sun, Sep 16, 2018 at 7:01 AM Reynold Xin <[hidden email]> wrote:
makes sense - i'd make this as consistent as to_json / from_json as possible. 

how would this work in sql? i.e. how would passing options in work?

--
excuse the brevity and lower case due to wrist injury


On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk <[hidden email]> wrote:
Hi All,

I would like to propose new function from_csv() for parsing columns containing strings in CSV format. Here is my PR: https://github.com/apache/spark/pull/22379

An use case is loading a dataset from an external storage, dbms or systems like Kafka to where CSV content was dumped as one of columns/fields. Other columns could contain related information like timestamps, ids, sources of data and etc. The column with CSV strings can be parsed by existing method csv() of DataFrameReader but in that case we have to "clean up" dataset and remove other columns since the csv() method requires Dataset[String]. Joining back result of parsing and original dataset by positions is expensive and not convenient. Instead users parse CSV columns by string functions. The approach is usually error prone especially for quoted values and other special cases.

The proposed in the PR methods should make a better user experience in parsing CSV-like columns. Please, share your thoughts.

--

Maxim Gekk

Technical Solutions Lead

Databricks Inc.

[hidden email]

databricks.com

 




--
Dongjin Lee

A hitchhiker in the mathematical world.


--
John Zhuge