renaming SchemaRDD -> DataFrame

classic Classic list List threaded Threaded
39 messages Options
12
Reply | Threaded
Open this post in threaded view
|

renaming SchemaRDD -> DataFrame

rxin
Hi,

We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
get the community's opinion.

The context is that SchemaRDD is becoming a common data format used for
bringing data into Spark from external systems, and used for various
components of Spark, e.g. MLlib's new pipeline API. We also expect more and
more users to be programming directly against SchemaRDD API rather than the
core RDD API. SchemaRDD, through its less commonly used DSL originally
designed for writing test cases, always has the data-frame like API. In
1.3, we are redesigning the API to make the API usable for end users.


There are two motivations for the renaming:

1. DataFrame seems to be a more self-evident name than SchemaRDD.

2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
though it would contain some RDD functions like map, flatMap, etc), and
calling it Schema*RDD* while it is not an RDD is highly confusing. Instead.
DataFrame.rdd will return the underlying RDD for all RDD methods.


My understanding is that very few users program directly against the
SchemaRDD API at the moment, because they are not well documented. However,
oo maintain backward compatibility, we can create a type alias DataFrame
that is still named SchemaRDD. This will maintain source compatibility for
Scala. That said, we will have to update all existing materials to use
DataFrame rather than SchemaRDD.
Reply | Threaded
Open this post in threaded view
|

Re: renaming SchemaRDD -> DataFrame

Patrick Wendell
One thing potentially not clear from this e-mail, there will be a 1:1
correspondence where you can get an RDD to/from a DataFrame.

On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <[hidden email]> wrote:

> Hi,
>
> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
> get the community's opinion.
>
> The context is that SchemaRDD is becoming a common data format used for
> bringing data into Spark from external systems, and used for various
> components of Spark, e.g. MLlib's new pipeline API. We also expect more and
> more users to be programming directly against SchemaRDD API rather than the
> core RDD API. SchemaRDD, through its less commonly used DSL originally
> designed for writing test cases, always has the data-frame like API. In
> 1.3, we are redesigning the API to make the API usable for end users.
>
>
> There are two motivations for the renaming:
>
> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>
> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
> though it would contain some RDD functions like map, flatMap, etc), and
> calling it Schema*RDD* while it is not an RDD is highly confusing. Instead.
> DataFrame.rdd will return the underlying RDD for all RDD methods.
>
>
> My understanding is that very few users program directly against the
> SchemaRDD API at the moment, because they are not well documented. However,
> oo maintain backward compatibility, we can create a type alias DataFrame
> that is still named SchemaRDD. This will maintain source compatibility for
> Scala. That said, we will have to update all existing materials to use
> DataFrame rather than SchemaRDD.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: renaming SchemaRDD -> DataFrame

Michael Malak-2
And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area Spark Meetup YouTube contained a wealth of background information on this idea (mostly from Patrick and Reynold :-).

https://www.youtube.com/watch?v=YWppYPWznSQ

________________________________
From: Patrick Wendell <[hidden email]>
To: Reynold Xin <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Sent: Monday, January 26, 2015 4:01 PM
Subject: Re: renaming SchemaRDD -> DataFrame


One thing potentially not clear from this e-mail, there will be a 1:1
correspondence where you can get an RDD to/from a DataFrame.


On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <[hidden email]> wrote:

> Hi,
>
> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
> get the community's opinion.
>
> The context is that SchemaRDD is becoming a common data format used for
> bringing data into Spark from external systems, and used for various
> components of Spark, e.g. MLlib's new pipeline API. We also expect more and
> more users to be programming directly against SchemaRDD API rather than the
> core RDD API. SchemaRDD, through its less commonly used DSL originally
> designed for writing test cases, always has the data-frame like API. In
> 1.3, we are redesigning the API to make the API usable for end users.
>
>
> There are two motivations for the renaming:
>
> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>
> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
> though it would contain some RDD functions like map, flatMap, etc), and
> calling it Schema*RDD* while it is not an RDD is highly confusing. Instead.
> DataFrame.rdd will return the underlying RDD for all RDD methods.
>
>
> My understanding is that very few users program directly against the
> SchemaRDD API at the moment, because they are not well documented. However,
> oo maintain backward compatibility, we can create a type alias DataFrame
> that is still named SchemaRDD. This will maintain source compatibility for
> Scala. That said, we will have to update all existing materials to use
> DataFrame rather than SchemaRDD.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: renaming SchemaRDD -> DataFrame

Koert Kuipers
"The context is that SchemaRDD is becoming a common data format used for
bringing data into Spark from external systems, and used for various
components of Spark, e.g. MLlib's new pipeline API."

i agree. this to me also implies it belongs in spark core, not sql

On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
[hidden email]> wrote:

> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area
> Spark Meetup YouTube contained a wealth of background information on this
> idea (mostly from Patrick and Reynold :-).
>
> https://www.youtube.com/watch?v=YWppYPWznSQ
>
> ________________________________
> From: Patrick Wendell <[hidden email]>
> To: Reynold Xin <[hidden email]>
> Cc: "[hidden email]" <[hidden email]>
> Sent: Monday, January 26, 2015 4:01 PM
> Subject: Re: renaming SchemaRDD -> DataFrame
>
>
> One thing potentially not clear from this e-mail, there will be a 1:1
> correspondence where you can get an RDD to/from a DataFrame.
>
>
> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <[hidden email]> wrote:
> > Hi,
> >
> > We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
> > get the community's opinion.
> >
> > The context is that SchemaRDD is becoming a common data format used for
> > bringing data into Spark from external systems, and used for various
> > components of Spark, e.g. MLlib's new pipeline API. We also expect more
> and
> > more users to be programming directly against SchemaRDD API rather than
> the
> > core RDD API. SchemaRDD, through its less commonly used DSL originally
> > designed for writing test cases, always has the data-frame like API. In
> > 1.3, we are redesigning the API to make the API usable for end users.
> >
> >
> > There are two motivations for the renaming:
> >
> > 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> >
> > 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
> > though it would contain some RDD functions like map, flatMap, etc), and
> > calling it Schema*RDD* while it is not an RDD is highly confusing.
> Instead.
> > DataFrame.rdd will return the underlying RDD for all RDD methods.
> >
> >
> > My understanding is that very few users program directly against the
> > SchemaRDD API at the moment, because they are not well documented.
> However,
> > oo maintain backward compatibility, we can create a type alias DataFrame
> > that is still named SchemaRDD. This will maintain source compatibility
> for
> > Scala. That said, we will have to update all existing materials to use
> > DataFrame rather than SchemaRDD.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: renaming SchemaRDD -> DataFrame

Koert Kuipers
what i am trying to say is: structured data != sql

On Mon, Jan 26, 2015 at 7:26 PM, Koert Kuipers <[hidden email]> wrote:

> "The context is that SchemaRDD is becoming a common data format used for
> bringing data into Spark from external systems, and used for various
> components of Spark, e.g. MLlib's new pipeline API."
>
> i agree. this to me also implies it belongs in spark core, not sql
>
> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> [hidden email]> wrote:
>
>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
>> Area Spark Meetup YouTube contained a wealth of background information on
>> this idea (mostly from Patrick and Reynold :-).
>>
>> https://www.youtube.com/watch?v=YWppYPWznSQ
>>
>> ________________________________
>> From: Patrick Wendell <[hidden email]>
>> To: Reynold Xin <[hidden email]>
>> Cc: "[hidden email]" <[hidden email]>
>> Sent: Monday, January 26, 2015 4:01 PM
>> Subject: Re: renaming SchemaRDD -> DataFrame
>>
>>
>> One thing potentially not clear from this e-mail, there will be a 1:1
>> correspondence where you can get an RDD to/from a DataFrame.
>>
>>
>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <[hidden email]> wrote:
>> > Hi,
>> >
>> > We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
>> > get the community's opinion.
>> >
>> > The context is that SchemaRDD is becoming a common data format used for
>> > bringing data into Spark from external systems, and used for various
>> > components of Spark, e.g. MLlib's new pipeline API. We also expect more
>> and
>> > more users to be programming directly against SchemaRDD API rather than
>> the
>> > core RDD API. SchemaRDD, through its less commonly used DSL originally
>> > designed for writing test cases, always has the data-frame like API. In
>> > 1.3, we are redesigning the API to make the API usable for end users.
>> >
>> >
>> > There are two motivations for the renaming:
>> >
>> > 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>> >
>> > 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
>> > though it would contain some RDD functions like map, flatMap, etc), and
>> > calling it Schema*RDD* while it is not an RDD is highly confusing.
>> Instead.
>> > DataFrame.rdd will return the underlying RDD for all RDD methods.
>> >
>> >
>> > My understanding is that very few users program directly against the
>> > SchemaRDD API at the moment, because they are not well documented.
>> However,
>> > oo maintain backward compatibility, we can create a type alias DataFrame
>> > that is still named SchemaRDD. This will maintain source compatibility
>> for
>> > Scala. That said, we will have to update all existing materials to use
>> > DataFrame rather than SchemaRDD.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: renaming SchemaRDD -> DataFrame

Matei Zaharia
Administrator
In reply to this post by Koert Kuipers
While it might be possible to move this concept to Spark Core long-term, supporting structured data efficiently does require quite a bit of the infrastructure in Spark SQL, such as query planning and columnar storage. The intent of Spark SQL though is to be more than a SQL server -- it's meant to be a library for manipulating structured data. Since this is possible to build over the core API, it's pretty natural to organize it that way, same as Spark Streaming is a library.

Matei

> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <[hidden email]> wrote:
>
> "The context is that SchemaRDD is becoming a common data format used for
> bringing data into Spark from external systems, and used for various
> components of Spark, e.g. MLlib's new pipeline API."
>
> i agree. this to me also implies it belongs in spark core, not sql
>
> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> [hidden email]> wrote:
>
>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area
>> Spark Meetup YouTube contained a wealth of background information on this
>> idea (mostly from Patrick and Reynold :-).
>>
>> https://www.youtube.com/watch?v=YWppYPWznSQ
>>
>> ________________________________
>> From: Patrick Wendell <[hidden email]>
>> To: Reynold Xin <[hidden email]>
>> Cc: "[hidden email]" <[hidden email]>
>> Sent: Monday, January 26, 2015 4:01 PM
>> Subject: Re: renaming SchemaRDD -> DataFrame
>>
>>
>> One thing potentially not clear from this e-mail, there will be a 1:1
>> correspondence where you can get an RDD to/from a DataFrame.
>>
>>
>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <[hidden email]> wrote:
>>> Hi,
>>>
>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
>>> get the community's opinion.
>>>
>>> The context is that SchemaRDD is becoming a common data format used for
>>> bringing data into Spark from external systems, and used for various
>>> components of Spark, e.g. MLlib's new pipeline API. We also expect more
>> and
>>> more users to be programming directly against SchemaRDD API rather than
>> the
>>> core RDD API. SchemaRDD, through its less commonly used DSL originally
>>> designed for writing test cases, always has the data-frame like API. In
>>> 1.3, we are redesigning the API to make the API usable for end users.
>>>
>>>
>>> There are two motivations for the renaming:
>>>
>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>>>
>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
>>> though it would contain some RDD functions like map, flatMap, etc), and
>>> calling it Schema*RDD* while it is not an RDD is highly confusing.
>> Instead.
>>> DataFrame.rdd will return the underlying RDD for all RDD methods.
>>>
>>>
>>> My understanding is that very few users program directly against the
>>> SchemaRDD API at the moment, because they are not well documented.
>> However,
>>> oo maintain backward compatibility, we can create a type alias DataFrame
>>> that is still named SchemaRDD. This will maintain source compatibility
>> for
>>> Scala. That said, we will have to update all existing materials to use
>>> DataFrame rather than SchemaRDD.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: renaming SchemaRDD -> DataFrame

Matei Zaharia
Administrator
(Actually when we designed Spark SQL we thought of giving it another name, like Spark Schema, but we decided to stick with SQL since that was the most obvious use case to many users.)

Matei

> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <[hidden email]> wrote:
>
> While it might be possible to move this concept to Spark Core long-term, supporting structured data efficiently does require quite a bit of the infrastructure in Spark SQL, such as query planning and columnar storage. The intent of Spark SQL though is to be more than a SQL server -- it's meant to be a library for manipulating structured data. Since this is possible to build over the core API, it's pretty natural to organize it that way, same as Spark Streaming is a library.
>
> Matei
>
>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <[hidden email]> wrote:
>>
>> "The context is that SchemaRDD is becoming a common data format used for
>> bringing data into Spark from external systems, and used for various
>> components of Spark, e.g. MLlib's new pipeline API."
>>
>> i agree. this to me also implies it belongs in spark core, not sql
>>
>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>> [hidden email]> wrote:
>>
>>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area
>>> Spark Meetup YouTube contained a wealth of background information on this
>>> idea (mostly from Patrick and Reynold :-).
>>>
>>> https://www.youtube.com/watch?v=YWppYPWznSQ
>>>
>>> ________________________________
>>> From: Patrick Wendell <[hidden email]>
>>> To: Reynold Xin <[hidden email]>
>>> Cc: "[hidden email]" <[hidden email]>
>>> Sent: Monday, January 26, 2015 4:01 PM
>>> Subject: Re: renaming SchemaRDD -> DataFrame
>>>
>>>
>>> One thing potentially not clear from this e-mail, there will be a 1:1
>>> correspondence where you can get an RDD to/from a DataFrame.
>>>
>>>
>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <[hidden email]> wrote:
>>>> Hi,
>>>>
>>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
>>>> get the community's opinion.
>>>>
>>>> The context is that SchemaRDD is becoming a common data format used for
>>>> bringing data into Spark from external systems, and used for various
>>>> components of Spark, e.g. MLlib's new pipeline API. We also expect more
>>> and
>>>> more users to be programming directly against SchemaRDD API rather than
>>> the
>>>> core RDD API. SchemaRDD, through its less commonly used DSL originally
>>>> designed for writing test cases, always has the data-frame like API. In
>>>> 1.3, we are redesigning the API to make the API usable for end users.
>>>>
>>>>
>>>> There are two motivations for the renaming:
>>>>
>>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>>>>
>>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
>>>> though it would contain some RDD functions like map, flatMap, etc), and
>>>> calling it Schema*RDD* while it is not an RDD is highly confusing.
>>> Instead.
>>>> DataFrame.rdd will return the underlying RDD for all RDD methods.
>>>>
>>>>
>>>> My understanding is that very few users program directly against the
>>>> SchemaRDD API at the moment, because they are not well documented.
>>> However,
>>>> oo maintain backward compatibility, we can create a type alias DataFrame
>>>> that is still named SchemaRDD. This will maintain source compatibility
>>> for
>>>> Scala. That said, we will have to update all existing materials to use
>>>> DataFrame rather than SchemaRDD.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: renaming SchemaRDD -> DataFrame

Sandy Ryza
Both SchemaRDD and DataFrame sound fine to me, though I like the former
slightly better because it's more descriptive.

Even if SchemaRDD's needs to rely on Spark SQL under the covers, it would
be more clear from a user-facing perspective to at least choose a package
name for it that omits "sql".

I would also be in favor of adding a separate Spark Schema module for Spark
SQL to rely on, but I imagine that might be too large a change at this
point?

-Sandy

On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <[hidden email]>
wrote:

> (Actually when we designed Spark SQL we thought of giving it another name,
> like Spark Schema, but we decided to stick with SQL since that was the most
> obvious use case to many users.)
>
> Matei
>
> > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <[hidden email]>
> wrote:
> >
> > While it might be possible to move this concept to Spark Core long-term,
> supporting structured data efficiently does require quite a bit of the
> infrastructure in Spark SQL, such as query planning and columnar storage.
> The intent of Spark SQL though is to be more than a SQL server -- it's
> meant to be a library for manipulating structured data. Since this is
> possible to build over the core API, it's pretty natural to organize it
> that way, same as Spark Streaming is a library.
> >
> > Matei
> >
> >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <[hidden email]> wrote:
> >>
> >> "The context is that SchemaRDD is becoming a common data format used for
> >> bringing data into Spark from external systems, and used for various
> >> components of Spark, e.g. MLlib's new pipeline API."
> >>
> >> i agree. this to me also implies it belongs in spark core, not sql
> >>
> >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >> [hidden email]> wrote:
> >>
> >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
> Area
> >>> Spark Meetup YouTube contained a wealth of background information on
> this
> >>> idea (mostly from Patrick and Reynold :-).
> >>>
> >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >>>
> >>> ________________________________
> >>> From: Patrick Wendell <[hidden email]>
> >>> To: Reynold Xin <[hidden email]>
> >>> Cc: "[hidden email]" <[hidden email]>
> >>> Sent: Monday, January 26, 2015 4:01 PM
> >>> Subject: Re: renaming SchemaRDD -> DataFrame
> >>>
> >>>
> >>> One thing potentially not clear from this e-mail, there will be a 1:1
> >>> correspondence where you can get an RDD to/from a DataFrame.
> >>>
> >>>
> >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <[hidden email]>
> wrote:
> >>>> Hi,
> >>>>
> >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted
> to
> >>>> get the community's opinion.
> >>>>
> >>>> The context is that SchemaRDD is becoming a common data format used
> for
> >>>> bringing data into Spark from external systems, and used for various
> >>>> components of Spark, e.g. MLlib's new pipeline API. We also expect
> more
> >>> and
> >>>> more users to be programming directly against SchemaRDD API rather
> than
> >>> the
> >>>> core RDD API. SchemaRDD, through its less commonly used DSL originally
> >>>> designed for writing test cases, always has the data-frame like API.
> In
> >>>> 1.3, we are redesigning the API to make the API usable for end users.
> >>>>
> >>>>
> >>>> There are two motivations for the renaming:
> >>>>
> >>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> >>>>
> >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
> (even
> >>>> though it would contain some RDD functions like map, flatMap, etc),
> and
> >>>> calling it Schema*RDD* while it is not an RDD is highly confusing.
> >>> Instead.
> >>>> DataFrame.rdd will return the underlying RDD for all RDD methods.
> >>>>
> >>>>
> >>>> My understanding is that very few users program directly against the
> >>>> SchemaRDD API at the moment, because they are not well documented.
> >>> However,
> >>>> oo maintain backward compatibility, we can create a type alias
> DataFrame
> >>>> that is still named SchemaRDD. This will maintain source compatibility
> >>> for
> >>>> Scala. That said, we will have to update all existing materials to use
> >>>> DataFrame rather than SchemaRDD.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [hidden email]
> >>> For additional commands, e-mail: [hidden email]
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [hidden email]
> >>> For additional commands, e-mail: [hidden email]
> >>>
> >>>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: renaming SchemaRDD -> DataFrame

Kushal Datta
I want to address the issue that Matei raised about the heavy lifting
required for a full SQL support. It is amazing that even after 30 years of
research there is not a single good open source columnar database like
Vertica. There is a column store option in MySQL, but it is not nearly as
sophisticated as Vertica or MonetDB. But there's a true need for such a
system. I wonder why so and it's high time to change that.
On Jan 26, 2015 5:47 PM, "Sandy Ryza" <[hidden email]> wrote:

> Both SchemaRDD and DataFrame sound fine to me, though I like the former
> slightly better because it's more descriptive.
>
> Even if SchemaRDD's needs to rely on Spark SQL under the covers, it would
> be more clear from a user-facing perspective to at least choose a package
> name for it that omits "sql".
>
> I would also be in favor of adding a separate Spark Schema module for Spark
> SQL to rely on, but I imagine that might be too large a change at this
> point?
>
> -Sandy
>
> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <[hidden email]>
> wrote:
>
> > (Actually when we designed Spark SQL we thought of giving it another
> name,
> > like Spark Schema, but we decided to stick with SQL since that was the
> most
> > obvious use case to many users.)
> >
> > Matei
> >
> > > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <[hidden email]>
> > wrote:
> > >
> > > While it might be possible to move this concept to Spark Core
> long-term,
> > supporting structured data efficiently does require quite a bit of the
> > infrastructure in Spark SQL, such as query planning and columnar storage.
> > The intent of Spark SQL though is to be more than a SQL server -- it's
> > meant to be a library for manipulating structured data. Since this is
> > possible to build over the core API, it's pretty natural to organize it
> > that way, same as Spark Streaming is a library.
> > >
> > > Matei
> > >
> > >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <[hidden email]> wrote:
> > >>
> > >> "The context is that SchemaRDD is becoming a common data format used
> for
> > >> bringing data into Spark from external systems, and used for various
> > >> components of Spark, e.g. MLlib's new pipeline API."
> > >>
> > >> i agree. this to me also implies it belongs in spark core, not sql
> > >>
> > >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> > >> [hidden email]> wrote:
> > >>
> > >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
> > Area
> > >>> Spark Meetup YouTube contained a wealth of background information on
> > this
> > >>> idea (mostly from Patrick and Reynold :-).
> > >>>
> > >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> > >>>
> > >>> ________________________________
> > >>> From: Patrick Wendell <[hidden email]>
> > >>> To: Reynold Xin <[hidden email]>
> > >>> Cc: "[hidden email]" <[hidden email]>
> > >>> Sent: Monday, January 26, 2015 4:01 PM
> > >>> Subject: Re: renaming SchemaRDD -> DataFrame
> > >>>
> > >>>
> > >>> One thing potentially not clear from this e-mail, there will be a 1:1
> > >>> correspondence where you can get an RDD to/from a DataFrame.
> > >>>
> > >>>
> > >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <[hidden email]>
> > wrote:
> > >>>> Hi,
> > >>>>
> > >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and
> wanted
> > to
> > >>>> get the community's opinion.
> > >>>>
> > >>>> The context is that SchemaRDD is becoming a common data format used
> > for
> > >>>> bringing data into Spark from external systems, and used for various
> > >>>> components of Spark, e.g. MLlib's new pipeline API. We also expect
> > more
> > >>> and
> > >>>> more users to be programming directly against SchemaRDD API rather
> > than
> > >>> the
> > >>>> core RDD API. SchemaRDD, through its less commonly used DSL
> originally
> > >>>> designed for writing test cases, always has the data-frame like API.
> > In
> > >>>> 1.3, we are redesigning the API to make the API usable for end
> users.
> > >>>>
> > >>>>
> > >>>> There are two motivations for the renaming:
> > >>>>
> > >>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> > >>>>
> > >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
> > (even
> > >>>> though it would contain some RDD functions like map, flatMap, etc),
> > and
> > >>>> calling it Schema*RDD* while it is not an RDD is highly confusing.
> > >>> Instead.
> > >>>> DataFrame.rdd will return the underlying RDD for all RDD methods.
> > >>>>
> > >>>>
> > >>>> My understanding is that very few users program directly against the
> > >>>> SchemaRDD API at the moment, because they are not well documented.
> > >>> However,
> > >>>> oo maintain backward compatibility, we can create a type alias
> > DataFrame
> > >>>> that is still named SchemaRDD. This will maintain source
> compatibility
> > >>> for
> > >>>> Scala. That said, we will have to update all existing materials to
> use
> > >>>> DataFrame rather than SchemaRDD.
> > >>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: [hidden email]
> > >>> For additional commands, e-mail: [hidden email]
> > >>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: [hidden email]
> > >>> For additional commands, e-mail: [hidden email]
> > >>>
> > >>>
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: renaming SchemaRDD -> DataFrame

Dirceu Semighini Filho
Can't the SchemaRDD remain the same, but deprecated, and be removed in the
release 1.5(+/- 1)  for example, and the new code been added to DataFrame?
With this, we don't impact in existing code for the next few releases.



2015-01-27 0:02 GMT-02:00 Kushal Datta <[hidden email]>:

> I want to address the issue that Matei raised about the heavy lifting
> required for a full SQL support. It is amazing that even after 30 years of
> research there is not a single good open source columnar database like
> Vertica. There is a column store option in MySQL, but it is not nearly as
> sophisticated as Vertica or MonetDB. But there's a true need for such a
> system. I wonder why so and it's high time to change that.
> On Jan 26, 2015 5:47 PM, "Sandy Ryza" <[hidden email]> wrote:
>
> > Both SchemaRDD and DataFrame sound fine to me, though I like the former
> > slightly better because it's more descriptive.
> >
> > Even if SchemaRDD's needs to rely on Spark SQL under the covers, it would
> > be more clear from a user-facing perspective to at least choose a package
> > name for it that omits "sql".
> >
> > I would also be in favor of adding a separate Spark Schema module for
> Spark
> > SQL to rely on, but I imagine that might be too large a change at this
> > point?
> >
> > -Sandy
> >
> > On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <[hidden email]>
> > wrote:
> >
> > > (Actually when we designed Spark SQL we thought of giving it another
> > name,
> > > like Spark Schema, but we decided to stick with SQL since that was the
> > most
> > > obvious use case to many users.)
> > >
> > > Matei
> > >
> > > > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <[hidden email]>
> > > wrote:
> > > >
> > > > While it might be possible to move this concept to Spark Core
> > long-term,
> > > supporting structured data efficiently does require quite a bit of the
> > > infrastructure in Spark SQL, such as query planning and columnar
> storage.
> > > The intent of Spark SQL though is to be more than a SQL server -- it's
> > > meant to be a library for manipulating structured data. Since this is
> > > possible to build over the core API, it's pretty natural to organize it
> > > that way, same as Spark Streaming is a library.
> > > >
> > > > Matei
> > > >
> > > >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <[hidden email]>
> wrote:
> > > >>
> > > >> "The context is that SchemaRDD is becoming a common data format used
> > for
> > > >> bringing data into Spark from external systems, and used for various
> > > >> components of Spark, e.g. MLlib's new pipeline API."
> > > >>
> > > >> i agree. this to me also implies it belongs in spark core, not sql
> > > >>
> > > >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> > > >> [hidden email]> wrote:
> > > >>
> > > >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13
> Bay
> > > Area
> > > >>> Spark Meetup YouTube contained a wealth of background information
> on
> > > this
> > > >>> idea (mostly from Patrick and Reynold :-).
> > > >>>
> > > >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> > > >>>
> > > >>> ________________________________
> > > >>> From: Patrick Wendell <[hidden email]>
> > > >>> To: Reynold Xin <[hidden email]>
> > > >>> Cc: "[hidden email]" <[hidden email]>
> > > >>> Sent: Monday, January 26, 2015 4:01 PM
> > > >>> Subject: Re: renaming SchemaRDD -> DataFrame
> > > >>>
> > > >>>
> > > >>> One thing potentially not clear from this e-mail, there will be a
> 1:1
> > > >>> correspondence where you can get an RDD to/from a DataFrame.
> > > >>>
> > > >>>
> > > >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <[hidden email]>
> > > wrote:
> > > >>>> Hi,
> > > >>>>
> > > >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and
> > wanted
> > > to
> > > >>>> get the community's opinion.
> > > >>>>
> > > >>>> The context is that SchemaRDD is becoming a common data format
> used
> > > for
> > > >>>> bringing data into Spark from external systems, and used for
> various
> > > >>>> components of Spark, e.g. MLlib's new pipeline API. We also expect
> > > more
> > > >>> and
> > > >>>> more users to be programming directly against SchemaRDD API rather
> > > than
> > > >>> the
> > > >>>> core RDD API. SchemaRDD, through its less commonly used DSL
> > originally
> > > >>>> designed for writing test cases, always has the data-frame like
> API.
> > > In
> > > >>>> 1.3, we are redesigning the API to make the API usable for end
> > users.
> > > >>>>
> > > >>>>
> > > >>>> There are two motivations for the renaming:
> > > >>>>
> > > >>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> > > >>>>
> > > >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
> > > (even
> > > >>>> though it would contain some RDD functions like map, flatMap,
> etc),
> > > and
> > > >>>> calling it Schema*RDD* while it is not an RDD is highly confusing.
> > > >>> Instead.
> > > >>>> DataFrame.rdd will return the underlying RDD for all RDD methods.
> > > >>>>
> > > >>>>
> > > >>>> My understanding is that very few users program directly against
> the
> > > >>>> SchemaRDD API at the moment, because they are not well documented.
> > > >>> However,
> > > >>>> oo maintain backward compatibility, we can create a type alias
> > > DataFrame
> > > >>>> that is still named SchemaRDD. This will maintain source
> > compatibility
> > > >>> for
> > > >>>> Scala. That said, we will have to update all existing materials to
> > use
> > > >>>> DataFrame rather than SchemaRDD.
> > > >>>
> > > >>>
> ---------------------------------------------------------------------
> > > >>> To unsubscribe, e-mail: [hidden email]
> > > >>> For additional commands, e-mail: [hidden email]
> > > >>>
> > > >>>
> ---------------------------------------------------------------------
> > > >>> To unsubscribe, e-mail: [hidden email]
> > > >>> For additional commands, e-mail: [hidden email]
> > > >>>
> > > >>>
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: renaming SchemaRDD -> DataFrame

Evan R. Sparks
In reply to this post by Matei Zaharia
I'm +1 on this, although a little worried about unknowingly introducing
SparkSQL dependencies every time someone wants to use this. It would be
great if the interface can be abstract and the implementation (in this
case, SparkSQL backend) could be swapped out.

One alternative suggestion on the name - why not call it DataTable?
DataFrame seems like a name carried over from pandas (and by extension, R),
and it's never been obvious to me what a "Frame" is.



On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <[hidden email]>
wrote:

> (Actually when we designed Spark SQL we thought of giving it another name,
> like Spark Schema, but we decided to stick with SQL since that was the most
> obvious use case to many users.)
>
> Matei
>
> > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <[hidden email]>
> wrote:
> >
> > While it might be possible to move this concept to Spark Core long-term,
> supporting structured data efficiently does require quite a bit of the
> infrastructure in Spark SQL, such as query planning and columnar storage.
> The intent of Spark SQL though is to be more than a SQL server -- it's
> meant to be a library for manipulating structured data. Since this is
> possible to build over the core API, it's pretty natural to organize it
> that way, same as Spark Streaming is a library.
> >
> > Matei
> >
> >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <[hidden email]> wrote:
> >>
> >> "The context is that SchemaRDD is becoming a common data format used for
> >> bringing data into Spark from external systems, and used for various
> >> components of Spark, e.g. MLlib's new pipeline API."
> >>
> >> i agree. this to me also implies it belongs in spark core, not sql
> >>
> >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >> [hidden email]> wrote:
> >>
> >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
> Area
> >>> Spark Meetup YouTube contained a wealth of background information on
> this
> >>> idea (mostly from Patrick and Reynold :-).
> >>>
> >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >>>
> >>> ________________________________
> >>> From: Patrick Wendell <[hidden email]>
> >>> To: Reynold Xin <[hidden email]>
> >>> Cc: "[hidden email]" <[hidden email]>
> >>> Sent: Monday, January 26, 2015 4:01 PM
> >>> Subject: Re: renaming SchemaRDD -> DataFrame
> >>>
> >>>
> >>> One thing potentially not clear from this e-mail, there will be a 1:1
> >>> correspondence where you can get an RDD to/from a DataFrame.
> >>>
> >>>
> >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <[hidden email]>
> wrote:
> >>>> Hi,
> >>>>
> >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted
> to
> >>>> get the community's opinion.
> >>>>
> >>>> The context is that SchemaRDD is becoming a common data format used
> for
> >>>> bringing data into Spark from external systems, and used for various
> >>>> components of Spark, e.g. MLlib's new pipeline API. We also expect
> more
> >>> and
> >>>> more users to be programming directly against SchemaRDD API rather
> than
> >>> the
> >>>> core RDD API. SchemaRDD, through its less commonly used DSL originally
> >>>> designed for writing test cases, always has the data-frame like API.
> In
> >>>> 1.3, we are redesigning the API to make the API usable for end users.
> >>>>
> >>>>
> >>>> There are two motivations for the renaming:
> >>>>
> >>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> >>>>
> >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
> (even
> >>>> though it would contain some RDD functions like map, flatMap, etc),
> and
> >>>> calling it Schema*RDD* while it is not an RDD is highly confusing.
> >>> Instead.
> >>>> DataFrame.rdd will return the underlying RDD for all RDD methods.
> >>>>
> >>>>
> >>>> My understanding is that very few users program directly against the
> >>>> SchemaRDD API at the moment, because they are not well documented.
> >>> However,
> >>>> oo maintain backward compatibility, we can create a type alias
> DataFrame
> >>>> that is still named SchemaRDD. This will maintain source compatibility
> >>> for
> >>>> Scala. That said, we will have to update all existing materials to use
> >>>> DataFrame rather than SchemaRDD.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [hidden email]
> >>> For additional commands, e-mail: [hidden email]
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [hidden email]
> >>> For additional commands, e-mail: [hidden email]
> >>>
> >>>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: renaming SchemaRDD -> DataFrame

Koert Kuipers
In reply to this post by Matei Zaharia
hey matei,
i think that stuff such as SchemaRDD, columar storage and perhaps also
query planning can be re-used by many systems that do analysis on
structured data. i can imagine panda-like systems, but also datalog or
scalding-like (which we use at tresata and i might rebase on SchemaRDD at
some point). SchemaRDD should become the interface for all these. and
columnar storage abstractions should be re-used between all these.

currently the sql tie in is way beyond just the (perhaps unfortunate)
naming convention. for example a core part of the SchemaRD abstraction is
Row, which is org.apache.spark.sql.catalyst.expressions.Row, forcing anyone
that want to build on top of SchemaRDD to dig into catalyst, a SQL Parser
(if i understand it correctly, i have not used catalyst, but it looks
neat). i should not need to include a SQL parser just to use structured
data in say a panda-like framework.

best, koert


On Mon, Jan 26, 2015 at 8:31 PM, Matei Zaharia <[hidden email]>
wrote:

> While it might be possible to move this concept to Spark Core long-term,
> supporting structured data efficiently does require quite a bit of the
> infrastructure in Spark SQL, such as query planning and columnar storage.
> The intent of Spark SQL though is to be more than a SQL server -- it's
> meant to be a library for manipulating structured data. Since this is
> possible to build over the core API, it's pretty natural to organize it
> that way, same as Spark Streaming is a library.
>
> Matei
>
> > On Jan 26, 2015, at 4:26 PM, Koert Kuipers <[hidden email]> wrote:
> >
> > "The context is that SchemaRDD is becoming a common data format used for
> > bringing data into Spark from external systems, and used for various
> > components of Spark, e.g. MLlib's new pipeline API."
> >
> > i agree. this to me also implies it belongs in spark core, not sql
> >
> > On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> > [hidden email]> wrote:
> >
> >> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
> Area
> >> Spark Meetup YouTube contained a wealth of background information on
> this
> >> idea (mostly from Patrick and Reynold :-).
> >>
> >> https://www.youtube.com/watch?v=YWppYPWznSQ
> >>
> >> ________________________________
> >> From: Patrick Wendell <[hidden email]>
> >> To: Reynold Xin <[hidden email]>
> >> Cc: "[hidden email]" <[hidden email]>
> >> Sent: Monday, January 26, 2015 4:01 PM
> >> Subject: Re: renaming SchemaRDD -> DataFrame
> >>
> >>
> >> One thing potentially not clear from this e-mail, there will be a 1:1
> >> correspondence where you can get an RDD to/from a DataFrame.
> >>
> >>
> >> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <[hidden email]>
> wrote:
> >>> Hi,
> >>>
> >>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted
> to
> >>> get the community's opinion.
> >>>
> >>> The context is that SchemaRDD is becoming a common data format used for
> >>> bringing data into Spark from external systems, and used for various
> >>> components of Spark, e.g. MLlib's new pipeline API. We also expect more
> >> and
> >>> more users to be programming directly against SchemaRDD API rather than
> >> the
> >>> core RDD API. SchemaRDD, through its less commonly used DSL originally
> >>> designed for writing test cases, always has the data-frame like API. In
> >>> 1.3, we are redesigning the API to make the API usable for end users.
> >>>
> >>>
> >>> There are two motivations for the renaming:
> >>>
> >>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> >>>
> >>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
> >>> though it would contain some RDD functions like map, flatMap, etc), and
> >>> calling it Schema*RDD* while it is not an RDD is highly confusing.
> >> Instead.
> >>> DataFrame.rdd will return the underlying RDD for all RDD methods.
> >>>
> >>>
> >>> My understanding is that very few users program directly against the
> >>> SchemaRDD API at the moment, because they are not well documented.
> >> However,
> >>> oo maintain backward compatibility, we can create a type alias
> DataFrame
> >>> that is still named SchemaRDD. This will maintain source compatibility
> >> for
> >>> Scala. That said, we will have to update all existing materials to use
> >>> DataFrame rather than SchemaRDD.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: renaming SchemaRDD -> DataFrame

Mark Hamstra
In master, Reynold has already taken care of moving Row
into org.apache.spark.sql; so, even though the implementation of Row (and
GenericRow et al.) is in Catalyst (which is more optimizer than parser),
that needn't be of concern to users of the API in its most recent state.

On Tue, Jan 27, 2015 at 9:08 AM, Koert Kuipers <[hidden email]> wrote:

> hey matei,
> i think that stuff such as SchemaRDD, columar storage and perhaps also
> query planning can be re-used by many systems that do analysis on
> structured data. i can imagine panda-like systems, but also datalog or
> scalding-like (which we use at tresata and i might rebase on SchemaRDD at
> some point). SchemaRDD should become the interface for all these. and
> columnar storage abstractions should be re-used between all these.
>
> currently the sql tie in is way beyond just the (perhaps unfortunate)
> naming convention. for example a core part of the SchemaRD abstraction is
> Row, which is org.apache.spark.sql.catalyst.expressions.Row, forcing anyone
> that want to build on top of SchemaRDD to dig into catalyst, a SQL Parser
> (if i understand it correctly, i have not used catalyst, but it looks
> neat). i should not need to include a SQL parser just to use structured
> data in say a panda-like framework.
>
> best, koert
>
>
> On Mon, Jan 26, 2015 at 8:31 PM, Matei Zaharia <[hidden email]>
> wrote:
>
> > While it might be possible to move this concept to Spark Core long-term,
> > supporting structured data efficiently does require quite a bit of the
> > infrastructure in Spark SQL, such as query planning and columnar storage.
> > The intent of Spark SQL though is to be more than a SQL server -- it's
> > meant to be a library for manipulating structured data. Since this is
> > possible to build over the core API, it's pretty natural to organize it
> > that way, same as Spark Streaming is a library.
> >
> > Matei
> >
> > > On Jan 26, 2015, at 4:26 PM, Koert Kuipers <[hidden email]> wrote:
> > >
> > > "The context is that SchemaRDD is becoming a common data format used
> for
> > > bringing data into Spark from external systems, and used for various
> > > components of Spark, e.g. MLlib's new pipeline API."
> > >
> > > i agree. this to me also implies it belongs in spark core, not sql
> > >
> > > On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> > > [hidden email]> wrote:
> > >
> > >> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
> > Area
> > >> Spark Meetup YouTube contained a wealth of background information on
> > this
> > >> idea (mostly from Patrick and Reynold :-).
> > >>
> > >> https://www.youtube.com/watch?v=YWppYPWznSQ
> > >>
> > >> ________________________________
> > >> From: Patrick Wendell <[hidden email]>
> > >> To: Reynold Xin <[hidden email]>
> > >> Cc: "[hidden email]" <[hidden email]>
> > >> Sent: Monday, January 26, 2015 4:01 PM
> > >> Subject: Re: renaming SchemaRDD -> DataFrame
> > >>
> > >>
> > >> One thing potentially not clear from this e-mail, there will be a 1:1
> > >> correspondence where you can get an RDD to/from a DataFrame.
> > >>
> > >>
> > >> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <[hidden email]>
> > wrote:
> > >>> Hi,
> > >>>
> > >>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted
> > to
> > >>> get the community's opinion.
> > >>>
> > >>> The context is that SchemaRDD is becoming a common data format used
> for
> > >>> bringing data into Spark from external systems, and used for various
> > >>> components of Spark, e.g. MLlib's new pipeline API. We also expect
> more
> > >> and
> > >>> more users to be programming directly against SchemaRDD API rather
> than
> > >> the
> > >>> core RDD API. SchemaRDD, through its less commonly used DSL
> originally
> > >>> designed for writing test cases, always has the data-frame like API.
> In
> > >>> 1.3, we are redesigning the API to make the API usable for end users.
> > >>>
> > >>>
> > >>> There are two motivations for the renaming:
> > >>>
> > >>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> > >>>
> > >>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
> (even
> > >>> though it would contain some RDD functions like map, flatMap, etc),
> and
> > >>> calling it Schema*RDD* while it is not an RDD is highly confusing.
> > >> Instead.
> > >>> DataFrame.rdd will return the underlying RDD for all RDD methods.
> > >>>
> > >>>
> > >>> My understanding is that very few users program directly against the
> > >>> SchemaRDD API at the moment, because they are not well documented.
> > >> However,
> > >>> oo maintain backward compatibility, we can create a type alias
> > DataFrame
> > >>> that is still named SchemaRDD. This will maintain source
> compatibility
> > >> for
> > >>> Scala. That said, we will have to update all existing materials to
> use
> > >>> DataFrame rather than SchemaRDD.
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: [hidden email]
> > >> For additional commands, e-mail: [hidden email]
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: [hidden email]
> > >> For additional commands, e-mail: [hidden email]
> > >>
> > >>
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: renaming SchemaRDD -> DataFrame

Michael Malak-2
In reply to this post by Evan R. Sparks
I personally have no preference DataFrame vs. DataTable, but only wish to lay out the history and etymology simply because I'm into that sort of thing.

"Frame" comes from Marvin Minsky's 1970's AI construct: "slots" and the data that go in them. The S programming language (precursor to R) adopted this terminology in 1991. R of course became popular with the rise of Data Science around 2012.
http://www.google.com/trends/explore#q=%22data%20science%22%2C%20%22r%20programming%22&cmpt=q&tz=

"DataFrame" would carry the implication that it comes along with its own metadata, whereas "DataTable" might carry the implication that metadata is stored in a central metadata repository.

"DataFrame" is thus technically more correct for SchemaRDD, but is a less familiar (and thus less accessible) term for those not immersed in data science or AI and thus may have narrower appeal.


----- Original Message -----
From: Evan R. Sparks <[hidden email]>
To: Matei Zaharia <[hidden email]>
Cc: Koert Kuipers <[hidden email]>; Michael Malak <[hidden email]>; Patrick Wendell <[hidden email]>; Reynold Xin <[hidden email]>; "[hidden email]" <[hidden email]>
Sent: Tuesday, January 27, 2015 9:55 AM
Subject: Re: renaming SchemaRDD -> DataFrame

I'm +1 on this, although a little worried about unknowingly introducing
SparkSQL dependencies every time someone wants to use this. It would be
great if the interface can be abstract and the implementation (in this
case, SparkSQL backend) could be swapped out.

One alternative suggestion on the name - why not call it DataTable?
DataFrame seems like a name carried over from pandas (and by extension, R),
and it's never been obvious to me what a "Frame" is.



On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <[hidden email]>
wrote:

> (Actually when we designed Spark SQL we thought of giving it another name,
> like Spark Schema, but we decided to stick with SQL since that was the most
> obvious use case to many users.)
>
> Matei
>
> > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <[hidden email]>
> wrote:
> >
> > While it might be possible to move this concept to Spark Core long-term,
> supporting structured data efficiently does require quite a bit of the
> infrastructure in Spark SQL, such as query planning and columnar storage.
> The intent of Spark SQL though is to be more than a SQL server -- it's
> meant to be a library for manipulating structured data. Since this is
> possible to build over the core API, it's pretty natural to organize it
> that way, same as Spark Streaming is a library.
> >
> > Matei
> >
> >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <[hidden email]> wrote:
> >>
> >> "The context is that SchemaRDD is becoming a common data format used for
> >> bringing data into Spark from external systems, and used for various
> >> components of Spark, e.g. MLlib's new pipeline API."
> >>
> >> i agree. this to me also implies it belongs in spark core, not sql
> >>
> >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >> [hidden email]> wrote:
> >>
> >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
> Area
> >>> Spark Meetup YouTube contained a wealth of background information on
> this
> >>> idea (mostly from Patrick and Reynold :-).
> >>>
> >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >>>
> >>> ________________________________
> >>> From: Patrick Wendell <[hidden email]>
> >>> To: Reynold Xin <[hidden email]>
> >>> Cc: "[hidden email]" <[hidden email]>
> >>> Sent: Monday, January 26, 2015 4:01 PM
> >>> Subject: Re: renaming SchemaRDD -> DataFrame
> >>>
> >>>
> >>> One thing potentially not clear from this e-mail, there will be a 1:1
> >>> correspondence where you can get an RDD to/from a DataFrame.
> >>>
> >>>
> >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <[hidden email]>
> wrote:
> >>>> Hi,
> >>>>
> >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted
> to
> >>>> get the community's opinion.
> >>>>
> >>>> The context is that SchemaRDD is becoming a common data format used
> for
> >>>> bringing data into Spark from external systems, and used for various
> >>>> components of Spark, e.g. MLlib's new pipeline API. We also expect
> more
> >>> and
> >>>> more users to be programming directly against SchemaRDD API rather
> than
> >>> the
> >>>> core RDD API. SchemaRDD, through its less commonly used DSL originally
> >>>> designed for writing test cases, always has the data-frame like API.
> In
> >>>> 1.3, we are redesigning the API to make the API usable for end users.
> >>>>
> >>>>
> >>>> There are two motivations for the renaming:
> >>>>
> >>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> >>>>
> >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
> (even
> >>>> though it would contain some RDD functions like map, flatMap, etc),
> and
> >>>> calling it Schema*RDD* while it is not an RDD is highly confusing.
> >>> Instead.
> >>>> DataFrame.rdd will return the underlying RDD for all RDD methods.
> >>>>
> >>>>
> >>>> My understanding is that very few users program directly against the
> >>>> SchemaRDD API at the moment, because they are not well documented.
> >>> However,
> >>>> oo maintain backward compatibility, we can create a type alias
> DataFrame
> >>>> that is still named SchemaRDD. This will maintain source compatibility
> >>> for
> >>>> Scala. That said, we will have to update all existing materials to use
> >>>> DataFrame rather than SchemaRDD.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [hidden email]
> >>> For additional commands, e-mail: [hidden email]

> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [hidden email]
> >>> For additional commands, e-mail: [hidden email]
> >>>
> >>>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: renaming SchemaRDD -> DataFrame

rxin
In reply to this post by Dirceu Semighini Filho
Dirceu,

That is not possible because one cannot overload return types.

SQLContext.parquetFile (and many other methods) needs to return some type,
and that type cannot be both SchemaRDD and DataFrame.

In 1.3, we will create a type alias for DataFrame called SchemaRDD to not
break source compatibility for Scala.


On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
[hidden email]> wrote:

> Can't the SchemaRDD remain the same, but deprecated, and be removed in the
> release 1.5(+/- 1)  for example, and the new code been added to DataFrame?
> With this, we don't impact in existing code for the next few releases.
>
>
>
> 2015-01-27 0:02 GMT-02:00 Kushal Datta <[hidden email]>:
>
> > I want to address the issue that Matei raised about the heavy lifting
> > required for a full SQL support. It is amazing that even after 30 years
> of
> > research there is not a single good open source columnar database like
> > Vertica. There is a column store option in MySQL, but it is not nearly as
> > sophisticated as Vertica or MonetDB. But there's a true need for such a
> > system. I wonder why so and it's high time to change that.
> > On Jan 26, 2015 5:47 PM, "Sandy Ryza" <[hidden email]> wrote:
> >
> > > Both SchemaRDD and DataFrame sound fine to me, though I like the former
> > > slightly better because it's more descriptive.
> > >
> > > Even if SchemaRDD's needs to rely on Spark SQL under the covers, it
> would
> > > be more clear from a user-facing perspective to at least choose a
> package
> > > name for it that omits "sql".
> > >
> > > I would also be in favor of adding a separate Spark Schema module for
> > Spark
> > > SQL to rely on, but I imagine that might be too large a change at this
> > > point?
> > >
> > > -Sandy
> > >
> > > On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
> [hidden email]>
> > > wrote:
> > >
> > > > (Actually when we designed Spark SQL we thought of giving it another
> > > name,
> > > > like Spark Schema, but we decided to stick with SQL since that was
> the
> > > most
> > > > obvious use case to many users.)
> > > >
> > > > Matei
> > > >
> > > > > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
> [hidden email]>
> > > > wrote:
> > > > >
> > > > > While it might be possible to move this concept to Spark Core
> > > long-term,
> > > > supporting structured data efficiently does require quite a bit of
> the
> > > > infrastructure in Spark SQL, such as query planning and columnar
> > storage.
> > > > The intent of Spark SQL though is to be more than a SQL server --
> it's
> > > > meant to be a library for manipulating structured data. Since this is
> > > > possible to build over the core API, it's pretty natural to organize
> it
> > > > that way, same as Spark Streaming is a library.
> > > > >
> > > > > Matei
> > > > >
> > > > >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <[hidden email]>
> > wrote:
> > > > >>
> > > > >> "The context is that SchemaRDD is becoming a common data format
> used
> > > for
> > > > >> bringing data into Spark from external systems, and used for
> various
> > > > >> components of Spark, e.g. MLlib's new pipeline API."
> > > > >>
> > > > >> i agree. this to me also implies it belongs in spark core, not sql
> > > > >>
> > > > >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> > > > >> [hidden email]> wrote:
> > > > >>
> > > > >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13
> > Bay
> > > > Area
> > > > >>> Spark Meetup YouTube contained a wealth of background information
> > on
> > > > this
> > > > >>> idea (mostly from Patrick and Reynold :-).
> > > > >>>
> > > > >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> > > > >>>
> > > > >>> ________________________________
> > > > >>> From: Patrick Wendell <[hidden email]>
> > > > >>> To: Reynold Xin <[hidden email]>
> > > > >>> Cc: "[hidden email]" <[hidden email]>
> > > > >>> Sent: Monday, January 26, 2015 4:01 PM
> > > > >>> Subject: Re: renaming SchemaRDD -> DataFrame
> > > > >>>
> > > > >>>
> > > > >>> One thing potentially not clear from this e-mail, there will be a
> > 1:1
> > > > >>> correspondence where you can get an RDD to/from a DataFrame.
> > > > >>>
> > > > >>>
> > > > >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
> [hidden email]>
> > > > wrote:
> > > > >>>> Hi,
> > > > >>>>
> > > > >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and
> > > wanted
> > > > to
> > > > >>>> get the community's opinion.
> > > > >>>>
> > > > >>>> The context is that SchemaRDD is becoming a common data format
> > used
> > > > for
> > > > >>>> bringing data into Spark from external systems, and used for
> > various
> > > > >>>> components of Spark, e.g. MLlib's new pipeline API. We also
> expect
> > > > more
> > > > >>> and
> > > > >>>> more users to be programming directly against SchemaRDD API
> rather
> > > > than
> > > > >>> the
> > > > >>>> core RDD API. SchemaRDD, through its less commonly used DSL
> > > originally
> > > > >>>> designed for writing test cases, always has the data-frame like
> > API.
> > > > In
> > > > >>>> 1.3, we are redesigning the API to make the API usable for end
> > > users.
> > > > >>>>
> > > > >>>>
> > > > >>>> There are two motivations for the renaming:
> > > > >>>>
> > > > >>>> 1. DataFrame seems to be a more self-evident name than
> SchemaRDD.
> > > > >>>>
> > > > >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD
> anymore
> > > > (even
> > > > >>>> though it would contain some RDD functions like map, flatMap,
> > etc),
> > > > and
> > > > >>>> calling it Schema*RDD* while it is not an RDD is highly
> confusing.
> > > > >>> Instead.
> > > > >>>> DataFrame.rdd will return the underlying RDD for all RDD
> methods.
> > > > >>>>
> > > > >>>>
> > > > >>>> My understanding is that very few users program directly against
> > the
> > > > >>>> SchemaRDD API at the moment, because they are not well
> documented.
> > > > >>> However,
> > > > >>>> oo maintain backward compatibility, we can create a type alias
> > > > DataFrame
> > > > >>>> that is still named SchemaRDD. This will maintain source
> > > compatibility
> > > > >>> for
> > > > >>>> Scala. That said, we will have to update all existing materials
> to
> > > use
> > > > >>>> DataFrame rather than SchemaRDD.
> > > > >>>
> > > > >>>
> > ---------------------------------------------------------------------
> > > > >>> To unsubscribe, e-mail: [hidden email]
> > > > >>> For additional commands, e-mail: [hidden email]
> > > > >>>
> > > > >>>
> > ---------------------------------------------------------------------
> > > > >>> To unsubscribe, e-mail: [hidden email]
> > > > >>> For additional commands, e-mail: [hidden email]
> > > > >>>
> > > > >>>
> > > > >
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [hidden email]
> > > > For additional commands, e-mail: [hidden email]
> > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: renaming SchemaRDD -> DataFrame

rxin
In reply to this post by Koert Kuipers
Koert,

As Mark said, I have already refactored the API so that nothing is catalyst
is exposed (and users won't need them anyway). Data types, Row interfaces
are both outside catalyst package and in org.apache.spark.sql.

On Tue, Jan 27, 2015 at 9:08 AM, Koert Kuipers <[hidden email]> wrote:

> hey matei,
> i think that stuff such as SchemaRDD, columar storage and perhaps also
> query planning can be re-used by many systems that do analysis on
> structured data. i can imagine panda-like systems, but also datalog or
> scalding-like (which we use at tresata and i might rebase on SchemaRDD at
> some point). SchemaRDD should become the interface for all these. and
> columnar storage abstractions should be re-used between all these.
>
> currently the sql tie in is way beyond just the (perhaps unfortunate)
> naming convention. for example a core part of the SchemaRD abstraction is
> Row, which is org.apache.spark.sql.catalyst.expressions.Row, forcing anyone
> that want to build on top of SchemaRDD to dig into catalyst, a SQL Parser
> (if i understand it correctly, i have not used catalyst, but it looks
> neat). i should not need to include a SQL parser just to use structured
> data in say a panda-like framework.
>
> best, koert
>
>
> On Mon, Jan 26, 2015 at 8:31 PM, Matei Zaharia <[hidden email]>
> wrote:
>
>> While it might be possible to move this concept to Spark Core long-term,
>> supporting structured data efficiently does require quite a bit of the
>> infrastructure in Spark SQL, such as query planning and columnar storage.
>> The intent of Spark SQL though is to be more than a SQL server -- it's
>> meant to be a library for manipulating structured data. Since this is
>> possible to build over the core API, it's pretty natural to organize it
>> that way, same as Spark Streaming is a library.
>>
>> Matei
>>
>> > On Jan 26, 2015, at 4:26 PM, Koert Kuipers <[hidden email]> wrote:
>> >
>> > "The context is that SchemaRDD is becoming a common data format used for
>> > bringing data into Spark from external systems, and used for various
>> > components of Spark, e.g. MLlib's new pipeline API."
>> >
>> > i agree. this to me also implies it belongs in spark core, not sql
>> >
>> > On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>> > [hidden email]> wrote:
>> >
>> >> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
>> Area
>> >> Spark Meetup YouTube contained a wealth of background information on
>> this
>> >> idea (mostly from Patrick and Reynold :-).
>> >>
>> >> https://www.youtube.com/watch?v=YWppYPWznSQ
>> >>
>> >> ________________________________
>> >> From: Patrick Wendell <[hidden email]>
>> >> To: Reynold Xin <[hidden email]>
>> >> Cc: "[hidden email]" <[hidden email]>
>> >> Sent: Monday, January 26, 2015 4:01 PM
>> >> Subject: Re: renaming SchemaRDD -> DataFrame
>> >>
>> >>
>> >> One thing potentially not clear from this e-mail, there will be a 1:1
>> >> correspondence where you can get an RDD to/from a DataFrame.
>> >>
>> >>
>> >> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <[hidden email]>
>> wrote:
>> >>> Hi,
>> >>>
>> >>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted
>> to
>> >>> get the community's opinion.
>> >>>
>> >>> The context is that SchemaRDD is becoming a common data format used
>> for
>> >>> bringing data into Spark from external systems, and used for various
>> >>> components of Spark, e.g. MLlib's new pipeline API. We also expect
>> more
>> >> and
>> >>> more users to be programming directly against SchemaRDD API rather
>> than
>> >> the
>> >>> core RDD API. SchemaRDD, through its less commonly used DSL originally
>> >>> designed for writing test cases, always has the data-frame like API.
>> In
>> >>> 1.3, we are redesigning the API to make the API usable for end users.
>> >>>
>> >>>
>> >>> There are two motivations for the renaming:
>> >>>
>> >>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>> >>>
>> >>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
>> (even
>> >>> though it would contain some RDD functions like map, flatMap, etc),
>> and
>> >>> calling it Schema*RDD* while it is not an RDD is highly confusing.
>> >> Instead.
>> >>> DataFrame.rdd will return the underlying RDD for all RDD methods.
>> >>>
>> >>>
>> >>> My understanding is that very few users program directly against the
>> >>> SchemaRDD API at the moment, because they are not well documented.
>> >> However,
>> >>> oo maintain backward compatibility, we can create a type alias
>> DataFrame
>> >>> that is still named SchemaRDD. This will maintain source compatibility
>> >> for
>> >>> Scala. That said, we will have to update all existing materials to use
>> >>> DataFrame rather than SchemaRDD.
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [hidden email]
>> >> For additional commands, e-mail: [hidden email]
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [hidden email]
>> >> For additional commands, e-mail: [hidden email]
>> >>
>> >>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: renaming SchemaRDD -> DataFrame

Koert Kuipers
thats great. guess i was looking at a somewhat stale master branch...

On Tue, Jan 27, 2015 at 2:19 PM, Reynold Xin <[hidden email]> wrote:

> Koert,
>
> As Mark said, I have already refactored the API so that nothing is
> catalyst is exposed (and users won't need them anyway). Data types, Row
> interfaces are both outside catalyst package and in org.apache.spark.sql.
>
> On Tue, Jan 27, 2015 at 9:08 AM, Koert Kuipers <[hidden email]> wrote:
>
>> hey matei,
>> i think that stuff such as SchemaRDD, columar storage and perhaps also
>> query planning can be re-used by many systems that do analysis on
>> structured data. i can imagine panda-like systems, but also datalog or
>> scalding-like (which we use at tresata and i might rebase on SchemaRDD at
>> some point). SchemaRDD should become the interface for all these. and
>> columnar storage abstractions should be re-used between all these.
>>
>> currently the sql tie in is way beyond just the (perhaps unfortunate)
>> naming convention. for example a core part of the SchemaRD abstraction is
>> Row, which is org.apache.spark.sql.catalyst.expressions.Row, forcing anyone
>> that want to build on top of SchemaRDD to dig into catalyst, a SQL Parser
>> (if i understand it correctly, i have not used catalyst, but it looks
>> neat). i should not need to include a SQL parser just to use structured
>> data in say a panda-like framework.
>>
>> best, koert
>>
>>
>> On Mon, Jan 26, 2015 at 8:31 PM, Matei Zaharia <[hidden email]>
>> wrote:
>>
>>> While it might be possible to move this concept to Spark Core long-term,
>>> supporting structured data efficiently does require quite a bit of the
>>> infrastructure in Spark SQL, such as query planning and columnar storage.
>>> The intent of Spark SQL though is to be more than a SQL server -- it's
>>> meant to be a library for manipulating structured data. Since this is
>>> possible to build over the core API, it's pretty natural to organize it
>>> that way, same as Spark Streaming is a library.
>>>
>>> Matei
>>>
>>> > On Jan 26, 2015, at 4:26 PM, Koert Kuipers <[hidden email]> wrote:
>>> >
>>> > "The context is that SchemaRDD is becoming a common data format used
>>> for
>>> > bringing data into Spark from external systems, and used for various
>>> > components of Spark, e.g. MLlib's new pipeline API."
>>> >
>>> > i agree. this to me also implies it belongs in spark core, not sql
>>> >
>>> > On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>>> > [hidden email]> wrote:
>>> >
>>> >> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
>>> Area
>>> >> Spark Meetup YouTube contained a wealth of background information on
>>> this
>>> >> idea (mostly from Patrick and Reynold :-).
>>> >>
>>> >> https://www.youtube.com/watch?v=YWppYPWznSQ
>>> >>
>>> >> ________________________________
>>> >> From: Patrick Wendell <[hidden email]>
>>> >> To: Reynold Xin <[hidden email]>
>>> >> Cc: "[hidden email]" <[hidden email]>
>>> >> Sent: Monday, January 26, 2015 4:01 PM
>>> >> Subject: Re: renaming SchemaRDD -> DataFrame
>>> >>
>>> >>
>>> >> One thing potentially not clear from this e-mail, there will be a 1:1
>>> >> correspondence where you can get an RDD to/from a DataFrame.
>>> >>
>>> >>
>>> >> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <[hidden email]>
>>> wrote:
>>> >>> Hi,
>>> >>>
>>> >>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and
>>> wanted to
>>> >>> get the community's opinion.
>>> >>>
>>> >>> The context is that SchemaRDD is becoming a common data format used
>>> for
>>> >>> bringing data into Spark from external systems, and used for various
>>> >>> components of Spark, e.g. MLlib's new pipeline API. We also expect
>>> more
>>> >> and
>>> >>> more users to be programming directly against SchemaRDD API rather
>>> than
>>> >> the
>>> >>> core RDD API. SchemaRDD, through its less commonly used DSL
>>> originally
>>> >>> designed for writing test cases, always has the data-frame like API.
>>> In
>>> >>> 1.3, we are redesigning the API to make the API usable for end users.
>>> >>>
>>> >>>
>>> >>> There are two motivations for the renaming:
>>> >>>
>>> >>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>>> >>>
>>> >>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
>>> (even
>>> >>> though it would contain some RDD functions like map, flatMap, etc),
>>> and
>>> >>> calling it Schema*RDD* while it is not an RDD is highly confusing.
>>> >> Instead.
>>> >>> DataFrame.rdd will return the underlying RDD for all RDD methods.
>>> >>>
>>> >>>
>>> >>> My understanding is that very few users program directly against the
>>> >>> SchemaRDD API at the moment, because they are not well documented.
>>> >> However,
>>> >>> oo maintain backward compatibility, we can create a type alias
>>> DataFrame
>>> >>> that is still named SchemaRDD. This will maintain source
>>> compatibility
>>> >> for
>>> >>> Scala. That said, we will have to update all existing materials to
>>> use
>>> >>> DataFrame rather than SchemaRDD.
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: [hidden email]
>>> >> For additional commands, e-mail: [hidden email]
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: [hidden email]
>>> >> For additional commands, e-mail: [hidden email]
>>> >>
>>> >>
>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: renaming SchemaRDD -> DataFrame

Dmitriy Lyubimov
In reply to this post by rxin
It has been pretty evident for some time that's what it is, hasn't it?

Yes that's a better name IMO.

On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <[hidden email]> wrote:

> Hi,
>
> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
> get the community's opinion.
>
> The context is that SchemaRDD is becoming a common data format used for
> bringing data into Spark from external systems, and used for various
> components of Spark, e.g. MLlib's new pipeline API. We also expect more and
> more users to be programming directly against SchemaRDD API rather than the
> core RDD API. SchemaRDD, through its less commonly used DSL originally
> designed for writing test cases, always has the data-frame like API. In
> 1.3, we are redesigning the API to make the API usable for end users.
>
>
> There are two motivations for the renaming:
>
> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>
> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
> though it would contain some RDD functions like map, flatMap, etc), and
> calling it Schema*RDD* while it is not an RDD is highly confusing. Instead.
> DataFrame.rdd will return the underlying RDD for all RDD methods.
>
>
> My understanding is that very few users program directly against the
> SchemaRDD API at the moment, because they are not well documented. However,
> oo maintain backward compatibility, we can create a type alias DataFrame
> that is still named SchemaRDD. This will maintain source compatibility for
> Scala. That said, we will have to update all existing materials to use
> DataFrame rather than SchemaRDD.
>
Reply | Threaded
Open this post in threaded view
|

Re: renaming SchemaRDD -> DataFrame

Dirceu Semighini Filho
In reply to this post by rxin
Reynold,
But with type alias we will have the same problem, right?
If the methods doesn't receive schemardd anymore, we will have to change
our code to migrade from schema to dataframe. Unless we have an implicit
conversion between DataFrame and SchemaRDD



2015-01-27 17:18 GMT-02:00 Reynold Xin <[hidden email]>:

> Dirceu,
>
> That is not possible because one cannot overload return types.
>
> SQLContext.parquetFile (and many other methods) needs to return some type,
> and that type cannot be both SchemaRDD and DataFrame.
>
> In 1.3, we will create a type alias for DataFrame called SchemaRDD to not
> break source compatibility for Scala.
>
>
> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> [hidden email]> wrote:
>
>> Can't the SchemaRDD remain the same, but deprecated, and be removed in the
>> release 1.5(+/- 1)  for example, and the new code been added to DataFrame?
>> With this, we don't impact in existing code for the next few releases.
>>
>>
>>
>> 2015-01-27 0:02 GMT-02:00 Kushal Datta <[hidden email]>:
>>
>> > I want to address the issue that Matei raised about the heavy lifting
>> > required for a full SQL support. It is amazing that even after 30 years
>> of
>> > research there is not a single good open source columnar database like
>> > Vertica. There is a column store option in MySQL, but it is not nearly
>> as
>> > sophisticated as Vertica or MonetDB. But there's a true need for such a
>> > system. I wonder why so and it's high time to change that.
>> > On Jan 26, 2015 5:47 PM, "Sandy Ryza" <[hidden email]> wrote:
>> >
>> > > Both SchemaRDD and DataFrame sound fine to me, though I like the
>> former
>> > > slightly better because it's more descriptive.
>> > >
>> > > Even if SchemaRDD's needs to rely on Spark SQL under the covers, it
>> would
>> > > be more clear from a user-facing perspective to at least choose a
>> package
>> > > name for it that omits "sql".
>> > >
>> > > I would also be in favor of adding a separate Spark Schema module for
>> > Spark
>> > > SQL to rely on, but I imagine that might be too large a change at this
>> > > point?
>> > >
>> > > -Sandy
>> > >
>> > > On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>> [hidden email]>
>> > > wrote:
>> > >
>> > > > (Actually when we designed Spark SQL we thought of giving it another
>> > > name,
>> > > > like Spark Schema, but we decided to stick with SQL since that was
>> the
>> > > most
>> > > > obvious use case to many users.)
>> > > >
>> > > > Matei
>> > > >
>> > > > > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>> [hidden email]>
>> > > > wrote:
>> > > > >
>> > > > > While it might be possible to move this concept to Spark Core
>> > > long-term,
>> > > > supporting structured data efficiently does require quite a bit of
>> the
>> > > > infrastructure in Spark SQL, such as query planning and columnar
>> > storage.
>> > > > The intent of Spark SQL though is to be more than a SQL server --
>> it's
>> > > > meant to be a library for manipulating structured data. Since this
>> is
>> > > > possible to build over the core API, it's pretty natural to
>> organize it
>> > > > that way, same as Spark Streaming is a library.
>> > > > >
>> > > > > Matei
>> > > > >
>> > > > >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <[hidden email]>
>> > wrote:
>> > > > >>
>> > > > >> "The context is that SchemaRDD is becoming a common data format
>> used
>> > > for
>> > > > >> bringing data into Spark from external systems, and used for
>> various
>> > > > >> components of Spark, e.g. MLlib's new pipeline API."
>> > > > >>
>> > > > >> i agree. this to me also implies it belongs in spark core, not
>> sql
>> > > > >>
>> > > > >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>> > > > >> [hidden email]> wrote:
>> > > > >>
>> > > > >>> And in the off chance that anyone hasn't seen it yet, the Jan.
>> 13
>> > Bay
>> > > > Area
>> > > > >>> Spark Meetup YouTube contained a wealth of background
>> information
>> > on
>> > > > this
>> > > > >>> idea (mostly from Patrick and Reynold :-).
>> > > > >>>
>> > > > >>> https://www.youtube.com/watch?v=YWppYPWznSQ
>> > > > >>>
>> > > > >>> ________________________________
>> > > > >>> From: Patrick Wendell <[hidden email]>
>> > > > >>> To: Reynold Xin <[hidden email]>
>> > > > >>> Cc: "[hidden email]" <[hidden email]>
>> > > > >>> Sent: Monday, January 26, 2015 4:01 PM
>> > > > >>> Subject: Re: renaming SchemaRDD -> DataFrame
>> > > > >>>
>> > > > >>>
>> > > > >>> One thing potentially not clear from this e-mail, there will be
>> a
>> > 1:1
>> > > > >>> correspondence where you can get an RDD to/from a DataFrame.
>> > > > >>>
>> > > > >>>
>> > > > >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
>> [hidden email]>
>> > > > wrote:
>> > > > >>>> Hi,
>> > > > >>>>
>> > > > >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and
>> > > wanted
>> > > > to
>> > > > >>>> get the community's opinion.
>> > > > >>>>
>> > > > >>>> The context is that SchemaRDD is becoming a common data format
>> > used
>> > > > for
>> > > > >>>> bringing data into Spark from external systems, and used for
>> > various
>> > > > >>>> components of Spark, e.g. MLlib's new pipeline API. We also
>> expect
>> > > > more
>> > > > >>> and
>> > > > >>>> more users to be programming directly against SchemaRDD API
>> rather
>> > > > than
>> > > > >>> the
>> > > > >>>> core RDD API. SchemaRDD, through its less commonly used DSL
>> > > originally
>> > > > >>>> designed for writing test cases, always has the data-frame like
>> > API.
>> > > > In
>> > > > >>>> 1.3, we are redesigning the API to make the API usable for end
>> > > users.
>> > > > >>>>
>> > > > >>>>
>> > > > >>>> There are two motivations for the renaming:
>> > > > >>>>
>> > > > >>>> 1. DataFrame seems to be a more self-evident name than
>> SchemaRDD.
>> > > > >>>>
>> > > > >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD
>> anymore
>> > > > (even
>> > > > >>>> though it would contain some RDD functions like map, flatMap,
>> > etc),
>> > > > and
>> > > > >>>> calling it Schema*RDD* while it is not an RDD is highly
>> confusing.
>> > > > >>> Instead.
>> > > > >>>> DataFrame.rdd will return the underlying RDD for all RDD
>> methods.
>> > > > >>>>
>> > > > >>>>
>> > > > >>>> My understanding is that very few users program directly
>> against
>> > the
>> > > > >>>> SchemaRDD API at the moment, because they are not well
>> documented.
>> > > > >>> However,
>> > > > >>>> oo maintain backward compatibility, we can create a type alias
>> > > > DataFrame
>> > > > >>>> that is still named SchemaRDD. This will maintain source
>> > > compatibility
>> > > > >>> for
>> > > > >>>> Scala. That said, we will have to update all existing
>> materials to
>> > > use
>> > > > >>>> DataFrame rather than SchemaRDD.
>> > > > >>>
>> > > > >>>
>> > ---------------------------------------------------------------------
>> > > > >>> To unsubscribe, e-mail: [hidden email]
>> > > > >>> For additional commands, e-mail: [hidden email]
>> > > > >>>
>> > > > >>>
>> > ---------------------------------------------------------------------
>> > > > >>> To unsubscribe, e-mail: [hidden email]
>> > > > >>> For additional commands, e-mail: [hidden email]
>> > > > >>>
>> > > > >>>
>> > > > >
>> > > >
>> > > >
>> > > >
>> ---------------------------------------------------------------------
>> > > > To unsubscribe, e-mail: [hidden email]
>> > > > For additional commands, e-mail: [hidden email]
>> > > >
>> > > >
>> > >
>> >
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: renaming SchemaRDD -> DataFrame

Matei Zaharia
Administrator
The type alias means your methods can specify either type and they will work. It's just another name for the same type. But Scaladocs and such will show DataFrame as the type.

Matei

> On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <[hidden email]> wrote:
>
> Reynold,
> But with type alias we will have the same problem, right?
> If the methods doesn't receive schemardd anymore, we will have to change
> our code to migrade from schema to dataframe. Unless we have an implicit
> conversion between DataFrame and SchemaRDD
>
>
>
> 2015-01-27 17:18 GMT-02:00 Reynold Xin <[hidden email]>:
>
>> Dirceu,
>>
>> That is not possible because one cannot overload return types.
>>
>> SQLContext.parquetFile (and many other methods) needs to return some type,
>> and that type cannot be both SchemaRDD and DataFrame.
>>
>> In 1.3, we will create a type alias for DataFrame called SchemaRDD to not
>> break source compatibility for Scala.
>>
>>
>> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>> [hidden email]> wrote:
>>
>>> Can't the SchemaRDD remain the same, but deprecated, and be removed in the
>>> release 1.5(+/- 1)  for example, and the new code been added to DataFrame?
>>> With this, we don't impact in existing code for the next few releases.
>>>
>>>
>>>
>>> 2015-01-27 0:02 GMT-02:00 Kushal Datta <[hidden email]>:
>>>
>>>> I want to address the issue that Matei raised about the heavy lifting
>>>> required for a full SQL support. It is amazing that even after 30 years
>>> of
>>>> research there is not a single good open source columnar database like
>>>> Vertica. There is a column store option in MySQL, but it is not nearly
>>> as
>>>> sophisticated as Vertica or MonetDB. But there's a true need for such a
>>>> system. I wonder why so and it's high time to change that.
>>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" <[hidden email]> wrote:
>>>>
>>>>> Both SchemaRDD and DataFrame sound fine to me, though I like the
>>> former
>>>>> slightly better because it's more descriptive.
>>>>>
>>>>> Even if SchemaRDD's needs to rely on Spark SQL under the covers, it
>>> would
>>>>> be more clear from a user-facing perspective to at least choose a
>>> package
>>>>> name for it that omits "sql".
>>>>>
>>>>> I would also be in favor of adding a separate Spark Schema module for
>>>> Spark
>>>>> SQL to rely on, but I imagine that might be too large a change at this
>>>>> point?
>>>>>
>>>>> -Sandy
>>>>>
>>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>>> [hidden email]>
>>>>> wrote:
>>>>>
>>>>>> (Actually when we designed Spark SQL we thought of giving it another
>>>>> name,
>>>>>> like Spark Schema, but we decided to stick with SQL since that was
>>> the
>>>>> most
>>>>>> obvious use case to many users.)
>>>>>>
>>>>>> Matei
>>>>>>
>>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>>> [hidden email]>
>>>>>> wrote:
>>>>>>>
>>>>>>> While it might be possible to move this concept to Spark Core
>>>>> long-term,
>>>>>> supporting structured data efficiently does require quite a bit of
>>> the
>>>>>> infrastructure in Spark SQL, such as query planning and columnar
>>>> storage.
>>>>>> The intent of Spark SQL though is to be more than a SQL server --
>>> it's
>>>>>> meant to be a library for manipulating structured data. Since this
>>> is
>>>>>> possible to build over the core API, it's pretty natural to
>>> organize it
>>>>>> that way, same as Spark Streaming is a library.
>>>>>>>
>>>>>>> Matei
>>>>>>>
>>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <[hidden email]>
>>>> wrote:
>>>>>>>>
>>>>>>>> "The context is that SchemaRDD is becoming a common data format
>>> used
>>>>> for
>>>>>>>> bringing data into Spark from external systems, and used for
>>> various
>>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
>>>>>>>>
>>>>>>>> i agree. this to me also implies it belongs in spark core, not
>>> sql
>>>>>>>>
>>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>>>>>>>> [hidden email]> wrote:
>>>>>>>>
>>>>>>>>> And in the off chance that anyone hasn't seen it yet, the Jan.
>>> 13
>>>> Bay
>>>>>> Area
>>>>>>>>> Spark Meetup YouTube contained a wealth of background
>>> information
>>>> on
>>>>>> this
>>>>>>>>> idea (mostly from Patrick and Reynold :-).
>>>>>>>>>
>>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
>>>>>>>>>
>>>>>>>>> ________________________________
>>>>>>>>> From: Patrick Wendell <[hidden email]>
>>>>>>>>> To: Reynold Xin <[hidden email]>
>>>>>>>>> Cc: "[hidden email]" <[hidden email]>
>>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
>>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> One thing potentially not clear from this e-mail, there will be
>>> a
>>>> 1:1
>>>>>>>>> correspondence where you can get an RDD to/from a DataFrame.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
>>> [hidden email]>
>>>>>> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and
>>>>> wanted
>>>>>> to
>>>>>>>>>> get the community's opinion.
>>>>>>>>>>
>>>>>>>>>> The context is that SchemaRDD is becoming a common data format
>>>> used
>>>>>> for
>>>>>>>>>> bringing data into Spark from external systems, and used for
>>>> various
>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. We also
>>> expect
>>>>>> more
>>>>>>>>> and
>>>>>>>>>> more users to be programming directly against SchemaRDD API
>>> rather
>>>>>> than
>>>>>>>>> the
>>>>>>>>>> core RDD API. SchemaRDD, through its less commonly used DSL
>>>>> originally
>>>>>>>>>> designed for writing test cases, always has the data-frame like
>>>> API.
>>>>>> In
>>>>>>>>>> 1.3, we are redesigning the API to make the API usable for end
>>>>> users.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> There are two motivations for the renaming:
>>>>>>>>>>
>>>>>>>>>> 1. DataFrame seems to be a more self-evident name than
>>> SchemaRDD.
>>>>>>>>>>
>>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD
>>> anymore
>>>>>> (even
>>>>>>>>>> though it would contain some RDD functions like map, flatMap,
>>>> etc),
>>>>>> and
>>>>>>>>>> calling it Schema*RDD* while it is not an RDD is highly
>>> confusing.
>>>>>>>>> Instead.
>>>>>>>>>> DataFrame.rdd will return the underlying RDD for all RDD
>>> methods.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> My understanding is that very few users program directly
>>> against
>>>> the
>>>>>>>>>> SchemaRDD API at the moment, because they are not well
>>> documented.
>>>>>>>>> However,
>>>>>>>>>> oo maintain backward compatibility, we can create a type alias
>>>>>> DataFrame
>>>>>>>>>> that is still named SchemaRDD. This will maintain source
>>>>> compatibility
>>>>>>>>> for
>>>>>>>>>> Scala. That said, we will have to update all existing
>>> materials to
>>>>> use
>>>>>>>>>> DataFrame rather than SchemaRDD.
>>>>>>>>>
>>>>>>>>>
>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: [hidden email]
>>>>>>>>> For additional commands, e-mail: [hidden email]
>>>>>>>>>
>>>>>>>>>
>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: [hidden email]
>>>>>>>>> For additional commands, e-mail: [hidden email]
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [hidden email]
>>>>>> For additional commands, e-mail: [hidden email]
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12