Spark Improvement Proposals

classic Classic list List threaded Threaded
107 messages Options
1234 ... 6
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Spark Improvement Proposals

Cody Koeninger-2
I love Spark.  3 or 4 years ago it was the first distributed computing
environment that felt usable, and the community was welcoming.

But I just got back from the Reactive Summit, and this is what I observed:

- Industry leaders on stage making fun of Spark's streaming model
- Open source project leaders saying they looked at Spark's governance
as a model to avoid
- Users saying they chose Flink because it was technically superior
and they couldn't get any answers on the Spark mailing lists

Whether you agree with the substance of any of this, when this stuff
gets repeated enough people will believe it.

Right now Spark is suffering from its own success, and I think
something needs to change.

- We need a clear process for planning significant changes to the codebase.
I'm not saying you need to adopt Kafka Improvement Proposals exactly,
but you need a documented process with a clear outcome (e.g. a vote).
Passing around google docs after an implementation has largely been
decided on doesn't cut it.

- All technical communication needs to be public.
Things getting decided in private chat, or when 1/3 of the committers
work for the same company and can just talk to each other...
Yes, it's convenient, but it's ultimately detrimental to the health of
the project.
The way structured streaming has played out has shown that there are
significant technical blind spots (myself included).
One way to address that is to get the people who have domain knowledge
involved, and listen to them.

- We need more committers, and more committer diversity.
Per committer there are, what, more than 20 contributors and 10 new
jira tickets a month?  It's too much.
There are people (I am _not_ referring to myself) who have been around
for years, contributed thousands of lines of code, helped educate the
public around Spark... and yet are never going to be voted in.

- We need a clear process for managing volunteer work.
Too many tickets sit around unowned, unclosed, uncertain.
If someone proposed something and it isn't up to snuff, tell them and
close it.  It may be blunt, but it's clearer than "silent no".
If someone wants to work on something, let them own the ticket and set
a deadline. If they don't meet it, close it or reassign it.

This is not me putting on an Apache Bureaucracy hat.  This is me
saying, as a fellow hacker and loyal dissenter, something is wrong
with the culture and process.

Please, let's change it.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Dean Wampler
I was there, too. I agree with Cody's assessments and recommendations

Dean

Sent from my rotary phone.


> On Oct 6, 2016, at 9:51 PM, Cody Koeninger <[hidden email]> wrote:
>
> I love Spark.  3 or 4 years ago it was the first distributed computing
> environment that felt usable, and the community was welcoming.
>
> But I just got back from the Reactive Summit, and this is what I observed:
>
> - Industry leaders on stage making fun of Spark's streaming model
> - Open source project leaders saying they looked at Spark's governance
> as a model to avoid
> - Users saying they chose Flink because it was technically superior
> and they couldn't get any answers on the Spark mailing lists
>
> Whether you agree with the substance of any of this, when this stuff
> gets repeated enough people will believe it.
>
> Right now Spark is suffering from its own success, and I think
> something needs to change.
>
> - We need a clear process for planning significant changes to the codebase.
> I'm not saying you need to adopt Kafka Improvement Proposals exactly,
> but you need a documented process with a clear outcome (e.g. a vote).
> Passing around google docs after an implementation has largely been
> decided on doesn't cut it.
>
> - All technical communication needs to be public.
> Things getting decided in private chat, or when 1/3 of the committers
> work for the same company and can just talk to each other...
> Yes, it's convenient, but it's ultimately detrimental to the health of
> the project.
> The way structured streaming has played out has shown that there are
> significant technical blind spots (myself included).
> One way to address that is to get the people who have domain knowledge
> involved, and listen to them.
>
> - We need more committers, and more committer diversity.
> Per committer there are, what, more than 20 contributors and 10 new
> jira tickets a month?  It's too much.
> There are people (I am _not_ referring to myself) who have been around
> for years, contributed thousands of lines of code, helped educate the
> public around Spark... and yet are never going to be voted in.
>
> - We need a clear process for managing volunteer work.
> Too many tickets sit around unowned, unclosed, uncertain.
> If someone proposed something and it isn't up to snuff, tell them and
> close it.  It may be blunt, but it's clearer than "silent no".
> If someone wants to work on something, let them own the ticket and set
> a deadline. If they don't meet it, close it or reassign it.
>
> This is not me putting on an Apache Bureaucracy hat.  This is me
> saying, as a fellow hacker and loyal dissenter, something is wrong
> with the culture and process.
>
> Please, let's change it.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Matei Zaharia
Administrator
In reply to this post by Cody Koeninger-2
Hey Cody,

Thanks for bringing these things up. You're talking about quite a few different things here, but let me get to them each in turn.

1) About technical / design discussion -- I fully agree that everything big should go through a lot of review, and I like the idea of a more formal way to propose and comment on larger features. So far, all of this has been done through JIRA, but as a start, maybe marking JIRAs as large (we often use Umbrella for this) and also opening a thread on the list about each such JIRA would help. For Structured Streaming in particular, FWIW, there was a pretty complete doc on the proposed semantics at https://issues.apache.org/jira/browse/SPARK-8360 since March. But it's true that other things such as the Kafka source for it didn't have as much design on JIRA. Nonetheless, this component is still early on and there's still a lot of time to change it, which is happening.

2) About what people say at Reactive Summit -- there will always be trolls, but just ignore them and build a great project. Those of us involved in the project for a while have long seen similar stuff, e.g. a prominent company saying Spark doesn't scale past 100 nodes when there were many documented instances to the contrary, and the best answer is to just make the project better. This same company, if you read their website now, recommends Apache Spark for most anything. For streaming in particular, there is a lot of confusion because many of the concepts aren't well-defined (e.g. what is "at least once", etc), and it's also a crowded space. But Spark Streaming prioritizes a few things that it does very well: correctness (you can easily tell what the app will do, and it does the same thing despite failures), ease of programming (which also requires correctness), and scalability. We should of course both explain what it does in more places and work on improving it where needed (e.g. adding a higher level API with Structured Streaming and built-in primitives for external timestamps).

3) About number and diversity of committers -- the PMC is always working to expand these, and you should email people on the PMC (or even the whole list) if you have people you'd like to propose. In general I think nearly all committers added in the past year were from organizations that haven't long been involved in Spark, and the number of committers continues to grow pretty fast.

4) Finally, about better organizing JIRA, marking dead issues, etc, this would be great and I think we just need a concrete proposal for how to do it. It would be best to point to an existing process that someone else has used here BTW so that we can see it in action.

Matei

> On Oct 6, 2016, at 7:51 PM, Cody Koeninger <[hidden email]> wrote:
>
> I love Spark.  3 or 4 years ago it was the first distributed computing
> environment that felt usable, and the community was welcoming.
>
> But I just got back from the Reactive Summit, and this is what I observed:
>
> - Industry leaders on stage making fun of Spark's streaming model
> - Open source project leaders saying they looked at Spark's governance
> as a model to avoid
> - Users saying they chose Flink because it was technically superior
> and they couldn't get any answers on the Spark mailing lists
>
> Whether you agree with the substance of any of this, when this stuff
> gets repeated enough people will believe it.
>
> Right now Spark is suffering from its own success, and I think
> something needs to change.
>
> - We need a clear process for planning significant changes to the codebase.
> I'm not saying you need to adopt Kafka Improvement Proposals exactly,
> but you need a documented process with a clear outcome (e.g. a vote).
> Passing around google docs after an implementation has largely been
> decided on doesn't cut it.
>
> - All technical communication needs to be public.
> Things getting decided in private chat, or when 1/3 of the committers
> work for the same company and can just talk to each other...
> Yes, it's convenient, but it's ultimately detrimental to the health of
> the project.
> The way structured streaming has played out has shown that there are
> significant technical blind spots (myself included).
> One way to address that is to get the people who have domain knowledge
> involved, and listen to them.
>
> - We need more committers, and more committer diversity.
> Per committer there are, what, more than 20 contributors and 10 new
> jira tickets a month?  It's too much.
> There are people (I am _not_ referring to myself) who have been around
> for years, contributed thousands of lines of code, helped educate the
> public around Spark... and yet are never going to be voted in.
>
> - We need a clear process for managing volunteer work.
> Too many tickets sit around unowned, unclosed, uncertain.
> If someone proposed something and it isn't up to snuff, tell them and
> close it.  It may be blunt, but it's clearer than "silent no".
> If someone wants to work on something, let them own the ticket and set
> a deadline. If they don't meet it, close it or reassign it.
>
> This is not me putting on an Apache Bureaucracy hat.  This is me
> saying, as a fellow hacker and loyal dissenter, something is wrong
> with the culture and process.
>
> Please, let's change it.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Xiao Li
Let us continue to improve Apache Spark!

I volunteer to go through all the SQL-related open JIRAs.

Xiao Li

2016-10-06 21:14 GMT-07:00 Matei Zaharia <[hidden email]>:

> Hey Cody,
>
> Thanks for bringing these things up. You're talking about quite a few different things here, but let me get to them each in turn.
>
> 1) About technical / design discussion -- I fully agree that everything big should go through a lot of review, and I like the idea of a more formal way to propose and comment on larger features. So far, all of this has been done through JIRA, but as a start, maybe marking JIRAs as large (we often use Umbrella for this) and also opening a thread on the list about each such JIRA would help. For Structured Streaming in particular, FWIW, there was a pretty complete doc on the proposed semantics at https://issues.apache.org/jira/browse/SPARK-8360 since March. But it's true that other things such as the Kafka source for it didn't have as much design on JIRA. Nonetheless, this component is still early on and there's still a lot of time to change it, which is happening.
>
> 2) About what people say at Reactive Summit -- there will always be trolls, but just ignore them and build a great project. Those of us involved in the project for a while have long seen similar stuff, e.g. a prominent company saying Spark doesn't scale past 100 nodes when there were many documented instances to the contrary, and the best answer is to just make the project better. This same company, if you read their website now, recommends Apache Spark for most anything. For streaming in particular, there is a lot of confusion because many of the concepts aren't well-defined (e.g. what is "at least once", etc), and it's also a crowded space. But Spark Streaming prioritizes a few things that it does very well: correctness (you can easily tell what the app will do, and it does the same thing despite failures), ease of programming (which also requires correctness), and scalability. We should of course both explain what it does in more places and work on improving it where needed (e.g. adding a higher level API with Structured Streaming and built-in primitives for external timestamps).
>
> 3) About number and diversity of committers -- the PMC is always working to expand these, and you should email people on the PMC (or even the whole list) if you have people you'd like to propose. In general I think nearly all committers added in the past year were from organizations that haven't long been involved in Spark, and the number of committers continues to grow pretty fast.
>
> 4) Finally, about better organizing JIRA, marking dead issues, etc, this would be great and I think we just need a concrete proposal for how to do it. It would be best to point to an existing process that someone else has used here BTW so that we can see it in action.
>
> Matei
>
>> On Oct 6, 2016, at 7:51 PM, Cody Koeninger <[hidden email]> wrote:
>>
>> I love Spark.  3 or 4 years ago it was the first distributed computing
>> environment that felt usable, and the community was welcoming.
>>
>> But I just got back from the Reactive Summit, and this is what I observed:
>>
>> - Industry leaders on stage making fun of Spark's streaming model
>> - Open source project leaders saying they looked at Spark's governance
>> as a model to avoid
>> - Users saying they chose Flink because it was technically superior
>> and they couldn't get any answers on the Spark mailing lists
>>
>> Whether you agree with the substance of any of this, when this stuff
>> gets repeated enough people will believe it.
>>
>> Right now Spark is suffering from its own success, and I think
>> something needs to change.
>>
>> - We need a clear process for planning significant changes to the codebase.
>> I'm not saying you need to adopt Kafka Improvement Proposals exactly,
>> but you need a documented process with a clear outcome (e.g. a vote).
>> Passing around google docs after an implementation has largely been
>> decided on doesn't cut it.
>>
>> - All technical communication needs to be public.
>> Things getting decided in private chat, or when 1/3 of the committers
>> work for the same company and can just talk to each other...
>> Yes, it's convenient, but it's ultimately detrimental to the health of
>> the project.
>> The way structured streaming has played out has shown that there are
>> significant technical blind spots (myself included).
>> One way to address that is to get the people who have domain knowledge
>> involved, and listen to them.
>>
>> - We need more committers, and more committer diversity.
>> Per committer there are, what, more than 20 contributors and 10 new
>> jira tickets a month?  It's too much.
>> There are people (I am _not_ referring to myself) who have been around
>> for years, contributed thousands of lines of code, helped educate the
>> public around Spark... and yet are never going to be voted in.
>>
>> - We need a clear process for managing volunteer work.
>> Too many tickets sit around unowned, unclosed, uncertain.
>> If someone proposed something and it isn't up to snuff, tell them and
>> close it.  It may be blunt, but it's clearer than "silent no".
>> If someone wants to work on something, let them own the ticket and set
>> a deadline. If they don't meet it, close it or reassign it.
>>
>> This is not me putting on an Apache Bureaucracy hat.  This is me
>> saying, as a fellow hacker and loyal dissenter, something is wrong
>> with the culture and process.
>>
>> Please, let's change it.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Sean Owen
In reply to this post by Matei Zaharia
Suggestion actions way at the bottom.

On Fri, Oct 7, 2016 at 5:14 AM Matei Zaharia <[hidden email]> wrote:
since March. But it's true that other things such as the Kafka source for it didn't have as much design on JIRA. Nonetheless, this component is still early on and there's still a lot of time to change it, which is happening.

It's hard to drive design discussions in OSS. Even when diligently publishing design docs, the doc happens after brainstorming, and that happens inside someone's head or in chats.

The lazy consensus model that works for small changes doesn't work well here. If a committer wants a change, that change will basically be made modulo small edits; vetoes are for dire disagreement. (Otherwise we'd get nothing done.) However this model means it's hard to significantly change a design after draft 1.

I've heard this complaint a few times, and it has never been down to bad faith. We should err further towards over-including early and often. I've seen some great discussions start more with a problem statement and an RFC, not a design doc. Keeping regular contributors enfranchised is essential, so that they're willing and able to participate when design time comes. (See below.)

 
2) About what people say at Reactive Summit -- there will always be trolls, but just ignore them and build a great project. Those of us involved in the project for a while have long seen similar stuff, e.g. a

The hype cycle may be turning against Spark, as is normal for this stage of maturity. People idealize technologies they don't really use as greener grass; it's the things they use and need to work that they love to hate.

I would not dismiss this as just trolling. Customer anecdotes I see suggest that Spark underperforms their (inflated) expectations, and generally does not Just Work. It takes expertise, tuning, patience, workarounds. And then it gets great things done. I do see a gap between how the group here talks about the technology, and how the users I see talk about it. The gap manifests in attention given to making yet more things, and attention given to fixing and project mechanics.

I would also not dismiss criticism of governance. We can recognize some big problems that were resolved over even the past 3 months. Usually I hear, well, we do better than most projects, right? and that is true. But, Spark is bigger and busier than most any other project. Exceptional projects need exceptional governance and we have merely "good". See next.
 

3) About number and diversity of committers -- the PMC is always working to expand these, and you should email people on the PMC (or even the whole list) if you have people you'd like to propose. In 

If you're suggesting that it's mostly a matter of asking, then this doesn't match my experience. I have seen a few people consistently soft-reject most proposals. The reasons given usually sound like "concerns about quality", which is probably the right answer to a somewhat wrong question.

We should probably be asking primarily who will net-net add efficiency to some part of the project's mechanics. Per above, it wouldn't hurt to ask who would expand coverage and add diversity of perspective too. 

I disagree that committers are being added at a sufficient rate. The overall committer-attention hours is dropping as the project grows -- am I the only one that perceives many regular committers aren't working nearly as much as before on the project? 

I call it a problem because we have IMHO people who 'qualify', and not giving them some stake is going to cost the project down the road. Always Be Recruiting. This is what I would worry about, since the governance and enfranchisement issues above kind of stem from this.

 
4) Finally, about better organizing JIRA, marking dead issues, etc, this would be great and I think we just need a concrete proposal for how to do it. It would be best to point to an existing process that someone else has used here BTW so that we can see it in action.

I don't think we're wanting for proposals. I went on and on about it last year, and don't think anyone disagreed about actions. I wouldn't suggest that clearing out dead issues is more complex than just putting in time to do it. It's just grunt work and understandably not appealing. (Thank you Xiao for your recent run at SQL JIRAs.)

It requires saying 'no', which is hard, because it requires some conviction. I have encountered reluctance to do this in Spark and think that culture should change. Is it weird to say that a broader group of gatekeepers can actually with more confidence and efficiency tackle the triage issue? that pushing back on 'bad' contribution actually increases the rate of 'good'?

FWIW I also find the project unpleasant to deal with day to day, mostly because of the scale of the triage, and think we could use all the qualified help we can get. I am looking to do less with the project over time, which is no big deal in itself, but is a big deal if these several factors are adding up to discourage fresh blood from joining the fray. Cody makes me think there are, at least, 2 of us. 

Concrete steps?

Go to spark-prs.com. Look at "Users". Look at your open PRs. Are any stale? can you close them or advance them? 

Look at the Stale PRs tab and sort by last updated. Do any look dead? can you ask the author to update or close? does the parent JIRA look like it's not otherwise relevant?

Go download JIRA Client at http://almworks.com/jiraclient/download.html Go look at all open JIRAs sorted by last update. Are any pretty obviously obsolete? 

If you don't feel comfortable acting, feel free to at least propose a list to dev@ for a look.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Cody Koeninger-2
Sean, that was very eloquently put, and I 100% agree.  If I ever meet
you in person, I'll buy you multiple rounds of beverages of your
choice ;)
This is probably reiterating some of what you said in a less clear
manner, but I'll throw more of my 2 cents in.

- Design.
Yes, design by committee doesn't work.  The best designs are when a
person who understands the problem builds something that works for
them, shares with others, and most importantly iterates when it
doesn't work for others.  This iteration only works if you're willing
to change interfaces, but committer and user goals are not aligned
here.  Users want something that is clearly documented and helps them
get their job done.  Committers (not all) want to minimize interface
change, even at the expense of users being able to do their jobs.  In
this situation, it is critical that you understand early what users
need to be able to do.  This is what the improvement proposal process
should focus on: Goals, non-goals, possible solutions, rejected
solutions.  Not class-level design.  Most importantly, it needs a
clear, unambiguous outcome that is visible to the public.

- Trolling
It's not just trolling.  Event time and kafka are technically
important and should not be ignored.  I've been banging this drum for
years.  These concerns haven't been fully heard and understood by
committers.  This one example of why diversity of enfranchised users
is important and governance concerns shouldn't be ignored.

- Jira
Concretely, automate closing stale jiras after X amount of time.  It's
really surprising to me how much reluctance a community of programmers
have shown towards automating their own processes around stuff like
this (not to mention automatic code formatting of modified files).  I
understand the arguments against. but the current alternative doesn't
work.
Concretely, clearly reject and close jiras.  I have a backlog of 50+
kafka jiras, many of which are irrelevant at this point, but I do not
feel that I have the political power to close them.
Concretely, make it clear who is working on something.  This can be as
simple as just "I'm working on this", assign it to me, if I don't
follow up in X amount of time, close it or reassign.  That doesn't
mean there can't be competing work, but it does mean those people
should talk to each other.  Conversely, if committers currently don't
have time to work on something that is important, make that clear in
the ticket.


On Fri, Oct 7, 2016 at 5:34 AM, Sean Owen <[hidden email]> wrote:

> Suggestion actions way at the bottom.
>
> On Fri, Oct 7, 2016 at 5:14 AM Matei Zaharia <[hidden email]>
> wrote:
>>
>> since March. But it's true that other things such as the Kafka source for
>> it didn't have as much design on JIRA. Nonetheless, this component is still
>> early on and there's still a lot of time to change it, which is happening.
>
>
> It's hard to drive design discussions in OSS. Even when diligently
> publishing design docs, the doc happens after brainstorming, and that
> happens inside someone's head or in chats.
>
> The lazy consensus model that works for small changes doesn't work well
> here. If a committer wants a change, that change will basically be made
> modulo small edits; vetoes are for dire disagreement. (Otherwise we'd get
> nothing done.) However this model means it's hard to significantly change a
> design after draft 1.
>
> I've heard this complaint a few times, and it has never been down to bad
> faith. We should err further towards over-including early and often. I've
> seen some great discussions start more with a problem statement and an RFC,
> not a design doc. Keeping regular contributors enfranchised is essential, so
> that they're willing and able to participate when design time comes. (See
> below.)
>
>
>>
>> 2) About what people say at Reactive Summit -- there will always be
>> trolls, but just ignore them and build a great project. Those of us involved
>> in the project for a while have long seen similar stuff, e.g. a
>
>
> The hype cycle may be turning against Spark, as is normal for this stage of
> maturity. People idealize technologies they don't really use as greener
> grass; it's the things they use and need to work that they love to hate.
>
> I would not dismiss this as just trolling. Customer anecdotes I see suggest
> that Spark underperforms their (inflated) expectations, and generally does
> not Just Work. It takes expertise, tuning, patience, workarounds. And then
> it gets great things done. I do see a gap between how the group here talks
> about the technology, and how the users I see talk about it. The gap
> manifests in attention given to making yet more things, and attention given
> to fixing and project mechanics.
>
> I would also not dismiss criticism of governance. We can recognize some big
> problems that were resolved over even the past 3 months. Usually I hear,
> well, we do better than most projects, right? and that is true. But, Spark
> is bigger and busier than most any other project. Exceptional projects need
> exceptional governance and we have merely "good". See next.
>
>
>> 3) About number and diversity of committers -- the PMC is always working
>> to expand these, and you should email people on the PMC (or even the whole
>> list) if you have people you'd like to propose. In
>
>
> If you're suggesting that it's mostly a matter of asking, then this doesn't
> match my experience. I have seen a few people consistently soft-reject most
> proposals. The reasons given usually sound like "concerns about quality",
> which is probably the right answer to a somewhat wrong question.
>
> We should probably be asking primarily who will net-net add efficiency to
> some part of the project's mechanics. Per above, it wouldn't hurt to ask who
> would expand coverage and add diversity of perspective too.
>
> I disagree that committers are being added at a sufficient rate. The overall
> committer-attention hours is dropping as the project grows -- am I the only
> one that perceives many regular committers aren't working nearly as much as
> before on the project?
>
> I call it a problem because we have IMHO people who 'qualify', and not
> giving them some stake is going to cost the project down the road. Always Be
> Recruiting. This is what I would worry about, since the governance and
> enfranchisement issues above kind of stem from this.
>
>
>>
>> 4) Finally, about better organizing JIRA, marking dead issues, etc, this
>> would be great and I think we just need a concrete proposal for how to do
>> it. It would be best to point to an existing process that someone else has
>> used here BTW so that we can see it in action.
>
>
> I don't think we're wanting for proposals. I went on and on about it last
> year, and don't think anyone disagreed about actions. I wouldn't suggest
> that clearing out dead issues is more complex than just putting in time to
> do it. It's just grunt work and understandably not appealing. (Thank you
> Xiao for your recent run at SQL JIRAs.)
>
> It requires saying 'no', which is hard, because it requires some conviction.
> I have encountered reluctance to do this in Spark and think that culture
> should change. Is it weird to say that a broader group of gatekeepers can
> actually with more confidence and efficiency tackle the triage issue? that
> pushing back on 'bad' contribution actually increases the rate of 'good'?
>
> FWIW I also find the project unpleasant to deal with day to day, mostly
> because of the scale of the triage, and think we could use all the qualified
> help we can get. I am looking to do less with the project over time, which
> is no big deal in itself, but is a big deal if these several factors are
> adding up to discourage fresh blood from joining the fray. Cody makes me
> think there are, at least, 2 of us.
>
> Concrete steps?
>
> Go to spark-prs.com. Look at "Users". Look at your open PRs. Are any stale?
> can you close them or advance them?
>
> Look at the Stale PRs tab and sort by last updated. Do any look dead? can
> you ask the author to update or close? does the parent JIRA look like it's
> not otherwise relevant?
>
> Go download JIRA Client at http://almworks.com/jiraclient/download.html Go
> look at all open JIRAs sorted by last update. Are any pretty obviously
> obsolete?
>
> If you don't feel comfortable acting, feel free to at least propose a list
> to dev@ for a look.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Holden Karau
First off, thanks Cody for taking the time to put together these proposals - I think it has kicked off some wonderful discussion.

I think dismissing people's complaints with Spark as largely trolls does us a disservice, it’s important for us to recognize our own shortcomings - otherwise we are blind to the weak spots where we need to improve and instead focus on new features. Parts of the Python community seem to be actively looking for alternatives, and I’d obviously like Spark continue to be the place where we come together and collaborate from different languages.

I’d be more than happy to do a review of the outstanding Python PRs (I’ve been keeping on top of the new ones but largely haven’t looked at the older ones) and if there is a committer (maybe Davies or Sean?) who would be able to help out with merging them once they are ready that would be awesome. I’m at PyData DC this weekend but I’ll also start going through some of the older Python JIRAs and seeing if they are still relevant, already fixed, or something we are unlikely to be interested in bringing into Spark.

I’m giving a talk later on this month on how to get started contributing to Apache Spark at OSCON London, and when I’ve given this talk before I’ve had to include a fair number of warnings about the challenges that can face a new contributor. I’d love to be able to drop those in future versions :)

P.S.

As one of the non-committers who has been working on Spark for several years (see http://bit.ly/hkspmg ) I have strong feelings around the current process being used for committers - but since I’m not on the PMC (catch-22 style) it's difficult to have any visibility into the process, so someone who does will have to weigh in on that :)


On Fri, Oct 7, 2016 at 8:00 AM, Cody Koeninger <[hidden email]> wrote:
Sean, that was very eloquently put, and I 100% agree.  If I ever meet
you in person, I'll buy you multiple rounds of beverages of your
choice ;)
This is probably reiterating some of what you said in a less clear
manner, but I'll throw more of my 2 cents in.

- Design.
Yes, design by committee doesn't work.  The best designs are when a
person who understands the problem builds something that works for
them, shares with others, and most importantly iterates when it
doesn't work for others.  This iteration only works if you're willing
to change interfaces, but committer and user goals are not aligned
here.  Users want something that is clearly documented and helps them
get their job done.  Committers (not all) want to minimize interface
change, even at the expense of users being able to do their jobs.  In
this situation, it is critical that you understand early what users
need to be able to do.  This is what the improvement proposal process
should focus on: Goals, non-goals, possible solutions, rejected
solutions.  Not class-level design.  Most importantly, it needs a
clear, unambiguous outcome that is visible to the public.

- Trolling
It's not just trolling.  Event time and kafka are technically
important and should not be ignored.  I've been banging this drum for
years.  These concerns haven't been fully heard and understood by
committers.  This one example of why diversity of enfranchised users
is important and governance concerns shouldn't be ignored.

- Jira
Concretely, automate closing stale jiras after X amount of time.  It's
really surprising to me how much reluctance a community of programmers
have shown towards automating their own processes around stuff like
this (not to mention automatic code formatting of modified files).  I
understand the arguments against. but the current alternative doesn't
work.
Concretely, clearly reject and close jiras.  I have a backlog of 50+
kafka jiras, many of which are irrelevant at this point, but I do not
feel that I have the political power to close them.
Concretely, make it clear who is working on something.  This can be as
simple as just "I'm working on this", assign it to me, if I don't
follow up in X amount of time, close it or reassign.  That doesn't
mean there can't be competing work, but it does mean those people
should talk to each other.  Conversely, if committers currently don't
have time to work on something that is important, make that clear in
the ticket.


On Fri, Oct 7, 2016 at 5:34 AM, Sean Owen <[hidden email]> wrote:
> Suggestion actions way at the bottom.
>
> On Fri, Oct 7, 2016 at 5:14 AM Matei Zaharia <[hidden email]>
> wrote:
>>
>> since March. But it's true that other things such as the Kafka source for
>> it didn't have as much design on JIRA. Nonetheless, this component is still
>> early on and there's still a lot of time to change it, which is happening.
>
>
> It's hard to drive design discussions in OSS. Even when diligently
> publishing design docs, the doc happens after brainstorming, and that
> happens inside someone's head or in chats.
>
> The lazy consensus model that works for small changes doesn't work well
> here. If a committer wants a change, that change will basically be made
> modulo small edits; vetoes are for dire disagreement. (Otherwise we'd get
> nothing done.) However this model means it's hard to significantly change a
> design after draft 1.
>
> I've heard this complaint a few times, and it has never been down to bad
> faith. We should err further towards over-including early and often. I've
> seen some great discussions start more with a problem statement and an RFC,
> not a design doc. Keeping regular contributors enfranchised is essential, so
> that they're willing and able to participate when design time comes. (See
> below.)
>
>
>>
>> 2) About what people say at Reactive Summit -- there will always be
>> trolls, but just ignore them and build a great project. Those of us involved
>> in the project for a while have long seen similar stuff, e.g. a
>
>
> The hype cycle may be turning against Spark, as is normal for this stage of
> maturity. People idealize technologies they don't really use as greener
> grass; it's the things they use and need to work that they love to hate.
>
> I would not dismiss this as just trolling. Customer anecdotes I see suggest
> that Spark underperforms their (inflated) expectations, and generally does
> not Just Work. It takes expertise, tuning, patience, workarounds. And then
> it gets great things done. I do see a gap between how the group here talks
> about the technology, and how the users I see talk about it. The gap
> manifests in attention given to making yet more things, and attention given
> to fixing and project mechanics.
>
> I would also not dismiss criticism of governance. We can recognize some big
> problems that were resolved over even the past 3 months. Usually I hear,
> well, we do better than most projects, right? and that is true. But, Spark
> is bigger and busier than most any other project. Exceptional projects need
> exceptional governance and we have merely "good". See next.
>
>
>> 3) About number and diversity of committers -- the PMC is always working
>> to expand these, and you should email people on the PMC (or even the whole
>> list) if you have people you'd like to propose. In
>
>
> If you're suggesting that it's mostly a matter of asking, then this doesn't
> match my experience. I have seen a few people consistently soft-reject most
> proposals. The reasons given usually sound like "concerns about quality",
> which is probably the right answer to a somewhat wrong question.
>
> We should probably be asking primarily who will net-net add efficiency to
> some part of the project's mechanics. Per above, it wouldn't hurt to ask who
> would expand coverage and add diversity of perspective too.
>
> I disagree that committers are being added at a sufficient rate. The overall
> committer-attention hours is dropping as the project grows -- am I the only
> one that perceives many regular committers aren't working nearly as much as
> before on the project?
>
> I call it a problem because we have IMHO people who 'qualify', and not
> giving them some stake is going to cost the project down the road. Always Be
> Recruiting. This is what I would worry about, since the governance and
> enfranchisement issues above kind of stem from this.
>
>
>>
>> 4) Finally, about better organizing JIRA, marking dead issues, etc, this
>> would be great and I think we just need a concrete proposal for how to do
>> it. It would be best to point to an existing process that someone else has
>> used here BTW so that we can see it in action.
>
>
> I don't think we're wanting for proposals. I went on and on about it last
> year, and don't think anyone disagreed about actions. I wouldn't suggest
> that clearing out dead issues is more complex than just putting in time to
> do it. It's just grunt work and understandably not appealing. (Thank you
> Xiao for your recent run at SQL JIRAs.)
>
> It requires saying 'no', which is hard, because it requires some conviction.
> I have encountered reluctance to do this in Spark and think that culture
> should change. Is it weird to say that a broader group of gatekeepers can
> actually with more confidence and efficiency tackle the triage issue? that
> pushing back on 'bad' contribution actually increases the rate of 'good'?
>
> FWIW I also find the project unpleasant to deal with day to day, mostly
> because of the scale of the triage, and think we could use all the qualified
> help we can get. I am looking to do less with the project over time, which
> is no big deal in itself, but is a big deal if these several factors are
> adding up to discourage fresh blood from joining the fray. Cody makes me
> think there are, at least, 2 of us.
>
> Concrete steps?
>
> Go to spark-prs.com. Look at "Users". Look at your open PRs. Are any stale?
> can you close them or advance them?
>
> Look at the Stale PRs tab and sort by last updated. Do any look dead? can
> you ask the author to update or close? does the parent JIRA look like it's
> not otherwise relevant?
>
> Go download JIRA Client at http://almworks.com/jiraclient/download.html Go
> look at all open JIRAs sorted by last update. Are any pretty obviously
> obsolete?
>
> If you don't feel comfortable acting, feel free to at least propose a list
> to dev@ for a look.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Cell : 425-233-8271
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Matei Zaharia
Administrator
I think people misunderstood my comment about trolls a bit -- I'm not saying to just dismiss what people say, but to focus on what improves the project instead of being upset that people criticize stuff. This stuff happens all the time to any project in a "hot" area, as Sean said. I don't think there's anyone that wants to stop adding features to streaming for example, or stop listening to users, etc, or who thinks the project is already perfect (I certainly spend much of my time looking at how to improve it).

Just to comment on a few things:

On Oct 7, 2016, at 9:16 AM, Holden Karau <[hidden email]> wrote:

First off, thanks Cody for taking the time to put together these proposals - I think it has kicked off some wonderful discussion.

I think dismissing people's complaints with Spark as largely trolls does us a disservice, it’s important for us to recognize our own shortcomings - otherwise we are blind to the weak spots where we need to improve and instead focus on new features. Parts of the Python community seem to be actively looking for alternatives, and I’d obviously like Spark continue to be the place where we come together and collaborate from different languages.

I’d be more than happy to do a review of the outstanding Python PRs (I’ve been keeping on top of the new ones but largely haven’t looked at the older ones) and if there is a committer (maybe Davies or Sean?) who would be able to help out with merging them once they are ready that would be awesome. I’m at PyData DC this weekend but I’ll also start going through some of the older Python JIRAs and seeing if they are still relevant, already fixed, or something we are unlikely to be interested in bringing into Spark.

It would be great to also hear why people are looking for other stuff at a high level -- are there just many small issues in Python, or are there some bigger things missing? For example, one thing I'd like to see is easy installation of PySpark using pip install pyspark. Another idea would be making startup time and initialization easy enough that people use Spark regularly on a single machine, as a replacement for multiprocessing.

- Design.
Yes, design by committee doesn't work.  The best designs are when a
person who understands the problem builds something that works for
them, shares with others, and most importantly iterates when it
doesn't work for others.  This iteration only works if you're willing
to change interfaces, but committer and user goals are not aligned
here.  Users want something that is clearly documented and helps them
get their job done.  Committers (not all) want to minimize interface
change, even at the expense of users being able to do their jobs.  In
this situation, it is critical that you understand early what users
need to be able to do.  This is what the improvement proposal process
should focus on: Goals, non-goals, possible solutions, rejected
solutions.  Not class-level design.  Most importantly, it needs a
clear, unambiguous outcome that is visible to the public.

Love the idea of a more visible "Spark Improvement Proposal" process that solicits user input on new APIs. For what it's worth, I don't think committers are trying to minimize their own work -- every committer cares about making the software useful for users. However, it is always hard to get user input and so it helps to have this kind of process. I've certainly looked at the *IPs a lot in other software I use just to see the biggest things on the roadmap.

When you're talking about "changing interfaces", are you talking about public or internal APIs? I do think many people hate changing public APIs and I actually think that's for the best of the project. That's a technical debate, but basically, the worst thing when you're using a piece of software is that the developers constantly ask you to rewrite your app to update to a new version (and thus benefit from bug fixes, etc). Cue anyone who's used Protobuf, or Guava. The "let's get everyone to change their code this release" model works well within a single large company, but doesn't work well for a community, which is why nearly all *very* widely used programming interfaces (I'm talking things like Java standard library, Windows API, etc) almost *never* break backwards compatibility. All this is done within reason though, e.g. we do change things in major releases (2.x, 3.x, etc).

- Trolling
It's not just trolling.  Event time and kafka are technically
important and should not be ignored.  I've been banging this drum for
years.  These concerns haven't been fully heard and understood by
committers.  This one example of why diversity of enfranchised users
is important and governance concerns shouldn't be ignored.

I agree about empowering people interested here to contribute, but I'm wondering, do you think there are technical things that people don't want to work on, or is it a matter of what there's been time to do? Everyone I know does want great Kafka support, event time, etc, it's just a question of working out the details and of course of getting the coding done. This is also an area where I'd love to see more contributions -- in the past, people have dome similar-scale contributions in other areas (e.g. better integration with Hive, on-the-wire encryption, etc).

FWIW, I think there are three things going on with streaming.

1) Structured Streaming, which is meant to provide a much higher-level new API. This was meant from the beginning to include event time, various complex form of windows, and great data source and sink support in a unified framework. It's also, IMHO, much simpler than most existing APIs for this stuff (i.e. look at the number of concepts you have to learn for those versus for this). However, this project is still very early on -- only the bare minimum API came out in 2.0. It's marked as alpha and it's precisely the type of system where I'd expect the API to improve in response to feedback. As with other APIs, such as Spark SQL's SchemaRDD and DataFrame, I think it's good to get it in front of *users* quickly and receive feedback -- even developers discussing among themselves can't anticipate all user needs.

2) Adding things in Spark Streaming. I haven't personally worked much on this lately, but it is a very reasonable thing that I'd love to see the project do to help current users. For example, consider adding an aggregate-by-event-time operator to Spark Streaming (it can be done using mapWithState), or a sessionization operator, etc.

3) Another thing that I think is possible is just lowering the latency of both Spark Streaming and Structured Streaming by 10x -- a few folks at Berkeley have been working on this (https://spark-summit.org/2016/events/low-latency-execution-for-apache-spark/). Happy to fork off a thread about how to do it. Their current system requires some new concepts in the Spark scheduler, but from measuring stuff it also seems that you can get somewhere with less intensive changes (most of the overhead is in RPCs, not in the scheduling logic or task execution).

- Jira
Concretely, automate closing stale jiras after X amount of time.  It's
really surprising to me how much reluctance a community of programmers
have shown towards automating their own processes around stuff like
this (not to mention automatic code formatting of modified files).  I
understand the arguments against. but the current alternative doesn't
work.
Concretely, clearly reject and close jiras.  I have a backlog of 50+
kafka jiras, many of which are irrelevant at this point, but I do not
feel that I have the political power to close them.
Concretely, make it clear who is working on something.  This can be as
simple as just "I'm working on this", assign it to me, if I don't
follow up in X amount of time, close it or reassign.  That doesn't
mean there can't be competing work, but it does mean those people
should talk to each other.  Conversely, if committers currently don't
have time to work on something that is important, make that clear in
the ticket.

Definitely agree with marking who's working on something early on, and timing it out if inactive. For closing JIRAs, I think the best way I've seen is for people to go through them once in a while. Automated closing is too impersonal IMO -- if I opened a JIRA on a project and nobody looked at it and that happened to me, I'd actively feel ignored. If you do that, you'll see people on stage saying "I reported a bug for Spark and some bot just closed it after 3 months", which is not ideal.

Matei




On Fri, Oct 7, 2016 at 5:34 AM, Sean Owen <[hidden email]> wrote:
> Suggestion actions way at the bottom.
>
> On Fri, Oct 7, 2016 at 5:14 AM Matei Zaharia <[hidden email]>
> wrote:
>>
>> since March. But it's true that other things such as the Kafka source for
>> it didn't have as much design on JIRA. Nonetheless, this component is still
>> early on and there's still a lot of time to change it, which is happening.
>
>
> It's hard to drive design discussions in OSS. Even when diligently
> publishing design docs, the doc happens after brainstorming, and that
> happens inside someone's head or in chats.
>
> The lazy consensus model that works for small changes doesn't work well
> here. If a committer wants a change, that change will basically be made
> modulo small edits; vetoes are for dire disagreement. (Otherwise we'd get
> nothing done.) However this model means it's hard to significantly change a
> design after draft 1.
>
> I've heard this complaint a few times, and it has never been down to bad
> faith. We should err further towards over-including early and often. I've
> seen some great discussions start more with a problem statement and an RFC,
> not a design doc. Keeping regular contributors enfranchised is essential, so
> that they're willing and able to participate when design time comes. (See
> below.)
>
>
>>
>> 2) About what people say at Reactive Summit -- there will always be
>> trolls, but just ignore them and build a great project. Those of us involved
>> in the project for a while have long seen similar stuff, e.g. a
>
>
> The hype cycle may be turning against Spark, as is normal for this stage of
> maturity. People idealize technologies they don't really use as greener
> grass; it's the things they use and need to work that they love to hate.
>
> I would not dismiss this as just trolling. Customer anecdotes I see suggest
> that Spark underperforms their (inflated) expectations, and generally does
> not Just Work. It takes expertise, tuning, patience, workarounds. And then
> it gets great things done. I do see a gap between how the group here talks
> about the technology, and how the users I see talk about it. The gap
> manifests in attention given to making yet more things, and attention given
> to fixing and project mechanics.
>
> I would also not dismiss criticism of governance. We can recognize some big
> problems that were resolved over even the past 3 months. Usually I hear,
> well, we do better than most projects, right? and that is true. But, Spark
> is bigger and busier than most any other project. Exceptional projects need
> exceptional governance and we have merely "good". See next.
>
>
>> 3) About number and diversity of committers -- the PMC is always working
>> to expand these, and you should email people on the PMC (or even the whole
>> list) if you have people you'd like to propose. In
>
>
> If you're suggesting that it's mostly a matter of asking, then this doesn't
> match my experience. I have seen a few people consistently soft-reject most
> proposals. The reasons given usually sound like "concerns about quality",
> which is probably the right answer to a somewhat wrong question.
>
> We should probably be asking primarily who will net-net add efficiency to
> some part of the project's mechanics. Per above, it wouldn't hurt to ask who
> would expand coverage and add diversity of perspective too.
>
> I disagree that committers are being added at a sufficient rate. The overall
> committer-attention hours is dropping as the project grows -- am I the only
> one that perceives many regular committers aren't working nearly as much as
> before on the project?
>
> I call it a problem because we have IMHO people who 'qualify', and not
> giving them some stake is going to cost the project down the road. Always Be
> Recruiting. This is what I would worry about, since the governance and
> enfranchisement issues above kind of stem from this.
>
>
>>
>> 4) Finally, about better organizing JIRA, marking dead issues, etc, this
>> would be great and I think we just need a concrete proposal for how to do
>> it. It would be best to point to an existing process that someone else has
>> used here BTW so that we can see it in action.
>
>
> I don't think we're wanting for proposals. I went on and on about it last
> year, and don't think anyone disagreed about actions. I wouldn't suggest
> that clearing out dead issues is more complex than just putting in time to
> do it. It's just grunt work and understandably not appealing. (Thank you
> Xiao for your recent run at SQL JIRAs.)
>
> It requires saying 'no', which is hard, because it requires some conviction.
> I have encountered reluctance to do this in Spark and think that culture
> should change. Is it weird to say that a broader group of gatekeepers can
> actually with more confidence and efficiency tackle the triage issue? that
> pushing back on 'bad' contribution actually increases the rate of 'good'?
>
> FWIW I also find the project unpleasant to deal with day to day, mostly
> because of the scale of the triage, and think we could use all the qualified
> help we can get. I am looking to do less with the project over time, which
> is no big deal in itself, but is a big deal if these several factors are
> adding up to discourage fresh blood from joining the fray. Cody makes me
> think there are, at least, 2 of us.
>
> Concrete steps?
>
> Go to spark-prs.com. Look at "Users". Look at your open PRs. Are any stale?
> can you close them or advance them?
>
> Look at the Stale PRs tab and sort by last updated. Do any look dead? can
> you ask the author to update or close? does the parent JIRA look like it's
> not otherwise relevant?
>
> Go download JIRA Client at http://almworks.com/jiraclient/download.html Go
> look at all open JIRAs sorted by last update. Are any pretty obviously
> obsolete?
>
> If you don't feel comfortable acting, feel free to at least propose a list
> to dev@ for a look.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Cell : 425-233-8271

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Nicholas Chammas

There are several important discussions happening simultaneously. Should we perhaps split them up into separate threads? Otherwise it’s really difficult to follow.

It seems like the discussion about having a more formal “Spark Improvement Proposal” process should take priority here.

Other discussions that could be fleshed out in separate threads are:

  • Better managing “organic” community contributions (i.e. PRs, JIRA issues, etc).
  • Adjusting Spark’s governance model / adding more committers.
  • Discussing / addressing competition to Spark coming out of the Python community.

Nick

On Fri, Oct 7, 2016 at 1:04 PM Matei Zaharia matei.zaharia@... wrote:

I think people misunderstood my comment about trolls a bit -- I'm not saying to just dismiss what people say, but to focus on what improves the project instead of being upset that people criticize stuff. This stuff happens all the time to any project in a "hot" area, as Sean said. I don't think there's anyone that wants to stop adding features to streaming for example, or stop listening to users, etc, or who thinks the project is already perfect (I certainly spend much of my time looking at how to improve it).

Just to comment on a few things:

On Oct 7, 2016, at 9:16 AM, Holden Karau <[hidden email]> wrote:

First off, thanks Cody for taking the time to put together these proposals - I think it has kicked off some wonderful discussion.

I think dismissing people's complaints with Spark as largely trolls does us a disservice, it’s important for us to recognize our own shortcomings - otherwise we are blind to the weak spots where we need to improve and instead focus on new features. Parts of the Python community seem to be actively looking for alternatives, and I’d obviously like Spark continue to be the place where we come together and collaborate from different languages.

I’d be more than happy to do a review of the outstanding Python PRs (I’ve been keeping on top of the new ones but largely haven’t looked at the older ones) and if there is a committer (maybe Davies or Sean?) who would be able to help out with merging them once they are ready that would be awesome. I’m at PyData DC this weekend but I’ll also start going through some of the older Python JIRAs and seeing if they are still relevant, already fixed, or something we are unlikely to be interested in bringing into Spark.

It would be great to also hear why people are looking for other stuff at a high level -- are there just many small issues in Python, or are there some bigger things missing? For example, one thing I'd like to see is easy installation of PySpark using pip install pyspark. Another idea would be making startup time and initialization easy enough that people use Spark regularly on a single machine, as a replacement for multiprocessing.

- Design.
Yes, design by committee doesn't work.  The best designs are when a
person who understands the problem builds something that works for
them, shares with others, and most importantly iterates when it
doesn't work for others.  This iteration only works if you're willing
to change interfaces, but committer and user goals are not aligned
here.  Users want something that is clearly documented and helps them
get their job done.  Committers (not all) want to minimize interface
change, even at the expense of users being able to do their jobs.  In
this situation, it is critical that you understand early what users
need to be able to do.  This is what the improvement proposal process
should focus on: Goals, non-goals, possible solutions, rejected
solutions.  Not class-level design.  Most importantly, it needs a
clear, unambiguous outcome that is visible to the public.

Love the idea of a more visible "Spark Improvement Proposal" process that solicits user input on new APIs. For what it's worth, I don't think committers are trying to minimize their own work -- every committer cares about making the software useful for users. However, it is always hard to get user input and so it helps to have this kind of process. I've certainly looked at the *IPs a lot in other software I use just to see the biggest things on the roadmap.

When you're talking about "changing interfaces", are you talking about public or internal APIs? I do think many people hate changing public APIs and I actually think that's for the best of the project. That's a technical debate, but basically, the worst thing when you're using a piece of software is that the developers constantly ask you to rewrite your app to update to a new version (and thus benefit from bug fixes, etc). Cue anyone who's used Protobuf, or Guava. The "let's get everyone to change their code this release" model works well within a single large company, but doesn't work well for a community, which is why nearly all *very* widely used programming interfaces (I'm talking things like Java standard library, Windows API, etc) almost *never* break backwards compatibility. All this is done within reason though, e.g. we do change things in major releases (2.x, 3.x, etc).

- Trolling
It's not just trolling.  Event time and kafka are technically
important and should not be ignored.  I've been banging this drum for
years.  These concerns haven't been fully heard and understood by
committers.  This one example of why diversity of enfranchised users
is important and governance concerns shouldn't be ignored.

I agree about empowering people interested here to contribute, but I'm wondering, do you think there are technical things that people don't want to work on, or is it a matter of what there's been time to do? Everyone I know does want great Kafka support, event time, etc, it's just a question of working out the details and of course of getting the coding done. This is also an area where I'd love to see more contributions -- in the past, people have dome similar-scale contributions in other areas (e.g. better integration with Hive, on-the-wire encryption, etc).

FWIW, I think there are three things going on with streaming.

1) Structured Streaming, which is meant to provide a much higher-level new API. This was meant from the beginning to include event time, various complex form of windows, and great data source and sink support in a unified framework. It's also, IMHO, much simpler than most existing APIs for this stuff (i.e. look at the number of concepts you have to learn for those versus for this). However, this project is still very early on -- only the bare minimum API came out in 2.0. It's marked as alpha and it's precisely the type of system where I'd expect the API to improve in response to feedback. As with other APIs, such as Spark SQL's SchemaRDD and DataFrame, I think it's good to get it in front of *users* quickly and receive feedback -- even developers discussing among themselves can't anticipate all user needs.

2) Adding things in Spark Streaming. I haven't personally worked much on this lately, but it is a very reasonable thing that I'd love to see the project do to help current users. For example, consider adding an aggregate-by-event-time operator to Spark Streaming (it can be done using mapWithState), or a sessionization operator, etc.

3) Another thing that I think is possible is just lowering the latency of both Spark Streaming and Structured Streaming by 10x -- a few folks at Berkeley have been working on this (https://spark-summit.org/2016/events/low-latency-execution-for-apache-spark/). Happy to fork off a thread about how to do it. Their current system requires some new concepts in the Spark scheduler, but from measuring stuff it also seems that you can get somewhere with less intensive changes (most of the overhead is in RPCs, not in the scheduling logic or task execution).

- Jira
Concretely, automate closing stale jiras after X amount of time.  It's
really surprising to me how much reluctance a community of programmers
have shown towards automating their own processes around stuff like
this (not to mention automatic code formatting of modified files).  I
understand the arguments against. but the current alternative doesn't
work.
Concretely, clearly reject and close jiras.  I have a backlog of 50+
kafka jiras, many of which are irrelevant at this point, but I do not
feel that I have the political power to close them.
Concretely, make it clear who is working on something.  This can be as
simple as just "I'm working on this", assign it to me, if I don't
follow up in X amount of time, close it or reassign.  That doesn't
mean there can't be competing work, but it does mean those people
should talk to each other.  Conversely, if committers currently don't
have time to work on something that is important, make that clear in
the ticket.

Definitely agree with marking who's working on something early on, and timing it out if inactive. For closing JIRAs, I think the best way I've seen is for people to go through them once in a while. Automated closing is too impersonal IMO -- if I opened a JIRA on a project and nobody looked at it and that happened to me, I'd actively feel ignored. If you do that, you'll see people on stage saying "I reported a bug for Spark and some bot just closed it after 3 months", which is not ideal.

Matei




On Fri, Oct 7, 2016 at 5:34 AM, Sean Owen <[hidden email]> wrote:
> Suggestion actions way at the bottom.
>
> On Fri, Oct 7, 2016 at 5:14 AM Matei Zaharia <[hidden email]>
> wrote:
>>
>> since March. But it's true that other things such as the Kafka source for
>> it didn't have as much design on JIRA. Nonetheless, this component is still
>> early on and there's still a lot of time to change it, which is happening.
>
>
> It's hard to drive design discussions in OSS. Even when diligently
> publishing design docs, the doc happens after brainstorming, and that
> happens inside someone's head or in chats.
>
> The lazy consensus model that works for small changes doesn't work well
> here. If a committer wants a change, that change will basically be made
> modulo small edits; vetoes are for dire disagreement. (Otherwise we'd get
> nothing done.) However this model means it's hard to significantly change a
> design after draft 1.
>
> I've heard this complaint a few times, and it has never been down to bad
> faith. We should err further towards over-including early and often. I've
> seen some great discussions start more with a problem statement and an RFC,
> not a design doc. Keeping regular contributors enfranchised is essential, so
> that they're willing and able to participate when design time comes. (See
> below.)
>
>
>>
>> 2) About what people say at Reactive Summit -- there will always be
>> trolls, but just ignore them and build a great project. Those of us involved
>> in the project for a while have long seen similar stuff, e.g. a
>
>
> The hype cycle may be turning against Spark, as is normal for this stage of
> maturity. People idealize technologies they don't really use as greener
> grass; it's the things they use and need to work that they love to hate.
>
> I would not dismiss this as just trolling. Customer anecdotes I see suggest
> that Spark underperforms their (inflated) expectations, and generally does
> not Just Work. It takes expertise, tuning, patience, workarounds. And then
> it gets great things done. I do see a gap between how the group here talks
> about the technology, and how the users I see talk about it. The gap
> manifests in attention given to making yet more things, and attention given
> to fixing and project mechanics.
>
> I would also not dismiss criticism of governance. We can recognize some big
> problems that were resolved over even the past 3 months. Usually I hear,
> well, we do better than most projects, right? and that is true. But, Spark
> is bigger and busier than most any other project. Exceptional projects need
> exceptional governance and we have merely "good". See next.
>
>
>> 3) About number and diversity of committers -- the PMC is always working
>> to expand these, and you should email people on the PMC (or even the whole
>> list) if you have people you'd like to propose. In
>
>
> If you're suggesting that it's mostly a matter of asking, then this doesn't
> match my experience. I have seen a few people consistently soft-reject most
> proposals. The reasons given usually sound like "concerns about quality",
> which is probably the right answer to a somewhat wrong question.
>
> We should probably be asking primarily who will net-net add efficiency to
> some part of the project's mechanics. Per above, it wouldn't hurt to ask who
> would expand coverage and add diversity of perspective too.
>
> I disagree that committers are being added at a sufficient rate. The overall
> committer-attention hours is dropping as the project grows -- am I the only
> one that perceives many regular committers aren't working nearly as much as
> before on the project?
>
> I call it a problem because we have IMHO people who 'qualify', and not
> giving them some stake is going to cost the project down the road. Always Be
> Recruiting. This is what I would worry about, since the governance and
> enfranchisement issues above kind of stem from this.
>
>
>>
>> 4) Finally, about better organizing JIRA, marking dead issues, etc, this
>> would be great and I think we just need a concrete proposal for how to do
>> it. It would be best to point to an existing process that someone else has
>> used here BTW so that we can see it in action.
>
>
> I don't think we're wanting for proposals. I went on and on about it last
> year, and don't think anyone disagreed about actions. I wouldn't suggest
> that clearing out dead issues is more complex than just putting in time to
> do it. It's just grunt work and understandably not appealing. (Thank you
> Xiao for your recent run at SQL JIRAs.)
>
> It requires saying 'no', which is hard, because it requires some conviction.
> I have encountered reluctance to do this in Spark and think that culture
> should change. Is it weird to say that a broader group of gatekeepers can
> actually with more confidence and efficiency tackle the triage issue? that
> pushing back on 'bad' contribution actually increases the rate of 'good'?
>
> FWIW I also find the project unpleasant to deal with day to day, mostly
> because of the scale of the triage, and think we could use all the qualified
> help we can get. I am looking to do less with the project over time, which
> is no big deal in itself, but is a big deal if these several factors are
> adding up to discourage fresh blood from joining the fray. Cody makes me
> think there are, at least, 2 of us.
>
> Concrete steps?
>
> Go to spark-prs.com. Look at "Users". Look at your open PRs. Are any stale?
> can you close them or advance them?
>
> Look at the Stale PRs tab and sort by last updated. Do any look dead? can
> you ask the author to update or close? does the parent JIRA look like it's
> not otherwise relevant?
>
> Go download JIRA Client at http://almworks.com/jiraclient/download.html Go
> look at all open JIRAs sorted by last update. Are any pretty obviously
> obsolete?
>
> If you don't feel comfortable acting, feel free to at least propose a list
> to dev@ for a look.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Cell : <a href="tel:(425)%20233-8271" value="+14252338271" class="gmail_msg" target="_blank">425-233-8271

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

rxin
In reply to this post by Matei Zaharia
I called Cody last night and talked about some of the topics in his email. It became clear to me Cody genuinely cares about the project.

Some of the frustrations come from the success of the project itself becoming very "hot", and it is difficult to get clarity from people who don't dedicate all their time to Spark. In fact, it is in some ways similar to scaling an engineering team in a successful startup: old processes that worked well might not work so well when it gets to a certain size, cultures can get diluted, building culture vs building process, etc.

I also really like to have a more visible process for larger changes, especially major user facing API changes. Historically we upload design docs for major changes, but it is not always consistent and difficult to quality of the docs, due to the volunteering nature of the organization.

Some of the more concrete ideas we discussed focus on building a culture to improve clarity:

- Process: Large changes should have design docs posted on JIRA. One thing Cody and I didn't discuss but an idea that just came to me is we should create a design doc template for the project and ask everybody to follow. The design doc template should also explicitly list goals and non-goals, to make design doc more consistent.

- Process: Email dev@ to solicit feedback. We have some this with some changes, but again very inconsistent. Just posting something on JIRA isn't sufficient, because there are simply too many JIRAs and the signal get lost in the noise. While this is generally impossible to enforce because we can't force all volunteers to conform to a process (or they might not even be aware of this),  those who are more familiar with the project can help by emailing the dev@ when they see something that hasn't been.

- Culture: The design doc author(s) should be open to feedback. A design doc should serve as the base for discussion and is by no means the final design. Of course, this does not mean the author has to accept every feedback. They should also be comfortable accepting / rejecting ideas on technical grounds.

- Process / Culture: For major ongoing projects, it can be useful to have some monthly Google hangouts that are open to the world. I am actually not sure how well this will work, because of the volunteering nature and we need to adjust for timezones for people across the globe, but it seems worth trying.

- Culture: Contributors (including committers) should be more direct in setting expectations, including whether they are working on a specific issue, whether they will be working on a specific issue, and whether an issue or pr or jira should be rejected. Most people I know in this community are nice and don't enjoy telling other people no, but it is often more annoying to a contributor to not know anything than getting a no.


On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia <[hidden email]> wrote:

Love the idea of a more visible "Spark Improvement Proposal" process that solicits user input on new APIs. For what it's worth, I don't think committers are trying to minimize their own work -- every committer cares about making the software useful for users. However, it is always hard to get user input and so it helps to have this kind of process. I've certainly looked at the *IPs a lot in other software I use just to see the biggest things on the roadmap.

When you're talking about "changing interfaces", are you talking about public or internal APIs? I do think many people hate changing public APIs and I actually think that's for the best of the project. That's a technical debate, but basically, the worst thing when you're using a piece of software is that the developers constantly ask you to rewrite your app to update to a new version (and thus benefit from bug fixes, etc). Cue anyone who's used Protobuf, or Guava. The "let's get everyone to change their code this release" model works well within a single large company, but doesn't work well for a community, which is why nearly all *very* widely used programming interfaces (I'm talking things like Java standard library, Windows API, etc) almost *never* break backwards compatibility. All this is done within reason though, e.g. we do change things in major releases (2.x, 3.x, etc).


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Matei Zaharia
Administrator
For the improvement proposals, I think one major point was to make them really visible to users who are not contributors, so we should do more than sending stuff to dev@. One very lightweight idea is to have a new type of JIRA called a SIP and have a link to a filter that shows all such JIRAs from http://spark.apache.org. I also like the idea of SIP and design doc templates (in fact many projects have them).

Matei

On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]> wrote:

I called Cody last night and talked about some of the topics in his email. It became clear to me Cody genuinely cares about the project.

Some of the frustrations come from the success of the project itself becoming very "hot", and it is difficult to get clarity from people who don't dedicate all their time to Spark. In fact, it is in some ways similar to scaling an engineering team in a successful startup: old processes that worked well might not work so well when it gets to a certain size, cultures can get diluted, building culture vs building process, etc.

I also really like to have a more visible process for larger changes, especially major user facing API changes. Historically we upload design docs for major changes, but it is not always consistent and difficult to quality of the docs, due to the volunteering nature of the organization.

Some of the more concrete ideas we discussed focus on building a culture to improve clarity:

- Process: Large changes should have design docs posted on JIRA. One thing Cody and I didn't discuss but an idea that just came to me is we should create a design doc template for the project and ask everybody to follow. The design doc template should also explicitly list goals and non-goals, to make design doc more consistent.

- Process: Email dev@ to solicit feedback. We have some this with some changes, but again very inconsistent. Just posting something on JIRA isn't sufficient, because there are simply too many JIRAs and the signal get lost in the noise. While this is generally impossible to enforce because we can't force all volunteers to conform to a process (or they might not even be aware of this),  those who are more familiar with the project can help by emailing the dev@ when they see something that hasn't been.

- Culture: The design doc author(s) should be open to feedback. A design doc should serve as the base for discussion and is by no means the final design. Of course, this does not mean the author has to accept every feedback. They should also be comfortable accepting / rejecting ideas on technical grounds.

- Process / Culture: For major ongoing projects, it can be useful to have some monthly Google hangouts that are open to the world. I am actually not sure how well this will work, because of the volunteering nature and we need to adjust for timezones for people across the globe, but it seems worth trying.

- Culture: Contributors (including committers) should be more direct in setting expectations, including whether they are working on a specific issue, whether they will be working on a specific issue, and whether an issue or pr or jira should be rejected. Most people I know in this community are nice and don't enjoy telling other people no, but it is often more annoying to a contributor to not know anything than getting a no.


On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia <[hidden email]> wrote:

Love the idea of a more visible "Spark Improvement Proposal" process that solicits user input on new APIs. For what it's worth, I don't think committers are trying to minimize their own work -- every committer cares about making the software useful for users. However, it is always hard to get user input and so it helps to have this kind of process. I've certainly looked at the *IPs a lot in other software I use just to see the biggest things on the roadmap.

When you're talking about "changing interfaces", are you talking about public or internal APIs? I do think many people hate changing public APIs and I actually think that's for the best of the project. That's a technical debate, but basically, the worst thing when you're using a piece of software is that the developers constantly ask you to rewrite your app to update to a new version (and thus benefit from bug fixes, etc). Cue anyone who's used Protobuf, or Guava. The "let's get everyone to change their code this release" model works well within a single large company, but doesn't work well for a community, which is why nearly all *very* widely used programming interfaces (I'm talking things like Java standard library, Windows API, etc) almost *never* break backwards compatibility. All this is done within reason though, e.g. we do change things in major releases (2.x, 3.x, etc).



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Hyukjin Kwon
In reply to this post by Holden Karau
I am glad that it was not only what I was thinking.
I also do agree with Holden, Sean and Cody. All I wanted to say were all said.



2016-10-08 1:16 GMT+09:00 Holden Karau <[hidden email]>:
First off, thanks Cody for taking the time to put together these proposals - I think it has kicked off some wonderful discussion.

I think dismissing people's complaints with Spark as largely trolls does us a disservice, it’s important for us to recognize our own shortcomings - otherwise we are blind to the weak spots where we need to improve and instead focus on new features. Parts of the Python community seem to be actively looking for alternatives, and I’d obviously like Spark continue to be the place where we come together and collaborate from different languages.

I’d be more than happy to do a review of the outstanding Python PRs (I’ve been keeping on top of the new ones but largely haven’t looked at the older ones) and if there is a committer (maybe Davies or Sean?) who would be able to help out with merging them once they are ready that would be awesome. I’m at PyData DC this weekend but I’ll also start going through some of the older Python JIRAs and seeing if they are still relevant, already fixed, or something we are unlikely to be interested in bringing into Spark.

I’m giving a talk later on this month on how to get started contributing to Apache Spark at OSCON London, and when I’ve given this talk before I’ve had to include a fair number of warnings about the challenges that can face a new contributor. I’d love to be able to drop those in future versions :)

P.S.

As one of the non-committers who has been working on Spark for several years (see http://bit.ly/hkspmg ) I have strong feelings around the current process being used for committers - but since I’m not on the PMC (catch-22 style) it's difficult to have any visibility into the process, so someone who does will have to weigh in on that :)


On Fri, Oct 7, 2016 at 8:00 AM, Cody Koeninger <[hidden email]> wrote:
Sean, that was very eloquently put, and I 100% agree.  If I ever meet
you in person, I'll buy you multiple rounds of beverages of your
choice ;)
This is probably reiterating some of what you said in a less clear
manner, but I'll throw more of my 2 cents in.

- Design.
Yes, design by committee doesn't work.  The best designs are when a
person who understands the problem builds something that works for
them, shares with others, and most importantly iterates when it
doesn't work for others.  This iteration only works if you're willing
to change interfaces, but committer and user goals are not aligned
here.  Users want something that is clearly documented and helps them
get their job done.  Committers (not all) want to minimize interface
change, even at the expense of users being able to do their jobs.  In
this situation, it is critical that you understand early what users
need to be able to do.  This is what the improvement proposal process
should focus on: Goals, non-goals, possible solutions, rejected
solutions.  Not class-level design.  Most importantly, it needs a
clear, unambiguous outcome that is visible to the public.

- Trolling
It's not just trolling.  Event time and kafka are technically
important and should not be ignored.  I've been banging this drum for
years.  These concerns haven't been fully heard and understood by
committers.  This one example of why diversity of enfranchised users
is important and governance concerns shouldn't be ignored.

- Jira
Concretely, automate closing stale jiras after X amount of time.  It's
really surprising to me how much reluctance a community of programmers
have shown towards automating their own processes around stuff like
this (not to mention automatic code formatting of modified files).  I
understand the arguments against. but the current alternative doesn't
work.
Concretely, clearly reject and close jiras.  I have a backlog of 50+
kafka jiras, many of which are irrelevant at this point, but I do not
feel that I have the political power to close them.
Concretely, make it clear who is working on something.  This can be as
simple as just "I'm working on this", assign it to me, if I don't
follow up in X amount of time, close it or reassign.  That doesn't
mean there can't be competing work, but it does mean those people
should talk to each other.  Conversely, if committers currently don't
have time to work on something that is important, make that clear in
the ticket.


On Fri, Oct 7, 2016 at 5:34 AM, Sean Owen <[hidden email]> wrote:
> Suggestion actions way at the bottom.
>
> On Fri, Oct 7, 2016 at 5:14 AM Matei Zaharia <[hidden email]>
> wrote:
>>
>> since March. But it's true that other things such as the Kafka source for
>> it didn't have as much design on JIRA. Nonetheless, this component is still
>> early on and there's still a lot of time to change it, which is happening.
>
>
> It's hard to drive design discussions in OSS. Even when diligently
> publishing design docs, the doc happens after brainstorming, and that
> happens inside someone's head or in chats.
>
> The lazy consensus model that works for small changes doesn't work well
> here. If a committer wants a change, that change will basically be made
> modulo small edits; vetoes are for dire disagreement. (Otherwise we'd get
> nothing done.) However this model means it's hard to significantly change a
> design after draft 1.
>
> I've heard this complaint a few times, and it has never been down to bad
> faith. We should err further towards over-including early and often. I've
> seen some great discussions start more with a problem statement and an RFC,
> not a design doc. Keeping regular contributors enfranchised is essential, so
> that they're willing and able to participate when design time comes. (See
> below.)
>
>
>>
>> 2) About what people say at Reactive Summit -- there will always be
>> trolls, but just ignore them and build a great project. Those of us involved
>> in the project for a while have long seen similar stuff, e.g. a
>
>
> The hype cycle may be turning against Spark, as is normal for this stage of
> maturity. People idealize technologies they don't really use as greener
> grass; it's the things they use and need to work that they love to hate.
>
> I would not dismiss this as just trolling. Customer anecdotes I see suggest
> that Spark underperforms their (inflated) expectations, and generally does
> not Just Work. It takes expertise, tuning, patience, workarounds. And then
> it gets great things done. I do see a gap between how the group here talks
> about the technology, and how the users I see talk about it. The gap
> manifests in attention given to making yet more things, and attention given
> to fixing and project mechanics.
>
> I would also not dismiss criticism of governance. We can recognize some big
> problems that were resolved over even the past 3 months. Usually I hear,
> well, we do better than most projects, right? and that is true. But, Spark
> is bigger and busier than most any other project. Exceptional projects need
> exceptional governance and we have merely "good". See next.
>
>
>> 3) About number and diversity of committers -- the PMC is always working
>> to expand these, and you should email people on the PMC (or even the whole
>> list) if you have people you'd like to propose. In
>
>
> If you're suggesting that it's mostly a matter of asking, then this doesn't
> match my experience. I have seen a few people consistently soft-reject most
> proposals. The reasons given usually sound like "concerns about quality",
> which is probably the right answer to a somewhat wrong question.
>
> We should probably be asking primarily who will net-net add efficiency to
> some part of the project's mechanics. Per above, it wouldn't hurt to ask who
> would expand coverage and add diversity of perspective too.
>
> I disagree that committers are being added at a sufficient rate. The overall
> committer-attention hours is dropping as the project grows -- am I the only
> one that perceives many regular committers aren't working nearly as much as
> before on the project?
>
> I call it a problem because we have IMHO people who 'qualify', and not
> giving them some stake is going to cost the project down the road. Always Be
> Recruiting. This is what I would worry about, since the governance and
> enfranchisement issues above kind of stem from this.
>
>
>>
>> 4) Finally, about better organizing JIRA, marking dead issues, etc, this
>> would be great and I think we just need a concrete proposal for how to do
>> it. It would be best to point to an existing process that someone else has
>> used here BTW so that we can see it in action.
>
>
> I don't think we're wanting for proposals. I went on and on about it last
> year, and don't think anyone disagreed about actions. I wouldn't suggest
> that clearing out dead issues is more complex than just putting in time to
> do it. It's just grunt work and understandably not appealing. (Thank you
> Xiao for your recent run at SQL JIRAs.)
>
> It requires saying 'no', which is hard, because it requires some conviction.
> I have encountered reluctance to do this in Spark and think that culture
> should change. Is it weird to say that a broader group of gatekeepers can
> actually with more confidence and efficiency tackle the triage issue? that
> pushing back on 'bad' contribution actually increases the rate of 'good'?
>
> FWIW I also find the project unpleasant to deal with day to day, mostly
> because of the scale of the triage, and think we could use all the qualified
> help we can get. I am looking to do less with the project over time, which
> is no big deal in itself, but is a big deal if these several factors are
> adding up to discourage fresh blood from joining the fray. Cody makes me
> think there are, at least, 2 of us.
>
> Concrete steps?
>
> Go to spark-prs.com. Look at "Users". Look at your open PRs. Are any stale?
> can you close them or advance them?
>
> Look at the Stale PRs tab and sort by last updated. Do any look dead? can
> you ask the author to update or close? does the parent JIRA look like it's
> not otherwise relevant?
>
> Go download JIRA Client at http://almworks.com/jiraclient/download.html Go
> look at all open JIRAs sorted by last update. Are any pretty obviously
> obsolete?
>
> If you don't feel comfortable acting, feel free to at least propose a list
> to dev@ for a look.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Cell : 425-233-8271

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

rxin
In reply to this post by Matei Zaharia
I like the lightweight proposal to add a SIP label.

During Spark 2.0 development, Tom (Graves) and I suggested using wiki to track the list of major changes, but that never really materialized due to the overhead. Adding a SIP label on major JIRAs and then link to them prominently on the Spark website makes a lot of sense.


On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia <[hidden email]> wrote:
For the improvement proposals, I think one major point was to make them really visible to users who are not contributors, so we should do more than sending stuff to dev@. One very lightweight idea is to have a new type of JIRA called a SIP and have a link to a filter that shows all such JIRAs from http://spark.apache.org. I also like the idea of SIP and design doc templates (in fact many projects have them).

Matei

On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]> wrote:

I called Cody last night and talked about some of the topics in his email. It became clear to me Cody genuinely cares about the project.

Some of the frustrations come from the success of the project itself becoming very "hot", and it is difficult to get clarity from people who don't dedicate all their time to Spark. In fact, it is in some ways similar to scaling an engineering team in a successful startup: old processes that worked well might not work so well when it gets to a certain size, cultures can get diluted, building culture vs building process, etc.

I also really like to have a more visible process for larger changes, especially major user facing API changes. Historically we upload design docs for major changes, but it is not always consistent and difficult to quality of the docs, due to the volunteering nature of the organization.

Some of the more concrete ideas we discussed focus on building a culture to improve clarity:

- Process: Large changes should have design docs posted on JIRA. One thing Cody and I didn't discuss but an idea that just came to me is we should create a design doc template for the project and ask everybody to follow. The design doc template should also explicitly list goals and non-goals, to make design doc more consistent.

- Process: Email dev@ to solicit feedback. We have some this with some changes, but again very inconsistent. Just posting something on JIRA isn't sufficient, because there are simply too many JIRAs and the signal get lost in the noise. While this is generally impossible to enforce because we can't force all volunteers to conform to a process (or they might not even be aware of this),  those who are more familiar with the project can help by emailing the dev@ when they see something that hasn't been.

- Culture: The design doc author(s) should be open to feedback. A design doc should serve as the base for discussion and is by no means the final design. Of course, this does not mean the author has to accept every feedback. They should also be comfortable accepting / rejecting ideas on technical grounds.

- Process / Culture: For major ongoing projects, it can be useful to have some monthly Google hangouts that are open to the world. I am actually not sure how well this will work, because of the volunteering nature and we need to adjust for timezones for people across the globe, but it seems worth trying.

- Culture: Contributors (including committers) should be more direct in setting expectations, including whether they are working on a specific issue, whether they will be working on a specific issue, and whether an issue or pr or jira should be rejected. Most people I know in this community are nice and don't enjoy telling other people no, but it is often more annoying to a contributor to not know anything than getting a no.


On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia <[hidden email]> wrote:

Love the idea of a more visible "Spark Improvement Proposal" process that solicits user input on new APIs. For what it's worth, I don't think committers are trying to minimize their own work -- every committer cares about making the software useful for users. However, it is always hard to get user input and so it helps to have this kind of process. I've certainly looked at the *IPs a lot in other software I use just to see the biggest things on the roadmap.

When you're talking about "changing interfaces", are you talking about public or internal APIs? I do think many people hate changing public APIs and I actually think that's for the best of the project. That's a technical debate, but basically, the worst thing when you're using a piece of software is that the developers constantly ask you to rewrite your app to update to a new version (and thus benefit from bug fixes, etc). Cue anyone who's used Protobuf, or Guava. The "let's get everyone to change their code this release" model works well within a single large company, but doesn't work well for a community, which is why nearly all *very* widely used programming interfaces (I'm talking things like Java standard library, Windows API, etc) almost *never* break backwards compatibility. All this is done within reason though, e.g. we do change things in major releases (2.x, 3.x, etc).




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

nsalian
This post has NOT been accepted by the mailing list yet.
All solid points by folks on this thread.
FWIW, I do see the Improvements Proposals path working well at projects like Flink and Kafka
Neelesh S. Salian  
Cloudera
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Cody Koeninger-2
In reply to this post by rxin
+1 to adding an SIP label and linking it from the website.  I think it needs

- template that focuses it towards soliciting user goals / non goals
- clear resolution as to which strategy was chosen to pursue.  I'd
recommend a vote.

Matei asked me to clarify what I meant by changing interfaces, I think
it's directly relevant to the SIP idea so I'll clarify here, and split
a thread for the other discussion per Nicholas' request.

I meant changing public user interfaces.  I think the first design is
unlikely to be right, because it's done at a time when you have the
least information.  As a user, I find it considerably more frustrating
to be unable to use a tool to get my job done, than I do having to
make minor changes to my code in order to take advantage of features.
I've seen committers be seriously reluctant to allow changes to
@experimental code that are needed in order for it to really work
right.  You need to be able to iterate, and if people on both sides of
the fence aren't going to respect that some newer apis are subject to
change, then why even mark them as such?

Ideally a finished SIP should give me a checklist of things that an
implementation must do, and things that it doesn't need to do.
Contributors/committers should be seriously discouraged from putting
out a version 0.1 that doesn't have at least a prototype
implementation of all those things, especially if they're then going
to argue against interface changes necessary to get the the rest of
the things done in the 0.2 version.


On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <[hidden email]> wrote:

> I like the lightweight proposal to add a SIP label.
>
> During Spark 2.0 development, Tom (Graves) and I suggested using wiki to
> track the list of major changes, but that never really materialized due to
> the overhead. Adding a SIP label on major JIRAs and then link to them
> prominently on the Spark website makes a lot of sense.
>
>
> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia <[hidden email]>
> wrote:
>>
>> For the improvement proposals, I think one major point was to make them
>> really visible to users who are not contributors, so we should do more than
>> sending stuff to dev@. One very lightweight idea is to have a new type of
>> JIRA called a SIP and have a link to a filter that shows all such JIRAs from
>> http://spark.apache.org. I also like the idea of SIP and design doc
>> templates (in fact many projects have them).
>>
>> Matei
>>
>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]> wrote:
>>
>> I called Cody last night and talked about some of the topics in his email.
>> It became clear to me Cody genuinely cares about the project.
>>
>> Some of the frustrations come from the success of the project itself
>> becoming very "hot", and it is difficult to get clarity from people who
>> don't dedicate all their time to Spark. In fact, it is in some ways similar
>> to scaling an engineering team in a successful startup: old processes that
>> worked well might not work so well when it gets to a certain size, cultures
>> can get diluted, building culture vs building process, etc.
>>
>> I also really like to have a more visible process for larger changes,
>> especially major user facing API changes. Historically we upload design docs
>> for major changes, but it is not always consistent and difficult to quality
>> of the docs, due to the volunteering nature of the organization.
>>
>> Some of the more concrete ideas we discussed focus on building a culture
>> to improve clarity:
>>
>> - Process: Large changes should have design docs posted on JIRA. One thing
>> Cody and I didn't discuss but an idea that just came to me is we should
>> create a design doc template for the project and ask everybody to follow.
>> The design doc template should also explicitly list goals and non-goals, to
>> make design doc more consistent.
>>
>> - Process: Email dev@ to solicit feedback. We have some this with some
>> changes, but again very inconsistent. Just posting something on JIRA isn't
>> sufficient, because there are simply too many JIRAs and the signal get lost
>> in the noise. While this is generally impossible to enforce because we can't
>> force all volunteers to conform to a process (or they might not even be
>> aware of this),  those who are more familiar with the project can help by
>> emailing the dev@ when they see something that hasn't been.
>>
>> - Culture: The design doc author(s) should be open to feedback. A design
>> doc should serve as the base for discussion and is by no means the final
>> design. Of course, this does not mean the author has to accept every
>> feedback. They should also be comfortable accepting / rejecting ideas on
>> technical grounds.
>>
>> - Process / Culture: For major ongoing projects, it can be useful to have
>> some monthly Google hangouts that are open to the world. I am actually not
>> sure how well this will work, because of the volunteering nature and we need
>> to adjust for timezones for people across the globe, but it seems worth
>> trying.
>>
>> - Culture: Contributors (including committers) should be more direct in
>> setting expectations, including whether they are working on a specific
>> issue, whether they will be working on a specific issue, and whether an
>> issue or pr or jira should be rejected. Most people I know in this community
>> are nice and don't enjoy telling other people no, but it is often more
>> annoying to a contributor to not know anything than getting a no.
>>
>>
>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia <[hidden email]>
>> wrote:
>>>
>>>
>>> Love the idea of a more visible "Spark Improvement Proposal" process that
>>> solicits user input on new APIs. For what it's worth, I don't think
>>> committers are trying to minimize their own work -- every committer cares
>>> about making the software useful for users. However, it is always hard to
>>> get user input and so it helps to have this kind of process. I've certainly
>>> looked at the *IPs a lot in other software I use just to see the biggest
>>> things on the roadmap.
>>>
>>> When you're talking about "changing interfaces", are you talking about
>>> public or internal APIs? I do think many people hate changing public APIs
>>> and I actually think that's for the best of the project. That's a technical
>>> debate, but basically, the worst thing when you're using a piece of software
>>> is that the developers constantly ask you to rewrite your app to update to a
>>> new version (and thus benefit from bug fixes, etc). Cue anyone who's used
>>> Protobuf, or Guava. The "let's get everyone to change their code this
>>> release" model works well within a single large company, but doesn't work
>>> well for a community, which is why nearly all *very* widely used programming
>>> interfaces (I'm talking things like Java standard library, Windows API, etc)
>>> almost *never* break backwards compatibility. All this is done within reason
>>> though, e.g. we do change things in major releases (2.x, 3.x, etc).
>>
>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Cody Koeninger-2
Yeah, in case it wasn't clear, I was talking about SIPs for major user-facing or cross-cutting changes, not minor feature adds.

On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos <[hidden email]> wrote:
+1 to the SIP label as long as it does not slow down things and it targets optimizing efforts, coordination etc. For example really small features should not need to go through this process (assuming they dont touch public interfaces)  or re-factorings and hope it will be kept this way. So as a guideline doc should be provided, like in the KIP case.

IMHO so far aside from tagging things and linking them elsewhere simply having design docs and prototypes implementations in PRs is not something that has not worked so far. What is really a pain in many projects out there is discontinuity in progress of PRs, missing features, slow reviews which is understandable to some extent... it is not only about Spark but things can be improved for sure for this project in particular as already stated.

On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <[hidden email]> wrote:
+1 to adding an SIP label and linking it from the website.  I think it needs

- template that focuses it towards soliciting user goals / non goals
- clear resolution as to which strategy was chosen to pursue.  I'd
recommend a vote.

Matei asked me to clarify what I meant by changing interfaces, I think
it's directly relevant to the SIP idea so I'll clarify here, and split
a thread for the other discussion per Nicholas' request.

I meant changing public user interfaces.  I think the first design is
unlikely to be right, because it's done at a time when you have the
least information.  As a user, I find it considerably more frustrating
to be unable to use a tool to get my job done, than I do having to
make minor changes to my code in order to take advantage of features.
I've seen committers be seriously reluctant to allow changes to
@experimental code that are needed in order for it to really work
right.  You need to be able to iterate, and if people on both sides of
the fence aren't going to respect that some newer apis are subject to
change, then why even mark them as such?

Ideally a finished SIP should give me a checklist of things that an
implementation must do, and things that it doesn't need to do.
Contributors/committers should be seriously discouraged from putting
out a version 0.1 that doesn't have at least a prototype
implementation of all those things, especially if they're then going
to argue against interface changes necessary to get the the rest of
the things done in the 0.2 version.


On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <[hidden email]> wrote:
> I like the lightweight proposal to add a SIP label.
>
> During Spark 2.0 development, Tom (Graves) and I suggested using wiki to
> track the list of major changes, but that never really materialized due to
> the overhead. Adding a SIP label on major JIRAs and then link to them
> prominently on the Spark website makes a lot of sense.
>
>
> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia <[hidden email]>
> wrote:
>>
>> For the improvement proposals, I think one major point was to make them
>> really visible to users who are not contributors, so we should do more than
>> sending stuff to dev@. One very lightweight idea is to have a new type of
>> JIRA called a SIP and have a link to a filter that shows all such JIRAs from
>> http://spark.apache.org. I also like the idea of SIP and design doc
>> templates (in fact many projects have them).
>>
>> Matei
>>
>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]> wrote:
>>
>> I called Cody last night and talked about some of the topics in his email.
>> It became clear to me Cody genuinely cares about the project.
>>
>> Some of the frustrations come from the success of the project itself
>> becoming very "hot", and it is difficult to get clarity from people who
>> don't dedicate all their time to Spark. In fact, it is in some ways similar
>> to scaling an engineering team in a successful startup: old processes that
>> worked well might not work so well when it gets to a certain size, cultures
>> can get diluted, building culture vs building process, etc.
>>
>> I also really like to have a more visible process for larger changes,
>> especially major user facing API changes. Historically we upload design docs
>> for major changes, but it is not always consistent and difficult to quality
>> of the docs, due to the volunteering nature of the organization.
>>
>> Some of the more concrete ideas we discussed focus on building a culture
>> to improve clarity:
>>
>> - Process: Large changes should have design docs posted on JIRA. One thing
>> Cody and I didn't discuss but an idea that just came to me is we should
>> create a design doc template for the project and ask everybody to follow.
>> The design doc template should also explicitly list goals and non-goals, to
>> make design doc more consistent.
>>
>> - Process: Email dev@ to solicit feedback. We have some this with some
>> changes, but again very inconsistent. Just posting something on JIRA isn't
>> sufficient, because there are simply too many JIRAs and the signal get lost
>> in the noise. While this is generally impossible to enforce because we can't
>> force all volunteers to conform to a process (or they might not even be
>> aware of this),  those who are more familiar with the project can help by
>> emailing the dev@ when they see something that hasn't been.
>>
>> - Culture: The design doc author(s) should be open to feedback. A design
>> doc should serve as the base for discussion and is by no means the final
>> design. Of course, this does not mean the author has to accept every
>> feedback. They should also be comfortable accepting / rejecting ideas on
>> technical grounds.
>>
>> - Process / Culture: For major ongoing projects, it can be useful to have
>> some monthly Google hangouts that are open to the world. I am actually not
>> sure how well this will work, because of the volunteering nature and we need
>> to adjust for timezones for people across the globe, but it seems worth
>> trying.
>>
>> - Culture: Contributors (including committers) should be more direct in
>> setting expectations, including whether they are working on a specific
>> issue, whether they will be working on a specific issue, and whether an
>> issue or pr or jira should be rejected. Most people I know in this community
>> are nice and don't enjoy telling other people no, but it is often more
>> annoying to a contributor to not know anything than getting a no.
>>
>>
>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia <[hidden email]>
>> wrote:
>>>
>>>
>>> Love the idea of a more visible "Spark Improvement Proposal" process that
>>> solicits user input on new APIs. For what it's worth, I don't think
>>> committers are trying to minimize their own work -- every committer cares
>>> about making the software useful for users. However, it is always hard to
>>> get user input and so it helps to have this kind of process. I've certainly
>>> looked at the *IPs a lot in other software I use just to see the biggest
>>> things on the roadmap.
>>>
>>> When you're talking about "changing interfaces", are you talking about
>>> public or internal APIs? I do think many people hate changing public APIs
>>> and I actually think that's for the best of the project. That's a technical
>>> debate, but basically, the worst thing when you're using a piece of software
>>> is that the developers constantly ask you to rewrite your app to update to a
>>> new version (and thus benefit from bug fixes, etc). Cue anyone who's used
>>> Protobuf, or Guava. The "let's get everyone to change their code this
>>> release" model works well within a single large company, but doesn't work
>>> well for a community, which is why nearly all *very* widely used programming
>>> interfaces (I'm talking things like Java standard library, Windows API, etc)
>>> almost *never* break backwards compatibility. All this is done within reason
>>> though, e.g. we do change things in major releases (2.x, 3.x, etc).
>>
>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Stavros Kontopoulos
Senior Software Engineer
Lightbend, Inc.
<a href="tel:%2B1%20650%20678%200020" value="+16506780020" target="_blank">p:  +30 6977967274

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

rxin
Alright looks like there are quite a bit of support. We should wait to hear from more people too.

To push this forward, Cody and I will be working together in the next couple of weeks to come up with a concrete, detailed proposal on what this entails, and then we can discuss this the specific proposal as well.


On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <[hidden email]> wrote:
Yeah, in case it wasn't clear, I was talking about SIPs for major user-facing or cross-cutting changes, not minor feature adds.

On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos <[hidden email]> wrote:
+1 to the SIP label as long as it does not slow down things and it targets optimizing efforts, coordination etc. For example really small features should not need to go through this process (assuming they dont touch public interfaces)  or re-factorings and hope it will be kept this way. So as a guideline doc should be provided, like in the KIP case.

IMHO so far aside from tagging things and linking them elsewhere simply having design docs and prototypes implementations in PRs is not something that has not worked so far. What is really a pain in many projects out there is discontinuity in progress of PRs, missing features, slow reviews which is understandable to some extent... it is not only about Spark but things can be improved for sure for this project in particular as already stated.

On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <[hidden email]> wrote:
+1 to adding an SIP label and linking it from the website.  I think it needs

- template that focuses it towards soliciting user goals / non goals
- clear resolution as to which strategy was chosen to pursue.  I'd
recommend a vote.

Matei asked me to clarify what I meant by changing interfaces, I think
it's directly relevant to the SIP idea so I'll clarify here, and split
a thread for the other discussion per Nicholas' request.

I meant changing public user interfaces.  I think the first design is
unlikely to be right, because it's done at a time when you have the
least information.  As a user, I find it considerably more frustrating
to be unable to use a tool to get my job done, than I do having to
make minor changes to my code in order to take advantage of features.
I've seen committers be seriously reluctant to allow changes to
@experimental code that are needed in order for it to really work
right.  You need to be able to iterate, and if people on both sides of
the fence aren't going to respect that some newer apis are subject to
change, then why even mark them as such?

Ideally a finished SIP should give me a checklist of things that an
implementation must do, and things that it doesn't need to do.
Contributors/committers should be seriously discouraged from putting
out a version 0.1 that doesn't have at least a prototype
implementation of all those things, especially if they're then going
to argue against interface changes necessary to get the the rest of
the things done in the 0.2 version.


On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <[hidden email]> wrote:
> I like the lightweight proposal to add a SIP label.
>
> During Spark 2.0 development, Tom (Graves) and I suggested using wiki to
> track the list of major changes, but that never really materialized due to
> the overhead. Adding a SIP label on major JIRAs and then link to them
> prominently on the Spark website makes a lot of sense.
>
>
> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia <[hidden email]>
> wrote:
>>
>> For the improvement proposals, I think one major point was to make them
>> really visible to users who are not contributors, so we should do more than
>> sending stuff to dev@. One very lightweight idea is to have a new type of
>> JIRA called a SIP and have a link to a filter that shows all such JIRAs from
>> http://spark.apache.org. I also like the idea of SIP and design doc
>> templates (in fact many projects have them).
>>
>> Matei
>>
>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]> wrote:
>>
>> I called Cody last night and talked about some of the topics in his email.
>> It became clear to me Cody genuinely cares about the project.
>>
>> Some of the frustrations come from the success of the project itself
>> becoming very "hot", and it is difficult to get clarity from people who
>> don't dedicate all their time to Spark. In fact, it is in some ways similar
>> to scaling an engineering team in a successful startup: old processes that
>> worked well might not work so well when it gets to a certain size, cultures
>> can get diluted, building culture vs building process, etc.
>>
>> I also really like to have a more visible process for larger changes,
>> especially major user facing API changes. Historically we upload design docs
>> for major changes, but it is not always consistent and difficult to quality
>> of the docs, due to the volunteering nature of the organization.
>>
>> Some of the more concrete ideas we discussed focus on building a culture
>> to improve clarity:
>>
>> - Process: Large changes should have design docs posted on JIRA. One thing
>> Cody and I didn't discuss but an idea that just came to me is we should
>> create a design doc template for the project and ask everybody to follow.
>> The design doc template should also explicitly list goals and non-goals, to
>> make design doc more consistent.
>>
>> - Process: Email dev@ to solicit feedback. We have some this with some
>> changes, but again very inconsistent. Just posting something on JIRA isn't
>> sufficient, because there are simply too many JIRAs and the signal get lost
>> in the noise. While this is generally impossible to enforce because we can't
>> force all volunteers to conform to a process (or they might not even be
>> aware of this),  those who are more familiar with the project can help by
>> emailing the dev@ when they see something that hasn't been.
>>
>> - Culture: The design doc author(s) should be open to feedback. A design
>> doc should serve as the base for discussion and is by no means the final
>> design. Of course, this does not mean the author has to accept every
>> feedback. They should also be comfortable accepting / rejecting ideas on
>> technical grounds.
>>
>> - Process / Culture: For major ongoing projects, it can be useful to have
>> some monthly Google hangouts that are open to the world. I am actually not
>> sure how well this will work, because of the volunteering nature and we need
>> to adjust for timezones for people across the globe, but it seems worth
>> trying.
>>
>> - Culture: Contributors (including committers) should be more direct in
>> setting expectations, including whether they are working on a specific
>> issue, whether they will be working on a specific issue, and whether an
>> issue or pr or jira should be rejected. Most people I know in this community
>> are nice and don't enjoy telling other people no, but it is often more
>> annoying to a contributor to not know anything than getting a no.
>>
>>
>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia <[hidden email]>
>> wrote:
>>>
>>>
>>> Love the idea of a more visible "Spark Improvement Proposal" process that
>>> solicits user input on new APIs. For what it's worth, I don't think
>>> committers are trying to minimize their own work -- every committer cares
>>> about making the software useful for users. However, it is always hard to
>>> get user input and so it helps to have this kind of process. I've certainly
>>> looked at the *IPs a lot in other software I use just to see the biggest
>>> things on the roadmap.
>>>
>>> When you're talking about "changing interfaces", are you talking about
>>> public or internal APIs? I do think many people hate changing public APIs
>>> and I actually think that's for the best of the project. That's a technical
>>> debate, but basically, the worst thing when you're using a piece of software
>>> is that the developers constantly ask you to rewrite your app to update to a
>>> new version (and thus benefit from bug fixes, etc). Cue anyone who's used
>>> Protobuf, or Guava. The "let's get everyone to change their code this
>>> release" model works well within a single large company, but doesn't work
>>> well for a community, which is why nearly all *very* widely used programming
>>> interfaces (I'm talking things like Java standard library, Windows API, etc)
>>> almost *never* break backwards compatibility. All this is done within reason
>>> though, e.g. we do change things in major releases (2.x, 3.x, etc).
>>
>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Stavros Kontopoulos
Senior Software Engineer
Lightbend, Inc.
<a href="tel:%2B1%20650%20678%200020" value="+16506780020" target="_blank">p:  +30 6977967274


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Matei Zaharia
Administrator
In reply to this post by Cody Koeninger-2
Sounds good. Just to comment on the compatibility part:

> I meant changing public user interfaces.  I think the first design is
> unlikely to be right, because it's done at a time when you have the
> least information.  As a user, I find it considerably more frustrating
> to be unable to use a tool to get my job done, than I do having to
> make minor changes to my code in order to take advantage of features.
> I've seen committers be seriously reluctant to allow changes to
> @experimental code that are needed in order for it to really work
> right.  You need to be able to iterate, and if people on both sides of
> the fence aren't going to respect that some newer apis are subject to
> change, then why even mark them as such?
>
> Ideally a finished SIP should give me a checklist of things that an
> implementation must do, and things that it doesn't need to do.
> Contributors/committers should be seriously discouraged from putting
> out a version 0.1 that doesn't have at least a prototype
> implementation of all those things, especially if they're then going
> to argue against interface changes necessary to get the the rest of
> the things done in the 0.2 version.

Experimental APIs and alpha components are indeed supposed to be changeable (https://cwiki.apache.org/confluence/display/SPARK/Spark+Versioning+Policy). Maybe people are being too conservative in some cases, but I do want to note that regardless of what precise policy we try to write down, this type of issue will ultimately be a judgment call. Is it worth making a small cosmetic change in an API that's marked experimental, but has been used widely for a year? Perhaps not. Is it worth making it in something one month old, or even in an older API as we move to 2.0? Maybe yes. I think we should just discuss each one (start an email thread if resolving it on JIRA is too complex) and perhaps be more religious about making things non-experimental when we think they're done.

Matei


>
>
> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <[hidden email]> wrote:
>> I like the lightweight proposal to add a SIP label.
>>
>> During Spark 2.0 development, Tom (Graves) and I suggested using wiki to
>> track the list of major changes, but that never really materialized due to
>> the overhead. Adding a SIP label on major JIRAs and then link to them
>> prominently on the Spark website makes a lot of sense.
>>
>>
>> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia <[hidden email]>
>> wrote:
>>>
>>> For the improvement proposals, I think one major point was to make them
>>> really visible to users who are not contributors, so we should do more than
>>> sending stuff to dev@. One very lightweight idea is to have a new type of
>>> JIRA called a SIP and have a link to a filter that shows all such JIRAs from
>>> http://spark.apache.org. I also like the idea of SIP and design doc
>>> templates (in fact many projects have them).
>>>
>>> Matei
>>>
>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]> wrote:
>>>
>>> I called Cody last night and talked about some of the topics in his email.
>>> It became clear to me Cody genuinely cares about the project.
>>>
>>> Some of the frustrations come from the success of the project itself
>>> becoming very "hot", and it is difficult to get clarity from people who
>>> don't dedicate all their time to Spark. In fact, it is in some ways similar
>>> to scaling an engineering team in a successful startup: old processes that
>>> worked well might not work so well when it gets to a certain size, cultures
>>> can get diluted, building culture vs building process, etc.
>>>
>>> I also really like to have a more visible process for larger changes,
>>> especially major user facing API changes. Historically we upload design docs
>>> for major changes, but it is not always consistent and difficult to quality
>>> of the docs, due to the volunteering nature of the organization.
>>>
>>> Some of the more concrete ideas we discussed focus on building a culture
>>> to improve clarity:
>>>
>>> - Process: Large changes should have design docs posted on JIRA. One thing
>>> Cody and I didn't discuss but an idea that just came to me is we should
>>> create a design doc template for the project and ask everybody to follow.
>>> The design doc template should also explicitly list goals and non-goals, to
>>> make design doc more consistent.
>>>
>>> - Process: Email dev@ to solicit feedback. We have some this with some
>>> changes, but again very inconsistent. Just posting something on JIRA isn't
>>> sufficient, because there are simply too many JIRAs and the signal get lost
>>> in the noise. While this is generally impossible to enforce because we can't
>>> force all volunteers to conform to a process (or they might not even be
>>> aware of this),  those who are more familiar with the project can help by
>>> emailing the dev@ when they see something that hasn't been.
>>>
>>> - Culture: The design doc author(s) should be open to feedback. A design
>>> doc should serve as the base for discussion and is by no means the final
>>> design. Of course, this does not mean the author has to accept every
>>> feedback. They should also be comfortable accepting / rejecting ideas on
>>> technical grounds.
>>>
>>> - Process / Culture: For major ongoing projects, it can be useful to have
>>> some monthly Google hangouts that are open to the world. I am actually not
>>> sure how well this will work, because of the volunteering nature and we need
>>> to adjust for timezones for people across the globe, but it seems worth
>>> trying.
>>>
>>> - Culture: Contributors (including committers) should be more direct in
>>> setting expectations, including whether they are working on a specific
>>> issue, whether they will be working on a specific issue, and whether an
>>> issue or pr or jira should be rejected. Most people I know in this community
>>> are nice and don't enjoy telling other people no, but it is often more
>>> annoying to a contributor to not know anything than getting a no.
>>>
>>>
>>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia <[hidden email]>
>>> wrote:
>>>>
>>>>
>>>> Love the idea of a more visible "Spark Improvement Proposal" process that
>>>> solicits user input on new APIs. For what it's worth, I don't think
>>>> committers are trying to minimize their own work -- every committer cares
>>>> about making the software useful for users. However, it is always hard to
>>>> get user input and so it helps to have this kind of process. I've certainly
>>>> looked at the *IPs a lot in other software I use just to see the biggest
>>>> things on the roadmap.
>>>>
>>>> When you're talking about "changing interfaces", are you talking about
>>>> public or internal APIs? I do think many people hate changing public APIs
>>>> and I actually think that's for the best of the project. That's a technical
>>>> debate, but basically, the worst thing when you're using a piece of software
>>>> is that the developers constantly ask you to rewrite your app to update to a
>>>> new version (and thus benefit from bug fixes, etc). Cue anyone who's used
>>>> Protobuf, or Guava. The "let's get everyone to change their code this
>>>> release" model works well within a single large company, but doesn't work
>>>> well for a community, which is why nearly all *very* widely used programming
>>>> interfaces (I'm talking things like Java standard library, Windows API, etc)
>>>> almost *never* break backwards compatibility. All this is done within reason
>>>> though, e.g. we do change things in major releases (2.x, 3.x, etc).
>>>
>>>
>>>
>>>
>>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

vaquarkhan

+1 for SIP lebles,waiting for Reynolds detailed proposal .

Regards,
Vaquar khan


On 8 Oct 2016 16:22, "Matei Zaharia" <[hidden email]> wrote:
Sounds good. Just to comment on the compatibility part:

> I meant changing public user interfaces.  I think the first design is
> unlikely to be right, because it's done at a time when you have the
> least information.  As a user, I find it considerably more frustrating
> to be unable to use a tool to get my job done, than I do having to
> make minor changes to my code in order to take advantage of features.
> I've seen committers be seriously reluctant to allow changes to
> @experimental code that are needed in order for it to really work
> right.  You need to be able to iterate, and if people on both sides of
> the fence aren't going to respect that some newer apis are subject to
> change, then why even mark them as such?
>
> Ideally a finished SIP should give me a checklist of things that an
> implementation must do, and things that it doesn't need to do.
> Contributors/committers should be seriously discouraged from putting
> out a version 0.1 that doesn't have at least a prototype
> implementation of all those things, especially if they're then going
> to argue against interface changes necessary to get the the rest of
> the things done in the 0.2 version.

Experimental APIs and alpha components are indeed supposed to be changeable (https://cwiki.apache.org/confluence/display/SPARK/Spark+Versioning+Policy). Maybe people are being too conservative in some cases, but I do want to note that regardless of what precise policy we try to write down, this type of issue will ultimately be a judgment call. Is it worth making a small cosmetic change in an API that's marked experimental, but has been used widely for a year? Perhaps not. Is it worth making it in something one month old, or even in an older API as we move to 2.0? Maybe yes. I think we should just discuss each one (start an email thread if resolving it on JIRA is too complex) and perhaps be more religious about making things non-experimental when we think they're done.

Matei


>
>
> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <[hidden email]> wrote:
>> I like the lightweight proposal to add a SIP label.
>>
>> During Spark 2.0 development, Tom (Graves) and I suggested using wiki to
>> track the list of major changes, but that never really materialized due to
>> the overhead. Adding a SIP label on major JIRAs and then link to them
>> prominently on the Spark website makes a lot of sense.
>>
>>
>> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia <[hidden email]>
>> wrote:
>>>
>>> For the improvement proposals, I think one major point was to make them
>>> really visible to users who are not contributors, so we should do more than
>>> sending stuff to dev@. One very lightweight idea is to have a new type of
>>> JIRA called a SIP and have a link to a filter that shows all such JIRAs from
>>> http://spark.apache.org. I also like the idea of SIP and design doc
>>> templates (in fact many projects have them).
>>>
>>> Matei
>>>
>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]> wrote:
>>>
>>> I called Cody last night and talked about some of the topics in his email.
>>> It became clear to me Cody genuinely cares about the project.
>>>
>>> Some of the frustrations come from the success of the project itself
>>> becoming very "hot", and it is difficult to get clarity from people who
>>> don't dedicate all their time to Spark. In fact, it is in some ways similar
>>> to scaling an engineering team in a successful startup: old processes that
>>> worked well might not work so well when it gets to a certain size, cultures
>>> can get diluted, building culture vs building process, etc.
>>>
>>> I also really like to have a more visible process for larger changes,
>>> especially major user facing API changes. Historically we upload design docs
>>> for major changes, but it is not always consistent and difficult to quality
>>> of the docs, due to the volunteering nature of the organization.
>>>
>>> Some of the more concrete ideas we discussed focus on building a culture
>>> to improve clarity:
>>>
>>> - Process: Large changes should have design docs posted on JIRA. One thing
>>> Cody and I didn't discuss but an idea that just came to me is we should
>>> create a design doc template for the project and ask everybody to follow.
>>> The design doc template should also explicitly list goals and non-goals, to
>>> make design doc more consistent.
>>>
>>> - Process: Email dev@ to solicit feedback. We have some this with some
>>> changes, but again very inconsistent. Just posting something on JIRA isn't
>>> sufficient, because there are simply too many JIRAs and the signal get lost
>>> in the noise. While this is generally impossible to enforce because we can't
>>> force all volunteers to conform to a process (or they might not even be
>>> aware of this),  those who are more familiar with the project can help by
>>> emailing the dev@ when they see something that hasn't been.
>>>
>>> - Culture: The design doc author(s) should be open to feedback. A design
>>> doc should serve as the base for discussion and is by no means the final
>>> design. Of course, this does not mean the author has to accept every
>>> feedback. They should also be comfortable accepting / rejecting ideas on
>>> technical grounds.
>>>
>>> - Process / Culture: For major ongoing projects, it can be useful to have
>>> some monthly Google hangouts that are open to the world. I am actually not
>>> sure how well this will work, because of the volunteering nature and we need
>>> to adjust for timezones for people across the globe, but it seems worth
>>> trying.
>>>
>>> - Culture: Contributors (including committers) should be more direct in
>>> setting expectations, including whether they are working on a specific
>>> issue, whether they will be working on a specific issue, and whether an
>>> issue or pr or jira should be rejected. Most people I know in this community
>>> are nice and don't enjoy telling other people no, but it is often more
>>> annoying to a contributor to not know anything than getting a no.
>>>
>>>
>>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia <[hidden email]>
>>> wrote:
>>>>
>>>>
>>>> Love the idea of a more visible "Spark Improvement Proposal" process that
>>>> solicits user input on new APIs. For what it's worth, I don't think
>>>> committers are trying to minimize their own work -- every committer cares
>>>> about making the software useful for users. However, it is always hard to
>>>> get user input and so it helps to have this kind of process. I've certainly
>>>> looked at the *IPs a lot in other software I use just to see the biggest
>>>> things on the roadmap.
>>>>
>>>> When you're talking about "changing interfaces", are you talking about
>>>> public or internal APIs? I do think many people hate changing public APIs
>>>> and I actually think that's for the best of the project. That's a technical
>>>> debate, but basically, the worst thing when you're using a piece of software
>>>> is that the developers constantly ask you to rewrite your app to update to a
>>>> new version (and thus benefit from bug fixes, etc). Cue anyone who's used
>>>> Protobuf, or Guava. The "let's get everyone to change their code this
>>>> release" model works well within a single large company, but doesn't work
>>>> well for a community, which is why nearly all *very* widely used programming
>>>> interfaces (I'm talking things like Java standard library, Windows API, etc)
>>>> almost *never* break backwards compatibility. All this is done within reason
>>>> though, e.g. we do change things in major releases (2.x, 3.x, etc).
>>>
>>>
>>>
>>>
>>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Cody Koeninger-2
In reply to this post by rxin
Here's my specific proposal (meta-proposal?)

Spark Improvement Proposals (SIP)


Background:

The current problem is that design and implementation of large features are often done in private, before soliciting user feedback.

When feedback is solicited, it is often as to detailed design specifics, not focused on goals.

When implementation does take place after design, there is often disagreement as to what goals are or are not in scope.

This results in commits that don't fully meet user needs.


Goals:

- Ensure user, contributor, and committer goals are clearly identified and agreed upon, before implementation takes place.

- Ensure that a technically feasible strategy is chosen that is likely to meet the goals.


Rejected Goals:

- SIPs are not for detailed design.  Design by committee doesn't work.

- SIPs are not for every change.  We dont need that much process.


Strategy:

My suggestion is outlined as a Spark Improvement Proposal process documented at

https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md

Specifics of Jira manipulation are an implementation detail we can figure out.

I'm suggesting voting; the need here is for a _clear_ outcome.


Rejected Strategies:

Having someone who understands the problem implement it first works, but only if significant iteration after user feedback is allowed.

Historically this has been problematic due to pressure to limit public api changes.


On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <[hidden email]> wrote:
Alright looks like there are quite a bit of support. We should wait to hear from more people too.

To push this forward, Cody and I will be working together in the next couple of weeks to come up with a concrete, detailed proposal on what this entails, and then we can discuss this the specific proposal as well.


On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <[hidden email]> wrote:
Yeah, in case it wasn't clear, I was talking about SIPs for major user-facing or cross-cutting changes, not minor feature adds.

On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos <[hidden email]> wrote:
+1 to the SIP label as long as it does not slow down things and it targets optimizing efforts, coordination etc. For example really small features should not need to go through this process (assuming they dont touch public interfaces)  or re-factorings and hope it will be kept this way. So as a guideline doc should be provided, like in the KIP case.

IMHO so far aside from tagging things and linking them elsewhere simply having design docs and prototypes implementations in PRs is not something that has not worked so far. What is really a pain in many projects out there is discontinuity in progress of PRs, missing features, slow reviews which is understandable to some extent... it is not only about Spark but things can be improved for sure for this project in particular as already stated.

On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <[hidden email]> wrote:
+1 to adding an SIP label and linking it from the website.  I think it needs

- template that focuses it towards soliciting user goals / non goals
- clear resolution as to which strategy was chosen to pursue.  I'd
recommend a vote.

Matei asked me to clarify what I meant by changing interfaces, I think
it's directly relevant to the SIP idea so I'll clarify here, and split
a thread for the other discussion per Nicholas' request.

I meant changing public user interfaces.  I think the first design is
unlikely to be right, because it's done at a time when you have the
least information.  As a user, I find it considerably more frustrating
to be unable to use a tool to get my job done, than I do having to
make minor changes to my code in order to take advantage of features.
I've seen committers be seriously reluctant to allow changes to
@experimental code that are needed in order for it to really work
right.  You need to be able to iterate, and if people on both sides of
the fence aren't going to respect that some newer apis are subject to
change, then why even mark them as such?

Ideally a finished SIP should give me a checklist of things that an
implementation must do, and things that it doesn't need to do.
Contributors/committers should be seriously discouraged from putting
out a version 0.1 that doesn't have at least a prototype
implementation of all those things, especially if they're then going
to argue against interface changes necessary to get the the rest of
the things done in the 0.2 version.


On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <[hidden email]> wrote:
> I like the lightweight proposal to add a SIP label.
>
> During Spark 2.0 development, Tom (Graves) and I suggested using wiki to
> track the list of major changes, but that never really materialized due to
> the overhead. Adding a SIP label on major JIRAs and then link to them
> prominently on the Spark website makes a lot of sense.
>
>
> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia <[hidden email]>
> wrote:
>>
>> For the improvement proposals, I think one major point was to make them
>> really visible to users who are not contributors, so we should do more than
>> sending stuff to dev@. One very lightweight idea is to have a new type of
>> JIRA called a SIP and have a link to a filter that shows all such JIRAs from
>> http://spark.apache.org. I also like the idea of SIP and design doc
>> templates (in fact many projects have them).
>>
>> Matei
>>
>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]> wrote:
>>
>> I called Cody last night and talked about some of the topics in his email.
>> It became clear to me Cody genuinely cares about the project.
>>
>> Some of the frustrations come from the success of the project itself
>> becoming very "hot", and it is difficult to get clarity from people who
>> don't dedicate all their time to Spark. In fact, it is in some ways similar
>> to scaling an engineering team in a successful startup: old processes that
>> worked well might not work so well when it gets to a certain size, cultures
>> can get diluted, building culture vs building process, etc.
>>
>> I also really like to have a more visible process for larger changes,
>> especially major user facing API changes. Historically we upload design docs
>> for major changes, but it is not always consistent and difficult to quality
>> of the docs, due to the volunteering nature of the organization.
>>
>> Some of the more concrete ideas we discussed focus on building a culture
>> to improve clarity:
>>
>> - Process: Large changes should have design docs posted on JIRA. One thing
>> Cody and I didn't discuss but an idea that just came to me is we should
>> create a design doc template for the project and ask everybody to follow.
>> The design doc template should also explicitly list goals and non-goals, to
>> make design doc more consistent.
>>
>> - Process: Email dev@ to solicit feedback. We have some this with some
>> changes, but again very inconsistent. Just posting something on JIRA isn't
>> sufficient, because there are simply too many JIRAs and the signal get lost
>> in the noise. While this is generally impossible to enforce because we can't
>> force all volunteers to conform to a process (or they might not even be
>> aware of this),  those who are more familiar with the project can help by
>> emailing the dev@ when they see something that hasn't been.
>>
>> - Culture: The design doc author(s) should be open to feedback. A design
>> doc should serve as the base for discussion and is by no means the final
>> design. Of course, this does not mean the author has to accept every
>> feedback. They should also be comfortable accepting / rejecting ideas on
>> technical grounds.
>>
>> - Process / Culture: For major ongoing projects, it can be useful to have
>> some monthly Google hangouts that are open to the world. I am actually not
>> sure how well this will work, because of the volunteering nature and we need
>> to adjust for timezones for people across the globe, but it seems worth
>> trying.
>>
>> - Culture: Contributors (including committers) should be more direct in
>> setting expectations, including whether they are working on a specific
>> issue, whether they will be working on a specific issue, and whether an
>> issue or pr or jira should be rejected. Most people I know in this community
>> are nice and don't enjoy telling other people no, but it is often more
>> annoying to a contributor to not know anything than getting a no.
>>
>>
>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia <[hidden email]>
>> wrote:
>>>
>>>
>>> Love the idea of a more visible "Spark Improvement Proposal" process that
>>> solicits user input on new APIs. For what it's worth, I don't think
>>> committers are trying to minimize their own work -- every committer cares
>>> about making the software useful for users. However, it is always hard to
>>> get user input and so it helps to have this kind of process. I've certainly
>>> looked at the *IPs a lot in other software I use just to see the biggest
>>> things on the roadmap.
>>>
>>> When you're talking about "changing interfaces", are you talking about
>>> public or internal APIs? I do think many people hate changing public APIs
>>> and I actually think that's for the best of the project. That's a technical
>>> debate, but basically, the worst thing when you're using a piece of software
>>> is that the developers constantly ask you to rewrite your app to update to a
>>> new version (and thus benefit from bug fixes, etc). Cue anyone who's used
>>> Protobuf, or Guava. The "let's get everyone to change their code this
>>> release" model works well within a single large company, but doesn't work
>>> well for a community, which is why nearly all *very* widely used programming
>>> interfaces (I'm talking things like Java standard library, Windows API, etc)
>>> almost *never* break backwards compatibility. All this is done within reason
>>> though, e.g. we do change things in major releases (2.x, 3.x, etc).
>>
>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Stavros Kontopoulos
Senior Software Engineer
Lightbend, Inc.
<a href="tel:%2B1%20650%20678%200020" value="+16506780020" target="_blank">p:  +30 6977967274



1234 ... 6
Loading...