Quantcast

Spark Improvement Proposals

classic Classic list List threaded Threaded
107 messages Options
123456
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Ryan Blue
Proposal submission: I think we should keep this as open as possible. If there is a problem with too many open proposals, then we should tackle that as a fix rather than excluding participation. Perhaps it will end up that way, but I think it's worth trying a more open model first.

Majority vs consensus: My rationale is that I don't think we want to consider a proposal approved if it had objections serious enough that committers down-voted (or PMC depending on who gets a vote). If these proposals are like PEPs, then they represent a significant amount of community effort and I wouldn't want to move forward if up to half of the community thinks it's an untenable idea.

rb

On Mon, Oct 10, 2016 at 12:07 PM, Cody Koeninger <[hidden email]> wrote:
I think this is closer to a procedural issue than a code modification
issue, hence why majority.  If everyone thinks consensus is better, I
don't care.  Again, I don't feel strongly about the way we achieve
clarity, just that we achieve clarity.

On Mon, Oct 10, 2016 at 2:02 PM, Ryan Blue <[hidden email]> wrote:
> Sorry, I missed that the proposal includes majority approval. Why majority
> instead of consensus? I think we want to build consensus around these
> proposals and it makes sense to discuss until no one would veto.
>
> rb
>
> On Mon, Oct 10, 2016 at 11:54 AM, Ryan Blue <[hidden email]> wrote:
>>
>> +1 to votes to approve proposals. I agree that proposals should have an
>> official mechanism to be accepted, and a vote is an established means of
>> doing that well. I like that it includes a period to review the proposal and
>> I think proposals should have been discussed enough ahead of a vote to
>> survive the possibility of a veto.
>>
>> I also like the names that are short and (mostly) unique, like SEP.
>>
>> Where I disagree is with the requirement that a committer must formally
>> propose an enhancement. I don't see the value of restricting this: if
>> someone has the will to write up a proposal then they should be encouraged
>> to do so and start a discussion about it. Even if there is a political
>> reality as Cody says, what is the value of codifying that in our process? I
>> think restricting who can submit proposals would only undermine them by
>> pushing contributors out. Maybe I'm missing something here?
>>
>> rb
>>
>>
>>
>> On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <[hidden email]>
>> wrote:
>>>
>>> Yes, users suggesting SIPs is a good thing and is explicitly called
>>> out in the linked document under the Who? section.  Formally proposing
>>> them, not so much, because of the political realities.
>>>
>>> Yes, implementation strategy definitely affects goals.  There are all
>>> kinds of examples of this, I'll pick one that's my fault so as to
>>> avoid sounding like I'm blaming:
>>>
>>> When I implemented the Kafka DStream, one of my (not explicitly agreed
>>> upon by the community) goals was to make sure people could use the
>>> Dstream with however they were already using Kafka at work.  The lack
>>> of explicit agreement on that goal led to all kinds of fighting with
>>> committers, that could have been avoided.  The lack of explicit
>>> up-front strategy discussion led to the DStream not really working
>>> with compacted topics.  I knew about compacted topics, but don't have
>>> a use for them, so had a blind spot there.  If there was explicit
>>> up-front discussion that my strategy was "assume that batches can be
>>> defined on the driver solely by beginning and ending offsets", there's
>>> a greater chance that a user would have seen that and said, "hey, what
>>> about non-contiguous offsets in a compacted topic".
>>>
>>> This kind of thing is only going to happen smoothly if we have a
>>> lightweight user-visible process with clear outcomes.
>>>
>>> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>>> <[hidden email]> wrote:
>>> > I agree with most of what Cody said.
>>> >
>>> > Two things:
>>> >
>>> > First we can always have other people suggest SIPs but mark them as
>>> > “unreviewed” and have committers basically move them forward. The
>>> > problem is
>>> > that writing a good document takes time. This way we can leverage non
>>> > committers to do some of this work (it is just another way to
>>> > contribute).
>>> >
>>> >
>>> >
>>> > As for strategy, in many cases implementation strategy can affect the
>>> > goals.
>>> > I will give  a small example: In the current structured streaming
>>> > strategy,
>>> > we group by the time to achieve a sliding window. This is definitely an
>>> > implementation decision and not a goal. However, I can think of several
>>> > aggregation functions which have the time inside their calculation
>>> > buffer.
>>> > For example, let’s say we want to return a set of all distinct values.
>>> > One
>>> > way to implement this would be to make the set into a map and have the
>>> > value
>>> > contain the last time seen. Multiplying it across the groupby would
>>> > cost a
>>> > lot in performance. So adding such a strategy would have a great effect
>>> > on
>>> > the type of aggregations and their performance which does affect the
>>> > goal.
>>> > Without adding the strategy, it is easy for whoever goes to the design
>>> > document to not think about these cases. Furthermore, it might be
>>> > decided
>>> > that these cases are rare enough so that the strategy is still good
>>> > enough
>>> > but how would we know it without user feedback?
>>> >
>>> > I believe this example is exactly what Cody was talking about. Since
>>> > many
>>> > times implementation strategies have a large effect on the goal, we
>>> > should
>>> > have it discussed when discussing the goals. In addition, while it is
>>> > often
>>> > easy to throw out completely infeasible goals, it is often much harder
>>> > to
>>> > figure out that the goals are unfeasible without fine tuning.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Assaf.
>>> >
>>> >
>>> >
>>> > From: Cody Koeninger-2 [via Apache Spark Developers List]
>>> > [mailto:[hidden email][hidden email]]
>>> > Sent: Monday, October 10, 2016 2:25 AM
>>> > To: Mendelson, Assaf
>>> > Subject: Re: Spark Improvement Proposals
>>> >
>>> >
>>> >
>>> > Only committers should formally submit SIPs because in an apache
>>> > project only commiters have explicit political power.  If a user can't
>>> > find a commiter willing to sponsor an SIP idea, they have no way to
>>> > get the idea passed in any case.  If I can't find a committer to
>>> > sponsor this meta-SIP idea, I'm out of luck.
>>> >
>>> > I do not believe unrealistic goals can be found solely by inspection.
>>> > We've managed to ignore unrealistic goals even after implementation!
>>> > Focusing on APIs can allow people to think they've solved something,
>>> > when there's really no way of implementing that API while meeting the
>>> > goals.  Rapid iteration is clearly the best way to address this, but
>>> > we've already talked about why that hasn't really worked.  If adding a
>>> > non-binding API section to the template is important to you, I'm not
>>> > against it, but I don't think it's sufficient.
>>> >
>>> > On your PRD vs design doc spectrum, I'm saying this is closer to a
>>> > PRD.  Clear agreement on goals is the most important thing and that's
>>> > why it's the thing I want binding agreement on.  But I cannot agree to
>>> > goals unless I have enough minimal technical info to judge whether the
>>> > goals are likely to actually be accomplished.
>>> >
>>> >
>>> >
>>> > On Sun, Oct 9, 2016 at 5:35 PM, Matei Zaharia <[hidden email]> wrote:
>>> >
>>> >
>>> >> Well, I think there are a few things here that don't make sense.
>>> >> First,
>>> >> why
>>> >> should only committers submit SIPs? Development in the project should
>>> >> be
>>> >> open to all contributors, whether they're committers or not. Second, I
>>> >> think
>>> >> unrealistic goals can be found just by inspecting the goals, and I'm
>>> >> not
>>> >> super worried that we'll accept a lot of SIPs that are then infeasible
>>> >> --
>>> >> we
>>> >> can then submit new ones. But this depends on whether you want this
>>> >> process
>>> >> to be a "design doc lite", where people also agree on implementation
>>> >> strategy, or just a way to agree on goals. This is what I asked
>>> >> earlier
>>> >> about PRDs vs design docs (and I'm open to either one but I'd just
>>> >> like
>>> >> clarity). Finally, both as a user and designer of software, I always
>>> >> want
>>> >> to
>>> >> give feedback on APIs, so I'd really like a culture of having those
>>> >> early.
>>> >> People don't argue about prettiness when they discuss APIs, they argue
>>> >> about
>>> >> the core concepts to expose in order to meet various goals, and then
>>> >> they're
>>> >> stuck maintaining those for a long time.
>>> >>
>>> >> Matei
>>> >>
>>> >> On Oct 9, 2016, at 3:10 PM, Cody Koeninger <[hidden email]> wrote:
>>> >>
>>> >> Users instead of people, sure.  Commiters and contributors are (or at
>>> >> least
>>> >> should be) a subset of users.
>>> >>
>>> >> Non goals, sure. I don't care what the name is, but we need to clearly
>>> >> say
>>> >> e.g. 'no we are not maintaining compatibility with XYZ right now'.
>>> >>
>>> >> API, what I care most about is whether it allows me to accomplish the
>>> >> goals.
>>> >> Arguing about how ugly or pretty it is can be saved for design/
>>> >> implementation imho.
>>> >>
>>> >> Strategy, this is necessary because otherwise goals can be out of line
>>> >> with
>>> >> reality.  Don't propose goals you don't have at least some idea of how
>>> >> to
>>> >> implement.
>>> >>
>>> >> Rejected strategies, given that commiters are the only ones I'm saying
>>> >> should formally submit SPARKLIs or SIPs, if they put junk in a
>>> >> required
>>> >> section then slap them down for it and tell them to fix it.
>>> >>
>>> >>
>>> >> On Oct 9, 2016 4:36 PM, "Matei Zaharia" <[hidden email]> wrote:
>>> >>>
>>> >>> Yup, this is the stuff that I found unclear. Thanks for clarifying
>>> >>> here,
>>> >>> but we should also clarify it in the writeup. In particular:
>>> >>>
>>> >>> - Goals needs to be about user-facing behavior ("people" is broad)
>>> >>>
>>> >>> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will dig
>>> >>> up
>>> >>> one of these and say "Spark's developers have officially rejected X,
>>> >>> which
>>> >>> our awesome system has".
>>> >>>
>>> >>> - For user-facing stuff, I think you need a section on API. Virtually
>>> >>> all
>>> >>> other *IPs I've seen have that.
>>> >>>
>>> >>> - I'm still not sure why the strategy section is needed if the
>>> >>> purpose is
>>> >>> to define user-facing behavior -- unless this is the strategy for
>>> >>> setting
>>> >>> the goals or for defining the API. That sounds squarely like a design
>>> >>> doc
>>> >>> issue. In some sense, who cares whether the proposal is technically
>>> >>> feasible
>>> >>> right now? If it's infeasible, that will be discovered later during
>>> >>> design
>>> >>> and implementation. Same thing with rejected strategies -- listing
>>> >>> some
>>> >>> of
>>> >>> those is definitely useful sometimes, but if you make this a
>>> >>> *required*
>>> >>> section, people are just going to fill it in with bogus stuff (I've
>>> >>> seen
>>> >>> this happen before).
>>> >>>
>>> >>> Matei
>>> >>>
>>> >
>>> >>> > On Oct 9, 2016, at 2:14 PM, Cody Koeninger <[hidden email]> wrote:
>>> >>> >
>>> >>> > So to focus the discussion on the specific strategy I'm suggesting,
>>> >>> > documented at
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>> >>> >
>>> >>> > "Goals: What must this allow people to do, that they can't
>>> >>> > currently?"
>>> >>> >
>>> >>> > Is it unclear that this is focusing specifically on people-visible
>>> >>> > behavior?
>>> >>> >
>>> >>> > Rejected goals -  are important because otherwise people keep
>>> >>> > trying
>>> >>> > to argue about scope.  Of course you can change things later with a
>>> >>> > different SIP and different vote, the point is to focus.
>>> >>> >
>>> >>> > Use cases - are something that people are going to bring up in
>>> >>> > discussion.  If they aren't clearly documented as a goal ("This
>>> >>> > must
>>> >>> > allow me to connect using SSL"), they should be added.
>>> >>> >
>>> >>> > Internal architecture - if the people who need specific behavior
>>> >>> > are
>>> >>> > implementers of other parts of the system, that's fine.
>>> >>> >
>>> >>> > Rejected strategies - If you have none of these, you have no
>>> >>> > evidence
>>> >>> > that the proponent didn't just go with the first thing they had in
>>> >>> > mind (or have already implemented), which is a big problem
>>> >>> > currently.
>>> >>> > Approval isn't binding as to specifics of implementation, so these
>>> >>> > aren't handcuffs.  The goals are the contract, the strategy is
>>> >>> > evidence that contract can actually be met.
>>> >>> >
>>> >>> > Design docs - I'm not touching design docs.  The markdown file I
>>> >>> > linked specifically says of the strategy section "This is not a
>>> >>> > full
>>> >>> > design document."  Is this unclear?  Design docs can be worked on
>>> >>> > obviously, but that's not what I'm concerned with here.
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <[hidden email]>
>>> >>> > wrote:
>>> >>> >> Hi Cody,
>>> >>> >>
>>> >>> >> I think this would be a lot more concrete if we had a more
>>> >>> >> detailed
>>> >>> >> template
>>> >>> >> for SIPs. Right now, it's not super clear what's in scope -- e.g.
>>> >>> >> are
>>> >>> >> they
>>> >>> >> a way to solicit feedback on the user-facing behavior or on the
>>> >>> >> internals?
>>> >>> >> "Goals" can cover both things. I've been thinking of SIPs more as
>>> >>> >> Product
>>> >>> >> Requirements Docs (PRDs), which focus on *what* a code change
>>> >>> >> should
>>> >>> >> do
>>> >>> >> as
>>> >>> >> opposed to how.
>>> >>> >>
>>> >>> >> In particular, here are some things that you may or may not
>>> >>> >> consider
>>> >>> >> in
>>> >>> >> scope for SIPs:
>>> >>> >>
>>> >>> >> - Goals and non-goals: This is definitely in scope, and IMO should
>>> >>> >> focus on
>>> >>> >> user-visible behavior (e.g. "system supports SQL window functions"
>>> >>> >> or
>>> >>> >> "system continues working if one node fails"). BTW I wouldn't say
>>> >>> >> "rejected
>>> >>> >> goals" because some of them might become goals later, so we're not
>>> >>> >> definitively rejecting them.
>>> >>> >>
>>> >>> >> - Public API: Probably should be included in most SIPs unless it's
>>> >>> >> too
>>> >>> >> large
>>> >>> >> to fully specify then (e.g. "let's add an ML library").
>>> >>> >>
>>> >>> >> - Use cases: I usually find this very useful in PRDs to better
>>> >>> >> communicate
>>> >>> >> the goals.
>>> >>> >>
>>> >>> >> - Internal architecture: This is usually *not* a thing users can
>>> >>> >> easily
>>> >>> >> comment on and it sounds more like a design doc item. Of course
>>> >>> >> it's
>>> >>> >> important to show that the SIP is feasible to implement. One
>>> >>> >> exception,
>>> >>> >> however, is that I think we'll have some SIPs primarily on
>>> >>> >> internals
>>> >>> >> (e.g.
>>> >>> >> if somebody wants to refactor Spark's query optimizer or
>>> >>> >> something).
>>> >>> >>
>>> >>> >> - Rejected strategies: I personally wouldn't put this, because
>>> >>> >> what's
>>> >>> >> the
>>> >>> >> point of voting to reject a strategy before you've really begun
>>> >>> >> designing
>>> >>> >> and implementing something? What if you discover that the strategy
>>> >>> >> is
>>> >>> >> actually better when you start doing stuff?
>>> >>> >>
>>> >>> >> At a super high level, it depends on whether you want the SIPs to
>>> >>> >> be
>>> >>> >> PRDs
>>> >>> >> for getting some quick feedback on the goals of a feature before
>>> >>> >> it is
>>> >>> >> designed, or something more like full-fledged design docs (just a
>>> >>> >> more
>>> >>> >> visible design doc for bigger changes). I looked at Kafka's KIPs,
>>> >>> >> and
>>> >>> >> they
>>> >>> >> actually seem to be more like design docs. This can work too but
>>> >>> >> it
>>> >>> >> does
>>> >>> >> require more work from the proposer and it can lead to the same
>>> >>> >> problems you
>>> >>> >> mentioned with people already having a design and implementation
>>> >>> >> in
>>> >>> >> mind.
>>> >>> >>
>>> >>> >> Basically, the question is, are you trying to iterate faster on
>>> >>> >> design
>>> >>> >> by
>>> >>> >> adding a step for user feedback earlier? Or are you just trying to
>>> >>> >> make
>>> >>> >> design docs for key features more visible (and their approval more
>>> >>> >> formal)?
>>> >>> >>
>>> >>> >> BTW note that in either case, I'd like to have a template for
>>> >>> >> design
>>> >>> >> docs
>>> >>> >> too, which should also include goals. I think that would've
>>> >>> >> avoided
>>> >>> >> some of
>>> >>> >> the issues you brought up.
>>> >>> >>
>>> >>> >> Matei
>>> >>> >>
>>> >>> >> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <[hidden email]>
>>> >>> >> wrote:
>>> >>> >>
>>> >>> >> Here's my specific proposal (meta-proposal?)
>>> >>> >>
>>> >>> >> Spark Improvement Proposals (SIP)
>>> >>> >>
>>> >>> >>
>>> >>> >> Background:
>>> >>> >>
>>> >>> >> The current problem is that design and implementation of large
>>> >>> >> features
>>> >>> >> are
>>> >>> >> often done in private, before soliciting user feedback.
>>> >>> >>
>>> >>> >> When feedback is solicited, it is often as to detailed design
>>> >>> >> specifics, not
>>> >>> >> focused on goals.
>>> >>> >>
>>> >>> >> When implementation does take place after design, there is often
>>> >>> >> disagreement as to what goals are or are not in scope.
>>> >>> >>
>>> >>> >> This results in commits that don't fully meet user needs.
>>> >>> >>
>>> >>> >>
>>> >>> >> Goals:
>>> >>> >>
>>> >>> >> - Ensure user, contributor, and committer goals are clearly
>>> >>> >> identified
>>> >>> >> and
>>> >>> >> agreed upon, before implementation takes place.
>>> >>> >>
>>> >>> >> - Ensure that a technically feasible strategy is chosen that is
>>> >>> >> likely
>>> >>> >> to
>>> >>> >> meet the goals.
>>> >>> >>
>>> >>> >>
>>> >>> >> Rejected Goals:
>>> >>> >>
>>> >>> >> - SIPs are not for detailed design.  Design by committee doesn't
>>> >>> >> work.
>>> >>> >>
>>> >>> >> - SIPs are not for every change.  We dont need that much process.
>>> >>> >>
>>> >>> >>
>>> >>> >> Strategy:
>>> >>> >>
>>> >>> >> My suggestion is outlined as a Spark Improvement Proposal process
>>> >>> >> documented
>>> >>> >> at
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>> >>> >>
>>> >>> >> Specifics of Jira manipulation are an implementation detail we can
>>> >>> >> figure
>>> >>> >> out.
>>> >>> >>
>>> >>> >> I'm suggesting voting; the need here is for a _clear_ outcome.
>>> >>> >>
>>> >>> >>
>>> >>> >> Rejected Strategies:
>>> >>> >>
>>> >>> >> Having someone who understands the problem implement it first
>>> >>> >> works,
>>> >>> >> but
>>> >>> >> only if significant iteration after user feedback is allowed.
>>> >>> >>
>>> >>> >> Historically this has been problematic due to pressure to limit
>>> >>> >> public
>>> >>> >> api
>>> >>> >> changes.
>>> >>> >>
>>> >>> >>
>>> >>> >> On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <[hidden email]>
>>> >>> >> wrote:
>>> >>> >>>
>>> >>> >>> Alright looks like there are quite a bit of support. We should
>>> >>> >>> wait
>>> >>> >>> to
>>> >>> >>> hear from more people too.
>>> >>> >>>
>>> >>> >>> To push this forward, Cody and I will be working together in the
>>> >>> >>> next
>>> >>> >>> couple of weeks to come up with a concrete, detailed proposal on
>>> >>> >>> what
>>> >>> >>> this
>>> >>> >>> entails, and then we can discuss this the specific proposal as
>>> >>> >>> well.
>>> >>> >>>
>>> >>> >>>
>>> >>> >>> On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <[hidden email]>
>>> >>> >>> wrote:
>>> >>> >>>>
>>> >>> >>>> Yeah, in case it wasn't clear, I was talking about SIPs for
>>> >>> >>>> major
>>> >>> >>>> user-facing or cross-cutting changes, not minor feature adds.
>>> >>> >>>>
>>> >>> >>>> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos
>>> >>> >>>> <[hidden email]> wrote:
>>> >>> >>>>>
>>> >>> >>>>> +1 to the SIP label as long as it does not slow down things and
>>> >>> >>>>> it
>>> >>> >>>>> targets optimizing efforts, coordination etc. For example
>>> >>> >>>>> really
>>> >>> >>>>> small
>>> >>> >>>>> features should not need to go through this process (assuming
>>> >>> >>>>> they
>>> >>> >>>>> dont
>>> >>> >>>>> touch public interfaces)  or re-factorings and hope it will be
>>> >>> >>>>> kept
>>> >>> >>>>> this
>>> >>> >>>>> way. So as a guideline doc should be provided, like in the KIP
>>> >>> >>>>> case.
>>> >>> >>>>>
>>> >>> >>>>> IMHO so far aside from tagging things and linking them
>>> >>> >>>>> elsewhere
>>> >>> >>>>> simply
>>> >>> >>>>> having design docs and prototypes implementations in PRs is not
>>> >>> >>>>> something
>>> >>> >>>>> that has not worked so far. What is really a pain in many
>>> >>> >>>>> projects
>>> >>> >>>>> out there
>>> >>> >>>>> is discontinuity in progress of PRs, missing features, slow
>>> >>> >>>>> reviews
>>> >>> >>>>> which is
>>> >>> >>>>> understandable to some extent... it is not only about Spark but
>>> >>> >>>>> things can
>>> >>> >>>>> be improved for sure for this project in particular as already
>>> >>> >>>>> stated.
>>> >>> >>>>>
>>> >>> >>>>> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <[hidden
>>> >>> >>>>> email]>
>>> >>> >>>>> wrote:
>>> >>> >>>>>>
>>> >>> >>>>>> +1 to adding an SIP label and linking it from the website.  I
>>> >>> >>>>>> think
>>> >>> >>>>>> it
>>> >>> >>>>>> needs
>>> >>> >>>>>>
>>> >>> >>>>>> - template that focuses it towards soliciting user goals / non
>>> >>> >>>>>> goals
>>> >>> >>>>>> - clear resolution as to which strategy was chosen to pursue.
>>> >>> >>>>>> I'd
>>> >>> >>>>>> recommend a vote.
>>> >>> >>>>>>
>>> >>> >>>>>> Matei asked me to clarify what I meant by changing interfaces,
>>> >>> >>>>>> I
>>> >>> >>>>>> think
>>> >>> >>>>>> it's directly relevant to the SIP idea so I'll clarify here,
>>> >>> >>>>>> and
>>> >>> >>>>>> split
>>> >>> >>>>>> a thread for the other discussion per Nicholas' request.
>>> >>> >>>>>>
>>> >>> >>>>>> I meant changing public user interfaces.  I think the first
>>> >>> >>>>>> design
>>> >>> >>>>>> is
>>> >>> >>>>>> unlikely to be right, because it's done at a time when you
>>> >>> >>>>>> have
>>> >>> >>>>>> the
>>> >>> >>>>>> least information.  As a user, I find it considerably more
>>> >>> >>>>>> frustrating
>>> >>> >>>>>> to be unable to use a tool to get my job done, than I do
>>> >>> >>>>>> having to
>>> >>> >>>>>> make minor changes to my code in order to take advantage of
>>> >>> >>>>>> features.
>>> >>> >>>>>> I've seen committers be seriously reluctant to allow changes
>>> >>> >>>>>> to
>>> >>> >>>>>> @experimental code that are needed in order for it to really
>>> >>> >>>>>> work
>>> >>> >>>>>> right.  You need to be able to iterate, and if people on both
>>> >>> >>>>>> sides
>>> >>> >>>>>> of
>>> >>> >>>>>> the fence aren't going to respect that some newer apis are
>>> >>> >>>>>> subject
>>> >>> >>>>>> to
>>> >>> >>>>>> change, then why even mark them as such?
>>> >>> >>>>>>
>>> >>> >>>>>> Ideally a finished SIP should give me a checklist of things
>>> >>> >>>>>> that
>>> >>> >>>>>> an
>>> >>> >>>>>> implementation must do, and things that it doesn't need to do.
>>> >>> >>>>>> Contributors/committers should be seriously discouraged from
>>> >>> >>>>>> putting
>>> >>> >>>>>> out a version 0.1 that doesn't have at least a prototype
>>> >>> >>>>>> implementation of all those things, especially if they're then
>>> >>> >>>>>> going
>>> >>> >>>>>> to argue against interface changes necessary to get the the
>>> >>> >>>>>> rest
>>> >>> >>>>>> of
>>> >>> >>>>>> the things done in the 0.2 version.
>>> >>> >>>>>>
>>> >>> >>>>>>
>>> >>> >>>>>> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <[hidden email]>
>>> >>> >>>>>> wrote:
>>> >>> >>>>>>> I like the lightweight proposal to add a SIP label.
>>> >>> >>>>>>>
>>> >>> >>>>>>> During Spark 2.0 development, Tom (Graves) and I suggested
>>> >>> >>>>>>> using
>>> >>> >>>>>>> wiki
>>> >>> >>>>>>> to
>>> >>> >>>>>>> track the list of major changes, but that never really
>>> >>> >>>>>>> materialized
>>> >>> >>>>>>> due to
>>> >>> >>>>>>> the overhead. Adding a SIP label on major JIRAs and then link
>>> >>> >>>>>>> to
>>> >>> >>>>>>> them
>>> >>> >>>>>>> prominently on the Spark website makes a lot of sense.
>>> >>> >>>>>>>
>>> >>> >>>>>>>
>>> >>> >>>>>>> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia
>>> >>> >>>>>>> <[hidden email]>
>>> >>> >>>>>>> wrote:
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> For the improvement proposals, I think one major point was
>>> >>> >>>>>>>> to
>>> >>> >>>>>>>> make
>>> >>> >>>>>>>> them
>>> >>> >>>>>>>> really visible to users who are not contributors, so we
>>> >>> >>>>>>>> should
>>> >>> >>>>>>>> do
>>> >>> >>>>>>>> more than
>>> >>> >>>>>>>> sending stuff to dev@. One very lightweight idea is to have
>>> >>> >>>>>>>> a
>>> >>> >>>>>>>> new
>>> >>> >>>>>>>> type of
>>> >>> >>>>>>>> JIRA called a SIP and have a link to a filter that shows all
>>> >>> >>>>>>>> such
>>> >>> >>>>>>>> JIRAs from
>>> >>> >>>>>>>> http://spark.apache.org. I also like the idea of SIP and
>>> >>> >>>>>>>> design
>>> >>> >>>>>>>> doc
>>> >>> >>>>>>>> templates (in fact many projects have them).
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> Matei
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]>
>>> >>> >>>>>>>> wrote:
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> I called Cody last night and talked about some of the topics
>>> >>> >>>>>>>> in
>>> >>> >>>>>>>> his
>>> >>> >>>>>>>> email.
>>> >>> >>>>>>>> It became clear to me Cody genuinely cares about the
>>> >>> >>>>>>>> project.
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> Some of the frustrations come from the success of the
>>> >>> >>>>>>>> project
>>> >>> >>>>>>>> itself
>>> >>> >>>>>>>> becoming very "hot", and it is difficult to get clarity from
>>> >>> >>>>>>>> people
>>> >>> >>>>>>>> who
>>> >>> >>>>>>>> don't dedicate all their time to Spark. In fact, it is in
>>> >>> >>>>>>>> some
>>> >>> >>>>>>>> ways
>>> >>> >>>>>>>> similar
>>> >>> >>>>>>>> to scaling an engineering team in a successful startup: old
>>> >>> >>>>>>>> processes that
>>> >>> >>>>>>>> worked well might not work so well when it gets to a certain
>>> >>> >>>>>>>> size,
>>> >>> >>>>>>>> cultures
>>> >>> >>>>>>>> can get diluted, building culture vs building process, etc.
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> I also really like to have a more visible process for larger
>>> >>> >>>>>>>> changes,
>>> >>> >>>>>>>> especially major user facing API changes. Historically we
>>> >>> >>>>>>>> upload
>>> >>> >>>>>>>> design docs
>>> >>> >>>>>>>> for major changes, but it is not always consistent and
>>> >>> >>>>>>>> difficult
>>> >>> >>>>>>>> to
>>> >>> >>>>>>>> quality
>>> >>> >>>>>>>> of the docs, due to the volunteering nature of the
>>> >>> >>>>>>>> organization.
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> Some of the more concrete ideas we discussed focus on
>>> >>> >>>>>>>> building a
>>> >>> >>>>>>>> culture
>>> >>> >>>>>>>> to improve clarity:
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> - Process: Large changes should have design docs posted on
>>> >>> >>>>>>>> JIRA.
>>> >>> >>>>>>>> One
>>> >>> >>>>>>>> thing
>>> >>> >>>>>>>> Cody and I didn't discuss but an idea that just came to me
>>> >>> >>>>>>>> is we
>>> >>> >>>>>>>> should
>>> >>> >>>>>>>> create a design doc template for the project and ask
>>> >>> >>>>>>>> everybody
>>> >>> >>>>>>>> to
>>> >>> >>>>>>>> follow.
>>> >>> >>>>>>>> The design doc template should also explicitly list goals
>>> >>> >>>>>>>> and
>>> >>> >>>>>>>> non-goals, to
>>> >>> >>>>>>>> make design doc more consistent.
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> - Process: Email dev@ to solicit feedback. We have some this
>>> >>> >>>>>>>> with
>>> >>> >>>>>>>> some
>>> >>> >>>>>>>> changes, but again very inconsistent. Just posting something
>>> >>> >>>>>>>> on
>>> >>> >>>>>>>> JIRA
>>> >>> >>>>>>>> isn't
>>> >>> >>>>>>>> sufficient, because there are simply too many JIRAs and the
>>> >>> >>>>>>>> signal
>>> >>> >>>>>>>> get lost
>>> >>> >>>>>>>> in the noise. While this is generally impossible to enforce
>>> >>> >>>>>>>> because
>>> >>> >>>>>>>> we can't
>>> >>> >>>>>>>> force all volunteers to conform to a process (or they might
>>> >>> >>>>>>>> not
>>> >>> >>>>>>>> even
>>> >>> >>>>>>>> be
>>> >>> >>>>>>>> aware of this),  those who are more familiar with the
>>> >>> >>>>>>>> project
>>> >>> >>>>>>>> can
>>> >>> >>>>>>>> help by
>>> >>> >>>>>>>> emailing the dev@ when they see something that hasn't been.
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> - Culture: The design doc author(s) should be open to
>>> >>> >>>>>>>> feedback.
>>> >>> >>>>>>>> A
>>> >>> >>>>>>>> design
>>> >>> >>>>>>>> doc should serve as the base for discussion and is by no
>>> >>> >>>>>>>> means
>>> >>> >>>>>>>> the
>>> >>> >>>>>>>> final
>>> >>> >>>>>>>> design. Of course, this does not mean the author has to
>>> >>> >>>>>>>> accept
>>> >>> >>>>>>>> every
>>> >>> >>>>>>>> feedback. They should also be comfortable accepting /
>>> >>> >>>>>>>> rejecting
>>> >>> >>>>>>>> ideas on
>>> >>> >>>>>>>> technical grounds.
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> - Process / Culture: For major ongoing projects, it can be
>>> >>> >>>>>>>> useful
>>> >>> >>>>>>>> to
>>> >>> >>>>>>>> have
>>> >>> >>>>>>>> some monthly Google hangouts that are open to the world. I
>>> >>> >>>>>>>> am
>>> >>> >>>>>>>> actually not
>>> >>> >>>>>>>> sure how well this will work, because of the volunteering
>>> >>> >>>>>>>> nature
>>> >>> >>>>>>>> and
>>> >>> >>>>>>>> we need
>>> >>> >>>>>>>> to adjust for timezones for people across the globe, but it
>>> >>> >>>>>>>> seems
>>> >>> >>>>>>>> worth
>>> >>> >>>>>>>> trying.
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> - Culture: Contributors (including committers) should be
>>> >>> >>>>>>>> more
>>> >>> >>>>>>>> direct
>>> >>> >>>>>>>> in
>>> >>> >>>>>>>> setting expectations, including whether they are working on
>>> >>> >>>>>>>> a
>>> >>> >>>>>>>> specific
>>> >>> >>>>>>>> issue, whether they will be working on a specific issue, and
>>> >>> >>>>>>>> whether
>>> >>> >>>>>>>> an
>>> >>> >>>>>>>> issue or pr or jira should be rejected. Most people I know
>>> >>> >>>>>>>> in
>>> >>> >>>>>>>> this
>>> >>> >>>>>>>> community
>>> >>> >>>>>>>> are nice and don't enjoy telling other people no, but it is
>>> >>> >>>>>>>> often
>>> >>> >>>>>>>> more
>>> >>> >>>>>>>> annoying to a contributor to not know anything than getting
>>> >>> >>>>>>>> a
>>> >>> >>>>>>>> no.
>>> >>> >>>>>>>>
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia
>>> >>> >>>>>>>> <[hidden email]>
>>> >>> >>>>>>>> wrote:
>>> >>> >>>>>>>>>
>>> >>> >>>>>>>>>
>>> >>> >>>>>>>>> Love the idea of a more visible "Spark Improvement
>>> >>> >>>>>>>>> Proposal"
>>> >>> >>>>>>>>> process that
>>> >>> >>>>>>>>> solicits user input on new APIs. For what it's worth, I
>>> >>> >>>>>>>>> don't
>>> >>> >>>>>>>>> think
>>> >>> >>>>>>>>> committers are trying to minimize their own work -- every
>>> >>> >>>>>>>>> committer
>>> >>> >>>>>>>>> cares
>>> >>> >>>>>>>>> about making the software useful for users. However, it is
>>> >>> >>>>>>>>> always
>>> >>> >>>>>>>>> hard to
>>> >>> >>>>>>>>> get user input and so it helps to have this kind of
>>> >>> >>>>>>>>> process.
>>> >>> >>>>>>>>> I've
>>> >>> >>>>>>>>> certainly
>>> >>> >>>>>>>>> looked at the *IPs a lot in other software I use just to
>>> >>> >>>>>>>>> see
>>> >>> >>>>>>>>> the
>>> >>> >>>>>>>>> biggest
>>> >>> >>>>>>>>> things on the roadmap.
>>> >>> >>>>>>>>>
>>> >>> >>>>>>>>> When you're talking about "changing interfaces", are you
>>> >>> >>>>>>>>> talking
>>> >>> >>>>>>>>> about
>>> >>> >>>>>>>>> public or internal APIs? I do think many people hate
>>> >>> >>>>>>>>> changing
>>> >>> >>>>>>>>> public APIs
>>> >>> >>>>>>>>> and I actually think that's for the best of the project.
>>> >>> >>>>>>>>> That's
>>> >>> >>>>>>>>> a
>>> >>> >>>>>>>>> technical
>>> >>> >>>>>>>>> debate, but basically, the worst thing when you're using a
>>> >>> >>>>>>>>> piece
>>> >>> >>>>>>>>> of
>>> >>> >>>>>>>>> software
>>> >>> >>>>>>>>> is that the developers constantly ask you to rewrite your
>>> >>> >>>>>>>>> app
>>> >>> >>>>>>>>> to
>>> >>> >>>>>>>>> update to a
>>> >>> >>>>>>>>> new version (and thus benefit from bug fixes, etc). Cue
>>> >>> >>>>>>>>> anyone
>>> >>> >>>>>>>>> who's used
>>> >>> >>>>>>>>> Protobuf, or Guava. The "let's get everyone to change their
>>> >>> >>>>>>>>> code
>>> >>> >>>>>>>>> this
>>> >>> >>>>>>>>> release" model works well within a single large company,
>>> >>> >>>>>>>>> but
>>> >>> >>>>>>>>> doesn't work
>>> >>> >>>>>>>>> well for a community, which is why nearly all *very* widely
>>> >>> >>>>>>>>> used
>>> >>> >>>>>>>>> programming
>>> >>> >>>>>>>>> interfaces (I'm talking things like Java standard library,
>>> >>> >>>>>>>>> Windows
>>> >>> >>>>>>>>> API, etc)
>>> >>> >>>>>>>>> almost *never* break backwards compatibility. All this is
>>> >>> >>>>>>>>> done
>>> >>> >>>>>>>>> within reason
>>> >>> >>>>>>>>> though, e.g. we do change things in major releases (2.x,
>>> >>> >>>>>>>>> 3.x,
>>> >>> >>>>>>>>> etc).
>>> >>> >>>>>>>>
>>> >>> >>>>>>>>
>>> >>> >>>>>>>>
>>> >>> >>>>>>>>
>>> >>> >>>>>>>
>>> >>> >>>>>>
>>> >>> >>>>>>
>>> >>> >>>>>>
>>> >>> >>>>>>
>>> >>> >>>>>> ---------------------------------------------------------------------
>>> >>> >>>>>> To unsubscribe e-mail: [hidden email]
>>> >>> >>>>>>
>>> >>> >>>>>
>>> >>> >>>>>
>>> >>> >>>>>
>>> >>> >>>>> --
>>> >>> >>>>> Stavros Kontopoulos
>>> >>> >>>>> Senior Software Engineer
>>> >>> >>>>> Lightbend, Inc.
>>> >>> >>>>> p:  <a href="tel:%2B30%206977967274" value="+306977967274">+30 6977967274
>>> >>> >>>>> e: [hidden email]
>>> >>> >>>>>
>>> >>> >>>>>
>>> >>> >>>>
>>> >>> >>>
>>> >>> >>
>>> >>> >>
>>> >>>
>>> >>
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe e-mail: [hidden email]
>>> >
>>> >
>>> > ________________________________
>>> >
>>> > If you reply to this email, your message will be added to the
>>> > discussion
>>> > below:
>>> >
>>> >
>>> > http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-tp19268p19359.html
>>> >
>>> > To start a new topic under Apache Spark Developers List, email [hidden
>>> > email]
>>> > To unsubscribe from Apache Spark Developers List, click here.
>>> > NAML
>>> >
>>> >
>>> > ________________________________
>>> > View this message in context: RE: Spark Improvement Proposals
>>> > Sent from the Apache Spark Developers List mailing list archive at
>>> > Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [hidden email]
>>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Cody Koeninger-2
That seems reasonable to me.

I do not want to see lazy consensus used on one of these proposals
though, I want a clear outcome, i.e. call for a vote, wait at least 72
hours, get three +1s and no vetos.



On Mon, Oct 10, 2016 at 2:15 PM, Ryan Blue <[hidden email]> wrote:

> Proposal submission: I think we should keep this as open as possible. If
> there is a problem with too many open proposals, then we should tackle that
> as a fix rather than excluding participation. Perhaps it will end up that
> way, but I think it's worth trying a more open model first.
>
> Majority vs consensus: My rationale is that I don't think we want to
> consider a proposal approved if it had objections serious enough that
> committers down-voted (or PMC depending on who gets a vote). If these
> proposals are like PEPs, then they represent a significant amount of
> community effort and I wouldn't want to move forward if up to half of the
> community thinks it's an untenable idea.
>
> rb
>
> On Mon, Oct 10, 2016 at 12:07 PM, Cody Koeninger <[hidden email]> wrote:
>>
>> I think this is closer to a procedural issue than a code modification
>> issue, hence why majority.  If everyone thinks consensus is better, I
>> don't care.  Again, I don't feel strongly about the way we achieve
>> clarity, just that we achieve clarity.
>>
>> On Mon, Oct 10, 2016 at 2:02 PM, Ryan Blue <[hidden email]> wrote:
>> > Sorry, I missed that the proposal includes majority approval. Why
>> > majority
>> > instead of consensus? I think we want to build consensus around these
>> > proposals and it makes sense to discuss until no one would veto.
>> >
>> > rb
>> >
>> > On Mon, Oct 10, 2016 at 11:54 AM, Ryan Blue <[hidden email]> wrote:
>> >>
>> >> +1 to votes to approve proposals. I agree that proposals should have an
>> >> official mechanism to be accepted, and a vote is an established means
>> >> of
>> >> doing that well. I like that it includes a period to review the
>> >> proposal and
>> >> I think proposals should have been discussed enough ahead of a vote to
>> >> survive the possibility of a veto.
>> >>
>> >> I also like the names that are short and (mostly) unique, like SEP.
>> >>
>> >> Where I disagree is with the requirement that a committer must formally
>> >> propose an enhancement. I don't see the value of restricting this: if
>> >> someone has the will to write up a proposal then they should be
>> >> encouraged
>> >> to do so and start a discussion about it. Even if there is a political
>> >> reality as Cody says, what is the value of codifying that in our
>> >> process? I
>> >> think restricting who can submit proposals would only undermine them by
>> >> pushing contributors out. Maybe I'm missing something here?
>> >>
>> >> rb
>> >>
>> >>
>> >>
>> >> On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <[hidden email]>
>> >> wrote:
>> >>>
>> >>> Yes, users suggesting SIPs is a good thing and is explicitly called
>> >>> out in the linked document under the Who? section.  Formally proposing
>> >>> them, not so much, because of the political realities.
>> >>>
>> >>> Yes, implementation strategy definitely affects goals.  There are all
>> >>> kinds of examples of this, I'll pick one that's my fault so as to
>> >>> avoid sounding like I'm blaming:
>> >>>
>> >>> When I implemented the Kafka DStream, one of my (not explicitly agreed
>> >>> upon by the community) goals was to make sure people could use the
>> >>> Dstream with however they were already using Kafka at work.  The lack
>> >>> of explicit agreement on that goal led to all kinds of fighting with
>> >>> committers, that could have been avoided.  The lack of explicit
>> >>> up-front strategy discussion led to the DStream not really working
>> >>> with compacted topics.  I knew about compacted topics, but don't have
>> >>> a use for them, so had a blind spot there.  If there was explicit
>> >>> up-front discussion that my strategy was "assume that batches can be
>> >>> defined on the driver solely by beginning and ending offsets", there's
>> >>> a greater chance that a user would have seen that and said, "hey, what
>> >>> about non-contiguous offsets in a compacted topic".
>> >>>
>> >>> This kind of thing is only going to happen smoothly if we have a
>> >>> lightweight user-visible process with clear outcomes.
>> >>>
>> >>> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>> >>> <[hidden email]> wrote:
>> >>> > I agree with most of what Cody said.
>> >>> >
>> >>> > Two things:
>> >>> >
>> >>> > First we can always have other people suggest SIPs but mark them as
>> >>> > “unreviewed” and have committers basically move them forward. The
>> >>> > problem is
>> >>> > that writing a good document takes time. This way we can leverage
>> >>> > non
>> >>> > committers to do some of this work (it is just another way to
>> >>> > contribute).
>> >>> >
>> >>> >
>> >>> >
>> >>> > As for strategy, in many cases implementation strategy can affect
>> >>> > the
>> >>> > goals.
>> >>> > I will give  a small example: In the current structured streaming
>> >>> > strategy,
>> >>> > we group by the time to achieve a sliding window. This is definitely
>> >>> > an
>> >>> > implementation decision and not a goal. However, I can think of
>> >>> > several
>> >>> > aggregation functions which have the time inside their calculation
>> >>> > buffer.
>> >>> > For example, let’s say we want to return a set of all distinct
>> >>> > values.
>> >>> > One
>> >>> > way to implement this would be to make the set into a map and have
>> >>> > the
>> >>> > value
>> >>> > contain the last time seen. Multiplying it across the groupby would
>> >>> > cost a
>> >>> > lot in performance. So adding such a strategy would have a great
>> >>> > effect
>> >>> > on
>> >>> > the type of aggregations and their performance which does affect the
>> >>> > goal.
>> >>> > Without adding the strategy, it is easy for whoever goes to the
>> >>> > design
>> >>> > document to not think about these cases. Furthermore, it might be
>> >>> > decided
>> >>> > that these cases are rare enough so that the strategy is still good
>> >>> > enough
>> >>> > but how would we know it without user feedback?
>> >>> >
>> >>> > I believe this example is exactly what Cody was talking about. Since
>> >>> > many
>> >>> > times implementation strategies have a large effect on the goal, we
>> >>> > should
>> >>> > have it discussed when discussing the goals. In addition, while it
>> >>> > is
>> >>> > often
>> >>> > easy to throw out completely infeasible goals, it is often much
>> >>> > harder
>> >>> > to
>> >>> > figure out that the goals are unfeasible without fine tuning.
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > Assaf.
>> >>> >
>> >>> >
>> >>> >
>> >>> > From: Cody Koeninger-2 [via Apache Spark Developers List]
>> >>> > [mailto:ml-node+[hidden email]]
>> >>> > Sent: Monday, October 10, 2016 2:25 AM
>> >>> > To: Mendelson, Assaf
>> >>> > Subject: Re: Spark Improvement Proposals
>> >>> >
>> >>> >
>> >>> >
>> >>> > Only committers should formally submit SIPs because in an apache
>> >>> > project only commiters have explicit political power.  If a user
>> >>> > can't
>> >>> > find a commiter willing to sponsor an SIP idea, they have no way to
>> >>> > get the idea passed in any case.  If I can't find a committer to
>> >>> > sponsor this meta-SIP idea, I'm out of luck.
>> >>> >
>> >>> > I do not believe unrealistic goals can be found solely by
>> >>> > inspection.
>> >>> > We've managed to ignore unrealistic goals even after implementation!
>> >>> > Focusing on APIs can allow people to think they've solved something,
>> >>> > when there's really no way of implementing that API while meeting
>> >>> > the
>> >>> > goals.  Rapid iteration is clearly the best way to address this, but
>> >>> > we've already talked about why that hasn't really worked.  If adding
>> >>> > a
>> >>> > non-binding API section to the template is important to you, I'm not
>> >>> > against it, but I don't think it's sufficient.
>> >>> >
>> >>> > On your PRD vs design doc spectrum, I'm saying this is closer to a
>> >>> > PRD.  Clear agreement on goals is the most important thing and
>> >>> > that's
>> >>> > why it's the thing I want binding agreement on.  But I cannot agree
>> >>> > to
>> >>> > goals unless I have enough minimal technical info to judge whether
>> >>> > the
>> >>> > goals are likely to actually be accomplished.
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Sun, Oct 9, 2016 at 5:35 PM, Matei Zaharia <[hidden email]>
>> >>> > wrote:
>> >>> >
>> >>> >
>> >>> >> Well, I think there are a few things here that don't make sense.
>> >>> >> First,
>> >>> >> why
>> >>> >> should only committers submit SIPs? Development in the project
>> >>> >> should
>> >>> >> be
>> >>> >> open to all contributors, whether they're committers or not.
>> >>> >> Second, I
>> >>> >> think
>> >>> >> unrealistic goals can be found just by inspecting the goals, and
>> >>> >> I'm
>> >>> >> not
>> >>> >> super worried that we'll accept a lot of SIPs that are then
>> >>> >> infeasible
>> >>> >> --
>> >>> >> we
>> >>> >> can then submit new ones. But this depends on whether you want this
>> >>> >> process
>> >>> >> to be a "design doc lite", where people also agree on
>> >>> >> implementation
>> >>> >> strategy, or just a way to agree on goals. This is what I asked
>> >>> >> earlier
>> >>> >> about PRDs vs design docs (and I'm open to either one but I'd just
>> >>> >> like
>> >>> >> clarity). Finally, both as a user and designer of software, I
>> >>> >> always
>> >>> >> want
>> >>> >> to
>> >>> >> give feedback on APIs, so I'd really like a culture of having those
>> >>> >> early.
>> >>> >> People don't argue about prettiness when they discuss APIs, they
>> >>> >> argue
>> >>> >> about
>> >>> >> the core concepts to expose in order to meet various goals, and
>> >>> >> then
>> >>> >> they're
>> >>> >> stuck maintaining those for a long time.
>> >>> >>
>> >>> >> Matei
>> >>> >>
>> >>> >> On Oct 9, 2016, at 3:10 PM, Cody Koeninger <[hidden email]> wrote:
>> >>> >>
>> >>> >> Users instead of people, sure.  Commiters and contributors are (or
>> >>> >> at
>> >>> >> least
>> >>> >> should be) a subset of users.
>> >>> >>
>> >>> >> Non goals, sure. I don't care what the name is, but we need to
>> >>> >> clearly
>> >>> >> say
>> >>> >> e.g. 'no we are not maintaining compatibility with XYZ right now'.
>> >>> >>
>> >>> >> API, what I care most about is whether it allows me to accomplish
>> >>> >> the
>> >>> >> goals.
>> >>> >> Arguing about how ugly or pretty it is can be saved for design/
>> >>> >> implementation imho.
>> >>> >>
>> >>> >> Strategy, this is necessary because otherwise goals can be out of
>> >>> >> line
>> >>> >> with
>> >>> >> reality.  Don't propose goals you don't have at least some idea of
>> >>> >> how
>> >>> >> to
>> >>> >> implement.
>> >>> >>
>> >>> >> Rejected strategies, given that commiters are the only ones I'm
>> >>> >> saying
>> >>> >> should formally submit SPARKLIs or SIPs, if they put junk in a
>> >>> >> required
>> >>> >> section then slap them down for it and tell them to fix it.
>> >>> >>
>> >>> >>
>> >>> >> On Oct 9, 2016 4:36 PM, "Matei Zaharia" <[hidden email]> wrote:
>> >>> >>>
>> >>> >>> Yup, this is the stuff that I found unclear. Thanks for clarifying
>> >>> >>> here,
>> >>> >>> but we should also clarify it in the writeup. In particular:
>> >>> >>>
>> >>> >>> - Goals needs to be about user-facing behavior ("people" is broad)
>> >>> >>>
>> >>> >>> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will
>> >>> >>> dig
>> >>> >>> up
>> >>> >>> one of these and say "Spark's developers have officially rejected
>> >>> >>> X,
>> >>> >>> which
>> >>> >>> our awesome system has".
>> >>> >>>
>> >>> >>> - For user-facing stuff, I think you need a section on API.
>> >>> >>> Virtually
>> >>> >>> all
>> >>> >>> other *IPs I've seen have that.
>> >>> >>>
>> >>> >>> - I'm still not sure why the strategy section is needed if the
>> >>> >>> purpose is
>> >>> >>> to define user-facing behavior -- unless this is the strategy for
>> >>> >>> setting
>> >>> >>> the goals or for defining the API. That sounds squarely like a
>> >>> >>> design
>> >>> >>> doc
>> >>> >>> issue. In some sense, who cares whether the proposal is
>> >>> >>> technically
>> >>> >>> feasible
>> >>> >>> right now? If it's infeasible, that will be discovered later
>> >>> >>> during
>> >>> >>> design
>> >>> >>> and implementation. Same thing with rejected strategies -- listing
>> >>> >>> some
>> >>> >>> of
>> >>> >>> those is definitely useful sometimes, but if you make this a
>> >>> >>> *required*
>> >>> >>> section, people are just going to fill it in with bogus stuff
>> >>> >>> (I've
>> >>> >>> seen
>> >>> >>> this happen before).
>> >>> >>>
>> >>> >>> Matei
>> >>> >>>
>> >>> >
>> >>> >>> > On Oct 9, 2016, at 2:14 PM, Cody Koeninger <[hidden email]>
>> >>> >>> > wrote:
>> >>> >>> >
>> >>> >>> > So to focus the discussion on the specific strategy I'm
>> >>> >>> > suggesting,
>> >>> >>> > documented at
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>> >>> >>> >
>> >>> >>> > "Goals: What must this allow people to do, that they can't
>> >>> >>> > currently?"
>> >>> >>> >
>> >>> >>> > Is it unclear that this is focusing specifically on
>> >>> >>> > people-visible
>> >>> >>> > behavior?
>> >>> >>> >
>> >>> >>> > Rejected goals -  are important because otherwise people keep
>> >>> >>> > trying
>> >>> >>> > to argue about scope.  Of course you can change things later
>> >>> >>> > with a
>> >>> >>> > different SIP and different vote, the point is to focus.
>> >>> >>> >
>> >>> >>> > Use cases - are something that people are going to bring up in
>> >>> >>> > discussion.  If they aren't clearly documented as a goal ("This
>> >>> >>> > must
>> >>> >>> > allow me to connect using SSL"), they should be added.
>> >>> >>> >
>> >>> >>> > Internal architecture - if the people who need specific behavior
>> >>> >>> > are
>> >>> >>> > implementers of other parts of the system, that's fine.
>> >>> >>> >
>> >>> >>> > Rejected strategies - If you have none of these, you have no
>> >>> >>> > evidence
>> >>> >>> > that the proponent didn't just go with the first thing they had
>> >>> >>> > in
>> >>> >>> > mind (or have already implemented), which is a big problem
>> >>> >>> > currently.
>> >>> >>> > Approval isn't binding as to specifics of implementation, so
>> >>> >>> > these
>> >>> >>> > aren't handcuffs.  The goals are the contract, the strategy is
>> >>> >>> > evidence that contract can actually be met.
>> >>> >>> >
>> >>> >>> > Design docs - I'm not touching design docs.  The markdown file I
>> >>> >>> > linked specifically says of the strategy section "This is not a
>> >>> >>> > full
>> >>> >>> > design document."  Is this unclear?  Design docs can be worked
>> >>> >>> > on
>> >>> >>> > obviously, but that's not what I'm concerned with here.
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <[hidden email]>
>> >>> >>> > wrote:
>> >>> >>> >> Hi Cody,
>> >>> >>> >>
>> >>> >>> >> I think this would be a lot more concrete if we had a more
>> >>> >>> >> detailed
>> >>> >>> >> template
>> >>> >>> >> for SIPs. Right now, it's not super clear what's in scope --
>> >>> >>> >> e.g.
>> >>> >>> >> are
>> >>> >>> >> they
>> >>> >>> >> a way to solicit feedback on the user-facing behavior or on the
>> >>> >>> >> internals?
>> >>> >>> >> "Goals" can cover both things. I've been thinking of SIPs more
>> >>> >>> >> as
>> >>> >>> >> Product
>> >>> >>> >> Requirements Docs (PRDs), which focus on *what* a code change
>> >>> >>> >> should
>> >>> >>> >> do
>> >>> >>> >> as
>> >>> >>> >> opposed to how.
>> >>> >>> >>
>> >>> >>> >> In particular, here are some things that you may or may not
>> >>> >>> >> consider
>> >>> >>> >> in
>> >>> >>> >> scope for SIPs:
>> >>> >>> >>
>> >>> >>> >> - Goals and non-goals: This is definitely in scope, and IMO
>> >>> >>> >> should
>> >>> >>> >> focus on
>> >>> >>> >> user-visible behavior (e.g. "system supports SQL window
>> >>> >>> >> functions"
>> >>> >>> >> or
>> >>> >>> >> "system continues working if one node fails"). BTW I wouldn't
>> >>> >>> >> say
>> >>> >>> >> "rejected
>> >>> >>> >> goals" because some of them might become goals later, so we're
>> >>> >>> >> not
>> >>> >>> >> definitively rejecting them.
>> >>> >>> >>
>> >>> >>> >> - Public API: Probably should be included in most SIPs unless
>> >>> >>> >> it's
>> >>> >>> >> too
>> >>> >>> >> large
>> >>> >>> >> to fully specify then (e.g. "let's add an ML library").
>> >>> >>> >>
>> >>> >>> >> - Use cases: I usually find this very useful in PRDs to better
>> >>> >>> >> communicate
>> >>> >>> >> the goals.
>> >>> >>> >>
>> >>> >>> >> - Internal architecture: This is usually *not* a thing users
>> >>> >>> >> can
>> >>> >>> >> easily
>> >>> >>> >> comment on and it sounds more like a design doc item. Of course
>> >>> >>> >> it's
>> >>> >>> >> important to show that the SIP is feasible to implement. One
>> >>> >>> >> exception,
>> >>> >>> >> however, is that I think we'll have some SIPs primarily on
>> >>> >>> >> internals
>> >>> >>> >> (e.g.
>> >>> >>> >> if somebody wants to refactor Spark's query optimizer or
>> >>> >>> >> something).
>> >>> >>> >>
>> >>> >>> >> - Rejected strategies: I personally wouldn't put this, because
>> >>> >>> >> what's
>> >>> >>> >> the
>> >>> >>> >> point of voting to reject a strategy before you've really begun
>> >>> >>> >> designing
>> >>> >>> >> and implementing something? What if you discover that the
>> >>> >>> >> strategy
>> >>> >>> >> is
>> >>> >>> >> actually better when you start doing stuff?
>> >>> >>> >>
>> >>> >>> >> At a super high level, it depends on whether you want the SIPs
>> >>> >>> >> to
>> >>> >>> >> be
>> >>> >>> >> PRDs
>> >>> >>> >> for getting some quick feedback on the goals of a feature
>> >>> >>> >> before
>> >>> >>> >> it is
>> >>> >>> >> designed, or something more like full-fledged design docs (just
>> >>> >>> >> a
>> >>> >>> >> more
>> >>> >>> >> visible design doc for bigger changes). I looked at Kafka's
>> >>> >>> >> KIPs,
>> >>> >>> >> and
>> >>> >>> >> they
>> >>> >>> >> actually seem to be more like design docs. This can work too
>> >>> >>> >> but
>> >>> >>> >> it
>> >>> >>> >> does
>> >>> >>> >> require more work from the proposer and it can lead to the same
>> >>> >>> >> problems you
>> >>> >>> >> mentioned with people already having a design and
>> >>> >>> >> implementation
>> >>> >>> >> in
>> >>> >>> >> mind.
>> >>> >>> >>
>> >>> >>> >> Basically, the question is, are you trying to iterate faster on
>> >>> >>> >> design
>> >>> >>> >> by
>> >>> >>> >> adding a step for user feedback earlier? Or are you just trying
>> >>> >>> >> to
>> >>> >>> >> make
>> >>> >>> >> design docs for key features more visible (and their approval
>> >>> >>> >> more
>> >>> >>> >> formal)?
>> >>> >>> >>
>> >>> >>> >> BTW note that in either case, I'd like to have a template for
>> >>> >>> >> design
>> >>> >>> >> docs
>> >>> >>> >> too, which should also include goals. I think that would've
>> >>> >>> >> avoided
>> >>> >>> >> some of
>> >>> >>> >> the issues you brought up.
>> >>> >>> >>
>> >>> >>> >> Matei
>> >>> >>> >>
>> >>> >>> >> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <[hidden email]>
>> >>> >>> >> wrote:
>> >>> >>> >>
>> >>> >>> >> Here's my specific proposal (meta-proposal?)
>> >>> >>> >>
>> >>> >>> >> Spark Improvement Proposals (SIP)
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> Background:
>> >>> >>> >>
>> >>> >>> >> The current problem is that design and implementation of large
>> >>> >>> >> features
>> >>> >>> >> are
>> >>> >>> >> often done in private, before soliciting user feedback.
>> >>> >>> >>
>> >>> >>> >> When feedback is solicited, it is often as to detailed design
>> >>> >>> >> specifics, not
>> >>> >>> >> focused on goals.
>> >>> >>> >>
>> >>> >>> >> When implementation does take place after design, there is
>> >>> >>> >> often
>> >>> >>> >> disagreement as to what goals are or are not in scope.
>> >>> >>> >>
>> >>> >>> >> This results in commits that don't fully meet user needs.
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> Goals:
>> >>> >>> >>
>> >>> >>> >> - Ensure user, contributor, and committer goals are clearly
>> >>> >>> >> identified
>> >>> >>> >> and
>> >>> >>> >> agreed upon, before implementation takes place.
>> >>> >>> >>
>> >>> >>> >> - Ensure that a technically feasible strategy is chosen that is
>> >>> >>> >> likely
>> >>> >>> >> to
>> >>> >>> >> meet the goals.
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> Rejected Goals:
>> >>> >>> >>
>> >>> >>> >> - SIPs are not for detailed design.  Design by committee
>> >>> >>> >> doesn't
>> >>> >>> >> work.
>> >>> >>> >>
>> >>> >>> >> - SIPs are not for every change.  We dont need that much
>> >>> >>> >> process.
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> Strategy:
>> >>> >>> >>
>> >>> >>> >> My suggestion is outlined as a Spark Improvement Proposal
>> >>> >>> >> process
>> >>> >>> >> documented
>> >>> >>> >> at
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>> >>> >>> >>
>> >>> >>> >> Specifics of Jira manipulation are an implementation detail we
>> >>> >>> >> can
>> >>> >>> >> figure
>> >>> >>> >> out.
>> >>> >>> >>
>> >>> >>> >> I'm suggesting voting; the need here is for a _clear_ outcome.
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> Rejected Strategies:
>> >>> >>> >>
>> >>> >>> >> Having someone who understands the problem implement it first
>> >>> >>> >> works,
>> >>> >>> >> but
>> >>> >>> >> only if significant iteration after user feedback is allowed.
>> >>> >>> >>
>> >>> >>> >> Historically this has been problematic due to pressure to limit
>> >>> >>> >> public
>> >>> >>> >> api
>> >>> >>> >> changes.
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <[hidden email]>
>> >>> >>> >> wrote:
>> >>> >>> >>>
>> >>> >>> >>> Alright looks like there are quite a bit of support. We should
>> >>> >>> >>> wait
>> >>> >>> >>> to
>> >>> >>> >>> hear from more people too.
>> >>> >>> >>>
>> >>> >>> >>> To push this forward, Cody and I will be working together in
>> >>> >>> >>> the
>> >>> >>> >>> next
>> >>> >>> >>> couple of weeks to come up with a concrete, detailed proposal
>> >>> >>> >>> on
>> >>> >>> >>> what
>> >>> >>> >>> this
>> >>> >>> >>> entails, and then we can discuss this the specific proposal as
>> >>> >>> >>> well.
>> >>> >>> >>>
>> >>> >>> >>>
>> >>> >>> >>> On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <[hidden
>> >>> >>> >>> email]>
>> >>> >>> >>> wrote:
>> >>> >>> >>>>
>> >>> >>> >>>> Yeah, in case it wasn't clear, I was talking about SIPs for
>> >>> >>> >>>> major
>> >>> >>> >>>> user-facing or cross-cutting changes, not minor feature adds.
>> >>> >>> >>>>
>> >>> >>> >>>> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos
>> >>> >>> >>>> <[hidden email]> wrote:
>> >>> >>> >>>>>
>> >>> >>> >>>>> +1 to the SIP label as long as it does not slow down things
>> >>> >>> >>>>> and
>> >>> >>> >>>>> it
>> >>> >>> >>>>> targets optimizing efforts, coordination etc. For example
>> >>> >>> >>>>> really
>> >>> >>> >>>>> small
>> >>> >>> >>>>> features should not need to go through this process
>> >>> >>> >>>>> (assuming
>> >>> >>> >>>>> they
>> >>> >>> >>>>> dont
>> >>> >>> >>>>> touch public interfaces)  or re-factorings and hope it will
>> >>> >>> >>>>> be
>> >>> >>> >>>>> kept
>> >>> >>> >>>>> this
>> >>> >>> >>>>> way. So as a guideline doc should be provided, like in the
>> >>> >>> >>>>> KIP
>> >>> >>> >>>>> case.
>> >>> >>> >>>>>
>> >>> >>> >>>>> IMHO so far aside from tagging things and linking them
>> >>> >>> >>>>> elsewhere
>> >>> >>> >>>>> simply
>> >>> >>> >>>>> having design docs and prototypes implementations in PRs is
>> >>> >>> >>>>> not
>> >>> >>> >>>>> something
>> >>> >>> >>>>> that has not worked so far. What is really a pain in many
>> >>> >>> >>>>> projects
>> >>> >>> >>>>> out there
>> >>> >>> >>>>> is discontinuity in progress of PRs, missing features, slow
>> >>> >>> >>>>> reviews
>> >>> >>> >>>>> which is
>> >>> >>> >>>>> understandable to some extent... it is not only about Spark
>> >>> >>> >>>>> but
>> >>> >>> >>>>> things can
>> >>> >>> >>>>> be improved for sure for this project in particular as
>> >>> >>> >>>>> already
>> >>> >>> >>>>> stated.
>> >>> >>> >>>>>
>> >>> >>> >>>>> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <[hidden
>> >>> >>> >>>>> email]>
>> >>> >>> >>>>> wrote:
>> >>> >>> >>>>>>
>> >>> >>> >>>>>> +1 to adding an SIP label and linking it from the website.
>> >>> >>> >>>>>> I
>> >>> >>> >>>>>> think
>> >>> >>> >>>>>> it
>> >>> >>> >>>>>> needs
>> >>> >>> >>>>>>
>> >>> >>> >>>>>> - template that focuses it towards soliciting user goals /
>> >>> >>> >>>>>> non
>> >>> >>> >>>>>> goals
>> >>> >>> >>>>>> - clear resolution as to which strategy was chosen to
>> >>> >>> >>>>>> pursue.
>> >>> >>> >>>>>> I'd
>> >>> >>> >>>>>> recommend a vote.
>> >>> >>> >>>>>>
>> >>> >>> >>>>>> Matei asked me to clarify what I meant by changing
>> >>> >>> >>>>>> interfaces,
>> >>> >>> >>>>>> I
>> >>> >>> >>>>>> think
>> >>> >>> >>>>>> it's directly relevant to the SIP idea so I'll clarify
>> >>> >>> >>>>>> here,
>> >>> >>> >>>>>> and
>> >>> >>> >>>>>> split
>> >>> >>> >>>>>> a thread for the other discussion per Nicholas' request.
>> >>> >>> >>>>>>
>> >>> >>> >>>>>> I meant changing public user interfaces.  I think the first
>> >>> >>> >>>>>> design
>> >>> >>> >>>>>> is
>> >>> >>> >>>>>> unlikely to be right, because it's done at a time when you
>> >>> >>> >>>>>> have
>> >>> >>> >>>>>> the
>> >>> >>> >>>>>> least information.  As a user, I find it considerably more
>> >>> >>> >>>>>> frustrating
>> >>> >>> >>>>>> to be unable to use a tool to get my job done, than I do
>> >>> >>> >>>>>> having to
>> >>> >>> >>>>>> make minor changes to my code in order to take advantage of
>> >>> >>> >>>>>> features.
>> >>> >>> >>>>>> I've seen committers be seriously reluctant to allow
>> >>> >>> >>>>>> changes
>> >>> >>> >>>>>> to
>> >>> >>> >>>>>> @experimental code that are needed in order for it to
>> >>> >>> >>>>>> really
>> >>> >>> >>>>>> work
>> >>> >>> >>>>>> right.  You need to be able to iterate, and if people on
>> >>> >>> >>>>>> both
>> >>> >>> >>>>>> sides
>> >>> >>> >>>>>> of
>> >>> >>> >>>>>> the fence aren't going to respect that some newer apis are
>> >>> >>> >>>>>> subject
>> >>> >>> >>>>>> to
>> >>> >>> >>>>>> change, then why even mark them as such?
>> >>> >>> >>>>>>
>> >>> >>> >>>>>> Ideally a finished SIP should give me a checklist of things
>> >>> >>> >>>>>> that
>> >>> >>> >>>>>> an
>> >>> >>> >>>>>> implementation must do, and things that it doesn't need to
>> >>> >>> >>>>>> do.
>> >>> >>> >>>>>> Contributors/committers should be seriously discouraged
>> >>> >>> >>>>>> from
>> >>> >>> >>>>>> putting
>> >>> >>> >>>>>> out a version 0.1 that doesn't have at least a prototype
>> >>> >>> >>>>>> implementation of all those things, especially if they're
>> >>> >>> >>>>>> then
>> >>> >>> >>>>>> going
>> >>> >>> >>>>>> to argue against interface changes necessary to get the the
>> >>> >>> >>>>>> rest
>> >>> >>> >>>>>> of
>> >>> >>> >>>>>> the things done in the 0.2 version.
>> >>> >>> >>>>>>
>> >>> >>> >>>>>>
>> >>> >>> >>>>>> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <[hidden
>> >>> >>> >>>>>> email]>
>> >>> >>> >>>>>> wrote:
>> >>> >>> >>>>>>> I like the lightweight proposal to add a SIP label.
>> >>> >>> >>>>>>>
>> >>> >>> >>>>>>> During Spark 2.0 development, Tom (Graves) and I suggested
>> >>> >>> >>>>>>> using
>> >>> >>> >>>>>>> wiki
>> >>> >>> >>>>>>> to
>> >>> >>> >>>>>>> track the list of major changes, but that never really
>> >>> >>> >>>>>>> materialized
>> >>> >>> >>>>>>> due to
>> >>> >>> >>>>>>> the overhead. Adding a SIP label on major JIRAs and then
>> >>> >>> >>>>>>> link
>> >>> >>> >>>>>>> to
>> >>> >>> >>>>>>> them
>> >>> >>> >>>>>>> prominently on the Spark website makes a lot of sense.
>> >>> >>> >>>>>>>
>> >>> >>> >>>>>>>
>> >>> >>> >>>>>>> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia
>> >>> >>> >>>>>>> <[hidden email]>
>> >>> >>> >>>>>>> wrote:
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> For the improvement proposals, I think one major point
>> >>> >>> >>>>>>>> was
>> >>> >>> >>>>>>>> to
>> >>> >>> >>>>>>>> make
>> >>> >>> >>>>>>>> them
>> >>> >>> >>>>>>>> really visible to users who are not contributors, so we
>> >>> >>> >>>>>>>> should
>> >>> >>> >>>>>>>> do
>> >>> >>> >>>>>>>> more than
>> >>> >>> >>>>>>>> sending stuff to dev@. One very lightweight idea is to
>> >>> >>> >>>>>>>> have
>> >>> >>> >>>>>>>> a
>> >>> >>> >>>>>>>> new
>> >>> >>> >>>>>>>> type of
>> >>> >>> >>>>>>>> JIRA called a SIP and have a link to a filter that shows
>> >>> >>> >>>>>>>> all
>> >>> >>> >>>>>>>> such
>> >>> >>> >>>>>>>> JIRAs from
>> >>> >>> >>>>>>>> http://spark.apache.org. I also like the idea of SIP and
>> >>> >>> >>>>>>>> design
>> >>> >>> >>>>>>>> doc
>> >>> >>> >>>>>>>> templates (in fact many projects have them).
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> Matei
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]>
>> >>> >>> >>>>>>>> wrote:
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> I called Cody last night and talked about some of the
>> >>> >>> >>>>>>>> topics
>> >>> >>> >>>>>>>> in
>> >>> >>> >>>>>>>> his
>> >>> >>> >>>>>>>> email.
>> >>> >>> >>>>>>>> It became clear to me Cody genuinely cares about the
>> >>> >>> >>>>>>>> project.
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> Some of the frustrations come from the success of the
>> >>> >>> >>>>>>>> project
>> >>> >>> >>>>>>>> itself
>> >>> >>> >>>>>>>> becoming very "hot", and it is difficult to get clarity
>> >>> >>> >>>>>>>> from
>> >>> >>> >>>>>>>> people
>> >>> >>> >>>>>>>> who
>> >>> >>> >>>>>>>> don't dedicate all their time to Spark. In fact, it is in
>> >>> >>> >>>>>>>> some
>> >>> >>> >>>>>>>> ways
>> >>> >>> >>>>>>>> similar
>> >>> >>> >>>>>>>> to scaling an engineering team in a successful startup:
>> >>> >>> >>>>>>>> old
>> >>> >>> >>>>>>>> processes that
>> >>> >>> >>>>>>>> worked well might not work so well when it gets to a
>> >>> >>> >>>>>>>> certain
>> >>> >>> >>>>>>>> size,
>> >>> >>> >>>>>>>> cultures
>> >>> >>> >>>>>>>> can get diluted, building culture vs building process,
>> >>> >>> >>>>>>>> etc.
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> I also really like to have a more visible process for
>> >>> >>> >>>>>>>> larger
>> >>> >>> >>>>>>>> changes,
>> >>> >>> >>>>>>>> especially major user facing API changes. Historically we
>> >>> >>> >>>>>>>> upload
>> >>> >>> >>>>>>>> design docs
>> >>> >>> >>>>>>>> for major changes, but it is not always consistent and
>> >>> >>> >>>>>>>> difficult
>> >>> >>> >>>>>>>> to
>> >>> >>> >>>>>>>> quality
>> >>> >>> >>>>>>>> of the docs, due to the volunteering nature of the
>> >>> >>> >>>>>>>> organization.
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> Some of the more concrete ideas we discussed focus on
>> >>> >>> >>>>>>>> building a
>> >>> >>> >>>>>>>> culture
>> >>> >>> >>>>>>>> to improve clarity:
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> - Process: Large changes should have design docs posted
>> >>> >>> >>>>>>>> on
>> >>> >>> >>>>>>>> JIRA.
>> >>> >>> >>>>>>>> One
>> >>> >>> >>>>>>>> thing
>> >>> >>> >>>>>>>> Cody and I didn't discuss but an idea that just came to
>> >>> >>> >>>>>>>> me
>> >>> >>> >>>>>>>> is we
>> >>> >>> >>>>>>>> should
>> >>> >>> >>>>>>>> create a design doc template for the project and ask
>> >>> >>> >>>>>>>> everybody
>> >>> >>> >>>>>>>> to
>> >>> >>> >>>>>>>> follow.
>> >>> >>> >>>>>>>> The design doc template should also explicitly list goals
>> >>> >>> >>>>>>>> and
>> >>> >>> >>>>>>>> non-goals, to
>> >>> >>> >>>>>>>> make design doc more consistent.
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> - Process: Email dev@ to solicit feedback. We have some
>> >>> >>> >>>>>>>> this
>> >>> >>> >>>>>>>> with
>> >>> >>> >>>>>>>> some
>> >>> >>> >>>>>>>> changes, but again very inconsistent. Just posting
>> >>> >>> >>>>>>>> something
>> >>> >>> >>>>>>>> on
>> >>> >>> >>>>>>>> JIRA
>> >>> >>> >>>>>>>> isn't
>> >>> >>> >>>>>>>> sufficient, because there are simply too many JIRAs and
>> >>> >>> >>>>>>>> the
>> >>> >>> >>>>>>>> signal
>> >>> >>> >>>>>>>> get lost
>> >>> >>> >>>>>>>> in the noise. While this is generally impossible to
>> >>> >>> >>>>>>>> enforce
>> >>> >>> >>>>>>>> because
>> >>> >>> >>>>>>>> we can't
>> >>> >>> >>>>>>>> force all volunteers to conform to a process (or they
>> >>> >>> >>>>>>>> might
>> >>> >>> >>>>>>>> not
>> >>> >>> >>>>>>>> even
>> >>> >>> >>>>>>>> be
>> >>> >>> >>>>>>>> aware of this),  those who are more familiar with the
>> >>> >>> >>>>>>>> project
>> >>> >>> >>>>>>>> can
>> >>> >>> >>>>>>>> help by
>> >>> >>> >>>>>>>> emailing the dev@ when they see something that hasn't
>> >>> >>> >>>>>>>> been.
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> - Culture: The design doc author(s) should be open to
>> >>> >>> >>>>>>>> feedback.
>> >>> >>> >>>>>>>> A
>> >>> >>> >>>>>>>> design
>> >>> >>> >>>>>>>> doc should serve as the base for discussion and is by no
>> >>> >>> >>>>>>>> means
>> >>> >>> >>>>>>>> the
>> >>> >>> >>>>>>>> final
>> >>> >>> >>>>>>>> design. Of course, this does not mean the author has to
>> >>> >>> >>>>>>>> accept
>> >>> >>> >>>>>>>> every
>> >>> >>> >>>>>>>> feedback. They should also be comfortable accepting /
>> >>> >>> >>>>>>>> rejecting
>> >>> >>> >>>>>>>> ideas on
>> >>> >>> >>>>>>>> technical grounds.
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> - Process / Culture: For major ongoing projects, it can
>> >>> >>> >>>>>>>> be
>> >>> >>> >>>>>>>> useful
>> >>> >>> >>>>>>>> to
>> >>> >>> >>>>>>>> have
>> >>> >>> >>>>>>>> some monthly Google hangouts that are open to the world.
>> >>> >>> >>>>>>>> I
>> >>> >>> >>>>>>>> am
>> >>> >>> >>>>>>>> actually not
>> >>> >>> >>>>>>>> sure how well this will work, because of the volunteering
>> >>> >>> >>>>>>>> nature
>> >>> >>> >>>>>>>> and
>> >>> >>> >>>>>>>> we need
>> >>> >>> >>>>>>>> to adjust for timezones for people across the globe, but
>> >>> >>> >>>>>>>> it
>> >>> >>> >>>>>>>> seems
>> >>> >>> >>>>>>>> worth
>> >>> >>> >>>>>>>> trying.
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> - Culture: Contributors (including committers) should be
>> >>> >>> >>>>>>>> more
>> >>> >>> >>>>>>>> direct
>> >>> >>> >>>>>>>> in
>> >>> >>> >>>>>>>> setting expectations, including whether they are working
>> >>> >>> >>>>>>>> on
>> >>> >>> >>>>>>>> a
>> >>> >>> >>>>>>>> specific
>> >>> >>> >>>>>>>> issue, whether they will be working on a specific issue,
>> >>> >>> >>>>>>>> and
>> >>> >>> >>>>>>>> whether
>> >>> >>> >>>>>>>> an
>> >>> >>> >>>>>>>> issue or pr or jira should be rejected. Most people I
>> >>> >>> >>>>>>>> know
>> >>> >>> >>>>>>>> in
>> >>> >>> >>>>>>>> this
>> >>> >>> >>>>>>>> community
>> >>> >>> >>>>>>>> are nice and don't enjoy telling other people no, but it
>> >>> >>> >>>>>>>> is
>> >>> >>> >>>>>>>> often
>> >>> >>> >>>>>>>> more
>> >>> >>> >>>>>>>> annoying to a contributor to not know anything than
>> >>> >>> >>>>>>>> getting
>> >>> >>> >>>>>>>> a
>> >>> >>> >>>>>>>> no.
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia
>> >>> >>> >>>>>>>> <[hidden email]>
>> >>> >>> >>>>>>>> wrote:
>> >>> >>> >>>>>>>>>
>> >>> >>> >>>>>>>>>
>> >>> >>> >>>>>>>>> Love the idea of a more visible "Spark Improvement
>> >>> >>> >>>>>>>>> Proposal"
>> >>> >>> >>>>>>>>> process that
>> >>> >>> >>>>>>>>> solicits user input on new APIs. For what it's worth, I
>> >>> >>> >>>>>>>>> don't
>> >>> >>> >>>>>>>>> think
>> >>> >>> >>>>>>>>> committers are trying to minimize their own work --
>> >>> >>> >>>>>>>>> every
>> >>> >>> >>>>>>>>> committer
>> >>> >>> >>>>>>>>> cares
>> >>> >>> >>>>>>>>> about making the software useful for users. However, it
>> >>> >>> >>>>>>>>> is
>> >>> >>> >>>>>>>>> always
>> >>> >>> >>>>>>>>> hard to
>> >>> >>> >>>>>>>>> get user input and so it helps to have this kind of
>> >>> >>> >>>>>>>>> process.
>> >>> >>> >>>>>>>>> I've
>> >>> >>> >>>>>>>>> certainly
>> >>> >>> >>>>>>>>> looked at the *IPs a lot in other software I use just to
>> >>> >>> >>>>>>>>> see
>> >>> >>> >>>>>>>>> the
>> >>> >>> >>>>>>>>> biggest
>> >>> >>> >>>>>>>>> things on the roadmap.
>> >>> >>> >>>>>>>>>
>> >>> >>> >>>>>>>>> When you're talking about "changing interfaces", are you
>> >>> >>> >>>>>>>>> talking
>> >>> >>> >>>>>>>>> about
>> >>> >>> >>>>>>>>> public or internal APIs? I do think many people hate
>> >>> >>> >>>>>>>>> changing
>> >>> >>> >>>>>>>>> public APIs
>> >>> >>> >>>>>>>>> and I actually think that's for the best of the project.
>> >>> >>> >>>>>>>>> That's
>> >>> >>> >>>>>>>>> a
>> >>> >>> >>>>>>>>> technical
>> >>> >>> >>>>>>>>> debate, but basically, the worst thing when you're using
>> >>> >>> >>>>>>>>> a
>> >>> >>> >>>>>>>>> piece
>> >>> >>> >>>>>>>>> of
>> >>> >>> >>>>>>>>> software
>> >>> >>> >>>>>>>>> is that the developers constantly ask you to rewrite
>> >>> >>> >>>>>>>>> your
>> >>> >>> >>>>>>>>> app
>> >>> >>> >>>>>>>>> to
>> >>> >>> >>>>>>>>> update to a
>> >>> >>> >>>>>>>>> new version (and thus benefit from bug fixes, etc). Cue
>> >>> >>> >>>>>>>>> anyone
>> >>> >>> >>>>>>>>> who's used
>> >>> >>> >>>>>>>>> Protobuf, or Guava. The "let's get everyone to change
>> >>> >>> >>>>>>>>> their
>> >>> >>> >>>>>>>>> code
>> >>> >>> >>>>>>>>> this
>> >>> >>> >>>>>>>>> release" model works well within a single large company,
>> >>> >>> >>>>>>>>> but
>> >>> >>> >>>>>>>>> doesn't work
>> >>> >>> >>>>>>>>> well for a community, which is why nearly all *very*
>> >>> >>> >>>>>>>>> widely
>> >>> >>> >>>>>>>>> used
>> >>> >>> >>>>>>>>> programming
>> >>> >>> >>>>>>>>> interfaces (I'm talking things like Java standard
>> >>> >>> >>>>>>>>> library,
>> >>> >>> >>>>>>>>> Windows
>> >>> >>> >>>>>>>>> API, etc)
>> >>> >>> >>>>>>>>> almost *never* break backwards compatibility. All this
>> >>> >>> >>>>>>>>> is
>> >>> >>> >>>>>>>>> done
>> >>> >>> >>>>>>>>> within reason
>> >>> >>> >>>>>>>>> though, e.g. we do change things in major releases (2.x,
>> >>> >>> >>>>>>>>> 3.x,
>> >>> >>> >>>>>>>>> etc).
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>
>> >>> >>> >>>>>>
>> >>> >>> >>>>>>
>> >>> >>> >>>>>>
>> >>> >>> >>>>>>
>> >>> >>> >>>>>>
>> >>> >>> >>>>>> ---------------------------------------------------------------------
>> >>> >>> >>>>>> To unsubscribe e-mail: [hidden email]
>> >>> >>> >>>>>>
>> >>> >>> >>>>>
>> >>> >>> >>>>>
>> >>> >>> >>>>>
>> >>> >>> >>>>> --
>> >>> >>> >>>>> Stavros Kontopoulos
>> >>> >>> >>>>> Senior Software Engineer
>> >>> >>> >>>>> Lightbend, Inc.
>> >>> >>> >>>>> p:  +30 6977967274
>> >>> >>> >>>>> e: [hidden email]
>> >>> >>> >>>>>
>> >>> >>> >>>>>
>> >>> >>> >>>>
>> >>> >>> >>>
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>>
>> >>> >>
>> >>> >
>> >>> >
>> >>> >
>> >>> > ---------------------------------------------------------------------
>> >>> > To unsubscribe e-mail: [hidden email]
>> >>> >
>> >>> >
>> >>> > ________________________________
>> >>> >
>> >>> > If you reply to this email, your message will be added to the
>> >>> > discussion
>> >>> > below:
>> >>> >
>> >>> >
>> >>> >
>> >>> > http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-tp19268p19359.html
>> >>> >
>> >>> > To start a new topic under Apache Spark Developers List, email
>> >>> > [hidden
>> >>> > email]
>> >>> > To unsubscribe from Apache Spark Developers List, click here.
>> >>> > NAML
>> >>> >
>> >>> >
>> >>> > ________________________________
>> >>> > View this message in context: RE: Spark Improvement Proposals
>> >>> > Sent from the Apache Spark Developers List mailing list archive at
>> >>> > Nabble.com.
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe e-mail: [hidden email]
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Ryan Blue
>> >> Software Engineer
>> >> Netflix
>> >
>> >
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Matei Zaharia
Administrator
Agreed with this. As I said before regarding who submits: it's not a normal ASF process to require contributions to only come from committers. Committers are of course the only people who can *commit* stuff. But the whole point of an open source project is that anyone can *contribute* -- indeed, that is how people become committers. For example, in every ASF project, anyone can open JIRAs, submit design docs, submit patches, review patches, and vote on releases. This particular process is very similar to posting a JIRA or a design doc.

I also like consensus with a deadline (e.g. someone says "here is a new SEP, we want to accept it by date X so please comment before").

In general, with this type of stuff, it's better to start with very lightweight processes and then expand them if needed. Adding lots of rules from the beginning makes it confusing and can reduce contributions. Although, as engineers, we believe that anything can be solved using mechanical rules, in practice software development is a social process that ultimately requires humans to tackle things on a case-by-case basis.

Matei


> On Oct 10, 2016, at 12:19 PM, Cody Koeninger <[hidden email]> wrote:
>
> That seems reasonable to me.
>
> I do not want to see lazy consensus used on one of these proposals
> though, I want a clear outcome, i.e. call for a vote, wait at least 72
> hours, get three +1s and no vetos.
>
>
>
> On Mon, Oct 10, 2016 at 2:15 PM, Ryan Blue <[hidden email]> wrote:
>> Proposal submission: I think we should keep this as open as possible. If
>> there is a problem with too many open proposals, then we should tackle that
>> as a fix rather than excluding participation. Perhaps it will end up that
>> way, but I think it's worth trying a more open model first.
>>
>> Majority vs consensus: My rationale is that I don't think we want to
>> consider a proposal approved if it had objections serious enough that
>> committers down-voted (or PMC depending on who gets a vote). If these
>> proposals are like PEPs, then they represent a significant amount of
>> community effort and I wouldn't want to move forward if up to half of the
>> community thinks it's an untenable idea.
>>
>> rb
>>
>> On Mon, Oct 10, 2016 at 12:07 PM, Cody Koeninger <[hidden email]> wrote:
>>>
>>> I think this is closer to a procedural issue than a code modification
>>> issue, hence why majority.  If everyone thinks consensus is better, I
>>> don't care.  Again, I don't feel strongly about the way we achieve
>>> clarity, just that we achieve clarity.
>>>
>>> On Mon, Oct 10, 2016 at 2:02 PM, Ryan Blue <[hidden email]> wrote:
>>>> Sorry, I missed that the proposal includes majority approval. Why
>>>> majority
>>>> instead of consensus? I think we want to build consensus around these
>>>> proposals and it makes sense to discuss until no one would veto.
>>>>
>>>> rb
>>>>
>>>> On Mon, Oct 10, 2016 at 11:54 AM, Ryan Blue <[hidden email]> wrote:
>>>>>
>>>>> +1 to votes to approve proposals. I agree that proposals should have an
>>>>> official mechanism to be accepted, and a vote is an established means
>>>>> of
>>>>> doing that well. I like that it includes a period to review the
>>>>> proposal and
>>>>> I think proposals should have been discussed enough ahead of a vote to
>>>>> survive the possibility of a veto.
>>>>>
>>>>> I also like the names that are short and (mostly) unique, like SEP.
>>>>>
>>>>> Where I disagree is with the requirement that a committer must formally
>>>>> propose an enhancement. I don't see the value of restricting this: if
>>>>> someone has the will to write up a proposal then they should be
>>>>> encouraged
>>>>> to do so and start a discussion about it. Even if there is a political
>>>>> reality as Cody says, what is the value of codifying that in our
>>>>> process? I
>>>>> think restricting who can submit proposals would only undermine them by
>>>>> pushing contributors out. Maybe I'm missing something here?
>>>>>
>>>>> rb
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <[hidden email]>
>>>>> wrote:
>>>>>>
>>>>>> Yes, users suggesting SIPs is a good thing and is explicitly called
>>>>>> out in the linked document under the Who? section.  Formally proposing
>>>>>> them, not so much, because of the political realities.
>>>>>>
>>>>>> Yes, implementation strategy definitely affects goals.  There are all
>>>>>> kinds of examples of this, I'll pick one that's my fault so as to
>>>>>> avoid sounding like I'm blaming:
>>>>>>
>>>>>> When I implemented the Kafka DStream, one of my (not explicitly agreed
>>>>>> upon by the community) goals was to make sure people could use the
>>>>>> Dstream with however they were already using Kafka at work.  The lack
>>>>>> of explicit agreement on that goal led to all kinds of fighting with
>>>>>> committers, that could have been avoided.  The lack of explicit
>>>>>> up-front strategy discussion led to the DStream not really working
>>>>>> with compacted topics.  I knew about compacted topics, but don't have
>>>>>> a use for them, so had a blind spot there.  If there was explicit
>>>>>> up-front discussion that my strategy was "assume that batches can be
>>>>>> defined on the driver solely by beginning and ending offsets", there's
>>>>>> a greater chance that a user would have seen that and said, "hey, what
>>>>>> about non-contiguous offsets in a compacted topic".
>>>>>>
>>>>>> This kind of thing is only going to happen smoothly if we have a
>>>>>> lightweight user-visible process with clear outcomes.
>>>>>>
>>>>>> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>>>>>> <[hidden email]> wrote:
>>>>>>> I agree with most of what Cody said.
>>>>>>>
>>>>>>> Two things:
>>>>>>>
>>>>>>> First we can always have other people suggest SIPs but mark them as
>>>>>>> “unreviewed” and have committers basically move them forward. The
>>>>>>> problem is
>>>>>>> that writing a good document takes time. This way we can leverage
>>>>>>> non
>>>>>>> committers to do some of this work (it is just another way to
>>>>>>> contribute).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> As for strategy, in many cases implementation strategy can affect
>>>>>>> the
>>>>>>> goals.
>>>>>>> I will give  a small example: In the current structured streaming
>>>>>>> strategy,
>>>>>>> we group by the time to achieve a sliding window. This is definitely
>>>>>>> an
>>>>>>> implementation decision and not a goal. However, I can think of
>>>>>>> several
>>>>>>> aggregation functions which have the time inside their calculation
>>>>>>> buffer.
>>>>>>> For example, let’s say we want to return a set of all distinct
>>>>>>> values.
>>>>>>> One
>>>>>>> way to implement this would be to make the set into a map and have
>>>>>>> the
>>>>>>> value
>>>>>>> contain the last time seen. Multiplying it across the groupby would
>>>>>>> cost a
>>>>>>> lot in performance. So adding such a strategy would have a great
>>>>>>> effect
>>>>>>> on
>>>>>>> the type of aggregations and their performance which does affect the
>>>>>>> goal.
>>>>>>> Without adding the strategy, it is easy for whoever goes to the
>>>>>>> design
>>>>>>> document to not think about these cases. Furthermore, it might be
>>>>>>> decided
>>>>>>> that these cases are rare enough so that the strategy is still good
>>>>>>> enough
>>>>>>> but how would we know it without user feedback?
>>>>>>>
>>>>>>> I believe this example is exactly what Cody was talking about. Since
>>>>>>> many
>>>>>>> times implementation strategies have a large effect on the goal, we
>>>>>>> should
>>>>>>> have it discussed when discussing the goals. In addition, while it
>>>>>>> is
>>>>>>> often
>>>>>>> easy to throw out completely infeasible goals, it is often much
>>>>>>> harder
>>>>>>> to
>>>>>>> figure out that the goals are unfeasible without fine tuning.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Assaf.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> From: Cody Koeninger-2 [via Apache Spark Developers List]
>>>>>>> [mailto:ml-node+[hidden email]]
>>>>>>> Sent: Monday, October 10, 2016 2:25 AM
>>>>>>> To: Mendelson, Assaf
>>>>>>> Subject: Re: Spark Improvement Proposals
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Only committers should formally submit SIPs because in an apache
>>>>>>> project only commiters have explicit political power.  If a user
>>>>>>> can't
>>>>>>> find a commiter willing to sponsor an SIP idea, they have no way to
>>>>>>> get the idea passed in any case.  If I can't find a committer to
>>>>>>> sponsor this meta-SIP idea, I'm out of luck.
>>>>>>>
>>>>>>> I do not believe unrealistic goals can be found solely by
>>>>>>> inspection.
>>>>>>> We've managed to ignore unrealistic goals even after implementation!
>>>>>>> Focusing on APIs can allow people to think they've solved something,
>>>>>>> when there's really no way of implementing that API while meeting
>>>>>>> the
>>>>>>> goals.  Rapid iteration is clearly the best way to address this, but
>>>>>>> we've already talked about why that hasn't really worked.  If adding
>>>>>>> a
>>>>>>> non-binding API section to the template is important to you, I'm not
>>>>>>> against it, but I don't think it's sufficient.
>>>>>>>
>>>>>>> On your PRD vs design doc spectrum, I'm saying this is closer to a
>>>>>>> PRD.  Clear agreement on goals is the most important thing and
>>>>>>> that's
>>>>>>> why it's the thing I want binding agreement on.  But I cannot agree
>>>>>>> to
>>>>>>> goals unless I have enough minimal technical info to judge whether
>>>>>>> the
>>>>>>> goals are likely to actually be accomplished.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Oct 9, 2016 at 5:35 PM, Matei Zaharia <[hidden email]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Well, I think there are a few things here that don't make sense.
>>>>>>>> First,
>>>>>>>> why
>>>>>>>> should only committers submit SIPs? Development in the project
>>>>>>>> should
>>>>>>>> be
>>>>>>>> open to all contributors, whether they're committers or not.
>>>>>>>> Second, I
>>>>>>>> think
>>>>>>>> unrealistic goals can be found just by inspecting the goals, and
>>>>>>>> I'm
>>>>>>>> not
>>>>>>>> super worried that we'll accept a lot of SIPs that are then
>>>>>>>> infeasible
>>>>>>>> --
>>>>>>>> we
>>>>>>>> can then submit new ones. But this depends on whether you want this
>>>>>>>> process
>>>>>>>> to be a "design doc lite", where people also agree on
>>>>>>>> implementation
>>>>>>>> strategy, or just a way to agree on goals. This is what I asked
>>>>>>>> earlier
>>>>>>>> about PRDs vs design docs (and I'm open to either one but I'd just
>>>>>>>> like
>>>>>>>> clarity). Finally, both as a user and designer of software, I
>>>>>>>> always
>>>>>>>> want
>>>>>>>> to
>>>>>>>> give feedback on APIs, so I'd really like a culture of having those
>>>>>>>> early.
>>>>>>>> People don't argue about prettiness when they discuss APIs, they
>>>>>>>> argue
>>>>>>>> about
>>>>>>>> the core concepts to expose in order to meet various goals, and
>>>>>>>> then
>>>>>>>> they're
>>>>>>>> stuck maintaining those for a long time.
>>>>>>>>
>>>>>>>> Matei
>>>>>>>>
>>>>>>>> On Oct 9, 2016, at 3:10 PM, Cody Koeninger <[hidden email]> wrote:
>>>>>>>>
>>>>>>>> Users instead of people, sure.  Commiters and contributors are (or
>>>>>>>> at
>>>>>>>> least
>>>>>>>> should be) a subset of users.
>>>>>>>>
>>>>>>>> Non goals, sure. I don't care what the name is, but we need to
>>>>>>>> clearly
>>>>>>>> say
>>>>>>>> e.g. 'no we are not maintaining compatibility with XYZ right now'.
>>>>>>>>
>>>>>>>> API, what I care most about is whether it allows me to accomplish
>>>>>>>> the
>>>>>>>> goals.
>>>>>>>> Arguing about how ugly or pretty it is can be saved for design/
>>>>>>>> implementation imho.
>>>>>>>>
>>>>>>>> Strategy, this is necessary because otherwise goals can be out of
>>>>>>>> line
>>>>>>>> with
>>>>>>>> reality.  Don't propose goals you don't have at least some idea of
>>>>>>>> how
>>>>>>>> to
>>>>>>>> implement.
>>>>>>>>
>>>>>>>> Rejected strategies, given that commiters are the only ones I'm
>>>>>>>> saying
>>>>>>>> should formally submit SPARKLIs or SIPs, if they put junk in a
>>>>>>>> required
>>>>>>>> section then slap them down for it and tell them to fix it.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Oct 9, 2016 4:36 PM, "Matei Zaharia" <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>> Yup, this is the stuff that I found unclear. Thanks for clarifying
>>>>>>>>> here,
>>>>>>>>> but we should also clarify it in the writeup. In particular:
>>>>>>>>>
>>>>>>>>> - Goals needs to be about user-facing behavior ("people" is broad)
>>>>>>>>>
>>>>>>>>> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will
>>>>>>>>> dig
>>>>>>>>> up
>>>>>>>>> one of these and say "Spark's developers have officially rejected
>>>>>>>>> X,
>>>>>>>>> which
>>>>>>>>> our awesome system has".
>>>>>>>>>
>>>>>>>>> - For user-facing stuff, I think you need a section on API.
>>>>>>>>> Virtually
>>>>>>>>> all
>>>>>>>>> other *IPs I've seen have that.
>>>>>>>>>
>>>>>>>>> - I'm still not sure why the strategy section is needed if the
>>>>>>>>> purpose is
>>>>>>>>> to define user-facing behavior -- unless this is the strategy for
>>>>>>>>> setting
>>>>>>>>> the goals or for defining the API. That sounds squarely like a
>>>>>>>>> design
>>>>>>>>> doc
>>>>>>>>> issue. In some sense, who cares whether the proposal is
>>>>>>>>> technically
>>>>>>>>> feasible
>>>>>>>>> right now? If it's infeasible, that will be discovered later
>>>>>>>>> during
>>>>>>>>> design
>>>>>>>>> and implementation. Same thing with rejected strategies -- listing
>>>>>>>>> some
>>>>>>>>> of
>>>>>>>>> those is definitely useful sometimes, but if you make this a
>>>>>>>>> *required*
>>>>>>>>> section, people are just going to fill it in with bogus stuff
>>>>>>>>> (I've
>>>>>>>>> seen
>>>>>>>>> this happen before).
>>>>>>>>>
>>>>>>>>> Matei
>>>>>>>>>
>>>>>>>
>>>>>>>>>> On Oct 9, 2016, at 2:14 PM, Cody Koeninger <[hidden email]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> So to focus the discussion on the specific strategy I'm
>>>>>>>>>> suggesting,
>>>>>>>>>> documented at
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>>>>>>>>
>>>>>>>>>> "Goals: What must this allow people to do, that they can't
>>>>>>>>>> currently?"
>>>>>>>>>>
>>>>>>>>>> Is it unclear that this is focusing specifically on
>>>>>>>>>> people-visible
>>>>>>>>>> behavior?
>>>>>>>>>>
>>>>>>>>>> Rejected goals -  are important because otherwise people keep
>>>>>>>>>> trying
>>>>>>>>>> to argue about scope.  Of course you can change things later
>>>>>>>>>> with a
>>>>>>>>>> different SIP and different vote, the point is to focus.
>>>>>>>>>>
>>>>>>>>>> Use cases - are something that people are going to bring up in
>>>>>>>>>> discussion.  If they aren't clearly documented as a goal ("This
>>>>>>>>>> must
>>>>>>>>>> allow me to connect using SSL"), they should be added.
>>>>>>>>>>
>>>>>>>>>> Internal architecture - if the people who need specific behavior
>>>>>>>>>> are
>>>>>>>>>> implementers of other parts of the system, that's fine.
>>>>>>>>>>
>>>>>>>>>> Rejected strategies - If you have none of these, you have no
>>>>>>>>>> evidence
>>>>>>>>>> that the proponent didn't just go with the first thing they had
>>>>>>>>>> in
>>>>>>>>>> mind (or have already implemented), which is a big problem
>>>>>>>>>> currently.
>>>>>>>>>> Approval isn't binding as to specifics of implementation, so
>>>>>>>>>> these
>>>>>>>>>> aren't handcuffs.  The goals are the contract, the strategy is
>>>>>>>>>> evidence that contract can actually be met.
>>>>>>>>>>
>>>>>>>>>> Design docs - I'm not touching design docs.  The markdown file I
>>>>>>>>>> linked specifically says of the strategy section "This is not a
>>>>>>>>>> full
>>>>>>>>>> design document."  Is this unclear?  Design docs can be worked
>>>>>>>>>> on
>>>>>>>>>> obviously, but that's not what I'm concerned with here.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <[hidden email]>
>>>>>>>>>> wrote:
>>>>>>>>>>> Hi Cody,
>>>>>>>>>>>
>>>>>>>>>>> I think this would be a lot more concrete if we had a more
>>>>>>>>>>> detailed
>>>>>>>>>>> template
>>>>>>>>>>> for SIPs. Right now, it's not super clear what's in scope --
>>>>>>>>>>> e.g.
>>>>>>>>>>> are
>>>>>>>>>>> they
>>>>>>>>>>> a way to solicit feedback on the user-facing behavior or on the
>>>>>>>>>>> internals?
>>>>>>>>>>> "Goals" can cover both things. I've been thinking of SIPs more
>>>>>>>>>>> as
>>>>>>>>>>> Product
>>>>>>>>>>> Requirements Docs (PRDs), which focus on *what* a code change
>>>>>>>>>>> should
>>>>>>>>>>> do
>>>>>>>>>>> as
>>>>>>>>>>> opposed to how.
>>>>>>>>>>>
>>>>>>>>>>> In particular, here are some things that you may or may not
>>>>>>>>>>> consider
>>>>>>>>>>> in
>>>>>>>>>>> scope for SIPs:
>>>>>>>>>>>
>>>>>>>>>>> - Goals and non-goals: This is definitely in scope, and IMO
>>>>>>>>>>> should
>>>>>>>>>>> focus on
>>>>>>>>>>> user-visible behavior (e.g. "system supports SQL window
>>>>>>>>>>> functions"
>>>>>>>>>>> or
>>>>>>>>>>> "system continues working if one node fails"). BTW I wouldn't
>>>>>>>>>>> say
>>>>>>>>>>> "rejected
>>>>>>>>>>> goals" because some of them might become goals later, so we're
>>>>>>>>>>> not
>>>>>>>>>>> definitively rejecting them.
>>>>>>>>>>>
>>>>>>>>>>> - Public API: Probably should be included in most SIPs unless
>>>>>>>>>>> it's
>>>>>>>>>>> too
>>>>>>>>>>> large
>>>>>>>>>>> to fully specify then (e.g. "let's add an ML library").
>>>>>>>>>>>
>>>>>>>>>>> - Use cases: I usually find this very useful in PRDs to better
>>>>>>>>>>> communicate
>>>>>>>>>>> the goals.
>>>>>>>>>>>
>>>>>>>>>>> - Internal architecture: This is usually *not* a thing users
>>>>>>>>>>> can
>>>>>>>>>>> easily
>>>>>>>>>>> comment on and it sounds more like a design doc item. Of course
>>>>>>>>>>> it's
>>>>>>>>>>> important to show that the SIP is feasible to implement. One
>>>>>>>>>>> exception,
>>>>>>>>>>> however, is that I think we'll have some SIPs primarily on
>>>>>>>>>>> internals
>>>>>>>>>>> (e.g.
>>>>>>>>>>> if somebody wants to refactor Spark's query optimizer or
>>>>>>>>>>> something).
>>>>>>>>>>>
>>>>>>>>>>> - Rejected strategies: I personally wouldn't put this, because
>>>>>>>>>>> what's
>>>>>>>>>>> the
>>>>>>>>>>> point of voting to reject a strategy before you've really begun
>>>>>>>>>>> designing
>>>>>>>>>>> and implementing something? What if you discover that the
>>>>>>>>>>> strategy
>>>>>>>>>>> is
>>>>>>>>>>> actually better when you start doing stuff?
>>>>>>>>>>>
>>>>>>>>>>> At a super high level, it depends on whether you want the SIPs
>>>>>>>>>>> to
>>>>>>>>>>> be
>>>>>>>>>>> PRDs
>>>>>>>>>>> for getting some quick feedback on the goals of a feature
>>>>>>>>>>> before
>>>>>>>>>>> it is
>>>>>>>>>>> designed, or something more like full-fledged design docs (just
>>>>>>>>>>> a
>>>>>>>>>>> more
>>>>>>>>>>> visible design doc for bigger changes). I looked at Kafka's
>>>>>>>>>>> KIPs,
>>>>>>>>>>> and
>>>>>>>>>>> they
>>>>>>>>>>> actually seem to be more like design docs. This can work too
>>>>>>>>>>> but
>>>>>>>>>>> it
>>>>>>>>>>> does
>>>>>>>>>>> require more work from the proposer and it can lead to the same
>>>>>>>>>>> problems you
>>>>>>>>>>> mentioned with people already having a design and
>>>>>>>>>>> implementation
>>>>>>>>>>> in
>>>>>>>>>>> mind.
>>>>>>>>>>>
>>>>>>>>>>> Basically, the question is, are you trying to iterate faster on
>>>>>>>>>>> design
>>>>>>>>>>> by
>>>>>>>>>>> adding a step for user feedback earlier? Or are you just trying
>>>>>>>>>>> to
>>>>>>>>>>> make
>>>>>>>>>>> design docs for key features more visible (and their approval
>>>>>>>>>>> more
>>>>>>>>>>> formal)?
>>>>>>>>>>>
>>>>>>>>>>> BTW note that in either case, I'd like to have a template for
>>>>>>>>>>> design
>>>>>>>>>>> docs
>>>>>>>>>>> too, which should also include goals. I think that would've
>>>>>>>>>>> avoided
>>>>>>>>>>> some of
>>>>>>>>>>> the issues you brought up.
>>>>>>>>>>>
>>>>>>>>>>> Matei
>>>>>>>>>>>
>>>>>>>>>>> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <[hidden email]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Here's my specific proposal (meta-proposal?)
>>>>>>>>>>>
>>>>>>>>>>> Spark Improvement Proposals (SIP)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Background:
>>>>>>>>>>>
>>>>>>>>>>> The current problem is that design and implementation of large
>>>>>>>>>>> features
>>>>>>>>>>> are
>>>>>>>>>>> often done in private, before soliciting user feedback.
>>>>>>>>>>>
>>>>>>>>>>> When feedback is solicited, it is often as to detailed design
>>>>>>>>>>> specifics, not
>>>>>>>>>>> focused on goals.
>>>>>>>>>>>
>>>>>>>>>>> When implementation does take place after design, there is
>>>>>>>>>>> often
>>>>>>>>>>> disagreement as to what goals are or are not in scope.
>>>>>>>>>>>
>>>>>>>>>>> This results in commits that don't fully meet user needs.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Goals:
>>>>>>>>>>>
>>>>>>>>>>> - Ensure user, contributor, and committer goals are clearly
>>>>>>>>>>> identified
>>>>>>>>>>> and
>>>>>>>>>>> agreed upon, before implementation takes place.
>>>>>>>>>>>
>>>>>>>>>>> - Ensure that a technically feasible strategy is chosen that is
>>>>>>>>>>> likely
>>>>>>>>>>> to
>>>>>>>>>>> meet the goals.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Rejected Goals:
>>>>>>>>>>>
>>>>>>>>>>> - SIPs are not for detailed design.  Design by committee
>>>>>>>>>>> doesn't
>>>>>>>>>>> work.
>>>>>>>>>>>
>>>>>>>>>>> - SIPs are not for every change.  We dont need that much
>>>>>>>>>>> process.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Strategy:
>>>>>>>>>>>
>>>>>>>>>>> My suggestion is outlined as a Spark Improvement Proposal
>>>>>>>>>>> process
>>>>>>>>>>> documented
>>>>>>>>>>> at
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>>>>>>>>>
>>>>>>>>>>> Specifics of Jira manipulation are an implementation detail we
>>>>>>>>>>> can
>>>>>>>>>>> figure
>>>>>>>>>>> out.
>>>>>>>>>>>
>>>>>>>>>>> I'm suggesting voting; the need here is for a _clear_ outcome.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Rejected Strategies:
>>>>>>>>>>>
>>>>>>>>>>> Having someone who understands the problem implement it first
>>>>>>>>>>> works,
>>>>>>>>>>> but
>>>>>>>>>>> only if significant iteration after user feedback is allowed.
>>>>>>>>>>>
>>>>>>>>>>> Historically this has been problematic due to pressure to limit
>>>>>>>>>>> public
>>>>>>>>>>> api
>>>>>>>>>>> changes.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <[hidden email]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Alright looks like there are quite a bit of support. We should
>>>>>>>>>>>> wait
>>>>>>>>>>>> to
>>>>>>>>>>>> hear from more people too.
>>>>>>>>>>>>
>>>>>>>>>>>> To push this forward, Cody and I will be working together in
>>>>>>>>>>>> the
>>>>>>>>>>>> next
>>>>>>>>>>>> couple of weeks to come up with a concrete, detailed proposal
>>>>>>>>>>>> on
>>>>>>>>>>>> what
>>>>>>>>>>>> this
>>>>>>>>>>>> entails, and then we can discuss this the specific proposal as
>>>>>>>>>>>> well.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <[hidden
>>>>>>>>>>>> email]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yeah, in case it wasn't clear, I was talking about SIPs for
>>>>>>>>>>>>> major
>>>>>>>>>>>>> user-facing or cross-cutting changes, not minor feature adds.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos
>>>>>>>>>>>>> <[hidden email]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1 to the SIP label as long as it does not slow down things
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>> targets optimizing efforts, coordination etc. For example
>>>>>>>>>>>>>> really
>>>>>>>>>>>>>> small
>>>>>>>>>>>>>> features should not need to go through this process
>>>>>>>>>>>>>> (assuming
>>>>>>>>>>>>>> they
>>>>>>>>>>>>>> dont
>>>>>>>>>>>>>> touch public interfaces)  or re-factorings and hope it will
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>> kept
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> way. So as a guideline doc should be provided, like in the
>>>>>>>>>>>>>> KIP
>>>>>>>>>>>>>> case.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> IMHO so far aside from tagging things and linking them
>>>>>>>>>>>>>> elsewhere
>>>>>>>>>>>>>> simply
>>>>>>>>>>>>>> having design docs and prototypes implementations in PRs is
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>> something
>>>>>>>>>>>>>> that has not worked so far. What is really a pain in many
>>>>>>>>>>>>>> projects
>>>>>>>>>>>>>> out there
>>>>>>>>>>>>>> is discontinuity in progress of PRs, missing features, slow
>>>>>>>>>>>>>> reviews
>>>>>>>>>>>>>> which is
>>>>>>>>>>>>>> understandable to some extent... it is not only about Spark
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>> things can
>>>>>>>>>>>>>> be improved for sure for this project in particular as
>>>>>>>>>>>>>> already
>>>>>>>>>>>>>> stated.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <[hidden
>>>>>>>>>>>>>> email]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +1 to adding an SIP label and linking it from the website.
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - template that focuses it towards soliciting user goals /
>>>>>>>>>>>>>>> non
>>>>>>>>>>>>>>> goals
>>>>>>>>>>>>>>> - clear resolution as to which strategy was chosen to
>>>>>>>>>>>>>>> pursue.
>>>>>>>>>>>>>>> I'd
>>>>>>>>>>>>>>> recommend a vote.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Matei asked me to clarify what I meant by changing
>>>>>>>>>>>>>>> interfaces,
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>> it's directly relevant to the SIP idea so I'll clarify
>>>>>>>>>>>>>>> here,
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> split
>>>>>>>>>>>>>>> a thread for the other discussion per Nicholas' request.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I meant changing public user interfaces.  I think the first
>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>> unlikely to be right, because it's done at a time when you
>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> least information.  As a user, I find it considerably more
>>>>>>>>>>>>>>> frustrating
>>>>>>>>>>>>>>> to be unable to use a tool to get my job done, than I do
>>>>>>>>>>>>>>> having to
>>>>>>>>>>>>>>> make minor changes to my code in order to take advantage of
>>>>>>>>>>>>>>> features.
>>>>>>>>>>>>>>> I've seen committers be seriously reluctant to allow
>>>>>>>>>>>>>>> changes
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> @experimental code that are needed in order for it to
>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>> right.  You need to be able to iterate, and if people on
>>>>>>>>>>>>>>> both
>>>>>>>>>>>>>>> sides
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> the fence aren't going to respect that some newer apis are
>>>>>>>>>>>>>>> subject
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> change, then why even mark them as such?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ideally a finished SIP should give me a checklist of things
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>> implementation must do, and things that it doesn't need to
>>>>>>>>>>>>>>> do.
>>>>>>>>>>>>>>> Contributors/committers should be seriously discouraged
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>> putting
>>>>>>>>>>>>>>> out a version 0.1 that doesn't have at least a prototype
>>>>>>>>>>>>>>> implementation of all those things, especially if they're
>>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>> to argue against interface changes necessary to get the the
>>>>>>>>>>>>>>> rest
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> the things done in the 0.2 version.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <[hidden
>>>>>>>>>>>>>>> email]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> I like the lightweight proposal to add a SIP label.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> During Spark 2.0 development, Tom (Graves) and I suggested
>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>> wiki
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> track the list of major changes, but that never really
>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>> due to
>>>>>>>>>>>>>>>> the overhead. Adding a SIP label on major JIRAs and then
>>>>>>>>>>>>>>>> link
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>> prominently on the Spark website makes a lot of sense.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia
>>>>>>>>>>>>>>>> <[hidden email]>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For the improvement proposals, I think one major point
>>>>>>>>>>>>>>>>> was
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>> really visible to users who are not contributors, so we
>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>> more than
>>>>>>>>>>>>>>>>> sending stuff to dev@. One very lightweight idea is to
>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>> type of
>>>>>>>>>>>>>>>>> JIRA called a SIP and have a link to a filter that shows
>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>>>> JIRAs from
>>>>>>>>>>>>>>>>> http://spark.apache.org. I also like the idea of SIP and
>>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>> doc
>>>>>>>>>>>>>>>>> templates (in fact many projects have them).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Matei
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I called Cody last night and talked about some of the
>>>>>>>>>>>>>>>>> topics
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> his
>>>>>>>>>>>>>>>>> email.
>>>>>>>>>>>>>>>>> It became clear to me Cody genuinely cares about the
>>>>>>>>>>>>>>>>> project.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Some of the frustrations come from the success of the
>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>> itself
>>>>>>>>>>>>>>>>> becoming very "hot", and it is difficult to get clarity
>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>> people
>>>>>>>>>>>>>>>>> who
>>>>>>>>>>>>>>>>> don't dedicate all their time to Spark. In fact, it is in
>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>> ways
>>>>>>>>>>>>>>>>> similar
>>>>>>>>>>>>>>>>> to scaling an engineering team in a successful startup:
>>>>>>>>>>>>>>>>> old
>>>>>>>>>>>>>>>>> processes that
>>>>>>>>>>>>>>>>> worked well might not work so well when it gets to a
>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>> size,
>>>>>>>>>>>>>>>>> cultures
>>>>>>>>>>>>>>>>> can get diluted, building culture vs building process,
>>>>>>>>>>>>>>>>> etc.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I also really like to have a more visible process for
>>>>>>>>>>>>>>>>> larger
>>>>>>>>>>>>>>>>> changes,
>>>>>>>>>>>>>>>>> especially major user facing API changes. Historically we
>>>>>>>>>>>>>>>>> upload
>>>>>>>>>>>>>>>>> design docs
>>>>>>>>>>>>>>>>> for major changes, but it is not always consistent and
>>>>>>>>>>>>>>>>> difficult
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> quality
>>>>>>>>>>>>>>>>> of the docs, due to the volunteering nature of the
>>>>>>>>>>>>>>>>> organization.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Some of the more concrete ideas we discussed focus on
>>>>>>>>>>>>>>>>> building a
>>>>>>>>>>>>>>>>> culture
>>>>>>>>>>>>>>>>> to improve clarity:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - Process: Large changes should have design docs posted
>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>> JIRA.
>>>>>>>>>>>>>>>>> One
>>>>>>>>>>>>>>>>> thing
>>>>>>>>>>>>>>>>> Cody and I didn't discuss but an idea that just came to
>>>>>>>>>>>>>>>>> me
>>>>>>>>>>>>>>>>> is we
>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>> create a design doc template for the project and ask
>>>>>>>>>>>>>>>>> everybody
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> follow.
>>>>>>>>>>>>>>>>> The design doc template should also explicitly list goals
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> non-goals, to
>>>>>>>>>>>>>>>>> make design doc more consistent.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - Process: Email dev@ to solicit feedback. We have some
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>> changes, but again very inconsistent. Just posting
>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>> JIRA
>>>>>>>>>>>>>>>>> isn't
>>>>>>>>>>>>>>>>> sufficient, because there are simply too many JIRAs and
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> signal
>>>>>>>>>>>>>>>>> get lost
>>>>>>>>>>>>>>>>> in the noise. While this is generally impossible to
>>>>>>>>>>>>>>>>> enforce
>>>>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>>>> we can't
>>>>>>>>>>>>>>>>> force all volunteers to conform to a process (or they
>>>>>>>>>>>>>>>>> might
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>> even
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> aware of this),  those who are more familiar with the
>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> help by
>>>>>>>>>>>>>>>>> emailing the dev@ when they see something that hasn't
>>>>>>>>>>>>>>>>> been.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - Culture: The design doc author(s) should be open to
>>>>>>>>>>>>>>>>> feedback.
>>>>>>>>>>>>>>>>> A
>>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>> doc should serve as the base for discussion and is by no
>>>>>>>>>>>>>>>>> means
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> final
>>>>>>>>>>>>>>>>> design. Of course, this does not mean the author has to
>>>>>>>>>>>>>>>>> accept
>>>>>>>>>>>>>>>>> every
>>>>>>>>>>>>>>>>> feedback. They should also be comfortable accepting /
>>>>>>>>>>>>>>>>> rejecting
>>>>>>>>>>>>>>>>> ideas on
>>>>>>>>>>>>>>>>> technical grounds.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - Process / Culture: For major ongoing projects, it can
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>> some monthly Google hangouts that are open to the world.
>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>> actually not
>>>>>>>>>>>>>>>>> sure how well this will work, because of the volunteering
>>>>>>>>>>>>>>>>> nature
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> we need
>>>>>>>>>>>>>>>>> to adjust for timezones for people across the globe, but
>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>> worth
>>>>>>>>>>>>>>>>> trying.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - Culture: Contributors (including committers) should be
>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>> direct
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> setting expectations, including whether they are working
>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>> issue, whether they will be working on a specific issue,
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> whether
>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>> issue or pr or jira should be rejected. Most people I
>>>>>>>>>>>>>>>>> know
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> community
>>>>>>>>>>>>>>>>> are nice and don't enjoy telling other people no, but it
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> often
>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>> annoying to a contributor to not know anything than
>>>>>>>>>>>>>>>>> getting
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> no.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia
>>>>>>>>>>>>>>>>> <[hidden email]>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Love the idea of a more visible "Spark Improvement
>>>>>>>>>>>>>>>>>> Proposal"
>>>>>>>>>>>>>>>>>> process that
>>>>>>>>>>>>>>>>>> solicits user input on new APIs. For what it's worth, I
>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>> committers are trying to minimize their own work --
>>>>>>>>>>>>>>>>>> every
>>>>>>>>>>>>>>>>>> committer
>>>>>>>>>>>>>>>>>> cares
>>>>>>>>>>>>>>>>>> about making the software useful for users. However, it
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> always
>>>>>>>>>>>>>>>>>> hard to
>>>>>>>>>>>>>>>>>> get user input and so it helps to have this kind of
>>>>>>>>>>>>>>>>>> process.
>>>>>>>>>>>>>>>>>> I've
>>>>>>>>>>>>>>>>>> certainly
>>>>>>>>>>>>>>>>>> looked at the *IPs a lot in other software I use just to
>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> biggest
>>>>>>>>>>>>>>>>>> things on the roadmap.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> When you're talking about "changing interfaces", are you
>>>>>>>>>>>>>>>>>> talking
>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>> public or internal APIs? I do think many people hate
>>>>>>>>>>>>>>>>>> changing
>>>>>>>>>>>>>>>>>> public APIs
>>>>>>>>>>>>>>>>>> and I actually think that's for the best of the project.
>>>>>>>>>>>>>>>>>> That's
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> technical
>>>>>>>>>>>>>>>>>> debate, but basically, the worst thing when you're using
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> piece
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> software
>>>>>>>>>>>>>>>>>> is that the developers constantly ask you to rewrite
>>>>>>>>>>>>>>>>>> your
>>>>>>>>>>>>>>>>>> app
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> update to a
>>>>>>>>>>>>>>>>>> new version (and thus benefit from bug fixes, etc). Cue
>>>>>>>>>>>>>>>>>> anyone
>>>>>>>>>>>>>>>>>> who's used
>>>>>>>>>>>>>>>>>> Protobuf, or Guava. The "let's get everyone to change
>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> release" model works well within a single large company,
>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>> doesn't work
>>>>>>>>>>>>>>>>>> well for a community, which is why nearly all *very*
>>>>>>>>>>>>>>>>>> widely
>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>> programming
>>>>>>>>>>>>>>>>>> interfaces (I'm talking things like Java standard
>>>>>>>>>>>>>>>>>> library,
>>>>>>>>>>>>>>>>>> Windows
>>>>>>>>>>>>>>>>>> API, etc)
>>>>>>>>>>>>>>>>>> almost *never* break backwards compatibility. All this
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> done
>>>>>>>>>>>>>>>>>> within reason
>>>>>>>>>>>>>>>>>> though, e.g. we do change things in major releases (2.x,
>>>>>>>>>>>>>>>>>> 3.x,
>>>>>>>>>>>>>>>>>> etc).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>> To unsubscribe e-mail: [hidden email]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Stavros Kontopoulos
>>>>>>>>>>>>>> Senior Software Engineer
>>>>>>>>>>>>>> Lightbend, Inc.
>>>>>>>>>>>>>> p:  +30 6977967274
>>>>>>>>>>>>>> e: [hidden email]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: [hidden email]
>>>>>>>
>>>>>>>
>>>>>>> ________________________________
>>>>>>>
>>>>>>> If you reply to this email, your message will be added to the
>>>>>>> discussion
>>>>>>> below:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-tp19268p19359.html
>>>>>>>
>>>>>>> To start a new topic under Apache Spark Developers List, email
>>>>>>> [hidden
>>>>>>> email]
>>>>>>> To unsubscribe from Apache Spark Developers List, click here.
>>>>>>> NAML
>>>>>>>
>>>>>>>
>>>>>>> ________________________________
>>>>>>> View this message in context: RE: Spark Improvement Proposals
>>>>>>> Sent from the Apache Spark Developers List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: [hidden email]
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>



---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Steve Loughran
This is an interesting process proposal; I think it could work well.

-It's got the flavour of the ASF incubator; maybe some of the processes there: mentor, regular reporting in could help, in particular, help stop the -1 at the end of the work
-it may also aid collaboration to have a medium lived branch, so enabling collaboration with multiple people submitting PRs into the ASF codebase. This can reduce cost of merge and enable jenkins to keep on top of it. It also fits in well with the ASF "do in apache infra" community development process.


> On 10 Oct 2016, at 20:26, Matei Zaharia <[hidden email]> wrote:
>
> Agreed with this. As I said before regarding who submits: it's not a normal ASF process to require contributions to only come from committers. Committers are of course the only people who can *commit* stuff. But the whole point of an open source project is that anyone can *contribute* -- indeed, that is how people become committers. For example, in every ASF project, anyone can open JIRAs, submit design docs, submit patches, review patches, and vote on releases. This particular process is very similar to posting a JIRA or a design doc.
>
> I also like consensus with a deadline (e.g. someone says "here is a new SEP, we want to accept it by date X so please comment before").
>
> In general, with this type of stuff, it's better to start with very lightweight processes and then expand them if needed. Adding lots of rules from the beginning makes it confusing and can reduce contributions. Although, as engineers, we believe that anything can be solved using mechanical rules, in practice software development is a social process that ultimately requires humans to tackle things on a case-by-case basis.
>
> Matei
>
>
>> On Oct 10, 2016, at 12:19 PM, Cody Koeninger <[hidden email]> wrote:
>>
>> That seems reasonable to me.
>>
>> I do not want to see lazy consensus used on one of these proposals
>> though, I want a clear outcome, i.e. call for a vote, wait at least 72
>> hours, get three +1s and no vetos.
>>
>>
>>
>> On Mon, Oct 10, 2016 at 2:15 PM, Ryan Blue <[hidden email]> wrote:
>>> Proposal submission: I think we should keep this as open as possible. If
>>> there is a problem with too many open proposals, then we should tackle that
>>> as a fix rather than excluding participation. Perhaps it will end up that
>>> way, but I think it's worth trying a more open model first.
>>>
>>> Majority vs consensus: My rationale is that I don't think we want to
>>> consider a proposal approved if it had objections serious enough that
>>> committers down-voted (or PMC depending on who gets a vote). If these
>>> proposals are like PEPs, then they represent a significant amount of
>>> community effort and I wouldn't want to move forward if up to half of the
>>> community thinks it's an untenable idea.
>>>
>>> rb
>>>
>>> On Mon, Oct 10, 2016 at 12:07 PM, Cody Koeninger <[hidden email]> wrote:
>>>>
>>>> I think this is closer to a procedural issue than a code modification
>>>> issue, hence why majority.  If everyone thinks consensus is better, I
>>>> don't care.  Again, I don't feel strongly about the way we achieve
>>>> clarity, just that we achieve clarity.
>>>>
>>>> On Mon, Oct 10, 2016 at 2:02 PM, Ryan Blue <[hidden email]> wrote:
>>>>> Sorry, I missed that the proposal includes majority approval. Why
>>>>> majority
>>>>> instead of consensus? I think we want to build consensus around these
>>>>> proposals and it makes sense to discuss until no one would veto.
>>>>>
>>>>> rb
>>>>>
>>>>> On Mon, Oct 10, 2016 at 11:54 AM, Ryan Blue <[hidden email]> wrote:
>>>>>>
>>>>>> +1 to votes to approve proposals. I agree that proposals should have an
>>>>>> official mechanism to be accepted, and a vote is an established means
>>>>>> of
>>>>>> doing that well. I like that it includes a period to review the
>>>>>> proposal and
>>>>>> I think proposals should have been discussed enough ahead of a vote to
>>>>>> survive the possibility of a veto.
>>>>>>
>>>>>> I also like the names that are short and (mostly) unique, like SEP.
>>>>>>
>>>>>> Where I disagree is with the requirement that a committer must formally
>>>>>> propose an enhancement. I don't see the value of restricting this: if
>>>>>> someone has the will to write up a proposal then they should be
>>>>>> encouraged
>>>>>> to do so and start a discussion about it. Even if there is a political
>>>>>> reality as Cody says, what is the value of codifying that in our
>>>>>> process? I
>>>>>> think restricting who can submit proposals would only undermine them by
>>>>>> pushing contributors out. Maybe I'm missing something here?
>>>>>>
>>>>>> rb
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <[hidden email]>
>>>>>> wrote:
>>>>>>>
>>>>>>> Yes, users suggesting SIPs is a good thing and is explicitly called
>>>>>>> out in the linked document under the Who? section.  Formally proposing
>>>>>>> them, not so much, because of the political realities.
>>>>>>>
>>>>>>> Yes, implementation strategy definitely affects goals.  There are all
>>>>>>> kinds of examples of this, I'll pick one that's my fault so as to
>>>>>>> avoid sounding like I'm blaming:
>>>>>>>
>>>>>>> When I implemented the Kafka DStream, one of my (not explicitly agreed
>>>>>>> upon by the community) goals was to make sure people could use the
>>>>>>> Dstream with however they were already using Kafka at work.  The lack
>>>>>>> of explicit agreement on that goal led to all kinds of fighting with
>>>>>>> committers, that could have been avoided.  The lack of explicit
>>>>>>> up-front strategy discussion led to the DStream not really working
>>>>>>> with compacted topics.  I knew about compacted topics, but don't have
>>>>>>> a use for them, so had a blind spot there.  If there was explicit
>>>>>>> up-front discussion that my strategy was "assume that batches can be
>>>>>>> defined on the driver solely by beginning and ending offsets", there's
>>>>>>> a greater chance that a user would have seen that and said, "hey, what
>>>>>>> about non-contiguous offsets in a compacted topic".
>>>>>>>
>>>>>>> This kind of thing is only going to happen smoothly if we have a
>>>>>>> lightweight user-visible process with clear outcomes.
>>>>>>>
>>>>>>> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>>>>>>> <[hidden email]> wrote:
>>>>>>>> I agree with most of what Cody said.
>>>>>>>>
>>>>>>>> Two things:
>>>>>>>>
>>>>>>>> First we can always have other people suggest SIPs but mark them as
>>>>>>>> “unreviewed” and have committers basically move them forward. The
>>>>>>>> problem is
>>>>>>>> that writing a good document takes time. This way we can leverage
>>>>>>>> non
>>>>>>>> committers to do some of this work (it is just another way to
>>>>>>>> contribute).
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> As for strategy, in many cases implementation strategy can affect
>>>>>>>> the
>>>>>>>> goals.
>>>>>>>> I will give  a small example: In the current structured streaming
>>>>>>>> strategy,
>>>>>>>> we group by the time to achieve a sliding window. This is definitely
>>>>>>>> an
>>>>>>>> implementation decision and not a goal. However, I can think of
>>>>>>>> several
>>>>>>>> aggregation functions which have the time inside their calculation
>>>>>>>> buffer.
>>>>>>>> For example, let’s say we want to return a set of all distinct
>>>>>>>> values.
>>>>>>>> One
>>>>>>>> way to implement this would be to make the set into a map and have
>>>>>>>> the
>>>>>>>> value
>>>>>>>> contain the last time seen. Multiplying it across the groupby would
>>>>>>>> cost a
>>>>>>>> lot in performance. So adding such a strategy would have a great
>>>>>>>> effect
>>>>>>>> on
>>>>>>>> the type of aggregations and their performance which does affect the
>>>>>>>> goal.
>>>>>>>> Without adding the strategy, it is easy for whoever goes to the
>>>>>>>> design
>>>>>>>> document to not think about these cases. Furthermore, it might be
>>>>>>>> decided
>>>>>>>> that these cases are rare enough so that the strategy is still good
>>>>>>>> enough
>>>>>>>> but how would we know it without user feedback?
>>>>>>>>
>>>>>>>> I believe this example is exactly what Cody was talking about. Since
>>>>>>>> many
>>>>>>>> times implementation strategies have a large effect on the goal, we
>>>>>>>> should
>>>>>>>> have it discussed when discussing the goals. In addition, while it
>>>>>>>> is
>>>>>>>> often
>>>>>>>> easy to throw out completely infeasible goals, it is often much
>>>>>>>> harder
>>>>>>>> to
>>>>>>>> figure out that the goals are unfeasible without fine tuning.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Assaf.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> From: Cody Koeninger-2 [via Apache Spark Developers List]
>>>>>>>> [mailto:ml-node+[hidden email]]
>>>>>>>> Sent: Monday, October 10, 2016 2:25 AM
>>>>>>>> To: Mendelson, Assaf
>>>>>>>> Subject: Re: Spark Improvement Proposals
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Only committers should formally submit SIPs because in an apache
>>>>>>>> project only commiters have explicit political power.  If a user
>>>>>>>> can't
>>>>>>>> find a commiter willing to sponsor an SIP idea, they have no way to
>>>>>>>> get the idea passed in any case.  If I can't find a committer to
>>>>>>>> sponsor this meta-SIP idea, I'm out of luck.
>>>>>>>>
>>>>>>>> I do not believe unrealistic goals can be found solely by
>>>>>>>> inspection.
>>>>>>>> We've managed to ignore unrealistic goals even after implementation!
>>>>>>>> Focusing on APIs can allow people to think they've solved something,
>>>>>>>> when there's really no way of implementing that API while meeting
>>>>>>>> the
>>>>>>>> goals.  Rapid iteration is clearly the best way to address this, but
>>>>>>>> we've already talked about why that hasn't really worked.  If adding
>>>>>>>> a
>>>>>>>> non-binding API section to the template is important to you, I'm not
>>>>>>>> against it, but I don't think it's sufficient.
>>>>>>>>
>>>>>>>> On your PRD vs design doc spectrum, I'm saying this is closer to a
>>>>>>>> PRD.  Clear agreement on goals is the most important thing and
>>>>>>>> that's
>>>>>>>> why it's the thing I want binding agreement on.  But I cannot agree
>>>>>>>> to
>>>>>>>> goals unless I have enough minimal technical info to judge whether
>>>>>>>> the
>>>>>>>> goals are likely to actually be accomplished.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Oct 9, 2016 at 5:35 PM, Matei Zaharia <[hidden email]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Well, I think there are a few things here that don't make sense.
>>>>>>>>> First,
>>>>>>>>> why
>>>>>>>>> should only committers submit SIPs? Development in the project
>>>>>>>>> should
>>>>>>>>> be
>>>>>>>>> open to all contributors, whether they're committers or not.
>>>>>>>>> Second, I
>>>>>>>>> think
>>>>>>>>> unrealistic goals can be found just by inspecting the goals, and
>>>>>>>>> I'm
>>>>>>>>> not
>>>>>>>>> super worried that we'll accept a lot of SIPs that are then
>>>>>>>>> infeasible
>>>>>>>>> --
>>>>>>>>> we
>>>>>>>>> can then submit new ones. But this depends on whether you want this
>>>>>>>>> process
>>>>>>>>> to be a "design doc lite", where people also agree on
>>>>>>>>> implementation
>>>>>>>>> strategy, or just a way to agree on goals. This is what I asked
>>>>>>>>> earlier
>>>>>>>>> about PRDs vs design docs (and I'm open to either one but I'd just
>>>>>>>>> like
>>>>>>>>> clarity). Finally, both as a user and designer of software, I
>>>>>>>>> always
>>>>>>>>> want
>>>>>>>>> to
>>>>>>>>> give feedback on APIs, so I'd really like a culture of having those
>>>>>>>>> early.
>>>>>>>>> People don't argue about prettiness when they discuss APIs, they
>>>>>>>>> argue
>>>>>>>>> about
>>>>>>>>> the core concepts to expose in order to meet various goals, and
>>>>>>>>> then
>>>>>>>>> they're
>>>>>>>>> stuck maintaining those for a long time.
>>>>>>>>>
>>>>>>>>> Matei
>>>>>>>>>
>>>>>>>>> On Oct 9, 2016, at 3:10 PM, Cody Koeninger <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>> Users instead of people, sure.  Commiters and contributors are (or
>>>>>>>>> at
>>>>>>>>> least
>>>>>>>>> should be) a subset of users.
>>>>>>>>>
>>>>>>>>> Non goals, sure. I don't care what the name is, but we need to
>>>>>>>>> clearly
>>>>>>>>> say
>>>>>>>>> e.g. 'no we are not maintaining compatibility with XYZ right now'.
>>>>>>>>>
>>>>>>>>> API, what I care most about is whether it allows me to accomplish
>>>>>>>>> the
>>>>>>>>> goals.
>>>>>>>>> Arguing about how ugly or pretty it is can be saved for design/
>>>>>>>>> implementation imho.
>>>>>>>>>
>>>>>>>>> Strategy, this is necessary because otherwise goals can be out of
>>>>>>>>> line
>>>>>>>>> with
>>>>>>>>> reality.  Don't propose goals you don't have at least some idea of
>>>>>>>>> how
>>>>>>>>> to
>>>>>>>>> implement.
>>>>>>>>>
>>>>>>>>> Rejected strategies, given that commiters are the only ones I'm
>>>>>>>>> saying
>>>>>>>>> should formally submit SPARKLIs or SIPs, if they put junk in a
>>>>>>>>> required
>>>>>>>>> section then slap them down for it and tell them to fix it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Oct 9, 2016 4:36 PM, "Matei Zaharia" <[hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>> Yup, this is the stuff that I found unclear. Thanks for clarifying
>>>>>>>>>> here,
>>>>>>>>>> but we should also clarify it in the writeup. In particular:
>>>>>>>>>>
>>>>>>>>>> - Goals needs to be about user-facing behavior ("people" is broad)
>>>>>>>>>>
>>>>>>>>>> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will
>>>>>>>>>> dig
>>>>>>>>>> up
>>>>>>>>>> one of these and say "Spark's developers have officially rejected
>>>>>>>>>> X,
>>>>>>>>>> which
>>>>>>>>>> our awesome system has".
>>>>>>>>>>
>>>>>>>>>> - For user-facing stuff, I think you need a section on API.
>>>>>>>>>> Virtually
>>>>>>>>>> all
>>>>>>>>>> other *IPs I've seen have that.
>>>>>>>>>>
>>>>>>>>>> - I'm still not sure why the strategy section is needed if the
>>>>>>>>>> purpose is
>>>>>>>>>> to define user-facing behavior -- unless this is the strategy for
>>>>>>>>>> setting
>>>>>>>>>> the goals or for defining the API. That sounds squarely like a
>>>>>>>>>> design
>>>>>>>>>> doc
>>>>>>>>>> issue. In some sense, who cares whether the proposal is
>>>>>>>>>> technically
>>>>>>>>>> feasible
>>>>>>>>>> right now? If it's infeasible, that will be discovered later
>>>>>>>>>> during
>>>>>>>>>> design
>>>>>>>>>> and implementation. Same thing with rejected strategies -- listing
>>>>>>>>>> some
>>>>>>>>>> of
>>>>>>>>>> those is definitely useful sometimes, but if you make this a
>>>>>>>>>> *required*
>>>>>>>>>> section, people are just going to fill it in with bogus stuff
>>>>>>>>>> (I've
>>>>>>>>>> seen
>>>>>>>>>> this happen before).
>>>>>>>>>>
>>>>>>>>>> Matei
>>>>>>>>>>
>>>>>>>>
>>>>>>>>>>> On Oct 9, 2016, at 2:14 PM, Cody Koeninger <[hidden email]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> So to focus the discussion on the specific strategy I'm
>>>>>>>>>>> suggesting,
>>>>>>>>>>> documented at
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>>>>>>>>>
>>>>>>>>>>> "Goals: What must this allow people to do, that they can't
>>>>>>>>>>> currently?"
>>>>>>>>>>>
>>>>>>>>>>> Is it unclear that this is focusing specifically on
>>>>>>>>>>> people-visible
>>>>>>>>>>> behavior?
>>>>>>>>>>>
>>>>>>>>>>> Rejected goals -  are important because otherwise people keep
>>>>>>>>>>> trying
>>>>>>>>>>> to argue about scope.  Of course you can change things later
>>>>>>>>>>> with a
>>>>>>>>>>> different SIP and different vote, the point is to focus.
>>>>>>>>>>>
>>>>>>>>>>> Use cases - are something that people are going to bring up in
>>>>>>>>>>> discussion.  If they aren't clearly documented as a goal ("This
>>>>>>>>>>> must
>>>>>>>>>>> allow me to connect using SSL"), they should be added.
>>>>>>>>>>>
>>>>>>>>>>> Internal architecture - if the people who need specific behavior
>>>>>>>>>>> are
>>>>>>>>>>> implementers of other parts of the system, that's fine.
>>>>>>>>>>>
>>>>>>>>>>> Rejected strategies - If you have none of these, you have no
>>>>>>>>>>> evidence
>>>>>>>>>>> that the proponent didn't just go with the first thing they had
>>>>>>>>>>> in
>>>>>>>>>>> mind (or have already implemented), which is a big problem
>>>>>>>>>>> currently.
>>>>>>>>>>> Approval isn't binding as to specifics of implementation, so
>>>>>>>>>>> these
>>>>>>>>>>> aren't handcuffs.  The goals are the contract, the strategy is
>>>>>>>>>>> evidence that contract can actually be met.
>>>>>>>>>>>
>>>>>>>>>>> Design docs - I'm not touching design docs.  The markdown file I
>>>>>>>>>>> linked specifically says of the strategy section "This is not a
>>>>>>>>>>> full
>>>>>>>>>>> design document."  Is this unclear?  Design docs can be worked
>>>>>>>>>>> on
>>>>>>>>>>> obviously, but that's not what I'm concerned with here.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <[hidden email]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> Hi Cody,
>>>>>>>>>>>>
>>>>>>>>>>>> I think this would be a lot more concrete if we had a more
>>>>>>>>>>>> detailed
>>>>>>>>>>>> template
>>>>>>>>>>>> for SIPs. Right now, it's not super clear what's in scope --
>>>>>>>>>>>> e.g.
>>>>>>>>>>>> are
>>>>>>>>>>>> they
>>>>>>>>>>>> a way to solicit feedback on the user-facing behavior or on the
>>>>>>>>>>>> internals?
>>>>>>>>>>>> "Goals" can cover both things. I've been thinking of SIPs more
>>>>>>>>>>>> as
>>>>>>>>>>>> Product
>>>>>>>>>>>> Requirements Docs (PRDs), which focus on *what* a code change
>>>>>>>>>>>> should
>>>>>>>>>>>> do
>>>>>>>>>>>> as
>>>>>>>>>>>> opposed to how.
>>>>>>>>>>>>
>>>>>>>>>>>> In particular, here are some things that you may or may not
>>>>>>>>>>>> consider
>>>>>>>>>>>> in
>>>>>>>>>>>> scope for SIPs:
>>>>>>>>>>>>
>>>>>>>>>>>> - Goals and non-goals: This is definitely in scope, and IMO
>>>>>>>>>>>> should
>>>>>>>>>>>> focus on
>>>>>>>>>>>> user-visible behavior (e.g. "system supports SQL window
>>>>>>>>>>>> functions"
>>>>>>>>>>>> or
>>>>>>>>>>>> "system continues working if one node fails"). BTW I wouldn't
>>>>>>>>>>>> say
>>>>>>>>>>>> "rejected
>>>>>>>>>>>> goals" because some of them might become goals later, so we're
>>>>>>>>>>>> not
>>>>>>>>>>>> definitively rejecting them.
>>>>>>>>>>>>
>>>>>>>>>>>> - Public API: Probably should be included in most SIPs unless
>>>>>>>>>>>> it's
>>>>>>>>>>>> too
>>>>>>>>>>>> large
>>>>>>>>>>>> to fully specify then (e.g. "let's add an ML library").
>>>>>>>>>>>>
>>>>>>>>>>>> - Use cases: I usually find this very useful in PRDs to better
>>>>>>>>>>>> communicate
>>>>>>>>>>>> the goals.
>>>>>>>>>>>>
>>>>>>>>>>>> - Internal architecture: This is usually *not* a thing users
>>>>>>>>>>>> can
>>>>>>>>>>>> easily
>>>>>>>>>>>> comment on and it sounds more like a design doc item. Of course
>>>>>>>>>>>> it's
>>>>>>>>>>>> important to show that the SIP is feasible to implement. One
>>>>>>>>>>>> exception,
>>>>>>>>>>>> however, is that I think we'll have some SIPs primarily on
>>>>>>>>>>>> internals
>>>>>>>>>>>> (e.g.
>>>>>>>>>>>> if somebody wants to refactor Spark's query optimizer or
>>>>>>>>>>>> something).
>>>>>>>>>>>>
>>>>>>>>>>>> - Rejected strategies: I personally wouldn't put this, because
>>>>>>>>>>>> what's
>>>>>>>>>>>> the
>>>>>>>>>>>> point of voting to reject a strategy before you've really begun
>>>>>>>>>>>> designing
>>>>>>>>>>>> and implementing something? What if you discover that the
>>>>>>>>>>>> strategy
>>>>>>>>>>>> is
>>>>>>>>>>>> actually better when you start doing stuff?
>>>>>>>>>>>>
>>>>>>>>>>>> At a super high level, it depends on whether you want the SIPs
>>>>>>>>>>>> to
>>>>>>>>>>>> be
>>>>>>>>>>>> PRDs
>>>>>>>>>>>> for getting some quick feedback on the goals of a feature
>>>>>>>>>>>> before
>>>>>>>>>>>> it is
>>>>>>>>>>>> designed, or something more like full-fledged design docs (just
>>>>>>>>>>>> a
>>>>>>>>>>>> more
>>>>>>>>>>>> visible design doc for bigger changes). I looked at Kafka's
>>>>>>>>>>>> KIPs,
>>>>>>>>>>>> and
>>>>>>>>>>>> they
>>>>>>>>>>>> actually seem to be more like design docs. This can work too
>>>>>>>>>>>> but
>>>>>>>>>>>> it
>>>>>>>>>>>> does
>>>>>>>>>>>> require more work from the proposer and it can lead to the same
>>>>>>>>>>>> problems you
>>>>>>>>>>>> mentioned with people already having a design and
>>>>>>>>>>>> implementation
>>>>>>>>>>>> in
>>>>>>>>>>>> mind.
>>>>>>>>>>>>
>>>>>>>>>>>> Basically, the question is, are you trying to iterate faster on
>>>>>>>>>>>> design
>>>>>>>>>>>> by
>>>>>>>>>>>> adding a step for user feedback earlier? Or are you just trying
>>>>>>>>>>>> to
>>>>>>>>>>>> make
>>>>>>>>>>>> design docs for key features more visible (and their approval
>>>>>>>>>>>> more
>>>>>>>>>>>> formal)?
>>>>>>>>>>>>
>>>>>>>>>>>> BTW note that in either case, I'd like to have a template for
>>>>>>>>>>>> design
>>>>>>>>>>>> docs
>>>>>>>>>>>> too, which should also include goals. I think that would've
>>>>>>>>>>>> avoided
>>>>>>>>>>>> some of
>>>>>>>>>>>> the issues you brought up.
>>>>>>>>>>>>
>>>>>>>>>>>> Matei
>>>>>>>>>>>>
>>>>>>>>>>>> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <[hidden email]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Here's my specific proposal (meta-proposal?)
>>>>>>>>>>>>
>>>>>>>>>>>> Spark Improvement Proposals (SIP)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Background:
>>>>>>>>>>>>
>>>>>>>>>>>> The current problem is that design and implementation of large
>>>>>>>>>>>> features
>>>>>>>>>>>> are
>>>>>>>>>>>> often done in private, before soliciting user feedback.
>>>>>>>>>>>>
>>>>>>>>>>>> When feedback is solicited, it is often as to detailed design
>>>>>>>>>>>> specifics, not
>>>>>>>>>>>> focused on goals.
>>>>>>>>>>>>
>>>>>>>>>>>> When implementation does take place after design, there is
>>>>>>>>>>>> often
>>>>>>>>>>>> disagreement as to what goals are or are not in scope.
>>>>>>>>>>>>
>>>>>>>>>>>> This results in commits that don't fully meet user needs.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Goals:
>>>>>>>>>>>>
>>>>>>>>>>>> - Ensure user, contributor, and committer goals are clearly
>>>>>>>>>>>> identified
>>>>>>>>>>>> and
>>>>>>>>>>>> agreed upon, before implementation takes place.
>>>>>>>>>>>>
>>>>>>>>>>>> - Ensure that a technically feasible strategy is chosen that is
>>>>>>>>>>>> likely
>>>>>>>>>>>> to
>>>>>>>>>>>> meet the goals.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Rejected Goals:
>>>>>>>>>>>>
>>>>>>>>>>>> - SIPs are not for detailed design.  Design by committee
>>>>>>>>>>>> doesn't
>>>>>>>>>>>> work.
>>>>>>>>>>>>
>>>>>>>>>>>> - SIPs are not for every change.  We dont need that much
>>>>>>>>>>>> process.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Strategy:
>>>>>>>>>>>>
>>>>>>>>>>>> My suggestion is outlined as a Spark Improvement Proposal
>>>>>>>>>>>> process
>>>>>>>>>>>> documented
>>>>>>>>>>>> at
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>>>>>>>>>>
>>>>>>>>>>>> Specifics of Jira manipulation are an implementation detail we
>>>>>>>>>>>> can
>>>>>>>>>>>> figure
>>>>>>>>>>>> out.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm suggesting voting; the need here is for a _clear_ outcome.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Rejected Strategies:
>>>>>>>>>>>>
>>>>>>>>>>>> Having someone who understands the problem implement it first
>>>>>>>>>>>> works,
>>>>>>>>>>>> but
>>>>>>>>>>>> only if significant iteration after user feedback is allowed.
>>>>>>>>>>>>
>>>>>>>>>>>> Historically this has been problematic due to pressure to limit
>>>>>>>>>>>> public
>>>>>>>>>>>> api
>>>>>>>>>>>> changes.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <[hidden email]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Alright looks like there are quite a bit of support. We should
>>>>>>>>>>>>> wait
>>>>>>>>>>>>> to
>>>>>>>>>>>>> hear from more people too.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To push this forward, Cody and I will be working together in
>>>>>>>>>>>>> the
>>>>>>>>>>>>> next
>>>>>>>>>>>>> couple of weeks to come up with a concrete, detailed proposal
>>>>>>>>>>>>> on
>>>>>>>>>>>>> what
>>>>>>>>>>>>> this
>>>>>>>>>>>>> entails, and then we can discuss this the specific proposal as
>>>>>>>>>>>>> well.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <[hidden
>>>>>>>>>>>>> email]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yeah, in case it wasn't clear, I was talking about SIPs for
>>>>>>>>>>>>>> major
>>>>>>>>>>>>>> user-facing or cross-cutting changes, not minor feature adds.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos
>>>>>>>>>>>>>> <[hidden email]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +1 to the SIP label as long as it does not slow down things
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>> targets optimizing efforts, coordination etc. For example
>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>> small
>>>>>>>>>>>>>>> features should not need to go through this process
>>>>>>>>>>>>>>> (assuming
>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>> dont
>>>>>>>>>>>>>>> touch public interfaces)  or re-factorings and hope it will
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> kept
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>> way. So as a guideline doc should be provided, like in the
>>>>>>>>>>>>>>> KIP
>>>>>>>>>>>>>>> case.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> IMHO so far aside from tagging things and linking them
>>>>>>>>>>>>>>> elsewhere
>>>>>>>>>>>>>>> simply
>>>>>>>>>>>>>>> having design docs and prototypes implementations in PRs is
>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>> that has not worked so far. What is really a pain in many
>>>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>> out there
>>>>>>>>>>>>>>> is discontinuity in progress of PRs, missing features, slow
>>>>>>>>>>>>>>> reviews
>>>>>>>>>>>>>>> which is
>>>>>>>>>>>>>>> understandable to some extent... it is not only about Spark
>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>> things can
>>>>>>>>>>>>>>> be improved for sure for this project in particular as
>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>> stated.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <[hidden
>>>>>>>>>>>>>>> email]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +1 to adding an SIP label and linking it from the website.
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - template that focuses it towards soliciting user goals /
>>>>>>>>>>>>>>>> non
>>>>>>>>>>>>>>>> goals
>>>>>>>>>>>>>>>> - clear resolution as to which strategy was chosen to
>>>>>>>>>>>>>>>> pursue.
>>>>>>>>>>>>>>>> I'd
>>>>>>>>>>>>>>>> recommend a vote.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Matei asked me to clarify what I meant by changing
>>>>>>>>>>>>>>>> interfaces,
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>> it's directly relevant to the SIP idea so I'll clarify
>>>>>>>>>>>>>>>> here,
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> split
>>>>>>>>>>>>>>>> a thread for the other discussion per Nicholas' request.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I meant changing public user interfaces.  I think the first
>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> unlikely to be right, because it's done at a time when you
>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> least information.  As a user, I find it considerably more
>>>>>>>>>>>>>>>> frustrating
>>>>>>>>>>>>>>>> to be unable to use a tool to get my job done, than I do
>>>>>>>>>>>>>>>> having to
>>>>>>>>>>>>>>>> make minor changes to my code in order to take advantage of
>>>>>>>>>>>>>>>> features.
>>>>>>>>>>>>>>>> I've seen committers be seriously reluctant to allow
>>>>>>>>>>>>>>>> changes
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> @experimental code that are needed in order for it to
>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>> right.  You need to be able to iterate, and if people on
>>>>>>>>>>>>>>>> both
>>>>>>>>>>>>>>>> sides
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> the fence aren't going to respect that some newer apis are
>>>>>>>>>>>>>>>> subject
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> change, then why even mark them as such?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ideally a finished SIP should give me a checklist of things
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>> implementation must do, and things that it doesn't need to
>>>>>>>>>>>>>>>> do.
>>>>>>>>>>>>>>>> Contributors/committers should be seriously discouraged
>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>> putting
>>>>>>>>>>>>>>>> out a version 0.1 that doesn't have at least a prototype
>>>>>>>>>>>>>>>> implementation of all those things, especially if they're
>>>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>> to argue against interface changes necessary to get the the
>>>>>>>>>>>>>>>> rest
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> the things done in the 0.2 version.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <[hidden
>>>>>>>>>>>>>>>> email]>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> I like the lightweight proposal to add a SIP label.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> During Spark 2.0 development, Tom (Graves) and I suggested
>>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>> wiki
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> track the list of major changes, but that never really
>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>> due to
>>>>>>>>>>>>>>>>> the overhead. Adding a SIP label on major JIRAs and then
>>>>>>>>>>>>>>>>> link
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>> prominently on the Spark website makes a lot of sense.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia
>>>>>>>>>>>>>>>>> <[hidden email]>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For the improvement proposals, I think one major point
>>>>>>>>>>>>>>>>>> was
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>>> really visible to users who are not contributors, so we
>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>> more than
>>>>>>>>>>>>>>>>>> sending stuff to dev@. One very lightweight idea is to
>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>> type of
>>>>>>>>>>>>>>>>>> JIRA called a SIP and have a link to a filter that shows
>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>>>>> JIRAs from
>>>>>>>>>>>>>>>>>> http://spark.apache.org. I also like the idea of SIP and
>>>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>>> doc
>>>>>>>>>>>>>>>>>> templates (in fact many projects have them).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Matei
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I called Cody last night and talked about some of the
>>>>>>>>>>>>>>>>>> topics
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> his
>>>>>>>>>>>>>>>>>> email.
>>>>>>>>>>>>>>>>>> It became clear to me Cody genuinely cares about the
>>>>>>>>>>>>>>>>>> project.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Some of the frustrations come from the success of the
>>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>> itself
>>>>>>>>>>>>>>>>>> becoming very "hot", and it is difficult to get clarity
>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>> people
>>>>>>>>>>>>>>>>>> who
>>>>>>>>>>>>>>>>>> don't dedicate all their time to Spark. In fact, it is in
>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>> ways
>>>>>>>>>>>>>>>>>> similar
>>>>>>>>>>>>>>>>>> to scaling an engineering team in a successful startup:
>>>>>>>>>>>>>>>>>> old
>>>>>>>>>>>>>>>>>> processes that
>>>>>>>>>>>>>>>>>> worked well might not work so well when it gets to a
>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>> size,
>>>>>>>>>>>>>>>>>> cultures
>>>>>>>>>>>>>>>>>> can get diluted, building culture vs building process,
>>>>>>>>>>>>>>>>>> etc.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I also really like to have a more visible process for
>>>>>>>>>>>>>>>>>> larger
>>>>>>>>>>>>>>>>>> changes,
>>>>>>>>>>>>>>>>>> especially major user facing API changes. Historically we
>>>>>>>>>>>>>>>>>> upload
>>>>>>>>>>>>>>>>>> design docs
>>>>>>>>>>>>>>>>>> for major changes, but it is not always consistent and
>>>>>>>>>>>>>>>>>> difficult
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> quality
>>>>>>>>>>>>>>>>>> of the docs, due to the volunteering nature of the
>>>>>>>>>>>>>>>>>> organization.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Some of the more concrete ideas we discussed focus on
>>>>>>>>>>>>>>>>>> building a
>>>>>>>>>>>>>>>>>> culture
>>>>>>>>>>>>>>>>>> to improve clarity:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - Process: Large changes should have design docs posted
>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>> JIRA.
>>>>>>>>>>>>>>>>>> One
>>>>>>>>>>>>>>>>>> thing
>>>>>>>>>>>>>>>>>> Cody and I didn't discuss but an idea that just came to
>>>>>>>>>>>>>>>>>> me
>>>>>>>>>>>>>>>>>> is we
>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>> create a design doc template for the project and ask
>>>>>>>>>>>>>>>>>> everybody
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> follow.
>>>>>>>>>>>>>>>>>> The design doc template should also explicitly list goals
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> non-goals, to
>>>>>>>>>>>>>>>>>> make design doc more consistent.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - Process: Email dev@ to solicit feedback. We have some
>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>> changes, but again very inconsistent. Just posting
>>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>> JIRA
>>>>>>>>>>>>>>>>>> isn't
>>>>>>>>>>>>>>>>>> sufficient, because there are simply too many JIRAs and
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> signal
>>>>>>>>>>>>>>>>>> get lost
>>>>>>>>>>>>>>>>>> in the noise. While this is generally impossible to
>>>>>>>>>>>>>>>>>> enforce
>>>>>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>>>>> we can't
>>>>>>>>>>>>>>>>>> force all volunteers to conform to a process (or they
>>>>>>>>>>>>>>>>>> might
>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>> even
>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>> aware of this),  those who are more familiar with the
>>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>> help by
>>>>>>>>>>>>>>>>>> emailing the dev@ when they see something that hasn't
>>>>>>>>>>>>>>>>>> been.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - Culture: The design doc author(s) should be open to
>>>>>>>>>>>>>>>>>> feedback.
>>>>>>>>>>>>>>>>>> A
>>>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>>> doc should serve as the base for discussion and is by no
>>>>>>>>>>>>>>>>>> means
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> final
>>>>>>>>>>>>>>>>>> design. Of course, this does not mean the author has to
>>>>>>>>>>>>>>>>>> accept
>>>>>>>>>>>>>>>>>> every
>>>>>>>>>>>>>>>>>> feedback. They should also be comfortable accepting /
>>>>>>>>>>>>>>>>>> rejecting
>>>>>>>>>>>>>>>>>> ideas on
>>>>>>>>>>>>>>>>>> technical grounds.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - Process / Culture: For major ongoing projects, it can
>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>> some monthly Google hangouts that are open to the world.
>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>> actually not
>>>>>>>>>>>>>>>>>> sure how well this will work, because of the volunteering
>>>>>>>>>>>>>>>>>> nature
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> we need
>>>>>>>>>>>>>>>>>> to adjust for timezones for people across the globe, but
>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>>> worth
>>>>>>>>>>>>>>>>>> trying.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - Culture: Contributors (including committers) should be
>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>> direct
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> setting expectations, including whether they are working
>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>> issue, whether they will be working on a specific issue,
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> whether
>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>> issue or pr or jira should be rejected. Most people I
>>>>>>>>>>>>>>>>>> know
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> community
>>>>>>>>>>>>>>>>>> are nice and don't enjoy telling other people no, but it
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> often
>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>> annoying to a contributor to not know anything than
>>>>>>>>>>>>>>>>>> getting
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> no.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia
>>>>>>>>>>>>>>>>>> <[hidden email]>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Love the idea of a more visible "Spark Improvement
>>>>>>>>>>>>>>>>>>> Proposal"
>>>>>>>>>>>>>>>>>>> process that
>>>>>>>>>>>>>>>>>>> solicits user input on new APIs. For what it's worth, I
>>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>> committers are trying to minimize their own work --
>>>>>>>>>>>>>>>>>>> every
>>>>>>>>>>>>>>>>>>> committer
>>>>>>>>>>>>>>>>>>> cares
>>>>>>>>>>>>>>>>>>> about making the software useful for users. However, it
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> always
>>>>>>>>>>>>>>>>>>> hard to
>>>>>>>>>>>>>>>>>>> get user input and so it helps to have this kind of
>>>>>>>>>>>>>>>>>>> process.
>>>>>>>>>>>>>>>>>>> I've
>>>>>>>>>>>>>>>>>>> certainly
>>>>>>>>>>>>>>>>>>> looked at the *IPs a lot in other software I use just to
>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> biggest
>>>>>>>>>>>>>>>>>>> things on the roadmap.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> When you're talking about "changing interfaces", are you
>>>>>>>>>>>>>>>>>>> talking
>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>> public or internal APIs? I do think many people hate
>>>>>>>>>>>>>>>>>>> changing
>>>>>>>>>>>>>>>>>>> public APIs
>>>>>>>>>>>>>>>>>>> and I actually think that's for the best of the project.
>>>>>>>>>>>>>>>>>>> That's
>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> technical
>>>>>>>>>>>>>>>>>>> debate, but basically, the worst thing when you're using
>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> piece
>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> software
>>>>>>>>>>>>>>>>>>> is that the developers constantly ask you to rewrite
>>>>>>>>>>>>>>>>>>> your
>>>>>>>>>>>>>>>>>>> app
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> update to a
>>>>>>>>>>>>>>>>>>> new version (and thus benefit from bug fixes, etc). Cue
>>>>>>>>>>>>>>>>>>> anyone
>>>>>>>>>>>>>>>>>>> who's used
>>>>>>>>>>>>>>>>>>> Protobuf, or Guava. The "let's get everyone to change
>>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> release" model works well within a single large company,
>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>> doesn't work
>>>>>>>>>>>>>>>>>>> well for a community, which is why nearly all *very*
>>>>>>>>>>>>>>>>>>> widely
>>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>> programming
>>>>>>>>>>>>>>>>>>> interfaces (I'm talking things like Java standard
>>>>>>>>>>>>>>>>>>> library,
>>>>>>>>>>>>>>>>>>> Windows
>>>>>>>>>>>>>>>>>>> API, etc)
>>>>>>>>>>>>>>>>>>> almost *never* break backwards compatibility. All this
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> done
>>>>>>>>>>>>>>>>>>> within reason
>>>>>>>>>>>>>>>>>>> though, e.g. we do change things in major releases (2.x,
>>>>>>>>>>>>>>>>>>> 3.x,
>>>>>>>>>>>>>>>>>>> etc).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>> To unsubscribe e-mail: [hidden email]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Stavros Kontopoulos
>>>>>>>>>>>>>>> Senior Software Engineer
>>>>>>>>>>>>>>> Lightbend, Inc.
>>>>>>>>>>>>>>> p:  +30 6977967274
>>>>>>>>>>>>>>> e: [hidden email]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe e-mail: [hidden email]
>>>>>>>>
>>>>>>>>
>>>>>>>> ________________________________
>>>>>>>>
>>>>>>>> If you reply to this email, your message will be added to the
>>>>>>>> discussion
>>>>>>>> below:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-tp19268p19359.html
>>>>>>>>
>>>>>>>> To start a new topic under Apache Spark Developers List, email
>>>>>>>> [hidden
>>>>>>>> email]
>>>>>>>> To unsubscribe from Apache Spark Developers List, click here.
>>>>>>>> NAML
>>>>>>>>
>>>>>>>>
>>>>>>>> ________________________________
>>>>>>>> View this message in context: RE: Spark Improvement Proposals
>>>>>>>> Sent from the Apache Spark Developers List mailing list archive at
>>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: [hidden email]
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Cody Koeninger-2
Updated on github,
https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md

I believe I've touched on all feedback with the exception of naming,
and API vs Strategy.

Do we want a straw poll on naming?

Matei, are your concerns about api vs strategy addressed if we add an
API bullet point to the template?

On Mon, Oct 10, 2016 at 2:38 PM, Steve Loughran <[hidden email]> wrote:

> This is an interesting process proposal; I think it could work well.
>
> -It's got the flavour of the ASF incubator; maybe some of the processes there: mentor, regular reporting in could help, in particular, help stop the -1 at the end of the work
> -it may also aid collaboration to have a medium lived branch, so enabling collaboration with multiple people submitting PRs into the ASF codebase. This can reduce cost of merge and enable jenkins to keep on top of it. It also fits in well with the ASF "do in apache infra" community development process.
>
>
>> On 10 Oct 2016, at 20:26, Matei Zaharia <[hidden email]> wrote:
>>
>> Agreed with this. As I said before regarding who submits: it's not a normal ASF process to require contributions to only come from committers. Committers are of course the only people who can *commit* stuff. But the whole point of an open source project is that anyone can *contribute* -- indeed, that is how people become committers. For example, in every ASF project, anyone can open JIRAs, submit design docs, submit patches, review patches, and vote on releases. This particular process is very similar to posting a JIRA or a design doc.
>>
>> I also like consensus with a deadline (e.g. someone says "here is a new SEP, we want to accept it by date X so please comment before").
>>
>> In general, with this type of stuff, it's better to start with very lightweight processes and then expand them if needed. Adding lots of rules from the beginning makes it confusing and can reduce contributions. Although, as engineers, we believe that anything can be solved using mechanical rules, in practice software development is a social process that ultimately requires humans to tackle things on a case-by-case basis.
>>
>> Matei
>>
>>
>>> On Oct 10, 2016, at 12:19 PM, Cody Koeninger <[hidden email]> wrote:
>>>
>>> That seems reasonable to me.
>>>
>>> I do not want to see lazy consensus used on one of these proposals
>>> though, I want a clear outcome, i.e. call for a vote, wait at least 72
>>> hours, get three +1s and no vetos.
>>>
>>>
>>>
>>> On Mon, Oct 10, 2016 at 2:15 PM, Ryan Blue <[hidden email]> wrote:
>>>> Proposal submission: I think we should keep this as open as possible. If
>>>> there is a problem with too many open proposals, then we should tackle that
>>>> as a fix rather than excluding participation. Perhaps it will end up that
>>>> way, but I think it's worth trying a more open model first.
>>>>
>>>> Majority vs consensus: My rationale is that I don't think we want to
>>>> consider a proposal approved if it had objections serious enough that
>>>> committers down-voted (or PMC depending on who gets a vote). If these
>>>> proposals are like PEPs, then they represent a significant amount of
>>>> community effort and I wouldn't want to move forward if up to half of the
>>>> community thinks it's an untenable idea.
>>>>
>>>> rb
>>>>
>>>> On Mon, Oct 10, 2016 at 12:07 PM, Cody Koeninger <[hidden email]> wrote:
>>>>>
>>>>> I think this is closer to a procedural issue than a code modification
>>>>> issue, hence why majority.  If everyone thinks consensus is better, I
>>>>> don't care.  Again, I don't feel strongly about the way we achieve
>>>>> clarity, just that we achieve clarity.
>>>>>
>>>>> On Mon, Oct 10, 2016 at 2:02 PM, Ryan Blue <[hidden email]> wrote:
>>>>>> Sorry, I missed that the proposal includes majority approval. Why
>>>>>> majority
>>>>>> instead of consensus? I think we want to build consensus around these
>>>>>> proposals and it makes sense to discuss until no one would veto.
>>>>>>
>>>>>> rb
>>>>>>
>>>>>> On Mon, Oct 10, 2016 at 11:54 AM, Ryan Blue <[hidden email]> wrote:
>>>>>>>
>>>>>>> +1 to votes to approve proposals. I agree that proposals should have an
>>>>>>> official mechanism to be accepted, and a vote is an established means
>>>>>>> of
>>>>>>> doing that well. I like that it includes a period to review the
>>>>>>> proposal and
>>>>>>> I think proposals should have been discussed enough ahead of a vote to
>>>>>>> survive the possibility of a veto.
>>>>>>>
>>>>>>> I also like the names that are short and (mostly) unique, like SEP.
>>>>>>>
>>>>>>> Where I disagree is with the requirement that a committer must formally
>>>>>>> propose an enhancement. I don't see the value of restricting this: if
>>>>>>> someone has the will to write up a proposal then they should be
>>>>>>> encouraged
>>>>>>> to do so and start a discussion about it. Even if there is a political
>>>>>>> reality as Cody says, what is the value of codifying that in our
>>>>>>> process? I
>>>>>>> think restricting who can submit proposals would only undermine them by
>>>>>>> pushing contributors out. Maybe I'm missing something here?
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <[hidden email]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Yes, users suggesting SIPs is a good thing and is explicitly called
>>>>>>>> out in the linked document under the Who? section.  Formally proposing
>>>>>>>> them, not so much, because of the political realities.
>>>>>>>>
>>>>>>>> Yes, implementation strategy definitely affects goals.  There are all
>>>>>>>> kinds of examples of this, I'll pick one that's my fault so as to
>>>>>>>> avoid sounding like I'm blaming:
>>>>>>>>
>>>>>>>> When I implemented the Kafka DStream, one of my (not explicitly agreed
>>>>>>>> upon by the community) goals was to make sure people could use the
>>>>>>>> Dstream with however they were already using Kafka at work.  The lack
>>>>>>>> of explicit agreement on that goal led to all kinds of fighting with
>>>>>>>> committers, that could have been avoided.  The lack of explicit
>>>>>>>> up-front strategy discussion led to the DStream not really working
>>>>>>>> with compacted topics.  I knew about compacted topics, but don't have
>>>>>>>> a use for them, so had a blind spot there.  If there was explicit
>>>>>>>> up-front discussion that my strategy was "assume that batches can be
>>>>>>>> defined on the driver solely by beginning and ending offsets", there's
>>>>>>>> a greater chance that a user would have seen that and said, "hey, what
>>>>>>>> about non-contiguous offsets in a compacted topic".
>>>>>>>>
>>>>>>>> This kind of thing is only going to happen smoothly if we have a
>>>>>>>> lightweight user-visible process with clear outcomes.
>>>>>>>>
>>>>>>>> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>>>>>>>> <[hidden email]> wrote:
>>>>>>>>> I agree with most of what Cody said.
>>>>>>>>>
>>>>>>>>> Two things:
>>>>>>>>>
>>>>>>>>> First we can always have other people suggest SIPs but mark them as
>>>>>>>>> “unreviewed” and have committers basically move them forward. The
>>>>>>>>> problem is
>>>>>>>>> that writing a good document takes time. This way we can leverage
>>>>>>>>> non
>>>>>>>>> committers to do some of this work (it is just another way to
>>>>>>>>> contribute).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> As for strategy, in many cases implementation strategy can affect
>>>>>>>>> the
>>>>>>>>> goals.
>>>>>>>>> I will give  a small example: In the current structured streaming
>>>>>>>>> strategy,
>>>>>>>>> we group by the time to achieve a sliding window. This is definitely
>>>>>>>>> an
>>>>>>>>> implementation decision and not a goal. However, I can think of
>>>>>>>>> several
>>>>>>>>> aggregation functions which have the time inside their calculation
>>>>>>>>> buffer.
>>>>>>>>> For example, let’s say we want to return a set of all distinct
>>>>>>>>> values.
>>>>>>>>> One
>>>>>>>>> way to implement this would be to make the set into a map and have
>>>>>>>>> the
>>>>>>>>> value
>>>>>>>>> contain the last time seen. Multiplying it across the groupby would
>>>>>>>>> cost a
>>>>>>>>> lot in performance. So adding such a strategy would have a great
>>>>>>>>> effect
>>>>>>>>> on
>>>>>>>>> the type of aggregations and their performance which does affect the
>>>>>>>>> goal.
>>>>>>>>> Without adding the strategy, it is easy for whoever goes to the
>>>>>>>>> design
>>>>>>>>> document to not think about these cases. Furthermore, it might be
>>>>>>>>> decided
>>>>>>>>> that these cases are rare enough so that the strategy is still good
>>>>>>>>> enough
>>>>>>>>> but how would we know it without user feedback?
>>>>>>>>>
>>>>>>>>> I believe this example is exactly what Cody was talking about. Since
>>>>>>>>> many
>>>>>>>>> times implementation strategies have a large effect on the goal, we
>>>>>>>>> should
>>>>>>>>> have it discussed when discussing the goals. In addition, while it
>>>>>>>>> is
>>>>>>>>> often
>>>>>>>>> easy to throw out completely infeasible goals, it is often much
>>>>>>>>> harder
>>>>>>>>> to
>>>>>>>>> figure out that the goals are unfeasible without fine tuning.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Assaf.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From: Cody Koeninger-2 [via Apache Spark Developers List]
>>>>>>>>> [mailto:ml-node+[hidden email]]
>>>>>>>>> Sent: Monday, October 10, 2016 2:25 AM
>>>>>>>>> To: Mendelson, Assaf
>>>>>>>>> Subject: Re: Spark Improvement Proposals
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Only committers should formally submit SIPs because in an apache
>>>>>>>>> project only commiters have explicit political power.  If a user
>>>>>>>>> can't
>>>>>>>>> find a commiter willing to sponsor an SIP idea, they have no way to
>>>>>>>>> get the idea passed in any case.  If I can't find a committer to
>>>>>>>>> sponsor this meta-SIP idea, I'm out of luck.
>>>>>>>>>
>>>>>>>>> I do not believe unrealistic goals can be found solely by
>>>>>>>>> inspection.
>>>>>>>>> We've managed to ignore unrealistic goals even after implementation!
>>>>>>>>> Focusing on APIs can allow people to think they've solved something,
>>>>>>>>> when there's really no way of implementing that API while meeting
>>>>>>>>> the
>>>>>>>>> goals.  Rapid iteration is clearly the best way to address this, but
>>>>>>>>> we've already talked about why that hasn't really worked.  If adding
>>>>>>>>> a
>>>>>>>>> non-binding API section to the template is important to you, I'm not
>>>>>>>>> against it, but I don't think it's sufficient.
>>>>>>>>>
>>>>>>>>> On your PRD vs design doc spectrum, I'm saying this is closer to a
>>>>>>>>> PRD.  Clear agreement on goals is the most important thing and
>>>>>>>>> that's
>>>>>>>>> why it's the thing I want binding agreement on.  But I cannot agree
>>>>>>>>> to
>>>>>>>>> goals unless I have enough minimal technical info to judge whether
>>>>>>>>> the
>>>>>>>>> goals are likely to actually be accomplished.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Oct 9, 2016 at 5:35 PM, Matei Zaharia <[hidden email]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Well, I think there are a few things here that don't make sense.
>>>>>>>>>> First,
>>>>>>>>>> why
>>>>>>>>>> should only committers submit SIPs? Development in the project
>>>>>>>>>> should
>>>>>>>>>> be
>>>>>>>>>> open to all contributors, whether they're committers or not.
>>>>>>>>>> Second, I
>>>>>>>>>> think
>>>>>>>>>> unrealistic goals can be found just by inspecting the goals, and
>>>>>>>>>> I'm
>>>>>>>>>> not
>>>>>>>>>> super worried that we'll accept a lot of SIPs that are then
>>>>>>>>>> infeasible
>>>>>>>>>> --
>>>>>>>>>> we
>>>>>>>>>> can then submit new ones. But this depends on whether you want this
>>>>>>>>>> process
>>>>>>>>>> to be a "design doc lite", where people also agree on
>>>>>>>>>> implementation
>>>>>>>>>> strategy, or just a way to agree on goals. This is what I asked
>>>>>>>>>> earlier
>>>>>>>>>> about PRDs vs design docs (and I'm open to either one but I'd just
>>>>>>>>>> like
>>>>>>>>>> clarity). Finally, both as a user and designer of software, I
>>>>>>>>>> always
>>>>>>>>>> want
>>>>>>>>>> to
>>>>>>>>>> give feedback on APIs, so I'd really like a culture of having those
>>>>>>>>>> early.
>>>>>>>>>> People don't argue about prettiness when they discuss APIs, they
>>>>>>>>>> argue
>>>>>>>>>> about
>>>>>>>>>> the core concepts to expose in order to meet various goals, and
>>>>>>>>>> then
>>>>>>>>>> they're
>>>>>>>>>> stuck maintaining those for a long time.
>>>>>>>>>>
>>>>>>>>>> Matei
>>>>>>>>>>
>>>>>>>>>> On Oct 9, 2016, at 3:10 PM, Cody Koeninger <[hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>> Users instead of people, sure.  Commiters and contributors are (or
>>>>>>>>>> at
>>>>>>>>>> least
>>>>>>>>>> should be) a subset of users.
>>>>>>>>>>
>>>>>>>>>> Non goals, sure. I don't care what the name is, but we need to
>>>>>>>>>> clearly
>>>>>>>>>> say
>>>>>>>>>> e.g. 'no we are not maintaining compatibility with XYZ right now'.
>>>>>>>>>>
>>>>>>>>>> API, what I care most about is whether it allows me to accomplish
>>>>>>>>>> the
>>>>>>>>>> goals.
>>>>>>>>>> Arguing about how ugly or pretty it is can be saved for design/
>>>>>>>>>> implementation imho.
>>>>>>>>>>
>>>>>>>>>> Strategy, this is necessary because otherwise goals can be out of
>>>>>>>>>> line
>>>>>>>>>> with
>>>>>>>>>> reality.  Don't propose goals you don't have at least some idea of
>>>>>>>>>> how
>>>>>>>>>> to
>>>>>>>>>> implement.
>>>>>>>>>>
>>>>>>>>>> Rejected strategies, given that commiters are the only ones I'm
>>>>>>>>>> saying
>>>>>>>>>> should formally submit SPARKLIs or SIPs, if they put junk in a
>>>>>>>>>> required
>>>>>>>>>> section then slap them down for it and tell them to fix it.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Oct 9, 2016 4:36 PM, "Matei Zaharia" <[hidden email]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Yup, this is the stuff that I found unclear. Thanks for clarifying
>>>>>>>>>>> here,
>>>>>>>>>>> but we should also clarify it in the writeup. In particular:
>>>>>>>>>>>
>>>>>>>>>>> - Goals needs to be about user-facing behavior ("people" is broad)
>>>>>>>>>>>
>>>>>>>>>>> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will
>>>>>>>>>>> dig
>>>>>>>>>>> up
>>>>>>>>>>> one of these and say "Spark's developers have officially rejected
>>>>>>>>>>> X,
>>>>>>>>>>> which
>>>>>>>>>>> our awesome system has".
>>>>>>>>>>>
>>>>>>>>>>> - For user-facing stuff, I think you need a section on API.
>>>>>>>>>>> Virtually
>>>>>>>>>>> all
>>>>>>>>>>> other *IPs I've seen have that.
>>>>>>>>>>>
>>>>>>>>>>> - I'm still not sure why the strategy section is needed if the
>>>>>>>>>>> purpose is
>>>>>>>>>>> to define user-facing behavior -- unless this is the strategy for
>>>>>>>>>>> setting
>>>>>>>>>>> the goals or for defining the API. That sounds squarely like a
>>>>>>>>>>> design
>>>>>>>>>>> doc
>>>>>>>>>>> issue. In some sense, who cares whether the proposal is
>>>>>>>>>>> technically
>>>>>>>>>>> feasible
>>>>>>>>>>> right now? If it's infeasible, that will be discovered later
>>>>>>>>>>> during
>>>>>>>>>>> design
>>>>>>>>>>> and implementation. Same thing with rejected strategies -- listing
>>>>>>>>>>> some
>>>>>>>>>>> of
>>>>>>>>>>> those is definitely useful sometimes, but if you make this a
>>>>>>>>>>> *required*
>>>>>>>>>>> section, people are just going to fill it in with bogus stuff
>>>>>>>>>>> (I've
>>>>>>>>>>> seen
>>>>>>>>>>> this happen before).
>>>>>>>>>>>
>>>>>>>>>>> Matei
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> On Oct 9, 2016, at 2:14 PM, Cody Koeninger <[hidden email]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> So to focus the discussion on the specific strategy I'm
>>>>>>>>>>>> suggesting,
>>>>>>>>>>>> documented at
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>>>>>>>>>>
>>>>>>>>>>>> "Goals: What must this allow people to do, that they can't
>>>>>>>>>>>> currently?"
>>>>>>>>>>>>
>>>>>>>>>>>> Is it unclear that this is focusing specifically on
>>>>>>>>>>>> people-visible
>>>>>>>>>>>> behavior?
>>>>>>>>>>>>
>>>>>>>>>>>> Rejected goals -  are important because otherwise people keep
>>>>>>>>>>>> trying
>>>>>>>>>>>> to argue about scope.  Of course you can change things later
>>>>>>>>>>>> with a
>>>>>>>>>>>> different SIP and different vote, the point is to focus.
>>>>>>>>>>>>
>>>>>>>>>>>> Use cases - are something that people are going to bring up in
>>>>>>>>>>>> discussion.  If they aren't clearly documented as a goal ("This
>>>>>>>>>>>> must
>>>>>>>>>>>> allow me to connect using SSL"), they should be added.
>>>>>>>>>>>>
>>>>>>>>>>>> Internal architecture - if the people who need specific behavior
>>>>>>>>>>>> are
>>>>>>>>>>>> implementers of other parts of the system, that's fine.
>>>>>>>>>>>>
>>>>>>>>>>>> Rejected strategies - If you have none of these, you have no
>>>>>>>>>>>> evidence
>>>>>>>>>>>> that the proponent didn't just go with the first thing they had
>>>>>>>>>>>> in
>>>>>>>>>>>> mind (or have already implemented), which is a big problem
>>>>>>>>>>>> currently.
>>>>>>>>>>>> Approval isn't binding as to specifics of implementation, so
>>>>>>>>>>>> these
>>>>>>>>>>>> aren't handcuffs.  The goals are the contract, the strategy is
>>>>>>>>>>>> evidence that contract can actually be met.
>>>>>>>>>>>>
>>>>>>>>>>>> Design docs - I'm not touching design docs.  The markdown file I
>>>>>>>>>>>> linked specifically says of the strategy section "This is not a
>>>>>>>>>>>> full
>>>>>>>>>>>> design document."  Is this unclear?  Design docs can be worked
>>>>>>>>>>>> on
>>>>>>>>>>>> obviously, but that's not what I'm concerned with here.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <[hidden email]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> Hi Cody,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think this would be a lot more concrete if we had a more
>>>>>>>>>>>>> detailed
>>>>>>>>>>>>> template
>>>>>>>>>>>>> for SIPs. Right now, it's not super clear what's in scope --
>>>>>>>>>>>>> e.g.
>>>>>>>>>>>>> are
>>>>>>>>>>>>> they
>>>>>>>>>>>>> a way to solicit feedback on the user-facing behavior or on the
>>>>>>>>>>>>> internals?
>>>>>>>>>>>>> "Goals" can cover both things. I've been thinking of SIPs more
>>>>>>>>>>>>> as
>>>>>>>>>>>>> Product
>>>>>>>>>>>>> Requirements Docs (PRDs), which focus on *what* a code change
>>>>>>>>>>>>> should
>>>>>>>>>>>>> do
>>>>>>>>>>>>> as
>>>>>>>>>>>>> opposed to how.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In particular, here are some things that you may or may not
>>>>>>>>>>>>> consider
>>>>>>>>>>>>> in
>>>>>>>>>>>>> scope for SIPs:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Goals and non-goals: This is definitely in scope, and IMO
>>>>>>>>>>>>> should
>>>>>>>>>>>>> focus on
>>>>>>>>>>>>> user-visible behavior (e.g. "system supports SQL window
>>>>>>>>>>>>> functions"
>>>>>>>>>>>>> or
>>>>>>>>>>>>> "system continues working if one node fails"). BTW I wouldn't
>>>>>>>>>>>>> say
>>>>>>>>>>>>> "rejected
>>>>>>>>>>>>> goals" because some of them might become goals later, so we're
>>>>>>>>>>>>> not
>>>>>>>>>>>>> definitively rejecting them.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Public API: Probably should be included in most SIPs unless
>>>>>>>>>>>>> it's
>>>>>>>>>>>>> too
>>>>>>>>>>>>> large
>>>>>>>>>>>>> to fully specify then (e.g. "let's add an ML library").
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Use cases: I usually find this very useful in PRDs to better
>>>>>>>>>>>>> communicate
>>>>>>>>>>>>> the goals.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Internal architecture: This is usually *not* a thing users
>>>>>>>>>>>>> can
>>>>>>>>>>>>> easily
>>>>>>>>>>>>> comment on and it sounds more like a design doc item. Of course
>>>>>>>>>>>>> it's
>>>>>>>>>>>>> important to show that the SIP is feasible to implement. One
>>>>>>>>>>>>> exception,
>>>>>>>>>>>>> however, is that I think we'll have some SIPs primarily on
>>>>>>>>>>>>> internals
>>>>>>>>>>>>> (e.g.
>>>>>>>>>>>>> if somebody wants to refactor Spark's query optimizer or
>>>>>>>>>>>>> something).
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Rejected strategies: I personally wouldn't put this, because
>>>>>>>>>>>>> what's
>>>>>>>>>>>>> the
>>>>>>>>>>>>> point of voting to reject a strategy before you've really begun
>>>>>>>>>>>>> designing
>>>>>>>>>>>>> and implementing something? What if you discover that the
>>>>>>>>>>>>> strategy
>>>>>>>>>>>>> is
>>>>>>>>>>>>> actually better when you start doing stuff?
>>>>>>>>>>>>>
>>>>>>>>>>>>> At a super high level, it depends on whether you want the SIPs
>>>>>>>>>>>>> to
>>>>>>>>>>>>> be
>>>>>>>>>>>>> PRDs
>>>>>>>>>>>>> for getting some quick feedback on the goals of a feature
>>>>>>>>>>>>> before
>>>>>>>>>>>>> it is
>>>>>>>>>>>>> designed, or something more like full-fledged design docs (just
>>>>>>>>>>>>> a
>>>>>>>>>>>>> more
>>>>>>>>>>>>> visible design doc for bigger changes). I looked at Kafka's
>>>>>>>>>>>>> KIPs,
>>>>>>>>>>>>> and
>>>>>>>>>>>>> they
>>>>>>>>>>>>> actually seem to be more like design docs. This can work too
>>>>>>>>>>>>> but
>>>>>>>>>>>>> it
>>>>>>>>>>>>> does
>>>>>>>>>>>>> require more work from the proposer and it can lead to the same
>>>>>>>>>>>>> problems you
>>>>>>>>>>>>> mentioned with people already having a design and
>>>>>>>>>>>>> implementation
>>>>>>>>>>>>> in
>>>>>>>>>>>>> mind.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Basically, the question is, are you trying to iterate faster on
>>>>>>>>>>>>> design
>>>>>>>>>>>>> by
>>>>>>>>>>>>> adding a step for user feedback earlier? Or are you just trying
>>>>>>>>>>>>> to
>>>>>>>>>>>>> make
>>>>>>>>>>>>> design docs for key features more visible (and their approval
>>>>>>>>>>>>> more
>>>>>>>>>>>>> formal)?
>>>>>>>>>>>>>
>>>>>>>>>>>>> BTW note that in either case, I'd like to have a template for
>>>>>>>>>>>>> design
>>>>>>>>>>>>> docs
>>>>>>>>>>>>> too, which should also include goals. I think that would've
>>>>>>>>>>>>> avoided
>>>>>>>>>>>>> some of
>>>>>>>>>>>>> the issues you brought up.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Matei
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <[hidden email]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here's my specific proposal (meta-proposal?)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Spark Improvement Proposals (SIP)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Background:
>>>>>>>>>>>>>
>>>>>>>>>>>>> The current problem is that design and implementation of large
>>>>>>>>>>>>> features
>>>>>>>>>>>>> are
>>>>>>>>>>>>> often done in private, before soliciting user feedback.
>>>>>>>>>>>>>
>>>>>>>>>>>>> When feedback is solicited, it is often as to detailed design
>>>>>>>>>>>>> specifics, not
>>>>>>>>>>>>> focused on goals.
>>>>>>>>>>>>>
>>>>>>>>>>>>> When implementation does take place after design, there is
>>>>>>>>>>>>> often
>>>>>>>>>>>>> disagreement as to what goals are or are not in scope.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This results in commits that don't fully meet user needs.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Goals:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Ensure user, contributor, and committer goals are clearly
>>>>>>>>>>>>> identified
>>>>>>>>>>>>> and
>>>>>>>>>>>>> agreed upon, before implementation takes place.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Ensure that a technically feasible strategy is chosen that is
>>>>>>>>>>>>> likely
>>>>>>>>>>>>> to
>>>>>>>>>>>>> meet the goals.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Rejected Goals:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - SIPs are not for detailed design.  Design by committee
>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>> work.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - SIPs are not for every change.  We dont need that much
>>>>>>>>>>>>> process.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Strategy:
>>>>>>>>>>>>>
>>>>>>>>>>>>> My suggestion is outlined as a Spark Improvement Proposal
>>>>>>>>>>>>> process
>>>>>>>>>>>>> documented
>>>>>>>>>>>>> at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>>>>>>>>>>>
>>>>>>>>>>>>> Specifics of Jira manipulation are an implementation detail we
>>>>>>>>>>>>> can
>>>>>>>>>>>>> figure
>>>>>>>>>>>>> out.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm suggesting voting; the need here is for a _clear_ outcome.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Rejected Strategies:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Having someone who understands the problem implement it first
>>>>>>>>>>>>> works,
>>>>>>>>>>>>> but
>>>>>>>>>>>>> only if significant iteration after user feedback is allowed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Historically this has been problematic due to pressure to limit
>>>>>>>>>>>>> public
>>>>>>>>>>>>> api
>>>>>>>>>>>>> changes.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <[hidden email]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Alright looks like there are quite a bit of support. We should
>>>>>>>>>>>>>> wait
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> hear from more people too.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To push this forward, Cody and I will be working together in
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> next
>>>>>>>>>>>>>> couple of weeks to come up with a concrete, detailed proposal
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>> what
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> entails, and then we can discuss this the specific proposal as
>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <[hidden
>>>>>>>>>>>>>> email]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yeah, in case it wasn't clear, I was talking about SIPs for
>>>>>>>>>>>>>>> major
>>>>>>>>>>>>>>> user-facing or cross-cutting changes, not minor feature adds.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos
>>>>>>>>>>>>>>> <[hidden email]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +1 to the SIP label as long as it does not slow down things
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>> targets optimizing efforts, coordination etc. For example
>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>> small
>>>>>>>>>>>>>>>> features should not need to go through this process
>>>>>>>>>>>>>>>> (assuming
>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>> dont
>>>>>>>>>>>>>>>> touch public interfaces)  or re-factorings and hope it will
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>> kept
>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>> way. So as a guideline doc should be provided, like in the
>>>>>>>>>>>>>>>> KIP
>>>>>>>>>>>>>>>> case.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> IMHO so far aside from tagging things and linking them
>>>>>>>>>>>>>>>> elsewhere
>>>>>>>>>>>>>>>> simply
>>>>>>>>>>>>>>>> having design docs and prototypes implementations in PRs is
>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>> that has not worked so far. What is really a pain in many
>>>>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>> out there
>>>>>>>>>>>>>>>> is discontinuity in progress of PRs, missing features, slow
>>>>>>>>>>>>>>>> reviews
>>>>>>>>>>>>>>>> which is
>>>>>>>>>>>>>>>> understandable to some extent... it is not only about Spark
>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>> things can
>>>>>>>>>>>>>>>> be improved for sure for this project in particular as
>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>> stated.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <[hidden
>>>>>>>>>>>>>>>> email]>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +1 to adding an SIP label and linking it from the website.
>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - template that focuses it towards soliciting user goals /
>>>>>>>>>>>>>>>>> non
>>>>>>>>>>>>>>>>> goals
>>>>>>>>>>>>>>>>> - clear resolution as to which strategy was chosen to
>>>>>>>>>>>>>>>>> pursue.
>>>>>>>>>>>>>>>>> I'd
>>>>>>>>>>>>>>>>> recommend a vote.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Matei asked me to clarify what I meant by changing
>>>>>>>>>>>>>>>>> interfaces,
>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>> it's directly relevant to the SIP idea so I'll clarify
>>>>>>>>>>>>>>>>> here,
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> split
>>>>>>>>>>>>>>>>> a thread for the other discussion per Nicholas' request.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I meant changing public user interfaces.  I think the first
>>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> unlikely to be right, because it's done at a time when you
>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> least information.  As a user, I find it considerably more
>>>>>>>>>>>>>>>>> frustrating
>>>>>>>>>>>>>>>>> to be unable to use a tool to get my job done, than I do
>>>>>>>>>>>>>>>>> having to
>>>>>>>>>>>>>>>>> make minor changes to my code in order to take advantage of
>>>>>>>>>>>>>>>>> features.
>>>>>>>>>>>>>>>>> I've seen committers be seriously reluctant to allow
>>>>>>>>>>>>>>>>> changes
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> @experimental code that are needed in order for it to
>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>> right.  You need to be able to iterate, and if people on
>>>>>>>>>>>>>>>>> both
>>>>>>>>>>>>>>>>> sides
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> the fence aren't going to respect that some newer apis are
>>>>>>>>>>>>>>>>> subject
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> change, then why even mark them as such?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ideally a finished SIP should give me a checklist of things
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>> implementation must do, and things that it doesn't need to
>>>>>>>>>>>>>>>>> do.
>>>>>>>>>>>>>>>>> Contributors/committers should be seriously discouraged
>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>> putting
>>>>>>>>>>>>>>>>> out a version 0.1 that doesn't have at least a prototype
>>>>>>>>>>>>>>>>> implementation of all those things, especially if they're
>>>>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>>> to argue against interface changes necessary to get the the
>>>>>>>>>>>>>>>>> rest
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> the things done in the 0.2 version.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <[hidden
>>>>>>>>>>>>>>>>> email]>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> I like the lightweight proposal to add a SIP label.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> During Spark 2.0 development, Tom (Graves) and I suggested
>>>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>> wiki
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> track the list of major changes, but that never really
>>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>> due to
>>>>>>>>>>>>>>>>>> the overhead. Adding a SIP label on major JIRAs and then
>>>>>>>>>>>>>>>>>> link
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>>> prominently on the Spark website makes a lot of sense.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia
>>>>>>>>>>>>>>>>>> <[hidden email]>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For the improvement proposals, I think one major point
>>>>>>>>>>>>>>>>>>> was
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>>>> really visible to users who are not contributors, so we
>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>> more than
>>>>>>>>>>>>>>>>>>> sending stuff to dev@. One very lightweight idea is to
>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>>> type of
>>>>>>>>>>>>>>>>>>> JIRA called a SIP and have a link to a filter that shows
>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>>>>>> JIRAs from
>>>>>>>>>>>>>>>>>>> http://spark.apache.org. I also like the idea of SIP and
>>>>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>>>> doc
>>>>>>>>>>>>>>>>>>> templates (in fact many projects have them).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Matei
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I called Cody last night and talked about some of the
>>>>>>>>>>>>>>>>>>> topics
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> his
>>>>>>>>>>>>>>>>>>> email.
>>>>>>>>>>>>>>>>>>> It became clear to me Cody genuinely cares about the
>>>>>>>>>>>>>>>>>>> project.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Some of the frustrations come from the success of the
>>>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>>> itself
>>>>>>>>>>>>>>>>>>> becoming very "hot", and it is difficult to get clarity
>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>> people
>>>>>>>>>>>>>>>>>>> who
>>>>>>>>>>>>>>>>>>> don't dedicate all their time to Spark. In fact, it is in
>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>> ways
>>>>>>>>>>>>>>>>>>> similar
>>>>>>>>>>>>>>>>>>> to scaling an engineering team in a successful startup:
>>>>>>>>>>>>>>>>>>> old
>>>>>>>>>>>>>>>>>>> processes that
>>>>>>>>>>>>>>>>>>> worked well might not work so well when it gets to a
>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>> size,
>>>>>>>>>>>>>>>>>>> cultures
>>>>>>>>>>>>>>>>>>> can get diluted, building culture vs building process,
>>>>>>>>>>>>>>>>>>> etc.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I also really like to have a more visible process for
>>>>>>>>>>>>>>>>>>> larger
>>>>>>>>>>>>>>>>>>> changes,
>>>>>>>>>>>>>>>>>>> especially major user facing API changes. Historically we
>>>>>>>>>>>>>>>>>>> upload
>>>>>>>>>>>>>>>>>>> design docs
>>>>>>>>>>>>>>>>>>> for major changes, but it is not always consistent and
>>>>>>>>>>>>>>>>>>> difficult
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> quality
>>>>>>>>>>>>>>>>>>> of the docs, due to the volunteering nature of the
>>>>>>>>>>>>>>>>>>> organization.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Some of the more concrete ideas we discussed focus on
>>>>>>>>>>>>>>>>>>> building a
>>>>>>>>>>>>>>>>>>> culture
>>>>>>>>>>>>>>>>>>> to improve clarity:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> - Process: Large changes should have design docs posted
>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>> JIRA.
>>>>>>>>>>>>>>>>>>> One
>>>>>>>>>>>>>>>>>>> thing
>>>>>>>>>>>>>>>>>>> Cody and I didn't discuss but an idea that just came to
>>>>>>>>>>>>>>>>>>> me
>>>>>>>>>>>>>>>>>>> is we
>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>> create a design doc template for the project and ask
>>>>>>>>>>>>>>>>>>> everybody
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> follow.
>>>>>>>>>>>>>>>>>>> The design doc template should also explicitly list goals
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> non-goals, to
>>>>>>>>>>>>>>>>>>> make design doc more consistent.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> - Process: Email dev@ to solicit feedback. We have some
>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>> changes, but again very inconsistent. Just posting
>>>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>> JIRA
>>>>>>>>>>>>>>>>>>> isn't
>>>>>>>>>>>>>>>>>>> sufficient, because there are simply too many JIRAs and
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> signal
>>>>>>>>>>>>>>>>>>> get lost
>>>>>>>>>>>>>>>>>>> in the noise. While this is generally impossible to
>>>>>>>>>>>>>>>>>>> enforce
>>>>>>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>>>>>> we can't
>>>>>>>>>>>>>>>>>>> force all volunteers to conform to a process (or they
>>>>>>>>>>>>>>>>>>> might
>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>> even
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>> aware of this),  those who are more familiar with the
>>>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>> help by
>>>>>>>>>>>>>>>>>>> emailing the dev@ when they see something that hasn't
>>>>>>>>>>>>>>>>>>> been.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> - Culture: The design doc author(s) should be open to
>>>>>>>>>>>>>>>>>>> feedback.
>>>>>>>>>>>>>>>>>>> A
>>>>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>>>> doc should serve as the base for discussion and is by no
>>>>>>>>>>>>>>>>>>> means
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> final
>>>>>>>>>>>>>>>>>>> design. Of course, this does not mean the author has to
>>>>>>>>>>>>>>>>>>> accept
>>>>>>>>>>>>>>>>>>> every
>>>>>>>>>>>>>>>>>>> feedback. They should also be comfortable accepting /
>>>>>>>>>>>>>>>>>>> rejecting
>>>>>>>>>>>>>>>>>>> ideas on
>>>>>>>>>>>>>>>>>>> technical grounds.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> - Process / Culture: For major ongoing projects, it can
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>> some monthly Google hangouts that are open to the world.
>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>> actually not
>>>>>>>>>>>>>>>>>>> sure how well this will work, because of the volunteering
>>>>>>>>>>>>>>>>>>> nature
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> we need
>>>>>>>>>>>>>>>>>>> to adjust for timezones for people across the globe, but
>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>>>> worth
>>>>>>>>>>>>>>>>>>> trying.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> - Culture: Contributors (including committers) should be
>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>> direct
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> setting expectations, including whether they are working
>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>> issue, whether they will be working on a specific issue,
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> whether
>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>> issue or pr or jira should be rejected. Most people I
>>>>>>>>>>>>>>>>>>> know
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> community
>>>>>>>>>>>>>>>>>>> are nice and don't enjoy telling other people no, but it
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> often
>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>> annoying to a contributor to not know anything than
>>>>>>>>>>>>>>>>>>> getting
>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> no.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia
>>>>>>>>>>>>>>>>>>> <[hidden email]>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Love the idea of a more visible "Spark Improvement
>>>>>>>>>>>>>>>>>>>> Proposal"
>>>>>>>>>>>>>>>>>>>> process that
>>>>>>>>>>>>>>>>>>>> solicits user input on new APIs. For what it's worth, I
>>>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>> committers are trying to minimize their own work --
>>>>>>>>>>>>>>>>>>>> every
>>>>>>>>>>>>>>>>>>>> committer
>>>>>>>>>>>>>>>>>>>> cares
>>>>>>>>>>>>>>>>>>>> about making the software useful for users. However, it
>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> always
>>>>>>>>>>>>>>>>>>>> hard to
>>>>>>>>>>>>>>>>>>>> get user input and so it helps to have this kind of
>>>>>>>>>>>>>>>>>>>> process.
>>>>>>>>>>>>>>>>>>>> I've
>>>>>>>>>>>>>>>>>>>> certainly
>>>>>>>>>>>>>>>>>>>> looked at the *IPs a lot in other software I use just to
>>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> biggest
>>>>>>>>>>>>>>>>>>>> things on the roadmap.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> When you're talking about "changing interfaces", are you
>>>>>>>>>>>>>>>>>>>> talking
>>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>> public or internal APIs? I do think many people hate
>>>>>>>>>>>>>>>>>>>> changing
>>>>>>>>>>>>>>>>>>>> public APIs
>>>>>>>>>>>>>>>>>>>> and I actually think that's for the best of the project.
>>>>>>>>>>>>>>>>>>>> That's
>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>> technical
>>>>>>>>>>>>>>>>>>>> debate, but basically, the worst thing when you're using
>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>> piece
>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> software
>>>>>>>>>>>>>>>>>>>> is that the developers constantly ask you to rewrite
>>>>>>>>>>>>>>>>>>>> your
>>>>>>>>>>>>>>>>>>>> app
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> update to a
>>>>>>>>>>>>>>>>>>>> new version (and thus benefit from bug fixes, etc). Cue
>>>>>>>>>>>>>>>>>>>> anyone
>>>>>>>>>>>>>>>>>>>> who's used
>>>>>>>>>>>>>>>>>>>> Protobuf, or Guava. The "let's get everyone to change
>>>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>> release" model works well within a single large company,
>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>> doesn't work
>>>>>>>>>>>>>>>>>>>> well for a community, which is why nearly all *very*
>>>>>>>>>>>>>>>>>>>> widely
>>>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>>> programming
>>>>>>>>>>>>>>>>>>>> interfaces (I'm talking things like Java standard
>>>>>>>>>>>>>>>>>>>> library,
>>>>>>>>>>>>>>>>>>>> Windows
>>>>>>>>>>>>>>>>>>>> API, etc)
>>>>>>>>>>>>>>>>>>>> almost *never* break backwards compatibility. All this
>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> done
>>>>>>>>>>>>>>>>>>>> within reason
>>>>>>>>>>>>>>>>>>>> though, e.g. we do change things in major releases (2.x,
>>>>>>>>>>>>>>>>>>>> 3.x,
>>>>>>>>>>>>>>>>>>>> etc).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>> To unsubscribe e-mail: [hidden email]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Stavros Kontopoulos
>>>>>>>>>>>>>>>> Senior Software Engineer
>>>>>>>>>>>>>>>> Lightbend, Inc.
>>>>>>>>>>>>>>>> p:  +30 6977967274
>>>>>>>>>>>>>>>> e: [hidden email]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe e-mail: [hidden email]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ________________________________
>>>>>>>>>
>>>>>>>>> If you reply to this email, your message will be added to the
>>>>>>>>> discussion
>>>>>>>>> below:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-tp19268p19359.html
>>>>>>>>>
>>>>>>>>> To start a new topic under Apache Spark Developers List, email
>>>>>>>>> [hidden
>>>>>>>>> email]
>>>>>>>>> To unsubscribe from Apache Spark Developers List, click here.
>>>>>>>>> NAML
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ________________________________
>>>>>>>>> View this message in context: RE: Spark Improvement Proposals
>>>>>>>>> Sent from the Apache Spark Developers List mailing list archive at
>>>>>>>>> Nabble.com.
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe e-mail: [hidden email]
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [hidden email]
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>
>>
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Mark Hamstra
In reply to this post by Cody Koeninger-2
If I'm correctly understanding the kind of voting that you are talking about, then to be accurate, it is only the PMC members that have a vote, not all committers: https://www.apache.org/foundation/how-it-works.html#pmc-members

On Mon, Oct 10, 2016 at 12:02 PM, Cody Koeninger <[hidden email]> wrote:
I think the main value is in being honest about what's going on.  No
one other than committers can cast a meaningful vote, that's the
reality.  Beyond that, if people think it's more open to allow formal
proposals from anyone, I'm not necessarily against it, but my main
question would be this:

If anyone can submit a proposal, are committers actually going to
clearly reject and close proposals that don't meet the requirements?

Right now we have a serious problem with lack of clarity regarding
contributions, and that cannot spill over into goal-setting.

On Mon, Oct 10, 2016 at 1:54 PM, Ryan Blue <[hidden email]> wrote:
> +1 to votes to approve proposals. I agree that proposals should have an
> official mechanism to be accepted, and a vote is an established means of
> doing that well. I like that it includes a period to review the proposal and
> I think proposals should have been discussed enough ahead of a vote to
> survive the possibility of a veto.
>
> I also like the names that are short and (mostly) unique, like SEP.
>
> Where I disagree is with the requirement that a committer must formally
> propose an enhancement. I don't see the value of restricting this: if
> someone has the will to write up a proposal then they should be encouraged
> to do so and start a discussion about it. Even if there is a political
> reality as Cody says, what is the value of codifying that in our process? I
> think restricting who can submit proposals would only undermine them by
> pushing contributors out. Maybe I'm missing something here?
>
> rb
>
>
>
> On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <[hidden email]> wrote:
>>
>> Yes, users suggesting SIPs is a good thing and is explicitly called
>> out in the linked document under the Who? section.  Formally proposing
>> them, not so much, because of the political realities.
>>
>> Yes, implementation strategy definitely affects goals.  There are all
>> kinds of examples of this, I'll pick one that's my fault so as to
>> avoid sounding like I'm blaming:
>>
>> When I implemented the Kafka DStream, one of my (not explicitly agreed
>> upon by the community) goals was to make sure people could use the
>> Dstream with however they were already using Kafka at work.  The lack
>> of explicit agreement on that goal led to all kinds of fighting with
>> committers, that could have been avoided.  The lack of explicit
>> up-front strategy discussion led to the DStream not really working
>> with compacted topics.  I knew about compacted topics, but don't have
>> a use for them, so had a blind spot there.  If there was explicit
>> up-front discussion that my strategy was "assume that batches can be
>> defined on the driver solely by beginning and ending offsets", there's
>> a greater chance that a user would have seen that and said, "hey, what
>> about non-contiguous offsets in a compacted topic".
>>
>> This kind of thing is only going to happen smoothly if we have a
>> lightweight user-visible process with clear outcomes.
>>
>> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>> <[hidden email]> wrote:
>> > I agree with most of what Cody said.
>> >
>> > Two things:
>> >
>> > First we can always have other people suggest SIPs but mark them as
>> > “unreviewed” and have committers basically move them forward. The
>> > problem is
>> > that writing a good document takes time. This way we can leverage non
>> > committers to do some of this work (it is just another way to
>> > contribute).
>> >
>> >
>> >
>> > As for strategy, in many cases implementation strategy can affect the
>> > goals.
>> > I will give  a small example: In the current structured streaming
>> > strategy,
>> > we group by the time to achieve a sliding window. This is definitely an
>> > implementation decision and not a goal. However, I can think of several
>> > aggregation functions which have the time inside their calculation
>> > buffer.
>> > For example, let’s say we want to return a set of all distinct values.
>> > One
>> > way to implement this would be to make the set into a map and have the
>> > value
>> > contain the last time seen. Multiplying it across the groupby would cost
>> > a
>> > lot in performance. So adding such a strategy would have a great effect
>> > on
>> > the type of aggregations and their performance which does affect the
>> > goal.
>> > Without adding the strategy, it is easy for whoever goes to the design
>> > document to not think about these cases. Furthermore, it might be
>> > decided
>> > that these cases are rare enough so that the strategy is still good
>> > enough
>> > but how would we know it without user feedback?
>> >
>> > I believe this example is exactly what Cody was talking about. Since
>> > many
>> > times implementation strategies have a large effect on the goal, we
>> > should
>> > have it discussed when discussing the goals. In addition, while it is
>> > often
>> > easy to throw out completely infeasible goals, it is often much harder
>> > to
>> > figure out that the goals are unfeasible without fine tuning.
>> >
>> >
>> >
>> >
>> >
>> > Assaf.
>> >
>> >
>> >
>> > From: Cody Koeninger-2 [via Apache Spark Developers List]
>> > [mailto:[hidden email][hidden email]]
>> > Sent: Monday, October 10, 2016 2:25 AM
>> > To: Mendelson, Assaf
>> > Subject: Re: Spark Improvement Proposals
>> >
>> >
>> >
>> > Only committers should formally submit SIPs because in an apache
>> > project only commiters have explicit political power.  If a user can't
>> > find a commiter willing to sponsor an SIP idea, they have no way to
>> > get the idea passed in any case.  If I can't find a committer to
>> > sponsor this meta-SIP idea, I'm out of luck.
>> >
>> > I do not believe unrealistic goals can be found solely by inspection.
>> > We've managed to ignore unrealistic goals even after implementation!
>> > Focusing on APIs can allow people to think they've solved something,
>> > when there's really no way of implementing that API while meeting the
>> > goals.  Rapid iteration is clearly the best way to address this, but
>> > we've already talked about why that hasn't really worked.  If adding a
>> > non-binding API section to the template is important to you, I'm not
>> > against it, but I don't think it's sufficient.
>> >
>> > On your PRD vs design doc spectrum, I'm saying this is closer to a
>> > PRD.  Clear agreement on goals is the most important thing and that's
>> > why it's the thing I want binding agreement on.  But I cannot agree to
>> > goals unless I have enough minimal technical info to judge whether the
>> > goals are likely to actually be accomplished.
>> >
>> >
>> >
>> > On Sun, Oct 9, 2016 at 5:35 PM, Matei Zaharia <[hidden email]> wrote:
>> >
>> >
>> >> Well, I think there are a few things here that don't make sense. First,
>> >> why
>> >> should only committers submit SIPs? Development in the project should
>> >> be
>> >> open to all contributors, whether they're committers or not. Second, I
>> >> think
>> >> unrealistic goals can be found just by inspecting the goals, and I'm
>> >> not
>> >> super worried that we'll accept a lot of SIPs that are then infeasible
>> >> --
>> >> we
>> >> can then submit new ones. But this depends on whether you want this
>> >> process
>> >> to be a "design doc lite", where people also agree on implementation
>> >> strategy, or just a way to agree on goals. This is what I asked earlier
>> >> about PRDs vs design docs (and I'm open to either one but I'd just like
>> >> clarity). Finally, both as a user and designer of software, I always
>> >> want
>> >> to
>> >> give feedback on APIs, so I'd really like a culture of having those
>> >> early.
>> >> People don't argue about prettiness when they discuss APIs, they argue
>> >> about
>> >> the core concepts to expose in order to meet various goals, and then
>> >> they're
>> >> stuck maintaining those for a long time.
>> >>
>> >> Matei
>> >>
>> >> On Oct 9, 2016, at 3:10 PM, Cody Koeninger <[hidden email]> wrote:
>> >>
>> >> Users instead of people, sure.  Commiters and contributors are (or at
>> >> least
>> >> should be) a subset of users.
>> >>
>> >> Non goals, sure. I don't care what the name is, but we need to clearly
>> >> say
>> >> e.g. 'no we are not maintaining compatibility with XYZ right now'.
>> >>
>> >> API, what I care most about is whether it allows me to accomplish the
>> >> goals.
>> >> Arguing about how ugly or pretty it is can be saved for design/
>> >> implementation imho.
>> >>
>> >> Strategy, this is necessary because otherwise goals can be out of line
>> >> with
>> >> reality.  Don't propose goals you don't have at least some idea of how
>> >> to
>> >> implement.
>> >>
>> >> Rejected strategies, given that commiters are the only ones I'm saying
>> >> should formally submit SPARKLIs or SIPs, if they put junk in a required
>> >> section then slap them down for it and tell them to fix it.
>> >>
>> >>
>> >> On Oct 9, 2016 4:36 PM, "Matei Zaharia" <[hidden email]> wrote:
>> >>>
>> >>> Yup, this is the stuff that I found unclear. Thanks for clarifying
>> >>> here,
>> >>> but we should also clarify it in the writeup. In particular:
>> >>>
>> >>> - Goals needs to be about user-facing behavior ("people" is broad)
>> >>>
>> >>> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will dig
>> >>> up
>> >>> one of these and say "Spark's developers have officially rejected X,
>> >>> which
>> >>> our awesome system has".
>> >>>
>> >>> - For user-facing stuff, I think you need a section on API. Virtually
>> >>> all
>> >>> other *IPs I've seen have that.
>> >>>
>> >>> - I'm still not sure why the strategy section is needed if the purpose
>> >>> is
>> >>> to define user-facing behavior -- unless this is the strategy for
>> >>> setting
>> >>> the goals or for defining the API. That sounds squarely like a design
>> >>> doc
>> >>> issue. In some sense, who cares whether the proposal is technically
>> >>> feasible
>> >>> right now? If it's infeasible, that will be discovered later during
>> >>> design
>> >>> and implementation. Same thing with rejected strategies -- listing
>> >>> some
>> >>> of
>> >>> those is definitely useful sometimes, but if you make this a
>> >>> *required*
>> >>> section, people are just going to fill it in with bogus stuff (I've
>> >>> seen
>> >>> this happen before).
>> >>>
>> >>> Matei
>> >>>
>> >
>> >>> > On Oct 9, 2016, at 2:14 PM, Cody Koeninger <[hidden email]> wrote:
>> >>> >
>> >>> > So to focus the discussion on the specific strategy I'm suggesting,
>> >>> > documented at
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>> >>> >
>> >>> > "Goals: What must this allow people to do, that they can't
>> >>> > currently?"
>> >>> >
>> >>> > Is it unclear that this is focusing specifically on people-visible
>> >>> > behavior?
>> >>> >
>> >>> > Rejected goals -  are important because otherwise people keep trying
>> >>> > to argue about scope.  Of course you can change things later with a
>> >>> > different SIP and different vote, the point is to focus.
>> >>> >
>> >>> > Use cases - are something that people are going to bring up in
>> >>> > discussion.  If they aren't clearly documented as a goal ("This must
>> >>> > allow me to connect using SSL"), they should be added.
>> >>> >
>> >>> > Internal architecture - if the people who need specific behavior are
>> >>> > implementers of other parts of the system, that's fine.
>> >>> >
>> >>> > Rejected strategies - If you have none of these, you have no
>> >>> > evidence
>> >>> > that the proponent didn't just go with the first thing they had in
>> >>> > mind (or have already implemented), which is a big problem
>> >>> > currently.
>> >>> > Approval isn't binding as to specifics of implementation, so these
>> >>> > aren't handcuffs.  The goals are the contract, the strategy is
>> >>> > evidence that contract can actually be met.
>> >>> >
>> >>> > Design docs - I'm not touching design docs.  The markdown file I
>> >>> > linked specifically says of the strategy section "This is not a full
>> >>> > design document."  Is this unclear?  Design docs can be worked on
>> >>> > obviously, but that's not what I'm concerned with here.
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <[hidden email]>
>> >>> > wrote:
>> >>> >> Hi Cody,
>> >>> >>
>> >>> >> I think this would be a lot more concrete if we had a more detailed
>> >>> >> template
>> >>> >> for SIPs. Right now, it's not super clear what's in scope -- e.g.
>> >>> >> are
>> >>> >> they
>> >>> >> a way to solicit feedback on the user-facing behavior or on the
>> >>> >> internals?
>> >>> >> "Goals" can cover both things. I've been thinking of SIPs more as
>> >>> >> Product
>> >>> >> Requirements Docs (PRDs), which focus on *what* a code change
>> >>> >> should
>> >>> >> do
>> >>> >> as
>> >>> >> opposed to how.
>> >>> >>
>> >>> >> In particular, here are some things that you may or may not
>> >>> >> consider
>> >>> >> in
>> >>> >> scope for SIPs:
>> >>> >>
>> >>> >> - Goals and non-goals: This is definitely in scope, and IMO should
>> >>> >> focus on
>> >>> >> user-visible behavior (e.g. "system supports SQL window functions"
>> >>> >> or
>> >>> >> "system continues working if one node fails"). BTW I wouldn't say
>> >>> >> "rejected
>> >>> >> goals" because some of them might become goals later, so we're not
>> >>> >> definitively rejecting them.
>> >>> >>
>> >>> >> - Public API: Probably should be included in most SIPs unless it's
>> >>> >> too
>> >>> >> large
>> >>> >> to fully specify then (e.g. "let's add an ML library").
>> >>> >>
>> >>> >> - Use cases: I usually find this very useful in PRDs to better
>> >>> >> communicate
>> >>> >> the goals.
>> >>> >>
>> >>> >> - Internal architecture: This is usually *not* a thing users can
>> >>> >> easily
>> >>> >> comment on and it sounds more like a design doc item. Of course
>> >>> >> it's
>> >>> >> important to show that the SIP is feasible to implement. One
>> >>> >> exception,
>> >>> >> however, is that I think we'll have some SIPs primarily on
>> >>> >> internals
>> >>> >> (e.g.
>> >>> >> if somebody wants to refactor Spark's query optimizer or
>> >>> >> something).
>> >>> >>
>> >>> >> - Rejected strategies: I personally wouldn't put this, because
>> >>> >> what's
>> >>> >> the
>> >>> >> point of voting to reject a strategy before you've really begun
>> >>> >> designing
>> >>> >> and implementing something? What if you discover that the strategy
>> >>> >> is
>> >>> >> actually better when you start doing stuff?
>> >>> >>
>> >>> >> At a super high level, it depends on whether you want the SIPs to
>> >>> >> be
>> >>> >> PRDs
>> >>> >> for getting some quick feedback on the goals of a feature before it
>> >>> >> is
>> >>> >> designed, or something more like full-fledged design docs (just a
>> >>> >> more
>> >>> >> visible design doc for bigger changes). I looked at Kafka's KIPs,
>> >>> >> and
>> >>> >> they
>> >>> >> actually seem to be more like design docs. This can work too but it
>> >>> >> does
>> >>> >> require more work from the proposer and it can lead to the same
>> >>> >> problems you
>> >>> >> mentioned with people already having a design and implementation in
>> >>> >> mind.
>> >>> >>
>> >>> >> Basically, the question is, are you trying to iterate faster on
>> >>> >> design
>> >>> >> by
>> >>> >> adding a step for user feedback earlier? Or are you just trying to
>> >>> >> make
>> >>> >> design docs for key features more visible (and their approval more
>> >>> >> formal)?
>> >>> >>
>> >>> >> BTW note that in either case, I'd like to have a template for
>> >>> >> design
>> >>> >> docs
>> >>> >> too, which should also include goals. I think that would've avoided
>> >>> >> some of
>> >>> >> the issues you brought up.
>> >>> >>
>> >>> >> Matei
>> >>> >>
>> >>> >> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <[hidden email]> wrote:
>> >>> >>
>> >>> >> Here's my specific proposal (meta-proposal?)
>> >>> >>
>> >>> >> Spark Improvement Proposals (SIP)
>> >>> >>
>> >>> >>
>> >>> >> Background:
>> >>> >>
>> >>> >> The current problem is that design and implementation of large
>> >>> >> features
>> >>> >> are
>> >>> >> often done in private, before soliciting user feedback.
>> >>> >>
>> >>> >> When feedback is solicited, it is often as to detailed design
>> >>> >> specifics, not
>> >>> >> focused on goals.
>> >>> >>
>> >>> >> When implementation does take place after design, there is often
>> >>> >> disagreement as to what goals are or are not in scope.
>> >>> >>
>> >>> >> This results in commits that don't fully meet user needs.
>> >>> >>
>> >>> >>
>> >>> >> Goals:
>> >>> >>
>> >>> >> - Ensure user, contributor, and committer goals are clearly
>> >>> >> identified
>> >>> >> and
>> >>> >> agreed upon, before implementation takes place.
>> >>> >>
>> >>> >> - Ensure that a technically feasible strategy is chosen that is
>> >>> >> likely
>> >>> >> to
>> >>> >> meet the goals.
>> >>> >>
>> >>> >>
>> >>> >> Rejected Goals:
>> >>> >>
>> >>> >> - SIPs are not for detailed design.  Design by committee doesn't
>> >>> >> work.
>> >>> >>
>> >>> >> - SIPs are not for every change.  We dont need that much process.
>> >>> >>
>> >>> >>
>> >>> >> Strategy:
>> >>> >>
>> >>> >> My suggestion is outlined as a Spark Improvement Proposal process
>> >>> >> documented
>> >>> >> at
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>> >>> >>
>> >>> >> Specifics of Jira manipulation are an implementation detail we can
>> >>> >> figure
>> >>> >> out.
>> >>> >>
>> >>> >> I'm suggesting voting; the need here is for a _clear_ outcome.
>> >>> >>
>> >>> >>
>> >>> >> Rejected Strategies:
>> >>> >>
>> >>> >> Having someone who understands the problem implement it first
>> >>> >> works,
>> >>> >> but
>> >>> >> only if significant iteration after user feedback is allowed.
>> >>> >>
>> >>> >> Historically this has been problematic due to pressure to limit
>> >>> >> public
>> >>> >> api
>> >>> >> changes.
>> >>> >>
>> >>> >>
>> >>> >> On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <[hidden email]>
>> >>> >> wrote:
>> >>> >>>
>> >>> >>> Alright looks like there are quite a bit of support. We should
>> >>> >>> wait
>> >>> >>> to
>> >>> >>> hear from more people too.
>> >>> >>>
>> >>> >>> To push this forward, Cody and I will be working together in the
>> >>> >>> next
>> >>> >>> couple of weeks to come up with a concrete, detailed proposal on
>> >>> >>> what
>> >>> >>> this
>> >>> >>> entails, and then we can discuss this the specific proposal as
>> >>> >>> well.
>> >>> >>>
>> >>> >>>
>> >>> >>> On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <[hidden email]>
>> >>> >>> wrote:
>> >>> >>>>
>> >>> >>>> Yeah, in case it wasn't clear, I was talking about SIPs for major
>> >>> >>>> user-facing or cross-cutting changes, not minor feature adds.
>> >>> >>>>
>> >>> >>>> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos
>> >>> >>>> <[hidden email]> wrote:
>> >>> >>>>>
>> >>> >>>>> +1 to the SIP label as long as it does not slow down things and
>> >>> >>>>> it
>> >>> >>>>> targets optimizing efforts, coordination etc. For example really
>> >>> >>>>> small
>> >>> >>>>> features should not need to go through this process (assuming
>> >>> >>>>> they
>> >>> >>>>> dont
>> >>> >>>>> touch public interfaces)  or re-factorings and hope it will be
>> >>> >>>>> kept
>> >>> >>>>> this
>> >>> >>>>> way. So as a guideline doc should be provided, like in the KIP
>> >>> >>>>> case.
>> >>> >>>>>
>> >>> >>>>> IMHO so far aside from tagging things and linking them elsewhere
>> >>> >>>>> simply
>> >>> >>>>> having design docs and prototypes implementations in PRs is not
>> >>> >>>>> something
>> >>> >>>>> that has not worked so far. What is really a pain in many
>> >>> >>>>> projects
>> >>> >>>>> out there
>> >>> >>>>> is discontinuity in progress of PRs, missing features, slow
>> >>> >>>>> reviews
>> >>> >>>>> which is
>> >>> >>>>> understandable to some extent... it is not only about Spark but
>> >>> >>>>> things can
>> >>> >>>>> be improved for sure for this project in particular as already
>> >>> >>>>> stated.
>> >>> >>>>>
>> >>> >>>>> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <[hidden email]>
>> >>> >>>>> wrote:
>> >>> >>>>>>
>> >>> >>>>>> +1 to adding an SIP label and linking it from the website.  I
>> >>> >>>>>> think
>> >>> >>>>>> it
>> >>> >>>>>> needs
>> >>> >>>>>>
>> >>> >>>>>> - template that focuses it towards soliciting user goals / non
>> >>> >>>>>> goals
>> >>> >>>>>> - clear resolution as to which strategy was chosen to pursue.
>> >>> >>>>>> I'd
>> >>> >>>>>> recommend a vote.
>> >>> >>>>>>
>> >>> >>>>>> Matei asked me to clarify what I meant by changing interfaces,
>> >>> >>>>>> I
>> >>> >>>>>> think
>> >>> >>>>>> it's directly relevant to the SIP idea so I'll clarify here,
>> >>> >>>>>> and
>> >>> >>>>>> split
>> >>> >>>>>> a thread for the other discussion per Nicholas' request.
>> >>> >>>>>>
>> >>> >>>>>> I meant changing public user interfaces.  I think the first
>> >>> >>>>>> design
>> >>> >>>>>> is
>> >>> >>>>>> unlikely to be right, because it's done at a time when you have
>> >>> >>>>>> the
>> >>> >>>>>> least information.  As a user, I find it considerably more
>> >>> >>>>>> frustrating
>> >>> >>>>>> to be unable to use a tool to get my job done, than I do having
>> >>> >>>>>> to
>> >>> >>>>>> make minor changes to my code in order to take advantage of
>> >>> >>>>>> features.
>> >>> >>>>>> I've seen committers be seriously reluctant to allow changes to
>> >>> >>>>>> @experimental code that are needed in order for it to really
>> >>> >>>>>> work
>> >>> >>>>>> right.  You need to be able to iterate, and if people on both
>> >>> >>>>>> sides
>> >>> >>>>>> of
>> >>> >>>>>> the fence aren't going to respect that some newer apis are
>> >>> >>>>>> subject
>> >>> >>>>>> to
>> >>> >>>>>> change, then why even mark them as such?
>> >>> >>>>>>
>> >>> >>>>>> Ideally a finished SIP should give me a checklist of things
>> >>> >>>>>> that
>> >>> >>>>>> an
>> >>> >>>>>> implementation must do, and things that it doesn't need to do.
>> >>> >>>>>> Contributors/committers should be seriously discouraged from
>> >>> >>>>>> putting
>> >>> >>>>>> out a version 0.1 that doesn't have at least a prototype
>> >>> >>>>>> implementation of all those things, especially if they're then
>> >>> >>>>>> going
>> >>> >>>>>> to argue against interface changes necessary to get the the
>> >>> >>>>>> rest
>> >>> >>>>>> of
>> >>> >>>>>> the things done in the 0.2 version.
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <[hidden email]>
>> >>> >>>>>> wrote:
>> >>> >>>>>>> I like the lightweight proposal to add a SIP label.
>> >>> >>>>>>>
>> >>> >>>>>>> During Spark 2.0 development, Tom (Graves) and I suggested
>> >>> >>>>>>> using
>> >>> >>>>>>> wiki
>> >>> >>>>>>> to
>> >>> >>>>>>> track the list of major changes, but that never really
>> >>> >>>>>>> materialized
>> >>> >>>>>>> due to
>> >>> >>>>>>> the overhead. Adding a SIP label on major JIRAs and then link
>> >>> >>>>>>> to
>> >>> >>>>>>> them
>> >>> >>>>>>> prominently on the Spark website makes a lot of sense.
>> >>> >>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia
>> >>> >>>>>>> <[hidden email]>
>> >>> >>>>>>> wrote:
>> >>> >>>>>>>>
>> >>> >>>>>>>> For the improvement proposals, I think one major point was to
>> >>> >>>>>>>> make
>> >>> >>>>>>>> them
>> >>> >>>>>>>> really visible to users who are not contributors, so we
>> >>> >>>>>>>> should
>> >>> >>>>>>>> do
>> >>> >>>>>>>> more than
>> >>> >>>>>>>> sending stuff to dev@. One very lightweight idea is to have a
>> >>> >>>>>>>> new
>> >>> >>>>>>>> type of
>> >>> >>>>>>>> JIRA called a SIP and have a link to a filter that shows all
>> >>> >>>>>>>> such
>> >>> >>>>>>>> JIRAs from
>> >>> >>>>>>>> http://spark.apache.org. I also like the idea of SIP and
>> >>> >>>>>>>> design
>> >>> >>>>>>>> doc
>> >>> >>>>>>>> templates (in fact many projects have them).
>> >>> >>>>>>>>
>> >>> >>>>>>>> Matei
>> >>> >>>>>>>>
>> >>> >>>>>>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]>
>> >>> >>>>>>>> wrote:
>> >>> >>>>>>>>
>> >>> >>>>>>>> I called Cody last night and talked about some of the topics
>> >>> >>>>>>>> in
>> >>> >>>>>>>> his
>> >>> >>>>>>>> email.
>> >>> >>>>>>>> It became clear to me Cody genuinely cares about the project.
>> >>> >>>>>>>>
>> >>> >>>>>>>> Some of the frustrations come from the success of the project
>> >>> >>>>>>>> itself
>> >>> >>>>>>>> becoming very "hot", and it is difficult to get clarity from
>> >>> >>>>>>>> people
>> >>> >>>>>>>> who
>> >>> >>>>>>>> don't dedicate all their time to Spark. In fact, it is in
>> >>> >>>>>>>> some
>> >>> >>>>>>>> ways
>> >>> >>>>>>>> similar
>> >>> >>>>>>>> to scaling an engineering team in a successful startup: old
>> >>> >>>>>>>> processes that
>> >>> >>>>>>>> worked well might not work so well when it gets to a certain
>> >>> >>>>>>>> size,
>> >>> >>>>>>>> cultures
>> >>> >>>>>>>> can get diluted, building culture vs building process, etc.
>> >>> >>>>>>>>
>> >>> >>>>>>>> I also really like to have a more visible process for larger
>> >>> >>>>>>>> changes,
>> >>> >>>>>>>> especially major user facing API changes. Historically we
>> >>> >>>>>>>> upload
>> >>> >>>>>>>> design docs
>> >>> >>>>>>>> for major changes, but it is not always consistent and
>> >>> >>>>>>>> difficult
>> >>> >>>>>>>> to
>> >>> >>>>>>>> quality
>> >>> >>>>>>>> of the docs, due to the volunteering nature of the
>> >>> >>>>>>>> organization.
>> >>> >>>>>>>>
>> >>> >>>>>>>> Some of the more concrete ideas we discussed focus on
>> >>> >>>>>>>> building a
>> >>> >>>>>>>> culture
>> >>> >>>>>>>> to improve clarity:
>> >>> >>>>>>>>
>> >>> >>>>>>>> - Process: Large changes should have design docs posted on
>> >>> >>>>>>>> JIRA.
>> >>> >>>>>>>> One
>> >>> >>>>>>>> thing
>> >>> >>>>>>>> Cody and I didn't discuss but an idea that just came to me is
>> >>> >>>>>>>> we
>> >>> >>>>>>>> should
>> >>> >>>>>>>> create a design doc template for the project and ask
>> >>> >>>>>>>> everybody
>> >>> >>>>>>>> to
>> >>> >>>>>>>> follow.
>> >>> >>>>>>>> The design doc template should also explicitly list goals and
>> >>> >>>>>>>> non-goals, to
>> >>> >>>>>>>> make design doc more consistent.
>> >>> >>>>>>>>
>> >>> >>>>>>>> - Process: Email dev@ to solicit feedback. We have some this
>> >>> >>>>>>>> with
>> >>> >>>>>>>> some
>> >>> >>>>>>>> changes, but again very inconsistent. Just posting something
>> >>> >>>>>>>> on
>> >>> >>>>>>>> JIRA
>> >>> >>>>>>>> isn't
>> >>> >>>>>>>> sufficient, because there are simply too many JIRAs and the
>> >>> >>>>>>>> signal
>> >>> >>>>>>>> get lost
>> >>> >>>>>>>> in the noise. While this is generally impossible to enforce
>> >>> >>>>>>>> because
>> >>> >>>>>>>> we can't
>> >>> >>>>>>>> force all volunteers to conform to a process (or they might
>> >>> >>>>>>>> not
>> >>> >>>>>>>> even
>> >>> >>>>>>>> be
>> >>> >>>>>>>> aware of this),  those who are more familiar with the project
>> >>> >>>>>>>> can
>> >>> >>>>>>>> help by
>> >>> >>>>>>>> emailing the dev@ when they see something that hasn't been.
>> >>> >>>>>>>>
>> >>> >>>>>>>> - Culture: The design doc author(s) should be open to
>> >>> >>>>>>>> feedback.
>> >>> >>>>>>>> A
>> >>> >>>>>>>> design
>> >>> >>>>>>>> doc should serve as the base for discussion and is by no
>> >>> >>>>>>>> means
>> >>> >>>>>>>> the
>> >>> >>>>>>>> final
>> >>> >>>>>>>> design. Of course, this does not mean the author has to
>> >>> >>>>>>>> accept
>> >>> >>>>>>>> every
>> >>> >>>>>>>> feedback. They should also be comfortable accepting /
>> >>> >>>>>>>> rejecting
>> >>> >>>>>>>> ideas on
>> >>> >>>>>>>> technical grounds.
>> >>> >>>>>>>>
>> >>> >>>>>>>> - Process / Culture: For major ongoing projects, it can be
>> >>> >>>>>>>> useful
>> >>> >>>>>>>> to
>> >>> >>>>>>>> have
>> >>> >>>>>>>> some monthly Google hangouts that are open to the world. I am
>> >>> >>>>>>>> actually not
>> >>> >>>>>>>> sure how well this will work, because of the volunteering
>> >>> >>>>>>>> nature
>> >>> >>>>>>>> and
>> >>> >>>>>>>> we need
>> >>> >>>>>>>> to adjust for timezones for people across the globe, but it
>> >>> >>>>>>>> seems
>> >>> >>>>>>>> worth
>> >>> >>>>>>>> trying.
>> >>> >>>>>>>>
>> >>> >>>>>>>> - Culture: Contributors (including committers) should be more
>> >>> >>>>>>>> direct
>> >>> >>>>>>>> in
>> >>> >>>>>>>> setting expectations, including whether they are working on a
>> >>> >>>>>>>> specific
>> >>> >>>>>>>> issue, whether they will be working on a specific issue, and
>> >>> >>>>>>>> whether
>> >>> >>>>>>>> an
>> >>> >>>>>>>> issue or pr or jira should be rejected. Most people I know in
>> >>> >>>>>>>> this
>> >>> >>>>>>>> community
>> >>> >>>>>>>> are nice and don't enjoy telling other people no, but it is
>> >>> >>>>>>>> often
>> >>> >>>>>>>> more
>> >>> >>>>>>>> annoying to a contributor to not know anything than getting a
>> >>> >>>>>>>> no.
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia
>> >>> >>>>>>>> <[hidden email]>
>> >>> >>>>>>>> wrote:
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> Love the idea of a more visible "Spark Improvement Proposal"
>> >>> >>>>>>>>> process that
>> >>> >>>>>>>>> solicits user input on new APIs. For what it's worth, I
>> >>> >>>>>>>>> don't
>> >>> >>>>>>>>> think
>> >>> >>>>>>>>> committers are trying to minimize their own work -- every
>> >>> >>>>>>>>> committer
>> >>> >>>>>>>>> cares
>> >>> >>>>>>>>> about making the software useful for users. However, it is
>> >>> >>>>>>>>> always
>> >>> >>>>>>>>> hard to
>> >>> >>>>>>>>> get user input and so it helps to have this kind of process.
>> >>> >>>>>>>>> I've
>> >>> >>>>>>>>> certainly
>> >>> >>>>>>>>> looked at the *IPs a lot in other software I use just to see
>> >>> >>>>>>>>> the
>> >>> >>>>>>>>> biggest
>> >>> >>>>>>>>> things on the roadmap.
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> When you're talking about "changing interfaces", are you
>> >>> >>>>>>>>> talking
>> >>> >>>>>>>>> about
>> >>> >>>>>>>>> public or internal APIs? I do think many people hate
>> >>> >>>>>>>>> changing
>> >>> >>>>>>>>> public APIs
>> >>> >>>>>>>>> and I actually think that's for the best of the project.
>> >>> >>>>>>>>> That's
>> >>> >>>>>>>>> a
>> >>> >>>>>>>>> technical
>> >>> >>>>>>>>> debate, but basically, the worst thing when you're using a
>> >>> >>>>>>>>> piece
>> >>> >>>>>>>>> of
>> >>> >>>>>>>>> software
>> >>> >>>>>>>>> is that the developers constantly ask you to rewrite your
>> >>> >>>>>>>>> app
>> >>> >>>>>>>>> to
>> >>> >>>>>>>>> update to a
>> >>> >>>>>>>>> new version (and thus benefit from bug fixes, etc). Cue
>> >>> >>>>>>>>> anyone
>> >>> >>>>>>>>> who's used
>> >>> >>>>>>>>> Protobuf, or Guava. The "let's get everyone to change their
>> >>> >>>>>>>>> code
>> >>> >>>>>>>>> this
>> >>> >>>>>>>>> release" model works well within a single large company, but
>> >>> >>>>>>>>> doesn't work
>> >>> >>>>>>>>> well for a community, which is why nearly all *very* widely
>> >>> >>>>>>>>> used
>> >>> >>>>>>>>> programming
>> >>> >>>>>>>>> interfaces (I'm talking things like Java standard library,
>> >>> >>>>>>>>> Windows
>> >>> >>>>>>>>> API, etc)
>> >>> >>>>>>>>> almost *never* break backwards compatibility. All this is
>> >>> >>>>>>>>> done
>> >>> >>>>>>>>> within reason
>> >>> >>>>>>>>> though, e.g. we do change things in major releases (2.x,
>> >>> >>>>>>>>> 3.x,
>> >>> >>>>>>>>> etc).
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>> ---------------------------------------------------------------------
>> >>> >>>>>> To unsubscribe e-mail: [hidden email]
>> >>> >>>>>>
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>> --
>> >>> >>>>> Stavros Kontopoulos
>> >>> >>>>> Senior Software Engineer
>> >>> >>>>> Lightbend, Inc.
>> >>> >>>>> p:  <a href="tel:%2B30%206977967274" value="+306977967274">+30 6977967274
>> >>> >>>>> e: [hidden email]
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>
>> >>> >>>
>> >>> >>
>> >>> >>
>> >>>
>> >>
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: [hidden email]
>> >
>> >
>> > ________________________________
>> >
>> > If you reply to this email, your message will be added to the discussion
>> > below:
>> >
>> >
>> > http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-tp19268p19359.html
>> >
>> > To start a new topic under Apache Spark Developers List, email [hidden
>> > email]
>> > To unsubscribe from Apache Spark Developers List, click here.
>> > NAML
>> >
>> >
>> > ________________________________
>> > View this message in context: RE: Spark Improvement Proposals
>> > Sent from the Apache Spark Developers List mailing list archive at
>> > Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Mark Hamstra
In reply to this post by Nicholas Chammas
I'm not a fan of the SEP acronym.  Besides it prior established meaning of "Somebody else's problem", the are other inappropriate or offensive connotations such as this Australian slang that often gets shortened to just "sep":  http://www.urbandictionary.com/define.php?term=Seppo 

On Sun, Oct 9, 2016 at 4:00 PM, Nicholas Chammas <[hidden email]> wrote:
On Sun, Oct 9, 2016 at 5:19 PM Cody Koeninger <[hidden email]> wrote:
Regarding name, if the SIP overlap is a concern, we can pick a different name.

My tongue in cheek suggestion would be

Spark Lightweight Improvement process (SPARKLI)

If others share my minor concern about the SIP name, I propose Spark Enhancement Proposal (SEP), taking inspiration from the Python Enhancement Proposal name.

So if we're going to number proposals like other projects do, they'd be numbered SEP-1, SEP-2, etc. This avoids the naming conflict with Scala SIPs.

Another way to avoid a conflict is to stick with "Spark Improvement Proposal" but use SPIP as the acronym. So SPIP-1, SPIP-2, etc.

Anyway, it's not a big deal. I just wanted to raise this point.

Nick

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Mark Hamstra
In reply to this post by Mark Hamstra
There is a larger issue to keep in mind, and that is that what you are proposing is a procedure that, as far as I am aware, hasn't previously been adopted in an Apache project, and thus is not an easy or exact fit with established practices that have been blessed as "The Apache Way".  As such, we need to be careful, because we have run into some trouble in the past with some inside the ASF but essentially outside the Spark community who didn't like the way we were doing things.

On Mon, Oct 10, 2016 at 3:53 PM, Cody Koeninger <[hidden email]> wrote:

Apache documents say lots of confusing stuff, including that commiters are in practice given a vote.

https://www.apache.org/foundation/voting.html

I don't care either way, if someone wants me to sub commiter for PMC in the voting section, fine, we just need a clear outcome.


On Oct 10, 2016 17:36, "Mark Hamstra" <[hidden email]> wrote:
If I'm correctly understanding the kind of voting that you are talking about, then to be accurate, it is only the PMC members that have a vote, not all committers: https://www.apache.org/foundation/how-it-works.html#pmc-members

On Mon, Oct 10, 2016 at 12:02 PM, Cody Koeninger <[hidden email]> wrote:
I think the main value is in being honest about what's going on.  No
one other than committers can cast a meaningful vote, that's the
reality.  Beyond that, if people think it's more open to allow formal
proposals from anyone, I'm not necessarily against it, but my main
question would be this:

If anyone can submit a proposal, are committers actually going to
clearly reject and close proposals that don't meet the requirements?

Right now we have a serious problem with lack of clarity regarding
contributions, and that cannot spill over into goal-setting.

On Mon, Oct 10, 2016 at 1:54 PM, Ryan Blue <[hidden email]> wrote:
> +1 to votes to approve proposals. I agree that proposals should have an
> official mechanism to be accepted, and a vote is an established means of
> doing that well. I like that it includes a period to review the proposal and
> I think proposals should have been discussed enough ahead of a vote to
> survive the possibility of a veto.
>
> I also like the names that are short and (mostly) unique, like SEP.
>
> Where I disagree is with the requirement that a committer must formally
> propose an enhancement. I don't see the value of restricting this: if
> someone has the will to write up a proposal then they should be encouraged
> to do so and start a discussion about it. Even if there is a political
> reality as Cody says, what is the value of codifying that in our process? I
> think restricting who can submit proposals would only undermine them by
> pushing contributors out. Maybe I'm missing something here?
>
> rb
>
>
>
> On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <[hidden email]> wrote:
>>
>> Yes, users suggesting SIPs is a good thing and is explicitly called
>> out in the linked document under the Who? section.  Formally proposing
>> them, not so much, because of the political realities.
>>
>> Yes, implementation strategy definitely affects goals.  There are all
>> kinds of examples of this, I'll pick one that's my fault so as to
>> avoid sounding like I'm blaming:
>>
>> When I implemented the Kafka DStream, one of my (not explicitly agreed
>> upon by the community) goals was to make sure people could use the
>> Dstream with however they were already using Kafka at work.  The lack
>> of explicit agreement on that goal led to all kinds of fighting with
>> committers, that could have been avoided.  The lack of explicit
>> up-front strategy discussion led to the DStream not really working
>> with compacted topics.  I knew about compacted topics, but don't have
>> a use for them, so had a blind spot there.  If there was explicit
>> up-front discussion that my strategy was "assume that batches can be
>> defined on the driver solely by beginning and ending offsets", there's
>> a greater chance that a user would have seen that and said, "hey, what
>> about non-contiguous offsets in a compacted topic".
>>
>> This kind of thing is only going to happen smoothly if we have a
>> lightweight user-visible process with clear outcomes.
>>
>> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>> <[hidden email]> wrote:
>> > I agree with most of what Cody said.
>> >
>> > Two things:
>> >
>> > First we can always have other people suggest SIPs but mark them as
>> > “unreviewed” and have committers basically move them forward. The
>> > problem is
>> > that writing a good document takes time. This way we can leverage non
>> > committers to do some of this work (it is just another way to
>> > contribute).
>> >
>> >
>> >
>> > As for strategy, in many cases implementation strategy can affect the
>> > goals.
>> > I will give  a small example: In the current structured streaming
>> > strategy,
>> > we group by the time to achieve a sliding window. This is definitely an
>> > implementation decision and not a goal. However, I can think of several
>> > aggregation functions which have the time inside their calculation
>> > buffer.
>> > For example, let’s say we want to return a set of all distinct values.
>> > One
>> > way to implement this would be to make the set into a map and have the
>> > value
>> > contain the last time seen. Multiplying it across the groupby would cost
>> > a
>> > lot in performance. So adding such a strategy would have a great effect
>> > on
>> > the type of aggregations and their performance which does affect the
>> > goal.
>> > Without adding the strategy, it is easy for whoever goes to the design
>> > document to not think about these cases. Furthermore, it might be
>> > decided
>> > that these cases are rare enough so that the strategy is still good
>> > enough
>> > but how would we know it without user feedback?
>> >
>> > I believe this example is exactly what Cody was talking about. Since
>> > many
>> > times implementation strategies have a large effect on the goal, we
>> > should
>> > have it discussed when discussing the goals. In addition, while it is
>> > often
>> > easy to throw out completely infeasible goals, it is often much harder
>> > to
>> > figure out that the goals are unfeasible without fine tuning.
>> >
>> >
>> >
>> >
>> >
>> > Assaf.
>> >
>> >
>> >
>> > From: Cody Koeninger-2 [via Apache Spark Developers List]
>> > [mailto:[hidden email][hidden email]]
>> > Sent: Monday, October 10, 2016 2:25 AM
>> > To: Mendelson, Assaf
>> > Subject: Re: Spark Improvement Proposals
>> >
>> >
>> >
>> > Only committers should formally submit SIPs because in an apache
>> > project only commiters have explicit political power.  If a user can't
>> > find a commiter willing to sponsor an SIP idea, they have no way to
>> > get the idea passed in any case.  If I can't find a committer to
>> > sponsor this meta-SIP idea, I'm out of luck.
>> >
>> > I do not believe unrealistic goals can be found solely by inspection.
>> > We've managed to ignore unrealistic goals even after implementation!
>> > Focusing on APIs can allow people to think they've solved something,
>> > when there's really no way of implementing that API while meeting the
>> > goals.  Rapid iteration is clearly the best way to address this, but
>> > we've already talked about why that hasn't really worked.  If adding a
>> > non-binding API section to the template is important to you, I'm not
>> > against it, but I don't think it's sufficient.
>> >
>> > On your PRD vs design doc spectrum, I'm saying this is closer to a
>> > PRD.  Clear agreement on goals is the most important thing and that's
>> > why it's the thing I want binding agreement on.  But I cannot agree to
>> > goals unless I have enough minimal technical info to judge whether the
>> > goals are likely to actually be accomplished.
>> >
>> >
>> >
>> > On Sun, Oct 9, 2016 at 5:35 PM, Matei Zaharia <[hidden email]> wrote:
>> >
>> >
>> >> Well, I think there are a few things here that don't make sense. First,
>> >> why
>> >> should only committers submit SIPs? Development in the project should
>> >> be
>> >> open to all contributors, whether they're committers or not. Second, I
>> >> think
>> >> unrealistic goals can be found just by inspecting the goals, and I'm
>> >> not
>> >> super worried that we'll accept a lot of SIPs that are then infeasible
>> >> --
>> >> we
>> >> can then submit new ones. But this depends on whether you want this
>> >> process
>> >> to be a "design doc lite", where people also agree on implementation
>> >> strategy, or just a way to agree on goals. This is what I asked earlier
>> >> about PRDs vs design docs (and I'm open to either one but I'd just like
>> >> clarity). Finally, both as a user and designer of software, I always
>> >> want
>> >> to
>> >> give feedback on APIs, so I'd really like a culture of having those
>> >> early.
>> >> People don't argue about prettiness when they discuss APIs, they argue
>> >> about
>> >> the core concepts to expose in order to meet various goals, and then
>> >> they're
>> >> stuck maintaining those for a long time.
>> >>
>> >> Matei
>> >>
>> >> On Oct 9, 2016, at 3:10 PM, Cody Koeninger <[hidden email]> wrote:
>> >>
>> >> Users instead of people, sure.  Commiters and contributors are (or at
>> >> least
>> >> should be) a subset of users.
>> >>
>> >> Non goals, sure. I don't care what the name is, but we need to clearly
>> >> say
>> >> e.g. 'no we are not maintaining compatibility with XYZ right now'.
>> >>
>> >> API, what I care most about is whether it allows me to accomplish the
>> >> goals.
>> >> Arguing about how ugly or pretty it is can be saved for design/
>> >> implementation imho.
>> >>
>> >> Strategy, this is necessary because otherwise goals can be out of line
>> >> with
>> >> reality.  Don't propose goals you don't have at least some idea of how
>> >> to
>> >> implement.
>> >>
>> >> Rejected strategies, given that commiters are the only ones I'm saying
>> >> should formally submit SPARKLIs or SIPs, if they put junk in a required
>> >> section then slap them down for it and tell them to fix it.
>> >>
>> >>
>> >> On Oct 9, 2016 4:36 PM, "Matei Zaharia" <[hidden email]> wrote:
>> >>>
>> >>> Yup, this is the stuff that I found unclear. Thanks for clarifying
>> >>> here,
>> >>> but we should also clarify it in the writeup. In particular:
>> >>>
>> >>> - Goals needs to be about user-facing behavior ("people" is broad)
>> >>>
>> >>> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will dig
>> >>> up
>> >>> one of these and say "Spark's developers have officially rejected X,
>> >>> which
>> >>> our awesome system has".
>> >>>
>> >>> - For user-facing stuff, I think you need a section on API. Virtually
>> >>> all
>> >>> other *IPs I've seen have that.
>> >>>
>> >>> - I'm still not sure why the strategy section is needed if the purpose
>> >>> is
>> >>> to define user-facing behavior -- unless this is the strategy for
>> >>> setting
>> >>> the goals or for defining the API. That sounds squarely like a design
>> >>> doc
>> >>> issue. In some sense, who cares whether the proposal is technically
>> >>> feasible
>> >>> right now? If it's infeasible, that will be discovered later during
>> >>> design
>> >>> and implementation. Same thing with rejected strategies -- listing
>> >>> some
>> >>> of
>> >>> those is definitely useful sometimes, but if you make this a
>> >>> *required*
>> >>> section, people are just going to fill it in with bogus stuff (I've
>> >>> seen
>> >>> this happen before).
>> >>>
>> >>> Matei
>> >>>
>> >
>> >>> > On Oct 9, 2016, at 2:14 PM, Cody Koeninger <[hidden email]> wrote:
>> >>> >
>> >>> > So to focus the discussion on the specific strategy I'm suggesting,
>> >>> > documented at
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>> >>> >
>> >>> > "Goals: What must this allow people to do, that they can't
>> >>> > currently?"
>> >>> >
>> >>> > Is it unclear that this is focusing specifically on people-visible
>> >>> > behavior?
>> >>> >
>> >>> > Rejected goals -  are important because otherwise people keep trying
>> >>> > to argue about scope.  Of course you can change things later with a
>> >>> > different SIP and different vote, the point is to focus.
>> >>> >
>> >>> > Use cases - are something that people are going to bring up in
>> >>> > discussion.  If they aren't clearly documented as a goal ("This must
>> >>> > allow me to connect using SSL"), they should be added.
>> >>> >
>> >>> > Internal architecture - if the people who need specific behavior are
>> >>> > implementers of other parts of the system, that's fine.
>> >>> >
>> >>> > Rejected strategies - If you have none of these, you have no
>> >>> > evidence
>> >>> > that the proponent didn't just go with the first thing they had in
>> >>> > mind (or have already implemented), which is a big problem
>> >>> > currently.
>> >>> > Approval isn't binding as to specifics of implementation, so these
>> >>> > aren't handcuffs.  The goals are the contract, the strategy is
>> >>> > evidence that contract can actually be met.
>> >>> >
>> >>> > Design docs - I'm not touching design docs.  The markdown file I
>> >>> > linked specifically says of the strategy section "This is not a full
>> >>> > design document."  Is this unclear?  Design docs can be worked on
>> >>> > obviously, but that's not what I'm concerned with here.
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <[hidden email]>
>> >>> > wrote:
>> >>> >> Hi Cody,
>> >>> >>
>> >>> >> I think this would be a lot more concrete if we had a more detailed
>> >>> >> template
>> >>> >> for SIPs. Right now, it's not super clear what's in scope -- e.g.
>> >>> >> are
>> >>> >> they
>> >>> >> a way to solicit feedback on the user-facing behavior or on the
>> >>> >> internals?
>> >>> >> "Goals" can cover both things. I've been thinking of SIPs more as
>> >>> >> Product
>> >>> >> Requirements Docs (PRDs), which focus on *what* a code change
>> >>> >> should
>> >>> >> do
>> >>> >> as
>> >>> >> opposed to how.
>> >>> >>
>> >>> >> In particular, here are some things that you may or may not
>> >>> >> consider
>> >>> >> in
>> >>> >> scope for SIPs:
>> >>> >>
>> >>> >> - Goals and non-goals: This is definitely in scope, and IMO should
>> >>> >> focus on
>> >>> >> user-visible behavior (e.g. "system supports SQL window functions"
>> >>> >> or
>> >>> >> "system continues working if one node fails"). BTW I wouldn't say
>> >>> >> "rejected
>> >>> >> goals" because some of them might become goals later, so we're not
>> >>> >> definitively rejecting them.
>> >>> >>
>> >>> >> - Public API: Probably should be included in most SIPs unless it's
>> >>> >> too
>> >>> >> large
>> >>> >> to fully specify then (e.g. "let's add an ML library").
>> >>> >>
>> >>> >> - Use cases: I usually find this very useful in PRDs to better
>> >>> >> communicate
>> >>> >> the goals.
>> >>> >>
>> >>> >> - Internal architecture: This is usually *not* a thing users can
>> >>> >> easily
>> >>> >> comment on and it sounds more like a design doc item. Of course
>> >>> >> it's
>> >>> >> important to show that the SIP is feasible to implement. One
>> >>> >> exception,
>> >>> >> however, is that I think we'll have some SIPs primarily on
>> >>> >> internals
>> >>> >> (e.g.
>> >>> >> if somebody wants to refactor Spark's query optimizer or
>> >>> >> something).
>> >>> >>
>> >>> >> - Rejected strategies: I personally wouldn't put this, because
>> >>> >> what's
>> >>> >> the
>> >>> >> point of voting to reject a strategy before you've really begun
>> >>> >> designing
>> >>> >> and implementing something? What if you discover that the strategy
>> >>> >> is
>> >>> >> actually better when you start doing stuff?
>> >>> >>
>> >>> >> At a super high level, it depends on whether you want the SIPs to
>> >>> >> be
>> >>> >> PRDs
>> >>> >> for getting some quick feedback on the goals of a feature before it
>> >>> >> is
>> >>> >> designed, or something more like full-fledged design docs (just a
>> >>> >> more
>> >>> >> visible design doc for bigger changes). I looked at Kafka's KIPs,
>> >>> >> and
>> >>> >> they
>> >>> >> actually seem to be more like design docs. This can work too but it
>> >>> >> does
>> >>> >> require more work from the proposer and it can lead to the same
>> >>> >> problems you
>> >>> >> mentioned with people already having a design and implementation in
>> >>> >> mind.
>> >>> >>
>> >>> >> Basically, the question is, are you trying to iterate faster on
>> >>> >> design
>> >>> >> by
>> >>> >> adding a step for user feedback earlier? Or are you just trying to
>> >>> >> make
>> >>> >> design docs for key features more visible (and their approval more
>> >>> >> formal)?
>> >>> >>
>> >>> >> BTW note that in either case, I'd like to have a template for
>> >>> >> design
>> >>> >> docs
>> >>> >> too, which should also include goals. I think that would've avoided
>> >>> >> some of
>> >>> >> the issues you brought up.
>> >>> >>
>> >>> >> Matei
>> >>> >>
>> >>> >> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <[hidden email]> wrote:
>> >>> >>
>> >>> >> Here's my specific proposal (meta-proposal?)
>> >>> >>
>> >>> >> Spark Improvement Proposals (SIP)
>> >>> >>
>> >>> >>
>> >>> >> Background:
>> >>> >>
>> >>> >> The current problem is that design and implementation of large
>> >>> >> features
>> >>> >> are
>> >>> >> often done in private, before soliciting user feedback.
>> >>> >>
>> >>> >> When feedback is solicited, it is often as to detailed design
>> >>> >> specifics, not
>> >>> >> focused on goals.
>> >>> >>
>> >>> >> When implementation does take place after design, there is often
>> >>> >> disagreement as to what goals are or are not in scope.
>> >>> >>
>> >>> >> This results in commits that don't fully meet user needs.
>> >>> >>
>> >>> >>
>> >>> >> Goals:
>> >>> >>
>> >>> >> - Ensure user, contributor, and committer goals are clearly
>> >>> >> identified
>> >>> >> and
>> >>> >> agreed upon, before implementation takes place.
>> >>> >>
>> >>> >> - Ensure that a technically feasible strategy is chosen that is
>> >>> >> likely
>> >>> >> to
>> >>> >> meet the goals.
>> >>> >>
>> >>> >>
>> >>> >> Rejected Goals:
>> >>> >>
>> >>> >> - SIPs are not for detailed design.  Design by committee doesn't
>> >>> >> work.
>> >>> >>
>> >>> >> - SIPs are not for every change.  We dont need that much process.
>> >>> >>
>> >>> >>
>> >>> >> Strategy:
>> >>> >>
>> >>> >> My suggestion is outlined as a Spark Improvement Proposal process
>> >>> >> documented
>> >>> >> at
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>> >>> >>
>> >>> >> Specifics of Jira manipulation are an implementation detail we can
>> >>> >> figure
>> >>> >> out.
>> >>> >>
>> >>> >> I'm suggesting voting; the need here is for a _clear_ outcome.
>> >>> >>
>> >>> >>
>> >>> >> Rejected Strategies:
>> >>> >>
>> >>> >> Having someone who understands the problem implement it first
>> >>> >> works,
>> >>> >> but
>> >>> >> only if significant iteration after user feedback is allowed.
>> >>> >>
>> >>> >> Historically this has been problematic due to pressure to limit
>> >>> >> public
>> >>> >> api
>> >>> >> changes.
>> >>> >>
>> >>> >>
>> >>> >> On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <[hidden email]>
>> >>> >> wrote:
>> >>> >>>
>> >>> >>> Alright looks like there are quite a bit of support. We should
>> >>> >>> wait
>> >>> >>> to
>> >>> >>> hear from more people too.
>> >>> >>>
>> >>> >>> To push this forward, Cody and I will be working together in the
>> >>> >>> next
>> >>> >>> couple of weeks to come up with a concrete, detailed proposal on
>> >>> >>> what
>> >>> >>> this
>> >>> >>> entails, and then we can discuss this the specific proposal as
>> >>> >>> well.
>> >>> >>>
>> >>> >>>
>> >>> >>> On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <[hidden email]>
>> >>> >>> wrote:
>> >>> >>>>
>> >>> >>>> Yeah, in case it wasn't clear, I was talking about SIPs for major
>> >>> >>>> user-facing or cross-cutting changes, not minor feature adds.
>> >>> >>>>
>> >>> >>>> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos
>> >>> >>>> <[hidden email]> wrote:
>> >>> >>>>>
>> >>> >>>>> +1 to the SIP label as long as it does not slow down things and
>> >>> >>>>> it
>> >>> >>>>> targets optimizing efforts, coordination etc. For example really
>> >>> >>>>> small
>> >>> >>>>> features should not need to go through this process (assuming
>> >>> >>>>> they
>> >>> >>>>> dont
>> >>> >>>>> touch public interfaces)  or re-factorings and hope it will be
>> >>> >>>>> kept
>> >>> >>>>> this
>> >>> >>>>> way. So as a guideline doc should be provided, like in the KIP
>> >>> >>>>> case.
>> >>> >>>>>
>> >>> >>>>> IMHO so far aside from tagging things and linking them elsewhere
>> >>> >>>>> simply
>> >>> >>>>> having design docs and prototypes implementations in PRs is not
>> >>> >>>>> something
>> >>> >>>>> that has not worked so far. What is really a pain in many
>> >>> >>>>> projects
>> >>> >>>>> out there
>> >>> >>>>> is discontinuity in progress of PRs, missing features, slow
>> >>> >>>>> reviews
>> >>> >>>>> which is
>> >>> >>>>> understandable to some extent... it is not only about Spark but
>> >>> >>>>> things can
>> >>> >>>>> be improved for sure for this project in particular as already
>> >>> >>>>> stated.
>> >>> >>>>>
>> >>> >>>>> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <[hidden email]>
>> >>> >>>>> wrote:
>> >>> >>>>>>
>> >>> >>>>>> +1 to adding an SIP label and linking it from the website.  I
>> >>> >>>>>> think
>> >>> >>>>>> it
>> >>> >>>>>> needs
>> >>> >>>>>>
>> >>> >>>>>> - template that focuses it towards soliciting user goals / non
>> >>> >>>>>> goals
>> >>> >>>>>> - clear resolution as to which strategy was chosen to pursue.
>> >>> >>>>>> I'd
>> >>> >>>>>> recommend a vote.
>> >>> >>>>>>
>> >>> >>>>>> Matei asked me to clarify what I meant by changing interfaces,
>> >>> >>>>>> I
>> >>> >>>>>> think
>> >>> >>>>>> it's directly relevant to the SIP idea so I'll clarify here,
>> >>> >>>>>> and
>> >>> >>>>>> split
>> >>> >>>>>> a thread for the other discussion per Nicholas' request.
>> >>> >>>>>>
>> >>> >>>>>> I meant changing public user interfaces.  I think the first
>> >>> >>>>>> design
>> >>> >>>>>> is
>> >>> >>>>>> unlikely to be right, because it's done at a time when you have
>> >>> >>>>>> the
>> >>> >>>>>> least information.  As a user, I find it considerably more
>> >>> >>>>>> frustrating
>> >>> >>>>>> to be unable to use a tool to get my job done, than I do having
>> >>> >>>>>> to
>> >>> >>>>>> make minor changes to my code in order to take advantage of
>> >>> >>>>>> features.
>> >>> >>>>>> I've seen committers be seriously reluctant to allow changes to
>> >>> >>>>>> @experimental code that are needed in order for it to really
>> >>> >>>>>> work
>> >>> >>>>>> right.  You need to be able to iterate, and if people on both
>> >>> >>>>>> sides
>> >>> >>>>>> of
>> >>> >>>>>> the fence aren't going to respect that some newer apis are
>> >>> >>>>>> subject
>> >>> >>>>>> to
>> >>> >>>>>> change, then why even mark them as such?
>> >>> >>>>>>
>> >>> >>>>>> Ideally a finished SIP should give me a checklist of things
>> >>> >>>>>> that
>> >>> >>>>>> an
>> >>> >>>>>> implementation must do, and things that it doesn't need to do.
>> >>> >>>>>> Contributors/committers should be seriously discouraged from
>> >>> >>>>>> putting
>> >>> >>>>>> out a version 0.1 that doesn't have at least a prototype
>> >>> >>>>>> implementation of all those things, especially if they're then
>> >>> >>>>>> going
>> >>> >>>>>> to argue against interface changes necessary to get the the
>> >>> >>>>>> rest
>> >>> >>>>>> of
>> >>> >>>>>> the things done in the 0.2 version.
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <[hidden email]>
>> >>> >>>>>> wrote:
>> >>> >>>>>>> I like the lightweight proposal to add a SIP label.
>> >>> >>>>>>>
>> >>> >>>>>>> During Spark 2.0 development, Tom (Graves) and I suggested
>> >>> >>>>>>> using
>> >>> >>>>>>> wiki
>> >>> >>>>>>> to
>> >>> >>>>>>> track the list of major changes, but that never really
>> >>> >>>>>>> materialized
>> >>> >>>>>>> due to
>> >>> >>>>>>> the overhead. Adding a SIP label on major JIRAs and then link
>> >>> >>>>>>> to
>> >>> >>>>>>> them
>> >>> >>>>>>> prominently on the Spark website makes a lot of sense.
>> >>> >>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia
>> >>> >>>>>>> <[hidden email]>
>> >>> >>>>>>> wrote:
>> >>> >>>>>>>>
>> >>> >>>>>>>> For the improvement proposals, I think one major point was to
>> >>> >>>>>>>> make
>> >>> >>>>>>>> them
>> >>> >>>>>>>> really visible to users who are not contributors, so we
>> >>> >>>>>>>> should
>> >>> >>>>>>>> do
>> >>> >>>>>>>> more than
>> >>> >>>>>>>> sending stuff to dev@. One very lightweight idea is to have a
>> >>> >>>>>>>> new
>> >>> >>>>>>>> type of
>> >>> >>>>>>>> JIRA called a SIP and have a link to a filter that shows all
>> >>> >>>>>>>> such
>> >>> >>>>>>>> JIRAs from
>> >>> >>>>>>>> http://spark.apache.org. I also like the idea of SIP and
>> >>> >>>>>>>> design
>> >>> >>>>>>>> doc
>> >>> >>>>>>>> templates (in fact many projects have them).
>> >>> >>>>>>>>
>> >>> >>>>>>>> Matei
>> >>> >>>>>>>>
>> >>> >>>>>>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]>
>> >>> >>>>>>>> wrote:
>> >>> >>>>>>>>
>> >>> >>>>>>>> I called Cody last night and talked about some of the topics
>> >>> >>>>>>>> in
>> >>> >>>>>>>> his
>> >>> >>>>>>>> email.
>> >>> >>>>>>>> It became clear to me Cody genuinely cares about the project.
>> >>> >>>>>>>>
>> >>> >>>>>>>> Some of the frustrations come from the success of the project
>> >>> >>>>>>>> itself
>> >>> >>>>>>>> becoming very "hot", and it is difficult to get clarity from
>> >>> >>>>>>>> people
>> >>> >>>>>>>> who
>> >>> >>>>>>>> don't dedicate all their time to Spark. In fact, it is in
>> >>> >>>>>>>> some
>> >>> >>>>>>>> ways
>> >>> >>>>>>>> similar
>> >>> >>>>>>>> to scaling an engineering team in a successful startup: old
>> >>> >>>>>>>> processes that
>> >>> >>>>>>>> worked well might not work so well when it gets to a certain
>> >>> >>>>>>>> size,
>> >>> >>>>>>>> cultures
>> >>> >>>>>>>> can get diluted, building culture vs building process, etc.
>> >>> >>>>>>>>
>> >>> >>>>>>>> I also really like to have a more visible process for larger
>> >>> >>>>>>>> changes,
>> >>> >>>>>>>> especially major user facing API changes. Historically we
>> >>> >>>>>>>> upload
>> >>> >>>>>>>> design docs
>> >>> >>>>>>>> for major changes, but it is not always consistent and
>> >>> >>>>>>>> difficult
>> >>> >>>>>>>> to
>> >>> >>>>>>>> quality
>> >>> >>>>>>>> of the docs, due to the volunteering nature of the
>> >>> >>>>>>>> organization.
>> >>> >>>>>>>>
>> >>> >>>>>>>> Some of the more concrete ideas we discussed focus on
>> >>> >>>>>>>> building a
>> >>> >>>>>>>> culture
>> >>> >>>>>>>> to improve clarity:
>> >>> >>>>>>>>
>> >>> >>>>>>>> - Process: Large changes should have design docs posted on
>> >>> >>>>>>>> JIRA.
>> >>> >>>>>>>> One
>> >>> >>>>>>>> thing
>> >>> >>>>>>>> Cody and I didn't discuss but an idea that just came to me is
>> >>> >>>>>>>> we
>> >>> >>>>>>>> should
>> >>> >>>>>>>> create a design doc template for the project and ask
>> >>> >>>>>>>> everybody
>> >>> >>>>>>>> to
>> >>> >>>>>>>> follow.
>> >>> >>>>>>>> The design doc template should also explicitly list goals and
>> >>> >>>>>>>> non-goals, to
>> >>> >>>>>>>> make design doc more consistent.
>> >>> >>>>>>>>
>> >>> >>>>>>>> - Process: Email dev@ to solicit feedback. We have some this
>> >>> >>>>>>>> with
>> >>> >>>>>>>> some
>> >>> >>>>>>>> changes, but again very inconsistent. Just posting something
>> >>> >>>>>>>> on
>> >>> >>>>>>>> JIRA
>> >>> >>>>>>>> isn't
>> >>> >>>>>>>> sufficient, because there are simply too many JIRAs and the
>> >>> >>>>>>>> signal
>> >>> >>>>>>>> get lost
>> >>> >>>>>>>> in the noise. While this is generally impossible to enforce
>> >>> >>>>>>>> because
>> >>> >>>>>>>> we can't
>> >>> >>>>>>>> force all volunteers to conform to a process (or they might
>> >>> >>>>>>>> not
>> >>> >>>>>>>> even
>> >>> >>>>>>>> be
>> >>> >>>>>>>> aware of this),  those who are more familiar with the project
>> >>> >>>>>>>> can
>> >>> >>>>>>>> help by
>> >>> >>>>>>>> emailing the dev@ when they see something that hasn't been.
>> >>> >>>>>>>>
>> >>> >>>>>>>> - Culture: The design doc author(s) should be open to
>> >>> >>>>>>>> feedback.
>> >>> >>>>>>>> A
>> >>> >>>>>>>> design
>> >>> >>>>>>>> doc should serve as the base for discussion and is by no
>> >>> >>>>>>>> means
>> >>> >>>>>>>> the
>> >>> >>>>>>>> final
>> >>> >>>>>>>> design. Of course, this does not mean the author has to
>> >>> >>>>>>>> accept
>> >>> >>>>>>>> every
>> >>> >>>>>>>> feedback. They should also be comfortable accepting /
>> >>> >>>>>>>> rejecting
>> >>> >>>>>>>> ideas on
>> >>> >>>>>>>> technical grounds.
>> >>> >>>>>>>>
>> >>> >>>>>>>> - Process / Culture: For major ongoing projects, it can be
>> >>> >>>>>>>> useful
>> >>> >>>>>>>> to
>> >>> >>>>>>>> have
>> >>> >>>>>>>> some monthly Google hangouts that are open to the world. I am
>> >>> >>>>>>>> actually not
>> >>> >>>>>>>> sure how well this will work, because of the volunteering
>> >>> >>>>>>>> nature
>> >>> >>>>>>>> and
>> >>> >>>>>>>> we need
>> >>> >>>>>>>> to adjust for timezones for people across the globe, but it
>> >>> >>>>>>>> seems
>> >>> >>>>>>>> worth
>> >>> >>>>>>>> trying.
>> >>> >>>>>>>>
>> >>> >>>>>>>> - Culture: Contributors (including committers) should be more
>> >>> >>>>>>>> direct
>> >>> >>>>>>>> in
>> >>> >>>>>>>> setting expectations, including whether they are working on a
>> >>> >>>>>>>> specific
>> >>> >>>>>>>> issue, whether they will be working on a specific issue, and
>> >>> >>>>>>>> whether
>> >>> >>>>>>>> an
>> >>> >>>>>>>> issue or pr or jira should be rejected. Most people I know in
>> >>> >>>>>>>> this
>> >>> >>>>>>>> community
>> >>> >>>>>>>> are nice and don't enjoy telling other people no, but it is
>> >>> >>>>>>>> often
>> >>> >>>>>>>> more
>> >>> >>>>>>>> annoying to a contributor to not know anything than getting a
>> >>> >>>>>>>> no.
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia
>> >>> >>>>>>>> <[hidden email]>
>> >>> >>>>>>>> wrote:
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> Love the idea of a more visible "Spark Improvement Proposal"
>> >>> >>>>>>>>> process that
>> >>> >>>>>>>>> solicits user input on new APIs. For what it's worth, I
>> >>> >>>>>>>>> don't
>> >>> >>>>>>>>> think
>> >>> >>>>>>>>> committers are trying to minimize their own work -- every
>> >>> >>>>>>>>> committer
>> >>> >>>>>>>>> cares
>> >>> >>>>>>>>> about making the software useful for users. However, it is
>> >>> >>>>>>>>> always
>> >>> >>>>>>>>> hard to
>> >>> >>>>>>>>> get user input and so it helps to have this kind of process.
>> >>> >>>>>>>>> I've
>> >>> >>>>>>>>> certainly
>> >>> >>>>>>>>> looked at the *IPs a lot in other software I use just to see
>> >>> >>>>>>>>> the
>> >>> >>>>>>>>> biggest
>> >>> >>>>>>>>> things on the roadmap.
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> When you're talking about "changing interfaces", are you
>> >>> >>>>>>>>> talking
>> >>> >>>>>>>>> about
>> >>> >>>>>>>>> public or internal APIs? I do think many people hate
>> >>> >>>>>>>>> changing
>> >>> >>>>>>>>> public APIs
>> >>> >>>>>>>>> and I actually think that's for the best of the project.
>> >>> >>>>>>>>> That's
>> >>> >>>>>>>>> a
>> >>> >>>>>>>>> technical
>> >>> >>>>>>>>> debate, but basically, the worst thing when you're using a
>> >>> >>>>>>>>> piece
>> >>> >>>>>>>>> of
>> >>> >>>>>>>>> software
>> >>> >>>>>>>>> is that the developers constantly ask you to rewrite your
>> >>> >>>>>>>>> app
>> >>> >>>>>>>>> to
>> >>> >>>>>>>>> update to a
>> >>> >>>>>>>>> new version (and thus benefit from bug fixes, etc). Cue
>> >>> >>>>>>>>> anyone
>> >>> >>>>>>>>> who's used
>> >>> >>>>>>>>> Protobuf, or Guava. The "let's get everyone to change their
>> >>> >>>>>>>>> code
>> >>> >>>>>>>>> this
>> >>> >>>>>>>>> release" model works well within a single large company, but
>> >>> >>>>>>>>> doesn't work
>> >>> >>>>>>>>> well for a community, which is why nearly all *very* widely
>> >>> >>>>>>>>> used
>> >>> >>>>>>>>> programming
>> >>> >>>>>>>>> interfaces (I'm talking things like Java standard library,
>> >>> >>>>>>>>> Windows
>> >>> >>>>>>>>> API, etc)
>> >>> >>>>>>>>> almost *never* break backwards compatibility. All this is
>> >>> >>>>>>>>> done
>> >>> >>>>>>>>> within reason
>> >>> >>>>>>>>> though, e.g. we do change things in major releases (2.x,
>> >>> >>>>>>>>> 3.x,
>> >>> >>>>>>>>> etc).
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>> ---------------------------------------------------------------------
>> >>> >>>>>> To unsubscribe e-mail: [hidden email]
>> >>> >>>>>>
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>> --
>> >>> >>>>> Stavros Kontopoulos
>> >>> >>>>> Senior Software Engineer
>> >>> >>>>> Lightbend, Inc.
>> >>> >>>>> p:  <a href="tel:%2B30%206977967274" value="+306977967274" target="_blank">+30 6977967274
>> >>> >>>>> e: [hidden email]
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>
>> >>> >>>
>> >>> >>
>> >>> >>
>> >>>
>> >>
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: [hidden email]
>> >
>> >
>> > ________________________________
>> >
>> > If you reply to this email, your message will be added to the discussion
>> > below:
>> >
>> >
>> > http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-tp19268p19359.html
>> >
>> > To start a new topic under Apache Spark Developers List, email [hidden
>> > email]
>> > To unsubscribe from Apache Spark Developers List, click here.
>> > NAML
>> >
>> >
>> > ________________________________
>> > View this message in context: RE: Spark Improvement Proposals
>> > Sent from the Apache Spark Developers List mailing list archive at
>> > Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Cody Koeninger-2
If someone wants to tell me that it's OK and "The Apache Way" for
Kafka and Flink to have a proposal process that ends in a lazy
majority, but it's not OK for Spark to have a proposal process that
ends in a non-lazy consensus...

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#KafkaImprovementProposals-Process

In practice any PMC member can stop a proposal they don't like, so I'm
not sure how much it matters.



On Mon, Oct 10, 2016 at 5:59 PM, Mark Hamstra <[hidden email]> wrote:

> There is a larger issue to keep in mind, and that is that what you are
> proposing is a procedure that, as far as I am aware, hasn't previously been
> adopted in an Apache project, and thus is not an easy or exact fit with
> established practices that have been blessed as "The Apache Way".  As such,
> we need to be careful, because we have run into some trouble in the past
> with some inside the ASF but essentially outside the Spark community who
> didn't like the way we were doing things.
>
> On Mon, Oct 10, 2016 at 3:53 PM, Cody Koeninger <[hidden email]> wrote:
>>
>> Apache documents say lots of confusing stuff, including that commiters are
>> in practice given a vote.
>>
>> https://www.apache.org/foundation/voting.html
>>
>> I don't care either way, if someone wants me to sub commiter for PMC in
>> the voting section, fine, we just need a clear outcome.
>>
>>
>> On Oct 10, 2016 17:36, "Mark Hamstra" <[hidden email]> wrote:
>>>
>>> If I'm correctly understanding the kind of voting that you are talking
>>> about, then to be accurate, it is only the PMC members that have a vote, not
>>> all committers:
>>> https://www.apache.org/foundation/how-it-works.html#pmc-members
>>>
>>> On Mon, Oct 10, 2016 at 12:02 PM, Cody Koeninger <[hidden email]>
>>> wrote:
>>>>
>>>> I think the main value is in being honest about what's going on.  No
>>>> one other than committers can cast a meaningful vote, that's the
>>>> reality.  Beyond that, if people think it's more open to allow formal
>>>> proposals from anyone, I'm not necessarily against it, but my main
>>>> question would be this:
>>>>
>>>> If anyone can submit a proposal, are committers actually going to
>>>> clearly reject and close proposals that don't meet the requirements?
>>>>
>>>> Right now we have a serious problem with lack of clarity regarding
>>>> contributions, and that cannot spill over into goal-setting.
>>>>
>>>> On Mon, Oct 10, 2016 at 1:54 PM, Ryan Blue <[hidden email]> wrote:
>>>> > +1 to votes to approve proposals. I agree that proposals should have
>>>> > an
>>>> > official mechanism to be accepted, and a vote is an established means
>>>> > of
>>>> > doing that well. I like that it includes a period to review the
>>>> > proposal and
>>>> > I think proposals should have been discussed enough ahead of a vote to
>>>> > survive the possibility of a veto.
>>>> >
>>>> > I also like the names that are short and (mostly) unique, like SEP.
>>>> >
>>>> > Where I disagree is with the requirement that a committer must
>>>> > formally
>>>> > propose an enhancement. I don't see the value of restricting this: if
>>>> > someone has the will to write up a proposal then they should be
>>>> > encouraged
>>>> > to do so and start a discussion about it. Even if there is a political
>>>> > reality as Cody says, what is the value of codifying that in our
>>>> > process? I
>>>> > think restricting who can submit proposals would only undermine them
>>>> > by
>>>> > pushing contributors out. Maybe I'm missing something here?
>>>> >
>>>> > rb
>>>> >
>>>> >
>>>> >
>>>> > On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <[hidden email]>
>>>> > wrote:
>>>> >>
>>>> >> Yes, users suggesting SIPs is a good thing and is explicitly called
>>>> >> out in the linked document under the Who? section.  Formally
>>>> >> proposing
>>>> >> them, not so much, because of the political realities.
>>>> >>
>>>> >> Yes, implementation strategy definitely affects goals.  There are all
>>>> >> kinds of examples of this, I'll pick one that's my fault so as to
>>>> >> avoid sounding like I'm blaming:
>>>> >>
>>>> >> When I implemented the Kafka DStream, one of my (not explicitly
>>>> >> agreed
>>>> >> upon by the community) goals was to make sure people could use the
>>>> >> Dstream with however they were already using Kafka at work.  The lack
>>>> >> of explicit agreement on that goal led to all kinds of fighting with
>>>> >> committers, that could have been avoided.  The lack of explicit
>>>> >> up-front strategy discussion led to the DStream not really working
>>>> >> with compacted topics.  I knew about compacted topics, but don't have
>>>> >> a use for them, so had a blind spot there.  If there was explicit
>>>> >> up-front discussion that my strategy was "assume that batches can be
>>>> >> defined on the driver solely by beginning and ending offsets",
>>>> >> there's
>>>> >> a greater chance that a user would have seen that and said, "hey,
>>>> >> what
>>>> >> about non-contiguous offsets in a compacted topic".
>>>> >>
>>>> >> This kind of thing is only going to happen smoothly if we have a
>>>> >> lightweight user-visible process with clear outcomes.
>>>> >>
>>>> >> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>>>> >> <[hidden email]> wrote:
>>>> >> > I agree with most of what Cody said.
>>>> >> >
>>>> >> > Two things:
>>>> >> >
>>>> >> > First we can always have other people suggest SIPs but mark them as
>>>> >> > “unreviewed” and have committers basically move them forward. The
>>>> >> > problem is
>>>> >> > that writing a good document takes time. This way we can leverage
>>>> >> > non
>>>> >> > committers to do some of this work (it is just another way to
>>>> >> > contribute).
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > As for strategy, in many cases implementation strategy can affect
>>>> >> > the
>>>> >> > goals.
>>>> >> > I will give  a small example: In the current structured streaming
>>>> >> > strategy,
>>>> >> > we group by the time to achieve a sliding window. This is
>>>> >> > definitely an
>>>> >> > implementation decision and not a goal. However, I can think of
>>>> >> > several
>>>> >> > aggregation functions which have the time inside their calculation
>>>> >> > buffer.
>>>> >> > For example, let’s say we want to return a set of all distinct
>>>> >> > values.
>>>> >> > One
>>>> >> > way to implement this would be to make the set into a map and have
>>>> >> > the
>>>> >> > value
>>>> >> > contain the last time seen. Multiplying it across the groupby would
>>>> >> > cost
>>>> >> > a
>>>> >> > lot in performance. So adding such a strategy would have a great
>>>> >> > effect
>>>> >> > on
>>>> >> > the type of aggregations and their performance which does affect
>>>> >> > the
>>>> >> > goal.
>>>> >> > Without adding the strategy, it is easy for whoever goes to the
>>>> >> > design
>>>> >> > document to not think about these cases. Furthermore, it might be
>>>> >> > decided
>>>> >> > that these cases are rare enough so that the strategy is still good
>>>> >> > enough
>>>> >> > but how would we know it without user feedback?
>>>> >> >
>>>> >> > I believe this example is exactly what Cody was talking about.
>>>> >> > Since
>>>> >> > many
>>>> >> > times implementation strategies have a large effect on the goal, we
>>>> >> > should
>>>> >> > have it discussed when discussing the goals. In addition, while it
>>>> >> > is
>>>> >> > often
>>>> >> > easy to throw out completely infeasible goals, it is often much
>>>> >> > harder
>>>> >> > to
>>>> >> > figure out that the goals are unfeasible without fine tuning.
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > Assaf.
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > From: Cody Koeninger-2 [via Apache Spark Developers List]
>>>> >> > [mailto:ml-node+[hidden email]]
>>>> >> > Sent: Monday, October 10, 2016 2:25 AM
>>>> >> > To: Mendelson, Assaf
>>>> >> > Subject: Re: Spark Improvement Proposals
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > Only committers should formally submit SIPs because in an apache
>>>> >> > project only commiters have explicit political power.  If a user
>>>> >> > can't
>>>> >> > find a commiter willing to sponsor an SIP idea, they have no way to
>>>> >> > get the idea passed in any case.  If I can't find a committer to
>>>> >> > sponsor this meta-SIP idea, I'm out of luck.
>>>> >> >
>>>> >> > I do not believe unrealistic goals can be found solely by
>>>> >> > inspection.
>>>> >> > We've managed to ignore unrealistic goals even after
>>>> >> > implementation!
>>>> >> > Focusing on APIs can allow people to think they've solved
>>>> >> > something,
>>>> >> > when there's really no way of implementing that API while meeting
>>>> >> > the
>>>> >> > goals.  Rapid iteration is clearly the best way to address this,
>>>> >> > but
>>>> >> > we've already talked about why that hasn't really worked.  If
>>>> >> > adding a
>>>> >> > non-binding API section to the template is important to you, I'm
>>>> >> > not
>>>> >> > against it, but I don't think it's sufficient.
>>>> >> >
>>>> >> > On your PRD vs design doc spectrum, I'm saying this is closer to a
>>>> >> > PRD.  Clear agreement on goals is the most important thing and
>>>> >> > that's
>>>> >> > why it's the thing I want binding agreement on.  But I cannot agree
>>>> >> > to
>>>> >> > goals unless I have enough minimal technical info to judge whether
>>>> >> > the
>>>> >> > goals are likely to actually be accomplished.
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > On Sun, Oct 9, 2016 at 5:35 PM, Matei Zaharia <[hidden email]>
>>>> >> > wrote:
>>>> >> >
>>>> >> >
>>>> >> >> Well, I think there are a few things here that don't make sense.
>>>> >> >> First,
>>>> >> >> why
>>>> >> >> should only committers submit SIPs? Development in the project
>>>> >> >> should
>>>> >> >> be
>>>> >> >> open to all contributors, whether they're committers or not.
>>>> >> >> Second, I
>>>> >> >> think
>>>> >> >> unrealistic goals can be found just by inspecting the goals, and
>>>> >> >> I'm
>>>> >> >> not
>>>> >> >> super worried that we'll accept a lot of SIPs that are then
>>>> >> >> infeasible
>>>> >> >> --
>>>> >> >> we
>>>> >> >> can then submit new ones. But this depends on whether you want
>>>> >> >> this
>>>> >> >> process
>>>> >> >> to be a "design doc lite", where people also agree on
>>>> >> >> implementation
>>>> >> >> strategy, or just a way to agree on goals. This is what I asked
>>>> >> >> earlier
>>>> >> >> about PRDs vs design docs (and I'm open to either one but I'd just
>>>> >> >> like
>>>> >> >> clarity). Finally, both as a user and designer of software, I
>>>> >> >> always
>>>> >> >> want
>>>> >> >> to
>>>> >> >> give feedback on APIs, so I'd really like a culture of having
>>>> >> >> those
>>>> >> >> early.
>>>> >> >> People don't argue about prettiness when they discuss APIs, they
>>>> >> >> argue
>>>> >> >> about
>>>> >> >> the core concepts to expose in order to meet various goals, and
>>>> >> >> then
>>>> >> >> they're
>>>> >> >> stuck maintaining those for a long time.
>>>> >> >>
>>>> >> >> Matei
>>>> >> >>
>>>> >> >> On Oct 9, 2016, at 3:10 PM, Cody Koeninger <[hidden email]> wrote:
>>>> >> >>
>>>> >> >> Users instead of people, sure.  Commiters and contributors are (or
>>>> >> >> at
>>>> >> >> least
>>>> >> >> should be) a subset of users.
>>>> >> >>
>>>> >> >> Non goals, sure. I don't care what the name is, but we need to
>>>> >> >> clearly
>>>> >> >> say
>>>> >> >> e.g. 'no we are not maintaining compatibility with XYZ right now'.
>>>> >> >>
>>>> >> >> API, what I care most about is whether it allows me to accomplish
>>>> >> >> the
>>>> >> >> goals.
>>>> >> >> Arguing about how ugly or pretty it is can be saved for design/
>>>> >> >> implementation imho.
>>>> >> >>
>>>> >> >> Strategy, this is necessary because otherwise goals can be out of
>>>> >> >> line
>>>> >> >> with
>>>> >> >> reality.  Don't propose goals you don't have at least some idea of
>>>> >> >> how
>>>> >> >> to
>>>> >> >> implement.
>>>> >> >>
>>>> >> >> Rejected strategies, given that commiters are the only ones I'm
>>>> >> >> saying
>>>> >> >> should formally submit SPARKLIs or SIPs, if they put junk in a
>>>> >> >> required
>>>> >> >> section then slap them down for it and tell them to fix it.
>>>> >> >>
>>>> >> >>
>>>> >> >> On Oct 9, 2016 4:36 PM, "Matei Zaharia" <[hidden email]> wrote:
>>>> >> >>>
>>>> >> >>> Yup, this is the stuff that I found unclear. Thanks for
>>>> >> >>> clarifying
>>>> >> >>> here,
>>>> >> >>> but we should also clarify it in the writeup. In particular:
>>>> >> >>>
>>>> >> >>> - Goals needs to be about user-facing behavior ("people" is
>>>> >> >>> broad)
>>>> >> >>>
>>>> >> >>> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will
>>>> >> >>> dig
>>>> >> >>> up
>>>> >> >>> one of these and say "Spark's developers have officially rejected
>>>> >> >>> X,
>>>> >> >>> which
>>>> >> >>> our awesome system has".
>>>> >> >>>
>>>> >> >>> - For user-facing stuff, I think you need a section on API.
>>>> >> >>> Virtually
>>>> >> >>> all
>>>> >> >>> other *IPs I've seen have that.
>>>> >> >>>
>>>> >> >>> - I'm still not sure why the strategy section is needed if the
>>>> >> >>> purpose
>>>> >> >>> is
>>>> >> >>> to define user-facing behavior -- unless this is the strategy for
>>>> >> >>> setting
>>>> >> >>> the goals or for defining the API. That sounds squarely like a
>>>> >> >>> design
>>>> >> >>> doc
>>>> >> >>> issue. In some sense, who cares whether the proposal is
>>>> >> >>> technically
>>>> >> >>> feasible
>>>> >> >>> right now? If it's infeasible, that will be discovered later
>>>> >> >>> during
>>>> >> >>> design
>>>> >> >>> and implementation. Same thing with rejected strategies --
>>>> >> >>> listing
>>>> >> >>> some
>>>> >> >>> of
>>>> >> >>> those is definitely useful sometimes, but if you make this a
>>>> >> >>> *required*
>>>> >> >>> section, people are just going to fill it in with bogus stuff
>>>> >> >>> (I've
>>>> >> >>> seen
>>>> >> >>> this happen before).
>>>> >> >>>
>>>> >> >>> Matei
>>>> >> >>>
>>>> >> >
>>>> >> >>> > On Oct 9, 2016, at 2:14 PM, Cody Koeninger <[hidden email]>
>>>> >> >>> > wrote:
>>>> >> >>> >
>>>> >> >>> > So to focus the discussion on the specific strategy I'm
>>>> >> >>> > suggesting,
>>>> >> >>> > documented at
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>> >> >>> >
>>>> >> >>> > "Goals: What must this allow people to do, that they can't
>>>> >> >>> > currently?"
>>>> >> >>> >
>>>> >> >>> > Is it unclear that this is focusing specifically on
>>>> >> >>> > people-visible
>>>> >> >>> > behavior?
>>>> >> >>> >
>>>> >> >>> > Rejected goals -  are important because otherwise people keep
>>>> >> >>> > trying
>>>> >> >>> > to argue about scope.  Of course you can change things later
>>>> >> >>> > with a
>>>> >> >>> > different SIP and different vote, the point is to focus.
>>>> >> >>> >
>>>> >> >>> > Use cases - are something that people are going to bring up in
>>>> >> >>> > discussion.  If they aren't clearly documented as a goal ("This
>>>> >> >>> > must
>>>> >> >>> > allow me to connect using SSL"), they should be added.
>>>> >> >>> >
>>>> >> >>> > Internal architecture - if the people who need specific
>>>> >> >>> > behavior are
>>>> >> >>> > implementers of other parts of the system, that's fine.
>>>> >> >>> >
>>>> >> >>> > Rejected strategies - If you have none of these, you have no
>>>> >> >>> > evidence
>>>> >> >>> > that the proponent didn't just go with the first thing they had
>>>> >> >>> > in
>>>> >> >>> > mind (or have already implemented), which is a big problem
>>>> >> >>> > currently.
>>>> >> >>> > Approval isn't binding as to specifics of implementation, so
>>>> >> >>> > these
>>>> >> >>> > aren't handcuffs.  The goals are the contract, the strategy is
>>>> >> >>> > evidence that contract can actually be met.
>>>> >> >>> >
>>>> >> >>> > Design docs - I'm not touching design docs.  The markdown file
>>>> >> >>> > I
>>>> >> >>> > linked specifically says of the strategy section "This is not a
>>>> >> >>> > full
>>>> >> >>> > design document."  Is this unclear?  Design docs can be worked
>>>> >> >>> > on
>>>> >> >>> > obviously, but that's not what I'm concerned with here.
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> > On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <[hidden email]>
>>>> >> >>> > wrote:
>>>> >> >>> >> Hi Cody,
>>>> >> >>> >>
>>>> >> >>> >> I think this would be a lot more concrete if we had a more
>>>> >> >>> >> detailed
>>>> >> >>> >> template
>>>> >> >>> >> for SIPs. Right now, it's not super clear what's in scope --
>>>> >> >>> >> e.g.
>>>> >> >>> >> are
>>>> >> >>> >> they
>>>> >> >>> >> a way to solicit feedback on the user-facing behavior or on
>>>> >> >>> >> the
>>>> >> >>> >> internals?
>>>> >> >>> >> "Goals" can cover both things. I've been thinking of SIPs more
>>>> >> >>> >> as
>>>> >> >>> >> Product
>>>> >> >>> >> Requirements Docs (PRDs), which focus on *what* a code change
>>>> >> >>> >> should
>>>> >> >>> >> do
>>>> >> >>> >> as
>>>> >> >>> >> opposed to how.
>>>> >> >>> >>
>>>> >> >>> >> In particular, here are some things that you may or may not
>>>> >> >>> >> consider
>>>> >> >>> >> in
>>>> >> >>> >> scope for SIPs:
>>>> >> >>> >>
>>>> >> >>> >> - Goals and non-goals: This is definitely in scope, and IMO
>>>> >> >>> >> should
>>>> >> >>> >> focus on
>>>> >> >>> >> user-visible behavior (e.g. "system supports SQL window
>>>> >> >>> >> functions"
>>>> >> >>> >> or
>>>> >> >>> >> "system continues working if one node fails"). BTW I wouldn't
>>>> >> >>> >> say
>>>> >> >>> >> "rejected
>>>> >> >>> >> goals" because some of them might become goals later, so we're
>>>> >> >>> >> not
>>>> >> >>> >> definitively rejecting them.
>>>> >> >>> >>
>>>> >> >>> >> - Public API: Probably should be included in most SIPs unless
>>>> >> >>> >> it's
>>>> >> >>> >> too
>>>> >> >>> >> large
>>>> >> >>> >> to fully specify then (e.g. "let's add an ML library").
>>>> >> >>> >>
>>>> >> >>> >> - Use cases: I usually find this very useful in PRDs to better
>>>> >> >>> >> communicate
>>>> >> >>> >> the goals.
>>>> >> >>> >>
>>>> >> >>> >> - Internal architecture: This is usually *not* a thing users
>>>> >> >>> >> can
>>>> >> >>> >> easily
>>>> >> >>> >> comment on and it sounds more like a design doc item. Of
>>>> >> >>> >> course
>>>> >> >>> >> it's
>>>> >> >>> >> important to show that the SIP is feasible to implement. One
>>>> >> >>> >> exception,
>>>> >> >>> >> however, is that I think we'll have some SIPs primarily on
>>>> >> >>> >> internals
>>>> >> >>> >> (e.g.
>>>> >> >>> >> if somebody wants to refactor Spark's query optimizer or
>>>> >> >>> >> something).
>>>> >> >>> >>
>>>> >> >>> >> - Rejected strategies: I personally wouldn't put this, because
>>>> >> >>> >> what's
>>>> >> >>> >> the
>>>> >> >>> >> point of voting to reject a strategy before you've really
>>>> >> >>> >> begun
>>>> >> >>> >> designing
>>>> >> >>> >> and implementing something? What if you discover that the
>>>> >> >>> >> strategy
>>>> >> >>> >> is
>>>> >> >>> >> actually better when you start doing stuff?
>>>> >> >>> >>
>>>> >> >>> >> At a super high level, it depends on whether you want the SIPs
>>>> >> >>> >> to
>>>> >> >>> >> be
>>>> >> >>> >> PRDs
>>>> >> >>> >> for getting some quick feedback on the goals of a feature
>>>> >> >>> >> before it
>>>> >> >>> >> is
>>>> >> >>> >> designed, or something more like full-fledged design docs
>>>> >> >>> >> (just a
>>>> >> >>> >> more
>>>> >> >>> >> visible design doc for bigger changes). I looked at Kafka's
>>>> >> >>> >> KIPs,
>>>> >> >>> >> and
>>>> >> >>> >> they
>>>> >> >>> >> actually seem to be more like design docs. This can work too
>>>> >> >>> >> but it
>>>> >> >>> >> does
>>>> >> >>> >> require more work from the proposer and it can lead to the
>>>> >> >>> >> same
>>>> >> >>> >> problems you
>>>> >> >>> >> mentioned with people already having a design and
>>>> >> >>> >> implementation in
>>>> >> >>> >> mind.
>>>> >> >>> >>
>>>> >> >>> >> Basically, the question is, are you trying to iterate faster
>>>> >> >>> >> on
>>>> >> >>> >> design
>>>> >> >>> >> by
>>>> >> >>> >> adding a step for user feedback earlier? Or are you just
>>>> >> >>> >> trying to
>>>> >> >>> >> make
>>>> >> >>> >> design docs for key features more visible (and their approval
>>>> >> >>> >> more
>>>> >> >>> >> formal)?
>>>> >> >>> >>
>>>> >> >>> >> BTW note that in either case, I'd like to have a template for
>>>> >> >>> >> design
>>>> >> >>> >> docs
>>>> >> >>> >> too, which should also include goals. I think that would've
>>>> >> >>> >> avoided
>>>> >> >>> >> some of
>>>> >> >>> >> the issues you brought up.
>>>> >> >>> >>
>>>> >> >>> >> Matei
>>>> >> >>> >>
>>>> >> >>> >> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <[hidden email]>
>>>> >> >>> >> wrote:
>>>> >> >>> >>
>>>> >> >>> >> Here's my specific proposal (meta-proposal?)
>>>> >> >>> >>
>>>> >> >>> >> Spark Improvement Proposals (SIP)
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> Background:
>>>> >> >>> >>
>>>> >> >>> >> The current problem is that design and implementation of large
>>>> >> >>> >> features
>>>> >> >>> >> are
>>>> >> >>> >> often done in private, before soliciting user feedback.
>>>> >> >>> >>
>>>> >> >>> >> When feedback is solicited, it is often as to detailed design
>>>> >> >>> >> specifics, not
>>>> >> >>> >> focused on goals.
>>>> >> >>> >>
>>>> >> >>> >> When implementation does take place after design, there is
>>>> >> >>> >> often
>>>> >> >>> >> disagreement as to what goals are or are not in scope.
>>>> >> >>> >>
>>>> >> >>> >> This results in commits that don't fully meet user needs.
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> Goals:
>>>> >> >>> >>
>>>> >> >>> >> - Ensure user, contributor, and committer goals are clearly
>>>> >> >>> >> identified
>>>> >> >>> >> and
>>>> >> >>> >> agreed upon, before implementation takes place.
>>>> >> >>> >>
>>>> >> >>> >> - Ensure that a technically feasible strategy is chosen that
>>>> >> >>> >> is
>>>> >> >>> >> likely
>>>> >> >>> >> to
>>>> >> >>> >> meet the goals.
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> Rejected Goals:
>>>> >> >>> >>
>>>> >> >>> >> - SIPs are not for detailed design.  Design by committee
>>>> >> >>> >> doesn't
>>>> >> >>> >> work.
>>>> >> >>> >>
>>>> >> >>> >> - SIPs are not for every change.  We dont need that much
>>>> >> >>> >> process.
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> Strategy:
>>>> >> >>> >>
>>>> >> >>> >> My suggestion is outlined as a Spark Improvement Proposal
>>>> >> >>> >> process
>>>> >> >>> >> documented
>>>> >> >>> >> at
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>> >> >>> >>
>>>> >> >>> >> Specifics of Jira manipulation are an implementation detail we
>>>> >> >>> >> can
>>>> >> >>> >> figure
>>>> >> >>> >> out.
>>>> >> >>> >>
>>>> >> >>> >> I'm suggesting voting; the need here is for a _clear_ outcome.
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> Rejected Strategies:
>>>> >> >>> >>
>>>> >> >>> >> Having someone who understands the problem implement it first
>>>> >> >>> >> works,
>>>> >> >>> >> but
>>>> >> >>> >> only if significant iteration after user feedback is allowed.
>>>> >> >>> >>
>>>> >> >>> >> Historically this has been problematic due to pressure to
>>>> >> >>> >> limit
>>>> >> >>> >> public
>>>> >> >>> >> api
>>>> >> >>> >> changes.
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <[hidden email]>
>>>> >> >>> >> wrote:
>>>> >> >>> >>>
>>>> >> >>> >>> Alright looks like there are quite a bit of support. We
>>>> >> >>> >>> should
>>>> >> >>> >>> wait
>>>> >> >>> >>> to
>>>> >> >>> >>> hear from more people too.
>>>> >> >>> >>>
>>>> >> >>> >>> To push this forward, Cody and I will be working together in
>>>> >> >>> >>> the
>>>> >> >>> >>> next
>>>> >> >>> >>> couple of weeks to come up with a concrete, detailed proposal
>>>> >> >>> >>> on
>>>> >> >>> >>> what
>>>> >> >>> >>> this
>>>> >> >>> >>> entails, and then we can discuss this the specific proposal
>>>> >> >>> >>> as
>>>> >> >>> >>> well.
>>>> >> >>> >>>
>>>> >> >>> >>>
>>>> >> >>> >>> On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <[hidden
>>>> >> >>> >>> email]>
>>>> >> >>> >>> wrote:
>>>> >> >>> >>>>
>>>> >> >>> >>>> Yeah, in case it wasn't clear, I was talking about SIPs for
>>>> >> >>> >>>> major
>>>> >> >>> >>>> user-facing or cross-cutting changes, not minor feature
>>>> >> >>> >>>> adds.
>>>> >> >>> >>>>
>>>> >> >>> >>>> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos
>>>> >> >>> >>>> <[hidden email]> wrote:
>>>> >> >>> >>>>>
>>>> >> >>> >>>>> +1 to the SIP label as long as it does not slow down things
>>>> >> >>> >>>>> and
>>>> >> >>> >>>>> it
>>>> >> >>> >>>>> targets optimizing efforts, coordination etc. For example
>>>> >> >>> >>>>> really
>>>> >> >>> >>>>> small
>>>> >> >>> >>>>> features should not need to go through this process
>>>> >> >>> >>>>> (assuming
>>>> >> >>> >>>>> they
>>>> >> >>> >>>>> dont
>>>> >> >>> >>>>> touch public interfaces)  or re-factorings and hope it will
>>>> >> >>> >>>>> be
>>>> >> >>> >>>>> kept
>>>> >> >>> >>>>> this
>>>> >> >>> >>>>> way. So as a guideline doc should be provided, like in the
>>>> >> >>> >>>>> KIP
>>>> >> >>> >>>>> case.
>>>> >> >>> >>>>>
>>>> >> >>> >>>>> IMHO so far aside from tagging things and linking them
>>>> >> >>> >>>>> elsewhere
>>>> >> >>> >>>>> simply
>>>> >> >>> >>>>> having design docs and prototypes implementations in PRs is
>>>> >> >>> >>>>> not
>>>> >> >>> >>>>> something
>>>> >> >>> >>>>> that has not worked so far. What is really a pain in many
>>>> >> >>> >>>>> projects
>>>> >> >>> >>>>> out there
>>>> >> >>> >>>>> is discontinuity in progress of PRs, missing features, slow
>>>> >> >>> >>>>> reviews
>>>> >> >>> >>>>> which is
>>>> >> >>> >>>>> understandable to some extent... it is not only about Spark
>>>> >> >>> >>>>> but
>>>> >> >>> >>>>> things can
>>>> >> >>> >>>>> be improved for sure for this project in particular as
>>>> >> >>> >>>>> already
>>>> >> >>> >>>>> stated.
>>>> >> >>> >>>>>
>>>> >> >>> >>>>> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <[hidden
>>>> >> >>> >>>>> email]>
>>>> >> >>> >>>>> wrote:
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> +1 to adding an SIP label and linking it from the website.
>>>> >> >>> >>>>>> I
>>>> >> >>> >>>>>> think
>>>> >> >>> >>>>>> it
>>>> >> >>> >>>>>> needs
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> - template that focuses it towards soliciting user goals /
>>>> >> >>> >>>>>> non
>>>> >> >>> >>>>>> goals
>>>> >> >>> >>>>>> - clear resolution as to which strategy was chosen to
>>>> >> >>> >>>>>> pursue.
>>>> >> >>> >>>>>> I'd
>>>> >> >>> >>>>>> recommend a vote.
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> Matei asked me to clarify what I meant by changing
>>>> >> >>> >>>>>> interfaces,
>>>> >> >>> >>>>>> I
>>>> >> >>> >>>>>> think
>>>> >> >>> >>>>>> it's directly relevant to the SIP idea so I'll clarify
>>>> >> >>> >>>>>> here,
>>>> >> >>> >>>>>> and
>>>> >> >>> >>>>>> split
>>>> >> >>> >>>>>> a thread for the other discussion per Nicholas' request.
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> I meant changing public user interfaces.  I think the
>>>> >> >>> >>>>>> first
>>>> >> >>> >>>>>> design
>>>> >> >>> >>>>>> is
>>>> >> >>> >>>>>> unlikely to be right, because it's done at a time when you
>>>> >> >>> >>>>>> have
>>>> >> >>> >>>>>> the
>>>> >> >>> >>>>>> least information.  As a user, I find it considerably more
>>>> >> >>> >>>>>> frustrating
>>>> >> >>> >>>>>> to be unable to use a tool to get my job done, than I do
>>>> >> >>> >>>>>> having
>>>> >> >>> >>>>>> to
>>>> >> >>> >>>>>> make minor changes to my code in order to take advantage
>>>> >> >>> >>>>>> of
>>>> >> >>> >>>>>> features.
>>>> >> >>> >>>>>> I've seen committers be seriously reluctant to allow
>>>> >> >>> >>>>>> changes to
>>>> >> >>> >>>>>> @experimental code that are needed in order for it to
>>>> >> >>> >>>>>> really
>>>> >> >>> >>>>>> work
>>>> >> >>> >>>>>> right.  You need to be able to iterate, and if people on
>>>> >> >>> >>>>>> both
>>>> >> >>> >>>>>> sides
>>>> >> >>> >>>>>> of
>>>> >> >>> >>>>>> the fence aren't going to respect that some newer apis are
>>>> >> >>> >>>>>> subject
>>>> >> >>> >>>>>> to
>>>> >> >>> >>>>>> change, then why even mark them as such?
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> Ideally a finished SIP should give me a checklist of
>>>> >> >>> >>>>>> things
>>>> >> >>> >>>>>> that
>>>> >> >>> >>>>>> an
>>>> >> >>> >>>>>> implementation must do, and things that it doesn't need to
>>>> >> >>> >>>>>> do.
>>>> >> >>> >>>>>> Contributors/committers should be seriously discouraged
>>>> >> >>> >>>>>> from
>>>> >> >>> >>>>>> putting
>>>> >> >>> >>>>>> out a version 0.1 that doesn't have at least a prototype
>>>> >> >>> >>>>>> implementation of all those things, especially if they're
>>>> >> >>> >>>>>> then
>>>> >> >>> >>>>>> going
>>>> >> >>> >>>>>> to argue against interface changes necessary to get the
>>>> >> >>> >>>>>> the
>>>> >> >>> >>>>>> rest
>>>> >> >>> >>>>>> of
>>>> >> >>> >>>>>> the things done in the 0.2 version.
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <[hidden
>>>> >> >>> >>>>>> email]>
>>>> >> >>> >>>>>> wrote:
>>>> >> >>> >>>>>>> I like the lightweight proposal to add a SIP label.
>>>> >> >>> >>>>>>>
>>>> >> >>> >>>>>>> During Spark 2.0 development, Tom (Graves) and I
>>>> >> >>> >>>>>>> suggested
>>>> >> >>> >>>>>>> using
>>>> >> >>> >>>>>>> wiki
>>>> >> >>> >>>>>>> to
>>>> >> >>> >>>>>>> track the list of major changes, but that never really
>>>> >> >>> >>>>>>> materialized
>>>> >> >>> >>>>>>> due to
>>>> >> >>> >>>>>>> the overhead. Adding a SIP label on major JIRAs and then
>>>> >> >>> >>>>>>> link
>>>> >> >>> >>>>>>> to
>>>> >> >>> >>>>>>> them
>>>> >> >>> >>>>>>> prominently on the Spark website makes a lot of sense.
>>>> >> >>> >>>>>>>
>>>> >> >>> >>>>>>>
>>>> >> >>> >>>>>>> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia
>>>> >> >>> >>>>>>> <[hidden email]>
>>>> >> >>> >>>>>>> wrote:
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> For the improvement proposals, I think one major point
>>>> >> >>> >>>>>>>> was to
>>>> >> >>> >>>>>>>> make
>>>> >> >>> >>>>>>>> them
>>>> >> >>> >>>>>>>> really visible to users who are not contributors, so we
>>>> >> >>> >>>>>>>> should
>>>> >> >>> >>>>>>>> do
>>>> >> >>> >>>>>>>> more than
>>>> >> >>> >>>>>>>> sending stuff to dev@. One very lightweight idea is to
>>>> >> >>> >>>>>>>> have a
>>>> >> >>> >>>>>>>> new
>>>> >> >>> >>>>>>>> type of
>>>> >> >>> >>>>>>>> JIRA called a SIP and have a link to a filter that shows
>>>> >> >>> >>>>>>>> all
>>>> >> >>> >>>>>>>> such
>>>> >> >>> >>>>>>>> JIRAs from
>>>> >> >>> >>>>>>>> http://spark.apache.org. I also like the idea of SIP and
>>>> >> >>> >>>>>>>> design
>>>> >> >>> >>>>>>>> doc
>>>> >> >>> >>>>>>>> templates (in fact many projects have them).
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> Matei
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden
>>>> >> >>> >>>>>>>> email]>
>>>> >> >>> >>>>>>>> wrote:
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> I called Cody last night and talked about some of the
>>>> >> >>> >>>>>>>> topics
>>>> >> >>> >>>>>>>> in
>>>> >> >>> >>>>>>>> his
>>>> >> >>> >>>>>>>> email.
>>>> >> >>> >>>>>>>> It became clear to me Cody genuinely cares about the
>>>> >> >>> >>>>>>>> project.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> Some of the frustrations come from the success of the
>>>> >> >>> >>>>>>>> project
>>>> >> >>> >>>>>>>> itself
>>>> >> >>> >>>>>>>> becoming very "hot", and it is difficult to get clarity
>>>> >> >>> >>>>>>>> from
>>>> >> >>> >>>>>>>> people
>>>> >> >>> >>>>>>>> who
>>>> >> >>> >>>>>>>> don't dedicate all their time to Spark. In fact, it is
>>>> >> >>> >>>>>>>> in
>>>> >> >>> >>>>>>>> some
>>>> >> >>> >>>>>>>> ways
>>>> >> >>> >>>>>>>> similar
>>>> >> >>> >>>>>>>> to scaling an engineering team in a successful startup:
>>>> >> >>> >>>>>>>> old
>>>> >> >>> >>>>>>>> processes that
>>>> >> >>> >>>>>>>> worked well might not work so well when it gets to a
>>>> >> >>> >>>>>>>> certain
>>>> >> >>> >>>>>>>> size,
>>>> >> >>> >>>>>>>> cultures
>>>> >> >>> >>>>>>>> can get diluted, building culture vs building process,
>>>> >> >>> >>>>>>>> etc.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> I also really like to have a more visible process for
>>>> >> >>> >>>>>>>> larger
>>>> >> >>> >>>>>>>> changes,
>>>> >> >>> >>>>>>>> especially major user facing API changes. Historically
>>>> >> >>> >>>>>>>> we
>>>> >> >>> >>>>>>>> upload
>>>> >> >>> >>>>>>>> design docs
>>>> >> >>> >>>>>>>> for major changes, but it is not always consistent and
>>>> >> >>> >>>>>>>> difficult
>>>> >> >>> >>>>>>>> to
>>>> >> >>> >>>>>>>> quality
>>>> >> >>> >>>>>>>> of the docs, due to the volunteering nature of the
>>>> >> >>> >>>>>>>> organization.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> Some of the more concrete ideas we discussed focus on
>>>> >> >>> >>>>>>>> building a
>>>> >> >>> >>>>>>>> culture
>>>> >> >>> >>>>>>>> to improve clarity:
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> - Process: Large changes should have design docs posted
>>>> >> >>> >>>>>>>> on
>>>> >> >>> >>>>>>>> JIRA.
>>>> >> >>> >>>>>>>> One
>>>> >> >>> >>>>>>>> thing
>>>> >> >>> >>>>>>>> Cody and I didn't discuss but an idea that just came to
>>>> >> >>> >>>>>>>> me is
>>>> >> >>> >>>>>>>> we
>>>> >> >>> >>>>>>>> should
>>>> >> >>> >>>>>>>> create a design doc template for the project and ask
>>>> >> >>> >>>>>>>> everybody
>>>> >> >>> >>>>>>>> to
>>>> >> >>> >>>>>>>> follow.
>>>> >> >>> >>>>>>>> The design doc template should also explicitly list
>>>> >> >>> >>>>>>>> goals and
>>>> >> >>> >>>>>>>> non-goals, to
>>>> >> >>> >>>>>>>> make design doc more consistent.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> - Process: Email dev@ to solicit feedback. We have some
>>>> >> >>> >>>>>>>> this
>>>> >> >>> >>>>>>>> with
>>>> >> >>> >>>>>>>> some
>>>> >> >>> >>>>>>>> changes, but again very inconsistent. Just posting
>>>> >> >>> >>>>>>>> something
>>>> >> >>> >>>>>>>> on
>>>> >> >>> >>>>>>>> JIRA
>>>> >> >>> >>>>>>>> isn't
>>>> >> >>> >>>>>>>> sufficient, because there are simply too many JIRAs and
>>>> >> >>> >>>>>>>> the
>>>> >> >>> >>>>>>>> signal
>>>> >> >>> >>>>>>>> get lost
>>>> >> >>> >>>>>>>> in the noise. While this is generally impossible to
>>>> >> >>> >>>>>>>> enforce
>>>> >> >>> >>>>>>>> because
>>>> >> >>> >>>>>>>> we can't
>>>> >> >>> >>>>>>>> force all volunteers to conform to a process (or they
>>>> >> >>> >>>>>>>> might
>>>> >> >>> >>>>>>>> not
>>>> >> >>> >>>>>>>> even
>>>> >> >>> >>>>>>>> be
>>>> >> >>> >>>>>>>> aware of this),  those who are more familiar with the
>>>> >> >>> >>>>>>>> project
>>>> >> >>> >>>>>>>> can
>>>> >> >>> >>>>>>>> help by
>>>> >> >>> >>>>>>>> emailing the dev@ when they see something that hasn't
>>>> >> >>> >>>>>>>> been.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> - Culture: The design doc author(s) should be open to
>>>> >> >>> >>>>>>>> feedback.
>>>> >> >>> >>>>>>>> A
>>>> >> >>> >>>>>>>> design
>>>> >> >>> >>>>>>>> doc should serve as the base for discussion and is by no
>>>> >> >>> >>>>>>>> means
>>>> >> >>> >>>>>>>> the
>>>> >> >>> >>>>>>>> final
>>>> >> >>> >>>>>>>> design. Of course, this does not mean the author has to
>>>> >> >>> >>>>>>>> accept
>>>> >> >>> >>>>>>>> every
>>>> >> >>> >>>>>>>> feedback. They should also be comfortable accepting /
>>>> >> >>> >>>>>>>> rejecting
>>>> >> >>> >>>>>>>> ideas on
>>>> >> >>> >>>>>>>> technical grounds.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> - Process / Culture: For major ongoing projects, it can
>>>> >> >>> >>>>>>>> be
>>>> >> >>> >>>>>>>> useful
>>>> >> >>> >>>>>>>> to
>>>> >> >>> >>>>>>>> have
>>>> >> >>> >>>>>>>> some monthly Google hangouts that are open to the world.
>>>> >> >>> >>>>>>>> I am
>>>> >> >>> >>>>>>>> actually not
>>>> >> >>> >>>>>>>> sure how well this will work, because of the
>>>> >> >>> >>>>>>>> volunteering
>>>> >> >>> >>>>>>>> nature
>>>> >> >>> >>>>>>>> and
>>>> >> >>> >>>>>>>> we need
>>>> >> >>> >>>>>>>> to adjust for timezones for people across the globe, but
>>>> >> >>> >>>>>>>> it
>>>> >> >>> >>>>>>>> seems
>>>> >> >>> >>>>>>>> worth
>>>> >> >>> >>>>>>>> trying.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> - Culture: Contributors (including committers) should be
>>>> >> >>> >>>>>>>> more
>>>> >> >>> >>>>>>>> direct
>>>> >> >>> >>>>>>>> in
>>>> >> >>> >>>>>>>> setting expectations, including whether they are working
>>>> >> >>> >>>>>>>> on a
>>>> >> >>> >>>>>>>> specific
>>>> >> >>> >>>>>>>> issue, whether they will be working on a specific issue,
>>>> >> >>> >>>>>>>> and
>>>> >> >>> >>>>>>>> whether
>>>> >> >>> >>>>>>>> an
>>>> >> >>> >>>>>>>> issue or pr or jira should be rejected. Most people I
>>>> >> >>> >>>>>>>> know in
>>>> >> >>> >>>>>>>> this
>>>> >> >>> >>>>>>>> community
>>>> >> >>> >>>>>>>> are nice and don't enjoy telling other people no, but it
>>>> >> >>> >>>>>>>> is
>>>> >> >>> >>>>>>>> often
>>>> >> >>> >>>>>>>> more
>>>> >> >>> >>>>>>>> annoying to a contributor to not know anything than
>>>> >> >>> >>>>>>>> getting a
>>>> >> >>> >>>>>>>> no.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia
>>>> >> >>> >>>>>>>> <[hidden email]>
>>>> >> >>> >>>>>>>> wrote:
>>>> >> >>> >>>>>>>>>
>>>> >> >>> >>>>>>>>>
>>>> >> >>> >>>>>>>>> Love the idea of a more visible "Spark Improvement
>>>> >> >>> >>>>>>>>> Proposal"
>>>> >> >>> >>>>>>>>> process that
>>>> >> >>> >>>>>>>>> solicits user input on new APIs. For what it's worth, I
>>>> >> >>> >>>>>>>>> don't
>>>> >> >>> >>>>>>>>> think
>>>> >> >>> >>>>>>>>> committers are trying to minimize their own work --
>>>> >> >>> >>>>>>>>> every
>>>> >> >>> >>>>>>>>> committer
>>>> >> >>> >>>>>>>>> cares
>>>> >> >>> >>>>>>>>> about making the software useful for users. However, it
>>>> >> >>> >>>>>>>>> is
>>>> >> >>> >>>>>>>>> always
>>>> >> >>> >>>>>>>>> hard to
>>>> >> >>> >>>>>>>>> get user input and so it helps to have this kind of
>>>> >> >>> >>>>>>>>> process.
>>>> >> >>> >>>>>>>>> I've
>>>> >> >>> >>>>>>>>> certainly
>>>> >> >>> >>>>>>>>> looked at the *IPs a lot in other software I use just
>>>> >> >>> >>>>>>>>> to see
>>>> >> >>> >>>>>>>>> the
>>>> >> >>> >>>>>>>>> biggest
>>>> >> >>> >>>>>>>>> things on the roadmap.
>>>> >> >>> >>>>>>>>>
>>>> >> >>> >>>>>>>>> When you're talking about "changing interfaces", are
>>>> >> >>> >>>>>>>>> you
>>>> >> >>> >>>>>>>>> talking
>>>> >> >>> >>>>>>>>> about
>>>> >> >>> >>>>>>>>> public or internal APIs? I do think many people hate
>>>> >> >>> >>>>>>>>> changing
>>>> >> >>> >>>>>>>>> public APIs
>>>> >> >>> >>>>>>>>> and I actually think that's for the best of the
>>>> >> >>> >>>>>>>>> project.
>>>> >> >>> >>>>>>>>> That's
>>>> >> >>> >>>>>>>>> a
>>>> >> >>> >>>>>>>>> technical
>>>> >> >>> >>>>>>>>> debate, but basically, the worst thing when you're
>>>> >> >>> >>>>>>>>> using a
>>>> >> >>> >>>>>>>>> piece
>>>> >> >>> >>>>>>>>> of
>>>> >> >>> >>>>>>>>> software
>>>> >> >>> >>>>>>>>> is that the developers constantly ask you to rewrite
>>>> >> >>> >>>>>>>>> your
>>>> >> >>> >>>>>>>>> app
>>>> >> >>> >>>>>>>>> to
>>>> >> >>> >>>>>>>>> update to a
>>>> >> >>> >>>>>>>>> new version (and thus benefit from bug fixes, etc). Cue
>>>> >> >>> >>>>>>>>> anyone
>>>> >> >>> >>>>>>>>> who's used
>>>> >> >>> >>>>>>>>> Protobuf, or Guava. The "let's get everyone to change
>>>> >> >>> >>>>>>>>> their
>>>> >> >>> >>>>>>>>> code
>>>> >> >>> >>>>>>>>> this
>>>> >> >>> >>>>>>>>> release" model works well within a single large
>>>> >> >>> >>>>>>>>> company, but
>>>> >> >>> >>>>>>>>> doesn't work
>>>> >> >>> >>>>>>>>> well for a community, which is why nearly all *very*
>>>> >> >>> >>>>>>>>> widely
>>>> >> >>> >>>>>>>>> used
>>>> >> >>> >>>>>>>>> programming
>>>> >> >>> >>>>>>>>> interfaces (I'm talking things like Java standard
>>>> >> >>> >>>>>>>>> library,
>>>> >> >>> >>>>>>>>> Windows
>>>> >> >>> >>>>>>>>> API, etc)
>>>> >> >>> >>>>>>>>> almost *never* break backwards compatibility. All this
>>>> >> >>> >>>>>>>>> is
>>>> >> >>> >>>>>>>>> done
>>>> >> >>> >>>>>>>>> within reason
>>>> >> >>> >>>>>>>>> though, e.g. we do change things in major releases
>>>> >> >>> >>>>>>>>> (2.x,
>>>> >> >>> >>>>>>>>> 3.x,
>>>> >> >>> >>>>>>>>> etc).
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> ---------------------------------------------------------------------
>>>> >> >>> >>>>>> To unsubscribe e-mail: [hidden email]
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>
>>>> >> >>> >>>>>
>>>> >> >>> >>>>>
>>>> >> >>> >>>>> --
>>>> >> >>> >>>>> Stavros Kontopoulos
>>>> >> >>> >>>>> Senior Software Engineer
>>>> >> >>> >>>>> Lightbend, Inc.
>>>> >> >>> >>>>> p:  +30 6977967274
>>>> >> >>> >>>>> e: [hidden email]
>>>> >> >>> >>>>>
>>>> >> >>> >>>>>
>>>> >> >>> >>>>
>>>> >> >>> >>>
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>>
>>>> >> >>
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > ---------------------------------------------------------------------
>>>> >> > To unsubscribe e-mail: [hidden email]
>>>> >> >
>>>> >> >
>>>> >> > ________________________________
>>>> >> >
>>>> >> > If you reply to this email, your message will be added to the
>>>> >> > discussion
>>>> >> > below:
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-tp19268p19359.html
>>>> >> >
>>>> >> > To start a new topic under Apache Spark Developers List, email
>>>> >> > [hidden
>>>> >> > email]
>>>> >> > To unsubscribe from Apache Spark Developers List, click here.
>>>> >> > NAML
>>>> >> >
>>>> >> >
>>>> >> > ________________________________
>>>> >> > View this message in context: RE: Spark Improvement Proposals
>>>> >> > Sent from the Apache Spark Developers List mailing list archive at
>>>> >> > Nabble.com.
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe e-mail: [hidden email]
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Ryan Blue
>>>> > Software Engineer
>>>> > Netflix
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: [hidden email]
>>>>
>>>
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Ryan Blue
I don't think we will have trouble with whatever rule that is adopted for accepting proposals. Considering committers' votes binding (if that is what we choose) is an established practice as long as it isn't for specific votes, like a release vote. From the Apache docs: "Who is permitted to vote is, to some extent, a community-specific thing." [1] And, I also don't see why it would be a problem to choose consensus, as long as we have an open discussion and vote about these rules.

rb

On Mon, Oct 10, 2016 at 4:15 PM, Cody Koeninger <[hidden email]> wrote:
If someone wants to tell me that it's OK and "The Apache Way" for
Kafka and Flink to have a proposal process that ends in a lazy
majority, but it's not OK for Spark to have a proposal process that
ends in a non-lazy consensus...

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#KafkaImprovementProposals-Process

In practice any PMC member can stop a proposal they don't like, so I'm
not sure how much it matters.



On Mon, Oct 10, 2016 at 5:59 PM, Mark Hamstra <[hidden email]> wrote:
> There is a larger issue to keep in mind, and that is that what you are
> proposing is a procedure that, as far as I am aware, hasn't previously been
> adopted in an Apache project, and thus is not an easy or exact fit with
> established practices that have been blessed as "The Apache Way".  As such,
> we need to be careful, because we have run into some trouble in the past
> with some inside the ASF but essentially outside the Spark community who
> didn't like the way we were doing things.
>
> On Mon, Oct 10, 2016 at 3:53 PM, Cody Koeninger <[hidden email]> wrote:
>>
>> Apache documents say lots of confusing stuff, including that commiters are
>> in practice given a vote.
>>
>> https://www.apache.org/foundation/voting.html
>>
>> I don't care either way, if someone wants me to sub commiter for PMC in
>> the voting section, fine, we just need a clear outcome.
>>
>>
>> On Oct 10, 2016 17:36, "Mark Hamstra" <[hidden email]> wrote:
>>>
>>> If I'm correctly understanding the kind of voting that you are talking
>>> about, then to be accurate, it is only the PMC members that have a vote, not
>>> all committers:
>>> https://www.apache.org/foundation/how-it-works.html#pmc-members
>>>
>>> On Mon, Oct 10, 2016 at 12:02 PM, Cody Koeninger <[hidden email]>
>>> wrote:
>>>>
>>>> I think the main value is in being honest about what's going on.  No
>>>> one other than committers can cast a meaningful vote, that's the
>>>> reality.  Beyond that, if people think it's more open to allow formal
>>>> proposals from anyone, I'm not necessarily against it, but my main
>>>> question would be this:
>>>>
>>>> If anyone can submit a proposal, are committers actually going to
>>>> clearly reject and close proposals that don't meet the requirements?
>>>>
>>>> Right now we have a serious problem with lack of clarity regarding
>>>> contributions, and that cannot spill over into goal-setting.
>>>>
>>>> On Mon, Oct 10, 2016 at 1:54 PM, Ryan Blue <[hidden email]> wrote:
>>>> > +1 to votes to approve proposals. I agree that proposals should have
>>>> > an
>>>> > official mechanism to be accepted, and a vote is an established means
>>>> > of
>>>> > doing that well. I like that it includes a period to review the
>>>> > proposal and
>>>> > I think proposals should have been discussed enough ahead of a vote to
>>>> > survive the possibility of a veto.
>>>> >
>>>> > I also like the names that are short and (mostly) unique, like SEP.
>>>> >
>>>> > Where I disagree is with the requirement that a committer must
>>>> > formally
>>>> > propose an enhancement. I don't see the value of restricting this: if
>>>> > someone has the will to write up a proposal then they should be
>>>> > encouraged
>>>> > to do so and start a discussion about it. Even if there is a political
>>>> > reality as Cody says, what is the value of codifying that in our
>>>> > process? I
>>>> > think restricting who can submit proposals would only undermine them
>>>> > by
>>>> > pushing contributors out. Maybe I'm missing something here?
>>>> >
>>>> > rb
>>>> >
>>>> >
>>>> >
>>>> > On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <[hidden email]>
>>>> > wrote:
>>>> >>
>>>> >> Yes, users suggesting SIPs is a good thing and is explicitly called
>>>> >> out in the linked document under the Who? section.  Formally
>>>> >> proposing
>>>> >> them, not so much, because of the political realities.
>>>> >>
>>>> >> Yes, implementation strategy definitely affects goals.  There are all
>>>> >> kinds of examples of this, I'll pick one that's my fault so as to
>>>> >> avoid sounding like I'm blaming:
>>>> >>
>>>> >> When I implemented the Kafka DStream, one of my (not explicitly
>>>> >> agreed
>>>> >> upon by the community) goals was to make sure people could use the
>>>> >> Dstream with however they were already using Kafka at work.  The lack
>>>> >> of explicit agreement on that goal led to all kinds of fighting with
>>>> >> committers, that could have been avoided.  The lack of explicit
>>>> >> up-front strategy discussion led to the DStream not really working
>>>> >> with compacted topics.  I knew about compacted topics, but don't have
>>>> >> a use for them, so had a blind spot there.  If there was explicit
>>>> >> up-front discussion that my strategy was "assume that batches can be
>>>> >> defined on the driver solely by beginning and ending offsets",
>>>> >> there's
>>>> >> a greater chance that a user would have seen that and said, "hey,
>>>> >> what
>>>> >> about non-contiguous offsets in a compacted topic".
>>>> >>
>>>> >> This kind of thing is only going to happen smoothly if we have a
>>>> >> lightweight user-visible process with clear outcomes.
>>>> >>
>>>> >> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>>>> >> <[hidden email]> wrote:
>>>> >> > I agree with most of what Cody said.
>>>> >> >
>>>> >> > Two things:
>>>> >> >
>>>> >> > First we can always have other people suggest SIPs but mark them as
>>>> >> > “unreviewed” and have committers basically move them forward. The
>>>> >> > problem is
>>>> >> > that writing a good document takes time. This way we can leverage
>>>> >> > non
>>>> >> > committers to do some of this work (it is just another way to
>>>> >> > contribute).
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > As for strategy, in many cases implementation strategy can affect
>>>> >> > the
>>>> >> > goals.
>>>> >> > I will give  a small example: In the current structured streaming
>>>> >> > strategy,
>>>> >> > we group by the time to achieve a sliding window. This is
>>>> >> > definitely an
>>>> >> > implementation decision and not a goal. However, I can think of
>>>> >> > several
>>>> >> > aggregation functions which have the time inside their calculation
>>>> >> > buffer.
>>>> >> > For example, let’s say we want to return a set of all distinct
>>>> >> > values.
>>>> >> > One
>>>> >> > way to implement this would be to make the set into a map and have
>>>> >> > the
>>>> >> > value
>>>> >> > contain the last time seen. Multiplying it across the groupby would
>>>> >> > cost
>>>> >> > a
>>>> >> > lot in performance. So adding such a strategy would have a great
>>>> >> > effect
>>>> >> > on
>>>> >> > the type of aggregations and their performance which does affect
>>>> >> > the
>>>> >> > goal.
>>>> >> > Without adding the strategy, it is easy for whoever goes to the
>>>> >> > design
>>>> >> > document to not think about these cases. Furthermore, it might be
>>>> >> > decided
>>>> >> > that these cases are rare enough so that the strategy is still good
>>>> >> > enough
>>>> >> > but how would we know it without user feedback?
>>>> >> >
>>>> >> > I believe this example is exactly what Cody was talking about.
>>>> >> > Since
>>>> >> > many
>>>> >> > times implementation strategies have a large effect on the goal, we
>>>> >> > should
>>>> >> > have it discussed when discussing the goals. In addition, while it
>>>> >> > is
>>>> >> > often
>>>> >> > easy to throw out completely infeasible goals, it is often much
>>>> >> > harder
>>>> >> > to
>>>> >> > figure out that the goals are unfeasible without fine tuning.
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > Assaf.
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > From: Cody Koeninger-2 [via Apache Spark Developers List]
>>>> >> > [mailto:[hidden email][hidden email]]
>>>> >> > Sent: Monday, October 10, 2016 2:25 AM
>>>> >> > To: Mendelson, Assaf
>>>> >> > Subject: Re: Spark Improvement Proposals
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > Only committers should formally submit SIPs because in an apache
>>>> >> > project only commiters have explicit political power.  If a user
>>>> >> > can't
>>>> >> > find a commiter willing to sponsor an SIP idea, they have no way to
>>>> >> > get the idea passed in any case.  If I can't find a committer to
>>>> >> > sponsor this meta-SIP idea, I'm out of luck.
>>>> >> >
>>>> >> > I do not believe unrealistic goals can be found solely by
>>>> >> > inspection.
>>>> >> > We've managed to ignore unrealistic goals even after
>>>> >> > implementation!
>>>> >> > Focusing on APIs can allow people to think they've solved
>>>> >> > something,
>>>> >> > when there's really no way of implementing that API while meeting
>>>> >> > the
>>>> >> > goals.  Rapid iteration is clearly the best way to address this,
>>>> >> > but
>>>> >> > we've already talked about why that hasn't really worked.  If
>>>> >> > adding a
>>>> >> > non-binding API section to the template is important to you, I'm
>>>> >> > not
>>>> >> > against it, but I don't think it's sufficient.
>>>> >> >
>>>> >> > On your PRD vs design doc spectrum, I'm saying this is closer to a
>>>> >> > PRD.  Clear agreement on goals is the most important thing and
>>>> >> > that's
>>>> >> > why it's the thing I want binding agreement on.  But I cannot agree
>>>> >> > to
>>>> >> > goals unless I have enough minimal technical info to judge whether
>>>> >> > the
>>>> >> > goals are likely to actually be accomplished.
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > On Sun, Oct 9, 2016 at 5:35 PM, Matei Zaharia <[hidden email]>
>>>> >> > wrote:
>>>> >> >
>>>> >> >
>>>> >> >> Well, I think there are a few things here that don't make sense.
>>>> >> >> First,
>>>> >> >> why
>>>> >> >> should only committers submit SIPs? Development in the project
>>>> >> >> should
>>>> >> >> be
>>>> >> >> open to all contributors, whether they're committers or not.
>>>> >> >> Second, I
>>>> >> >> think
>>>> >> >> unrealistic goals can be found just by inspecting the goals, and
>>>> >> >> I'm
>>>> >> >> not
>>>> >> >> super worried that we'll accept a lot of SIPs that are then
>>>> >> >> infeasible
>>>> >> >> --
>>>> >> >> we
>>>> >> >> can then submit new ones. But this depends on whether you want
>>>> >> >> this
>>>> >> >> process
>>>> >> >> to be a "design doc lite", where people also agree on
>>>> >> >> implementation
>>>> >> >> strategy, or just a way to agree on goals. This is what I asked
>>>> >> >> earlier
>>>> >> >> about PRDs vs design docs (and I'm open to either one but I'd just
>>>> >> >> like
>>>> >> >> clarity). Finally, both as a user and designer of software, I
>>>> >> >> always
>>>> >> >> want
>>>> >> >> to
>>>> >> >> give feedback on APIs, so I'd really like a culture of having
>>>> >> >> those
>>>> >> >> early.
>>>> >> >> People don't argue about prettiness when they discuss APIs, they
>>>> >> >> argue
>>>> >> >> about
>>>> >> >> the core concepts to expose in order to meet various goals, and
>>>> >> >> then
>>>> >> >> they're
>>>> >> >> stuck maintaining those for a long time.
>>>> >> >>
>>>> >> >> Matei
>>>> >> >>
>>>> >> >> On Oct 9, 2016, at 3:10 PM, Cody Koeninger <[hidden email]> wrote:
>>>> >> >>
>>>> >> >> Users instead of people, sure.  Commiters and contributors are (or
>>>> >> >> at
>>>> >> >> least
>>>> >> >> should be) a subset of users.
>>>> >> >>
>>>> >> >> Non goals, sure. I don't care what the name is, but we need to
>>>> >> >> clearly
>>>> >> >> say
>>>> >> >> e.g. 'no we are not maintaining compatibility with XYZ right now'.
>>>> >> >>
>>>> >> >> API, what I care most about is whether it allows me to accomplish
>>>> >> >> the
>>>> >> >> goals.
>>>> >> >> Arguing about how ugly or pretty it is can be saved for design/
>>>> >> >> implementation imho.
>>>> >> >>
>>>> >> >> Strategy, this is necessary because otherwise goals can be out of
>>>> >> >> line
>>>> >> >> with
>>>> >> >> reality.  Don't propose goals you don't have at least some idea of
>>>> >> >> how
>>>> >> >> to
>>>> >> >> implement.
>>>> >> >>
>>>> >> >> Rejected strategies, given that commiters are the only ones I'm
>>>> >> >> saying
>>>> >> >> should formally submit SPARKLIs or SIPs, if they put junk in a
>>>> >> >> required
>>>> >> >> section then slap them down for it and tell them to fix it.
>>>> >> >>
>>>> >> >>
>>>> >> >> On Oct 9, 2016 4:36 PM, "Matei Zaharia" <[hidden email]> wrote:
>>>> >> >>>
>>>> >> >>> Yup, this is the stuff that I found unclear. Thanks for
>>>> >> >>> clarifying
>>>> >> >>> here,
>>>> >> >>> but we should also clarify it in the writeup. In particular:
>>>> >> >>>
>>>> >> >>> - Goals needs to be about user-facing behavior ("people" is
>>>> >> >>> broad)
>>>> >> >>>
>>>> >> >>> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will
>>>> >> >>> dig
>>>> >> >>> up
>>>> >> >>> one of these and say "Spark's developers have officially rejected
>>>> >> >>> X,
>>>> >> >>> which
>>>> >> >>> our awesome system has".
>>>> >> >>>
>>>> >> >>> - For user-facing stuff, I think you need a section on API.
>>>> >> >>> Virtually
>>>> >> >>> all
>>>> >> >>> other *IPs I've seen have that.
>>>> >> >>>
>>>> >> >>> - I'm still not sure why the strategy section is needed if the
>>>> >> >>> purpose
>>>> >> >>> is
>>>> >> >>> to define user-facing behavior -- unless this is the strategy for
>>>> >> >>> setting
>>>> >> >>> the goals or for defining the API. That sounds squarely like a
>>>> >> >>> design
>>>> >> >>> doc
>>>> >> >>> issue. In some sense, who cares whether the proposal is
>>>> >> >>> technically
>>>> >> >>> feasible
>>>> >> >>> right now? If it's infeasible, that will be discovered later
>>>> >> >>> during
>>>> >> >>> design
>>>> >> >>> and implementation. Same thing with rejected strategies --
>>>> >> >>> listing
>>>> >> >>> some
>>>> >> >>> of
>>>> >> >>> those is definitely useful sometimes, but if you make this a
>>>> >> >>> *required*
>>>> >> >>> section, people are just going to fill it in with bogus stuff
>>>> >> >>> (I've
>>>> >> >>> seen
>>>> >> >>> this happen before).
>>>> >> >>>
>>>> >> >>> Matei
>>>> >> >>>
>>>> >> >
>>>> >> >>> > On Oct 9, 2016, at 2:14 PM, Cody Koeninger <[hidden email]>
>>>> >> >>> > wrote:
>>>> >> >>> >
>>>> >> >>> > So to focus the discussion on the specific strategy I'm
>>>> >> >>> > suggesting,
>>>> >> >>> > documented at
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>> >> >>> >
>>>> >> >>> > "Goals: What must this allow people to do, that they can't
>>>> >> >>> > currently?"
>>>> >> >>> >
>>>> >> >>> > Is it unclear that this is focusing specifically on
>>>> >> >>> > people-visible
>>>> >> >>> > behavior?
>>>> >> >>> >
>>>> >> >>> > Rejected goals -  are important because otherwise people keep
>>>> >> >>> > trying
>>>> >> >>> > to argue about scope.  Of course you can change things later
>>>> >> >>> > with a
>>>> >> >>> > different SIP and different vote, the point is to focus.
>>>> >> >>> >
>>>> >> >>> > Use cases - are something that people are going to bring up in
>>>> >> >>> > discussion.  If they aren't clearly documented as a goal ("This
>>>> >> >>> > must
>>>> >> >>> > allow me to connect using SSL"), they should be added.
>>>> >> >>> >
>>>> >> >>> > Internal architecture - if the people who need specific
>>>> >> >>> > behavior are
>>>> >> >>> > implementers of other parts of the system, that's fine.
>>>> >> >>> >
>>>> >> >>> > Rejected strategies - If you have none of these, you have no
>>>> >> >>> > evidence
>>>> >> >>> > that the proponent didn't just go with the first thing they had
>>>> >> >>> > in
>>>> >> >>> > mind (or have already implemented), which is a big problem
>>>> >> >>> > currently.
>>>> >> >>> > Approval isn't binding as to specifics of implementation, so
>>>> >> >>> > these
>>>> >> >>> > aren't handcuffs.  The goals are the contract, the strategy is
>>>> >> >>> > evidence that contract can actually be met.
>>>> >> >>> >
>>>> >> >>> > Design docs - I'm not touching design docs.  The markdown file
>>>> >> >>> > I
>>>> >> >>> > linked specifically says of the strategy section "This is not a
>>>> >> >>> > full
>>>> >> >>> > design document."  Is this unclear?  Design docs can be worked
>>>> >> >>> > on
>>>> >> >>> > obviously, but that's not what I'm concerned with here.
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> > On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <[hidden email]>
>>>> >> >>> > wrote:
>>>> >> >>> >> Hi Cody,
>>>> >> >>> >>
>>>> >> >>> >> I think this would be a lot more concrete if we had a more
>>>> >> >>> >> detailed
>>>> >> >>> >> template
>>>> >> >>> >> for SIPs. Right now, it's not super clear what's in scope --
>>>> >> >>> >> e.g.
>>>> >> >>> >> are
>>>> >> >>> >> they
>>>> >> >>> >> a way to solicit feedback on the user-facing behavior or on
>>>> >> >>> >> the
>>>> >> >>> >> internals?
>>>> >> >>> >> "Goals" can cover both things. I've been thinking of SIPs more
>>>> >> >>> >> as
>>>> >> >>> >> Product
>>>> >> >>> >> Requirements Docs (PRDs), which focus on *what* a code change
>>>> >> >>> >> should
>>>> >> >>> >> do
>>>> >> >>> >> as
>>>> >> >>> >> opposed to how.
>>>> >> >>> >>
>>>> >> >>> >> In particular, here are some things that you may or may not
>>>> >> >>> >> consider
>>>> >> >>> >> in
>>>> >> >>> >> scope for SIPs:
>>>> >> >>> >>
>>>> >> >>> >> - Goals and non-goals: This is definitely in scope, and IMO
>>>> >> >>> >> should
>>>> >> >>> >> focus on
>>>> >> >>> >> user-visible behavior (e.g. "system supports SQL window
>>>> >> >>> >> functions"
>>>> >> >>> >> or
>>>> >> >>> >> "system continues working if one node fails"). BTW I wouldn't
>>>> >> >>> >> say
>>>> >> >>> >> "rejected
>>>> >> >>> >> goals" because some of them might become goals later, so we're
>>>> >> >>> >> not
>>>> >> >>> >> definitively rejecting them.
>>>> >> >>> >>
>>>> >> >>> >> - Public API: Probably should be included in most SIPs unless
>>>> >> >>> >> it's
>>>> >> >>> >> too
>>>> >> >>> >> large
>>>> >> >>> >> to fully specify then (e.g. "let's add an ML library").
>>>> >> >>> >>
>>>> >> >>> >> - Use cases: I usually find this very useful in PRDs to better
>>>> >> >>> >> communicate
>>>> >> >>> >> the goals.
>>>> >> >>> >>
>>>> >> >>> >> - Internal architecture: This is usually *not* a thing users
>>>> >> >>> >> can
>>>> >> >>> >> easily
>>>> >> >>> >> comment on and it sounds more like a design doc item. Of
>>>> >> >>> >> course
>>>> >> >>> >> it's
>>>> >> >>> >> important to show that the SIP is feasible to implement. One
>>>> >> >>> >> exception,
>>>> >> >>> >> however, is that I think we'll have some SIPs primarily on
>>>> >> >>> >> internals
>>>> >> >>> >> (e.g.
>>>> >> >>> >> if somebody wants to refactor Spark's query optimizer or
>>>> >> >>> >> something).
>>>> >> >>> >>
>>>> >> >>> >> - Rejected strategies: I personally wouldn't put this, because
>>>> >> >>> >> what's
>>>> >> >>> >> the
>>>> >> >>> >> point of voting to reject a strategy before you've really
>>>> >> >>> >> begun
>>>> >> >>> >> designing
>>>> >> >>> >> and implementing something? What if you discover that the
>>>> >> >>> >> strategy
>>>> >> >>> >> is
>>>> >> >>> >> actually better when you start doing stuff?
>>>> >> >>> >>
>>>> >> >>> >> At a super high level, it depends on whether you want the SIPs
>>>> >> >>> >> to
>>>> >> >>> >> be
>>>> >> >>> >> PRDs
>>>> >> >>> >> for getting some quick feedback on the goals of a feature
>>>> >> >>> >> before it
>>>> >> >>> >> is
>>>> >> >>> >> designed, or something more like full-fledged design docs
>>>> >> >>> >> (just a
>>>> >> >>> >> more
>>>> >> >>> >> visible design doc for bigger changes). I looked at Kafka's
>>>> >> >>> >> KIPs,
>>>> >> >>> >> and
>>>> >> >>> >> they
>>>> >> >>> >> actually seem to be more like design docs. This can work too
>>>> >> >>> >> but it
>>>> >> >>> >> does
>>>> >> >>> >> require more work from the proposer and it can lead to the
>>>> >> >>> >> same
>>>> >> >>> >> problems you
>>>> >> >>> >> mentioned with people already having a design and
>>>> >> >>> >> implementation in
>>>> >> >>> >> mind.
>>>> >> >>> >>
>>>> >> >>> >> Basically, the question is, are you trying to iterate faster
>>>> >> >>> >> on
>>>> >> >>> >> design
>>>> >> >>> >> by
>>>> >> >>> >> adding a step for user feedback earlier? Or are you just
>>>> >> >>> >> trying to
>>>> >> >>> >> make
>>>> >> >>> >> design docs for key features more visible (and their approval
>>>> >> >>> >> more
>>>> >> >>> >> formal)?
>>>> >> >>> >>
>>>> >> >>> >> BTW note that in either case, I'd like to have a template for
>>>> >> >>> >> design
>>>> >> >>> >> docs
>>>> >> >>> >> too, which should also include goals. I think that would've
>>>> >> >>> >> avoided
>>>> >> >>> >> some of
>>>> >> >>> >> the issues you brought up.
>>>> >> >>> >>
>>>> >> >>> >> Matei
>>>> >> >>> >>
>>>> >> >>> >> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <[hidden email]>
>>>> >> >>> >> wrote:
>>>> >> >>> >>
>>>> >> >>> >> Here's my specific proposal (meta-proposal?)
>>>> >> >>> >>
>>>> >> >>> >> Spark Improvement Proposals (SIP)
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> Background:
>>>> >> >>> >>
>>>> >> >>> >> The current problem is that design and implementation of large
>>>> >> >>> >> features
>>>> >> >>> >> are
>>>> >> >>> >> often done in private, before soliciting user feedback.
>>>> >> >>> >>
>>>> >> >>> >> When feedback is solicited, it is often as to detailed design
>>>> >> >>> >> specifics, not
>>>> >> >>> >> focused on goals.
>>>> >> >>> >>
>>>> >> >>> >> When implementation does take place after design, there is
>>>> >> >>> >> often
>>>> >> >>> >> disagreement as to what goals are or are not in scope.
>>>> >> >>> >>
>>>> >> >>> >> This results in commits that don't fully meet user needs.
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> Goals:
>>>> >> >>> >>
>>>> >> >>> >> - Ensure user, contributor, and committer goals are clearly
>>>> >> >>> >> identified
>>>> >> >>> >> and
>>>> >> >>> >> agreed upon, before implementation takes place.
>>>> >> >>> >>
>>>> >> >>> >> - Ensure that a technically feasible strategy is chosen that
>>>> >> >>> >> is
>>>> >> >>> >> likely
>>>> >> >>> >> to
>>>> >> >>> >> meet the goals.
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> Rejected Goals:
>>>> >> >>> >>
>>>> >> >>> >> - SIPs are not for detailed design.  Design by committee
>>>> >> >>> >> doesn't
>>>> >> >>> >> work.
>>>> >> >>> >>
>>>> >> >>> >> - SIPs are not for every change.  We dont need that much
>>>> >> >>> >> process.
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> Strategy:
>>>> >> >>> >>
>>>> >> >>> >> My suggestion is outlined as a Spark Improvement Proposal
>>>> >> >>> >> process
>>>> >> >>> >> documented
>>>> >> >>> >> at
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>> >> >>> >>
>>>> >> >>> >> Specifics of Jira manipulation are an implementation detail we
>>>> >> >>> >> can
>>>> >> >>> >> figure
>>>> >> >>> >> out.
>>>> >> >>> >>
>>>> >> >>> >> I'm suggesting voting; the need here is for a _clear_ outcome.
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> Rejected Strategies:
>>>> >> >>> >>
>>>> >> >>> >> Having someone who understands the problem implement it first
>>>> >> >>> >> works,
>>>> >> >>> >> but
>>>> >> >>> >> only if significant iteration after user feedback is allowed.
>>>> >> >>> >>
>>>> >> >>> >> Historically this has been problematic due to pressure to
>>>> >> >>> >> limit
>>>> >> >>> >> public
>>>> >> >>> >> api
>>>> >> >>> >> changes.
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <[hidden email]>
>>>> >> >>> >> wrote:
>>>> >> >>> >>>
>>>> >> >>> >>> Alright looks like there are quite a bit of support. We
>>>> >> >>> >>> should
>>>> >> >>> >>> wait
>>>> >> >>> >>> to
>>>> >> >>> >>> hear from more people too.
>>>> >> >>> >>>
>>>> >> >>> >>> To push this forward, Cody and I will be working together in
>>>> >> >>> >>> the
>>>> >> >>> >>> next
>>>> >> >>> >>> couple of weeks to come up with a concrete, detailed proposal
>>>> >> >>> >>> on
>>>> >> >>> >>> what
>>>> >> >>> >>> this
>>>> >> >>> >>> entails, and then we can discuss this the specific proposal
>>>> >> >>> >>> as
>>>> >> >>> >>> well.
>>>> >> >>> >>>
>>>> >> >>> >>>
>>>> >> >>> >>> On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <[hidden
>>>> >> >>> >>> email]>
>>>> >> >>> >>> wrote:
>>>> >> >>> >>>>
>>>> >> >>> >>>> Yeah, in case it wasn't clear, I was talking about SIPs for
>>>> >> >>> >>>> major
>>>> >> >>> >>>> user-facing or cross-cutting changes, not minor feature
>>>> >> >>> >>>> adds.
>>>> >> >>> >>>>
>>>> >> >>> >>>> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos
>>>> >> >>> >>>> <[hidden email]> wrote:
>>>> >> >>> >>>>>
>>>> >> >>> >>>>> +1 to the SIP label as long as it does not slow down things
>>>> >> >>> >>>>> and
>>>> >> >>> >>>>> it
>>>> >> >>> >>>>> targets optimizing efforts, coordination etc. For example
>>>> >> >>> >>>>> really
>>>> >> >>> >>>>> small
>>>> >> >>> >>>>> features should not need to go through this process
>>>> >> >>> >>>>> (assuming
>>>> >> >>> >>>>> they
>>>> >> >>> >>>>> dont
>>>> >> >>> >>>>> touch public interfaces)  or re-factorings and hope it will
>>>> >> >>> >>>>> be
>>>> >> >>> >>>>> kept
>>>> >> >>> >>>>> this
>>>> >> >>> >>>>> way. So as a guideline doc should be provided, like in the
>>>> >> >>> >>>>> KIP
>>>> >> >>> >>>>> case.
>>>> >> >>> >>>>>
>>>> >> >>> >>>>> IMHO so far aside from tagging things and linking them
>>>> >> >>> >>>>> elsewhere
>>>> >> >>> >>>>> simply
>>>> >> >>> >>>>> having design docs and prototypes implementations in PRs is
>>>> >> >>> >>>>> not
>>>> >> >>> >>>>> something
>>>> >> >>> >>>>> that has not worked so far. What is really a pain in many
>>>> >> >>> >>>>> projects
>>>> >> >>> >>>>> out there
>>>> >> >>> >>>>> is discontinuity in progress of PRs, missing features, slow
>>>> >> >>> >>>>> reviews
>>>> >> >>> >>>>> which is
>>>> >> >>> >>>>> understandable to some extent... it is not only about Spark
>>>> >> >>> >>>>> but
>>>> >> >>> >>>>> things can
>>>> >> >>> >>>>> be improved for sure for this project in particular as
>>>> >> >>> >>>>> already
>>>> >> >>> >>>>> stated.
>>>> >> >>> >>>>>
>>>> >> >>> >>>>> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <[hidden
>>>> >> >>> >>>>> email]>
>>>> >> >>> >>>>> wrote:
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> +1 to adding an SIP label and linking it from the website.
>>>> >> >>> >>>>>> I
>>>> >> >>> >>>>>> think
>>>> >> >>> >>>>>> it
>>>> >> >>> >>>>>> needs
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> - template that focuses it towards soliciting user goals /
>>>> >> >>> >>>>>> non
>>>> >> >>> >>>>>> goals
>>>> >> >>> >>>>>> - clear resolution as to which strategy was chosen to
>>>> >> >>> >>>>>> pursue.
>>>> >> >>> >>>>>> I'd
>>>> >> >>> >>>>>> recommend a vote.
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> Matei asked me to clarify what I meant by changing
>>>> >> >>> >>>>>> interfaces,
>>>> >> >>> >>>>>> I
>>>> >> >>> >>>>>> think
>>>> >> >>> >>>>>> it's directly relevant to the SIP idea so I'll clarify
>>>> >> >>> >>>>>> here,
>>>> >> >>> >>>>>> and
>>>> >> >>> >>>>>> split
>>>> >> >>> >>>>>> a thread for the other discussion per Nicholas' request.
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> I meant changing public user interfaces.  I think the
>>>> >> >>> >>>>>> first
>>>> >> >>> >>>>>> design
>>>> >> >>> >>>>>> is
>>>> >> >>> >>>>>> unlikely to be right, because it's done at a time when you
>>>> >> >>> >>>>>> have
>>>> >> >>> >>>>>> the
>>>> >> >>> >>>>>> least information.  As a user, I find it considerably more
>>>> >> >>> >>>>>> frustrating
>>>> >> >>> >>>>>> to be unable to use a tool to get my job done, than I do
>>>> >> >>> >>>>>> having
>>>> >> >>> >>>>>> to
>>>> >> >>> >>>>>> make minor changes to my code in order to take advantage
>>>> >> >>> >>>>>> of
>>>> >> >>> >>>>>> features.
>>>> >> >>> >>>>>> I've seen committers be seriously reluctant to allow
>>>> >> >>> >>>>>> changes to
>>>> >> >>> >>>>>> @experimental code that are needed in order for it to
>>>> >> >>> >>>>>> really
>>>> >> >>> >>>>>> work
>>>> >> >>> >>>>>> right.  You need to be able to iterate, and if people on
>>>> >> >>> >>>>>> both
>>>> >> >>> >>>>>> sides
>>>> >> >>> >>>>>> of
>>>> >> >>> >>>>>> the fence aren't going to respect that some newer apis are
>>>> >> >>> >>>>>> subject
>>>> >> >>> >>>>>> to
>>>> >> >>> >>>>>> change, then why even mark them as such?
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> Ideally a finished SIP should give me a checklist of
>>>> >> >>> >>>>>> things
>>>> >> >>> >>>>>> that
>>>> >> >>> >>>>>> an
>>>> >> >>> >>>>>> implementation must do, and things that it doesn't need to
>>>> >> >>> >>>>>> do.
>>>> >> >>> >>>>>> Contributors/committers should be seriously discouraged
>>>> >> >>> >>>>>> from
>>>> >> >>> >>>>>> putting
>>>> >> >>> >>>>>> out a version 0.1 that doesn't have at least a prototype
>>>> >> >>> >>>>>> implementation of all those things, especially if they're
>>>> >> >>> >>>>>> then
>>>> >> >>> >>>>>> going
>>>> >> >>> >>>>>> to argue against interface changes necessary to get the
>>>> >> >>> >>>>>> the
>>>> >> >>> >>>>>> rest
>>>> >> >>> >>>>>> of
>>>> >> >>> >>>>>> the things done in the 0.2 version.
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <[hidden
>>>> >> >>> >>>>>> email]>
>>>> >> >>> >>>>>> wrote:
>>>> >> >>> >>>>>>> I like the lightweight proposal to add a SIP label.
>>>> >> >>> >>>>>>>
>>>> >> >>> >>>>>>> During Spark 2.0 development, Tom (Graves) and I
>>>> >> >>> >>>>>>> suggested
>>>> >> >>> >>>>>>> using
>>>> >> >>> >>>>>>> wiki
>>>> >> >>> >>>>>>> to
>>>> >> >>> >>>>>>> track the list of major changes, but that never really
>>>> >> >>> >>>>>>> materialized
>>>> >> >>> >>>>>>> due to
>>>> >> >>> >>>>>>> the overhead. Adding a SIP label on major JIRAs and then
>>>> >> >>> >>>>>>> link
>>>> >> >>> >>>>>>> to
>>>> >> >>> >>>>>>> them
>>>> >> >>> >>>>>>> prominently on the Spark website makes a lot of sense.
>>>> >> >>> >>>>>>>
>>>> >> >>> >>>>>>>
>>>> >> >>> >>>>>>> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia
>>>> >> >>> >>>>>>> <[hidden email]>
>>>> >> >>> >>>>>>> wrote:
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> For the improvement proposals, I think one major point
>>>> >> >>> >>>>>>>> was to
>>>> >> >>> >>>>>>>> make
>>>> >> >>> >>>>>>>> them
>>>> >> >>> >>>>>>>> really visible to users who are not contributors, so we
>>>> >> >>> >>>>>>>> should
>>>> >> >>> >>>>>>>> do
>>>> >> >>> >>>>>>>> more than
>>>> >> >>> >>>>>>>> sending stuff to dev@. One very lightweight idea is to
>>>> >> >>> >>>>>>>> have a
>>>> >> >>> >>>>>>>> new
>>>> >> >>> >>>>>>>> type of
>>>> >> >>> >>>>>>>> JIRA called a SIP and have a link to a filter that shows
>>>> >> >>> >>>>>>>> all
>>>> >> >>> >>>>>>>> such
>>>> >> >>> >>>>>>>> JIRAs from
>>>> >> >>> >>>>>>>> http://spark.apache.org. I also like the idea of SIP and
>>>> >> >>> >>>>>>>> design
>>>> >> >>> >>>>>>>> doc
>>>> >> >>> >>>>>>>> templates (in fact many projects have them).
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> Matei
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden
>>>> >> >>> >>>>>>>> email]>
>>>> >> >>> >>>>>>>> wrote:
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> I called Cody last night and talked about some of the
>>>> >> >>> >>>>>>>> topics
>>>> >> >>> >>>>>>>> in
>>>> >> >>> >>>>>>>> his
>>>> >> >>> >>>>>>>> email.
>>>> >> >>> >>>>>>>> It became clear to me Cody genuinely cares about the
>>>> >> >>> >>>>>>>> project.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> Some of the frustrations come from the success of the
>>>> >> >>> >>>>>>>> project
>>>> >> >>> >>>>>>>> itself
>>>> >> >>> >>>>>>>> becoming very "hot", and it is difficult to get clarity
>>>> >> >>> >>>>>>>> from
>>>> >> >>> >>>>>>>> people
>>>> >> >>> >>>>>>>> who
>>>> >> >>> >>>>>>>> don't dedicate all their time to Spark. In fact, it is
>>>> >> >>> >>>>>>>> in
>>>> >> >>> >>>>>>>> some
>>>> >> >>> >>>>>>>> ways
>>>> >> >>> >>>>>>>> similar
>>>> >> >>> >>>>>>>> to scaling an engineering team in a successful startup:
>>>> >> >>> >>>>>>>> old
>>>> >> >>> >>>>>>>> processes that
>>>> >> >>> >>>>>>>> worked well might not work so well when it gets to a
>>>> >> >>> >>>>>>>> certain
>>>> >> >>> >>>>>>>> size,
>>>> >> >>> >>>>>>>> cultures
>>>> >> >>> >>>>>>>> can get diluted, building culture vs building process,
>>>> >> >>> >>>>>>>> etc.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> I also really like to have a more visible process for
>>>> >> >>> >>>>>>>> larger
>>>> >> >>> >>>>>>>> changes,
>>>> >> >>> >>>>>>>> especially major user facing API changes. Historically
>>>> >> >>> >>>>>>>> we
>>>> >> >>> >>>>>>>> upload
>>>> >> >>> >>>>>>>> design docs
>>>> >> >>> >>>>>>>> for major changes, but it is not always consistent and
>>>> >> >>> >>>>>>>> difficult
>>>> >> >>> >>>>>>>> to
>>>> >> >>> >>>>>>>> quality
>>>> >> >>> >>>>>>>> of the docs, due to the volunteering nature of the
>>>> >> >>> >>>>>>>> organization.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> Some of the more concrete ideas we discussed focus on
>>>> >> >>> >>>>>>>> building a
>>>> >> >>> >>>>>>>> culture
>>>> >> >>> >>>>>>>> to improve clarity:
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> - Process: Large changes should have design docs posted
>>>> >> >>> >>>>>>>> on
>>>> >> >>> >>>>>>>> JIRA.
>>>> >> >>> >>>>>>>> One
>>>> >> >>> >>>>>>>> thing
>>>> >> >>> >>>>>>>> Cody and I didn't discuss but an idea that just came to
>>>> >> >>> >>>>>>>> me is
>>>> >> >>> >>>>>>>> we
>>>> >> >>> >>>>>>>> should
>>>> >> >>> >>>>>>>> create a design doc template for the project and ask
>>>> >> >>> >>>>>>>> everybody
>>>> >> >>> >>>>>>>> to
>>>> >> >>> >>>>>>>> follow.
>>>> >> >>> >>>>>>>> The design doc template should also explicitly list
>>>> >> >>> >>>>>>>> goals and
>>>> >> >>> >>>>>>>> non-goals, to
>>>> >> >>> >>>>>>>> make design doc more consistent.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> - Process: Email dev@ to solicit feedback. We have some
>>>> >> >>> >>>>>>>> this
>>>> >> >>> >>>>>>>> with
>>>> >> >>> >>>>>>>> some
>>>> >> >>> >>>>>>>> changes, but again very inconsistent. Just posting
>>>> >> >>> >>>>>>>> something
>>>> >> >>> >>>>>>>> on
>>>> >> >>> >>>>>>>> JIRA
>>>> >> >>> >>>>>>>> isn't
>>>> >> >>> >>>>>>>> sufficient, because there are simply too many JIRAs and
>>>> >> >>> >>>>>>>> the
>>>> >> >>> >>>>>>>> signal
>>>> >> >>> >>>>>>>> get lost
>>>> >> >>> >>>>>>>> in the noise. While this is generally impossible to
>>>> >> >>> >>>>>>>> enforce
>>>> >> >>> >>>>>>>> because
>>>> >> >>> >>>>>>>> we can't
>>>> >> >>> >>>>>>>> force all volunteers to conform to a process (or they
>>>> >> >>> >>>>>>>> might
>>>> >> >>> >>>>>>>> not
>>>> >> >>> >>>>>>>> even
>>>> >> >>> >>>>>>>> be
>>>> >> >>> >>>>>>>> aware of this),  those who are more familiar with the
>>>> >> >>> >>>>>>>> project
>>>> >> >>> >>>>>>>> can
>>>> >> >>> >>>>>>>> help by
>>>> >> >>> >>>>>>>> emailing the dev@ when they see something that hasn't
>>>> >> >>> >>>>>>>> been.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> - Culture: The design doc author(s) should be open to
>>>> >> >>> >>>>>>>> feedback.
>>>> >> >>> >>>>>>>> A
>>>> >> >>> >>>>>>>> design
>>>> >> >>> >>>>>>>> doc should serve as the base for discussion and is by no
>>>> >> >>> >>>>>>>> means
>>>> >> >>> >>>>>>>> the
>>>> >> >>> >>>>>>>> final
>>>> >> >>> >>>>>>>> design. Of course, this does not mean the author has to
>>>> >> >>> >>>>>>>> accept
>>>> >> >>> >>>>>>>> every
>>>> >> >>> >>>>>>>> feedback. They should also be comfortable accepting /
>>>> >> >>> >>>>>>>> rejecting
>>>> >> >>> >>>>>>>> ideas on
>>>> >> >>> >>>>>>>> technical grounds.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> - Process / Culture: For major ongoing projects, it can
>>>> >> >>> >>>>>>>> be
>>>> >> >>> >>>>>>>> useful
>>>> >> >>> >>>>>>>> to
>>>> >> >>> >>>>>>>> have
>>>> >> >>> >>>>>>>> some monthly Google hangouts that are open to the world.
>>>> >> >>> >>>>>>>> I am
>>>> >> >>> >>>>>>>> actually not
>>>> >> >>> >>>>>>>> sure how well this will work, because of the
>>>> >> >>> >>>>>>>> volunteering
>>>> >> >>> >>>>>>>> nature
>>>> >> >>> >>>>>>>> and
>>>> >> >>> >>>>>>>> we need
>>>> >> >>> >>>>>>>> to adjust for timezones for people across the globe, but
>>>> >> >>> >>>>>>>> it
>>>> >> >>> >>>>>>>> seems
>>>> >> >>> >>>>>>>> worth
>>>> >> >>> >>>>>>>> trying.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> - Culture: Contributors (including committers) should be
>>>> >> >>> >>>>>>>> more
>>>> >> >>> >>>>>>>> direct
>>>> >> >>> >>>>>>>> in
>>>> >> >>> >>>>>>>> setting expectations, including whether they are working
>>>> >> >>> >>>>>>>> on a
>>>> >> >>> >>>>>>>> specific
>>>> >> >>> >>>>>>>> issue, whether they will be working on a specific issue,
>>>> >> >>> >>>>>>>> and
>>>> >> >>> >>>>>>>> whether
>>>> >> >>> >>>>>>>> an
>>>> >> >>> >>>>>>>> issue or pr or jira should be rejected. Most people I
>>>> >> >>> >>>>>>>> know in
>>>> >> >>> >>>>>>>> this
>>>> >> >>> >>>>>>>> community
>>>> >> >>> >>>>>>>> are nice and don't enjoy telling other people no, but it
>>>> >> >>> >>>>>>>> is
>>>> >> >>> >>>>>>>> often
>>>> >> >>> >>>>>>>> more
>>>> >> >>> >>>>>>>> annoying to a contributor to not know anything than
>>>> >> >>> >>>>>>>> getting a
>>>> >> >>> >>>>>>>> no.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia
>>>> >> >>> >>>>>>>> <[hidden email]>
>>>> >> >>> >>>>>>>> wrote:
>>>> >> >>> >>>>>>>>>
>>>> >> >>> >>>>>>>>>
>>>> >> >>> >>>>>>>>> Love the idea of a more visible "Spark Improvement
>>>> >> >>> >>>>>>>>> Proposal"
>>>> >> >>> >>>>>>>>> process that
>>>> >> >>> >>>>>>>>> solicits user input on new APIs. For what it's worth, I
>>>> >> >>> >>>>>>>>> don't
>>>> >> >>> >>>>>>>>> think
>>>> >> >>> >>>>>>>>> committers are trying to minimize their own work --
>>>> >> >>> >>>>>>>>> every
>>>> >> >>> >>>>>>>>> committer
>>>> >> >>> >>>>>>>>> cares
>>>> >> >>> >>>>>>>>> about making the software useful for users. However, it
>>>> >> >>> >>>>>>>>> is
>>>> >> >>> >>>>>>>>> always
>>>> >> >>> >>>>>>>>> hard to
>>>> >> >>> >>>>>>>>> get user input and so it helps to have this kind of
>>>> >> >>> >>>>>>>>> process.
>>>> >> >>> >>>>>>>>> I've
>>>> >> >>> >>>>>>>>> certainly
>>>> >> >>> >>>>>>>>> looked at the *IPs a lot in other software I use just
>>>> >> >>> >>>>>>>>> to see
>>>> >> >>> >>>>>>>>> the
>>>> >> >>> >>>>>>>>> biggest
>>>> >> >>> >>>>>>>>> things on the roadmap.
>>>> >> >>> >>>>>>>>>
>>>> >> >>> >>>>>>>>> When you're talking about "changing interfaces", are
>>>> >> >>> >>>>>>>>> you
>>>> >> >>> >>>>>>>>> talking
>>>> >> >>> >>>>>>>>> about
>>>> >> >>> >>>>>>>>> public or internal APIs? I do think many people hate
>>>> >> >>> >>>>>>>>> changing
>>>> >> >>> >>>>>>>>> public APIs
>>>> >> >>> >>>>>>>>> and I actually think that's for the best of the
>>>> >> >>> >>>>>>>>> project.
>>>> >> >>> >>>>>>>>> That's
>>>> >> >>> >>>>>>>>> a
>>>> >> >>> >>>>>>>>> technical
>>>> >> >>> >>>>>>>>> debate, but basically, the worst thing when you're
>>>> >> >>> >>>>>>>>> using a
>>>> >> >>> >>>>>>>>> piece
>>>> >> >>> >>>>>>>>> of
>>>> >> >>> >>>>>>>>> software
>>>> >> >>> >>>>>>>>> is that the developers constantly ask you to rewrite
>>>> >> >>> >>>>>>>>> your
>>>> >> >>> >>>>>>>>> app
>>>> >> >>> >>>>>>>>> to
>>>> >> >>> >>>>>>>>> update to a
>>>> >> >>> >>>>>>>>> new version (and thus benefit from bug fixes, etc). Cue
>>>> >> >>> >>>>>>>>> anyone
>>>> >> >>> >>>>>>>>> who's used
>>>> >> >>> >>>>>>>>> Protobuf, or Guava. The "let's get everyone to change
>>>> >> >>> >>>>>>>>> their
>>>> >> >>> >>>>>>>>> code
>>>> >> >>> >>>>>>>>> this
>>>> >> >>> >>>>>>>>> release" model works well within a single large
>>>> >> >>> >>>>>>>>> company, but
>>>> >> >>> >>>>>>>>> doesn't work
>>>> >> >>> >>>>>>>>> well for a community, which is why nearly all *very*
>>>> >> >>> >>>>>>>>> widely
>>>> >> >>> >>>>>>>>> used
>>>> >> >>> >>>>>>>>> programming
>>>> >> >>> >>>>>>>>> interfaces (I'm talking things like Java standard
>>>> >> >>> >>>>>>>>> library,
>>>> >> >>> >>>>>>>>> Windows
>>>> >> >>> >>>>>>>>> API, etc)
>>>> >> >>> >>>>>>>>> almost *never* break backwards compatibility. All this
>>>> >> >>> >>>>>>>>> is
>>>> >> >>> >>>>>>>>> done
>>>> >> >>> >>>>>>>>> within reason
>>>> >> >>> >>>>>>>>> though, e.g. we do change things in major releases
>>>> >> >>> >>>>>>>>> (2.x,
>>>> >> >>> >>>>>>>>> 3.x,
>>>> >> >>> >>>>>>>>> etc).
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> ---------------------------------------------------------------------
>>>> >> >>> >>>>>> To unsubscribe e-mail: [hidden email]
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>
>>>> >> >>> >>>>>
>>>> >> >>> >>>>>
>>>> >> >>> >>>>> --
>>>> >> >>> >>>>> Stavros Kontopoulos
>>>> >> >>> >>>>> Senior Software Engineer
>>>> >> >>> >>>>> Lightbend, Inc.
>>>> >> >>> >>>>> p:  +30 6977967274
>>>> >> >>> >>>>> e: [hidden email]
>>>> >> >>> >>>>>
>>>> >> >>> >>>>>
>>>> >> >>> >>>>
>>>> >> >>> >>>
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>>
>>>> >> >>
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > ---------------------------------------------------------------------
>>>> >> > To unsubscribe e-mail: [hidden email]
>>>> >> >
>>>> >> >
>>>> >> > ________________________________
>>>> >> >
>>>> >> > If you reply to this email, your message will be added to the
>>>> >> > discussion
>>>> >> > below:
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-tp19268p19359.html
>>>> >> >
>>>> >> > To start a new topic under Apache Spark Developers List, email
>>>> >> > [hidden
>>>> >> > email]
>>>> >> > To unsubscribe from Apache Spark Developers List, click here.
>>>> >> > NAML
>>>> >> >
>>>> >> >
>>>> >> > ________________________________
>>>> >> > View this message in context: RE: Spark Improvement Proposals
>>>> >> > Sent from the Apache Spark Developers List mailing list archive at
>>>> >> > Nabble.com.
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe e-mail: [hidden email]
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Ryan Blue
>>>> > Software Engineer
>>>> > Netflix
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: [hidden email]
>>>>
>>>
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

kant kodali
Some of you guys may have already seen this but in case if you haven't you may want to check it out.




On Tue, Oct 11, 2016 at 1:57 PM, Ryan Blue <[hidden email]> wrote:
I don't think we will have trouble with whatever rule that is adopted for accepting proposals. Considering committers' votes binding (if that is what we choose) is an established practice as long as it isn't for specific votes, like a release vote. From the Apache docs: "Who is permitted to vote is, to some extent, a community-specific thing." [1] And, I also don't see why it would be a problem to choose consensus, as long as we have an open discussion and vote about these rules.

rb

On Mon, Oct 10, 2016 at 4:15 PM, Cody Koeninger <[hidden email]> wrote:
If someone wants to tell me that it's OK and "The Apache Way" for
Kafka and Flink to have a proposal process that ends in a lazy
majority, but it's not OK for Spark to have a proposal process that
ends in a non-lazy consensus...

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#KafkaImprovementProposals-Process

In practice any PMC member can stop a proposal they don't like, so I'm
not sure how much it matters.



On Mon, Oct 10, 2016 at 5:59 PM, Mark Hamstra <[hidden email]> wrote:
> There is a larger issue to keep in mind, and that is that what you are
> proposing is a procedure that, as far as I am aware, hasn't previously been
> adopted in an Apache project, and thus is not an easy or exact fit with
> established practices that have been blessed as "The Apache Way".  As such,
> we need to be careful, because we have run into some trouble in the past
> with some inside the ASF but essentially outside the Spark community who
> didn't like the way we were doing things.
>
> On Mon, Oct 10, 2016 at 3:53 PM, Cody Koeninger <[hidden email]> wrote:
>>
>> Apache documents say lots of confusing stuff, including that commiters are
>> in practice given a vote.
>>
>> https://www.apache.org/foundation/voting.html
>>
>> I don't care either way, if someone wants me to sub commiter for PMC in
>> the voting section, fine, we just need a clear outcome.
>>
>>
>> On Oct 10, 2016 17:36, "Mark Hamstra" <[hidden email]> wrote:
>>>
>>> If I'm correctly understanding the kind of voting that you are talking
>>> about, then to be accurate, it is only the PMC members that have a vote, not
>>> all committers:
>>> https://www.apache.org/foundation/how-it-works.html#pmc-members
>>>
>>> On Mon, Oct 10, 2016 at 12:02 PM, Cody Koeninger <[hidden email]>
>>> wrote:
>>>>
>>>> I think the main value is in being honest about what's going on.  No
>>>> one other than committers can cast a meaningful vote, that's the
>>>> reality.  Beyond that, if people think it's more open to allow formal
>>>> proposals from anyone, I'm not necessarily against it, but my main
>>>> question would be this:
>>>>
>>>> If anyone can submit a proposal, are committers actually going to
>>>> clearly reject and close proposals that don't meet the requirements?
>>>>
>>>> Right now we have a serious problem with lack of clarity regarding
>>>> contributions, and that cannot spill over into goal-setting.
>>>>
>>>> On Mon, Oct 10, 2016 at 1:54 PM, Ryan Blue <[hidden email]> wrote:
>>>> > +1 to votes to approve proposals. I agree that proposals should have
>>>> > an
>>>> > official mechanism to be accepted, and a vote is an established means
>>>> > of
>>>> > doing that well. I like that it includes a period to review the
>>>> > proposal and
>>>> > I think proposals should have been discussed enough ahead of a vote to
>>>> > survive the possibility of a veto.
>>>> >
>>>> > I also like the names that are short and (mostly) unique, like SEP.
>>>> >
>>>> > Where I disagree is with the requirement that a committer must
>>>> > formally
>>>> > propose an enhancement. I don't see the value of restricting this: if
>>>> > someone has the will to write up a proposal then they should be
>>>> > encouraged
>>>> > to do so and start a discussion about it. Even if there is a political
>>>> > reality as Cody says, what is the value of codifying that in our
>>>> > process? I
>>>> > think restricting who can submit proposals would only undermine them
>>>> > by
>>>> > pushing contributors out. Maybe I'm missing something here?
>>>> >
>>>> > rb
>>>> >
>>>> >
>>>> >
>>>> > On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <[hidden email]>
>>>> > wrote:
>>>> >>
>>>> >> Yes, users suggesting SIPs is a good thing and is explicitly called
>>>> >> out in the linked document under the Who? section.  Formally
>>>> >> proposing
>>>> >> them, not so much, because of the political realities.
>>>> >>
>>>> >> Yes, implementation strategy definitely affects goals.  There are all
>>>> >> kinds of examples of this, I'll pick one that's my fault so as to
>>>> >> avoid sounding like I'm blaming:
>>>> >>
>>>> >> When I implemented the Kafka DStream, one of my (not explicitly
>>>> >> agreed
>>>> >> upon by the community) goals was to make sure people could use the
>>>> >> Dstream with however they were already using Kafka at work.  The lack
>>>> >> of explicit agreement on that goal led to all kinds of fighting with
>>>> >> committers, that could have been avoided.  The lack of explicit
>>>> >> up-front strategy discussion led to the DStream not really working
>>>> >> with compacted topics.  I knew about compacted topics, but don't have
>>>> >> a use for them, so had a blind spot there.  If there was explicit
>>>> >> up-front discussion that my strategy was "assume that batches can be
>>>> >> defined on the driver solely by beginning and ending offsets",
>>>> >> there's
>>>> >> a greater chance that a user would have seen that and said, "hey,
>>>> >> what
>>>> >> about non-contiguous offsets in a compacted topic".
>>>> >>
>>>> >> This kind of thing is only going to happen smoothly if we have a
>>>> >> lightweight user-visible process with clear outcomes.
>>>> >>
>>>> >> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>>>> >> <[hidden email]> wrote:
>>>> >> > I agree with most of what Cody said.
>>>> >> >
>>>> >> > Two things:
>>>> >> >
>>>> >> > First we can always have other people suggest SIPs but mark them as
>>>> >> > “unreviewed” and have committers basically move them forward. The
>>>> >> > problem is
>>>> >> > that writing a good document takes time. This way we can leverage
>>>> >> > non
>>>> >> > committers to do some of this work (it is just another way to
>>>> >> > contribute).
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > As for strategy, in many cases implementation strategy can affect
>>>> >> > the
>>>> >> > goals.
>>>> >> > I will give  a small example: In the current structured streaming
>>>> >> > strategy,
>>>> >> > we group by the time to achieve a sliding window. This is
>>>> >> > definitely an
>>>> >> > implementation decision and not a goal. However, I can think of
>>>> >> > several
>>>> >> > aggregation functions which have the time inside their calculation
>>>> >> > buffer.
>>>> >> > For example, let’s say we want to return a set of all distinct
>>>> >> > values.
>>>> >> > One
>>>> >> > way to implement this would be to make the set into a map and have
>>>> >> > the
>>>> >> > value
>>>> >> > contain the last time seen. Multiplying it across the groupby would
>>>> >> > cost
>>>> >> > a
>>>> >> > lot in performance. So adding such a strategy would have a great
>>>> >> > effect
>>>> >> > on
>>>> >> > the type of aggregations and their performance which does affect
>>>> >> > the
>>>> >> > goal.
>>>> >> > Without adding the strategy, it is easy for whoever goes to the
>>>> >> > design
>>>> >> > document to not think about these cases. Furthermore, it might be
>>>> >> > decided
>>>> >> > that these cases are rare enough so that the strategy is still good
>>>> >> > enough
>>>> >> > but how would we know it without user feedback?
>>>> >> >
>>>> >> > I believe this example is exactly what Cody was talking about.
>>>> >> > Since
>>>> >> > many
>>>> >> > times implementation strategies have a large effect on the goal, we
>>>> >> > should
>>>> >> > have it discussed when discussing the goals. In addition, while it
>>>> >> > is
>>>> >> > often
>>>> >> > easy to throw out completely infeasible goals, it is often much
>>>> >> > harder
>>>> >> > to
>>>> >> > figure out that the goals are unfeasible without fine tuning.
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > Assaf.
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > From: Cody Koeninger-2 [via Apache Spark Developers List]
>>>> >> > [mailto:[hidden email][hidden email]]
>>>> >> > Sent: Monday, October 10, 2016 2:25 AM
>>>> >> > To: Mendelson, Assaf
>>>> >> > Subject: Re: Spark Improvement Proposals
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > Only committers should formally submit SIPs because in an apache
>>>> >> > project only commiters have explicit political power.  If a user
>>>> >> > can't
>>>> >> > find a commiter willing to sponsor an SIP idea, they have no way to
>>>> >> > get the idea passed in any case.  If I can't find a committer to
>>>> >> > sponsor this meta-SIP idea, I'm out of luck.
>>>> >> >
>>>> >> > I do not believe unrealistic goals can be found solely by
>>>> >> > inspection.
>>>> >> > We've managed to ignore unrealistic goals even after
>>>> >> > implementation!
>>>> >> > Focusing on APIs can allow people to think they've solved
>>>> >> > something,
>>>> >> > when there's really no way of implementing that API while meeting
>>>> >> > the
>>>> >> > goals.  Rapid iteration is clearly the best way to address this,
>>>> >> > but
>>>> >> > we've already talked about why that hasn't really worked.  If
>>>> >> > adding a
>>>> >> > non-binding API section to the template is important to you, I'm
>>>> >> > not
>>>> >> > against it, but I don't think it's sufficient.
>>>> >> >
>>>> >> > On your PRD vs design doc spectrum, I'm saying this is closer to a
>>>> >> > PRD.  Clear agreement on goals is the most important thing and
>>>> >> > that's
>>>> >> > why it's the thing I want binding agreement on.  But I cannot agree
>>>> >> > to
>>>> >> > goals unless I have enough minimal technical info to judge whether
>>>> >> > the
>>>> >> > goals are likely to actually be accomplished.
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > On Sun, Oct 9, 2016 at 5:35 PM, Matei Zaharia <[hidden email]>
>>>> >> > wrote:
>>>> >> >
>>>> >> >
>>>> >> >> Well, I think there are a few things here that don't make sense.
>>>> >> >> First,
>>>> >> >> why
>>>> >> >> should only committers submit SIPs? Development in the project
>>>> >> >> should
>>>> >> >> be
>>>> >> >> open to all contributors, whether they're committers or not.
>>>> >> >> Second, I
>>>> >> >> think
>>>> >> >> unrealistic goals can be found just by inspecting the goals, and
>>>> >> >> I'm
>>>> >> >> not
>>>> >> >> super worried that we'll accept a lot of SIPs that are then
>>>> >> >> infeasible
>>>> >> >> --
>>>> >> >> we
>>>> >> >> can then submit new ones. But this depends on whether you want
>>>> >> >> this
>>>> >> >> process
>>>> >> >> to be a "design doc lite", where people also agree on
>>>> >> >> implementation
>>>> >> >> strategy, or just a way to agree on goals. This is what I asked
>>>> >> >> earlier
>>>> >> >> about PRDs vs design docs (and I'm open to either one but I'd just
>>>> >> >> like
>>>> >> >> clarity). Finally, both as a user and designer of software, I
>>>> >> >> always
>>>> >> >> want
>>>> >> >> to
>>>> >> >> give feedback on APIs, so I'd really like a culture of having
>>>> >> >> those
>>>> >> >> early.
>>>> >> >> People don't argue about prettiness when they discuss APIs, they
>>>> >> >> argue
>>>> >> >> about
>>>> >> >> the core concepts to expose in order to meet various goals, and
>>>> >> >> then
>>>> >> >> they're
>>>> >> >> stuck maintaining those for a long time.
>>>> >> >>
>>>> >> >> Matei
>>>> >> >>
>>>> >> >> On Oct 9, 2016, at 3:10 PM, Cody Koeninger <[hidden email]> wrote:
>>>> >> >>
>>>> >> >> Users instead of people, sure.  Commiters and contributors are (or
>>>> >> >> at
>>>> >> >> least
>>>> >> >> should be) a subset of users.
>>>> >> >>
>>>> >> >> Non goals, sure. I don't care what the name is, but we need to
>>>> >> >> clearly
>>>> >> >> say
>>>> >> >> e.g. 'no we are not maintaining compatibility with XYZ right now'.
>>>> >> >>
>>>> >> >> API, what I care most about is whether it allows me to accomplish
>>>> >> >> the
>>>> >> >> goals.
>>>> >> >> Arguing about how ugly or pretty it is can be saved for design/
>>>> >> >> implementation imho.
>>>> >> >>
>>>> >> >> Strategy, this is necessary because otherwise goals can be out of
>>>> >> >> line
>>>> >> >> with
>>>> >> >> reality.  Don't propose goals you don't have at least some idea of
>>>> >> >> how
>>>> >> >> to
>>>> >> >> implement.
>>>> >> >>
>>>> >> >> Rejected strategies, given that commiters are the only ones I'm
>>>> >> >> saying
>>>> >> >> should formally submit SPARKLIs or SIPs, if they put junk in a
>>>> >> >> required
>>>> >> >> section then slap them down for it and tell them to fix it.
>>>> >> >>
>>>> >> >>
>>>> >> >> On Oct 9, 2016 4:36 PM, "Matei Zaharia" <[hidden email]> wrote:
>>>> >> >>>
>>>> >> >>> Yup, this is the stuff that I found unclear. Thanks for
>>>> >> >>> clarifying
>>>> >> >>> here,
>>>> >> >>> but we should also clarify it in the writeup. In particular:
>>>> >> >>>
>>>> >> >>> - Goals needs to be about user-facing behavior ("people" is
>>>> >> >>> broad)
>>>> >> >>>
>>>> >> >>> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will
>>>> >> >>> dig
>>>> >> >>> up
>>>> >> >>> one of these and say "Spark's developers have officially rejected
>>>> >> >>> X,
>>>> >> >>> which
>>>> >> >>> our awesome system has".
>>>> >> >>>
>>>> >> >>> - For user-facing stuff, I think you need a section on API.
>>>> >> >>> Virtually
>>>> >> >>> all
>>>> >> >>> other *IPs I've seen have that.
>>>> >> >>>
>>>> >> >>> - I'm still not sure why the strategy section is needed if the
>>>> >> >>> purpose
>>>> >> >>> is
>>>> >> >>> to define user-facing behavior -- unless this is the strategy for
>>>> >> >>> setting
>>>> >> >>> the goals or for defining the API. That sounds squarely like a
>>>> >> >>> design
>>>> >> >>> doc
>>>> >> >>> issue. In some sense, who cares whether the proposal is
>>>> >> >>> technically
>>>> >> >>> feasible
>>>> >> >>> right now? If it's infeasible, that will be discovered later
>>>> >> >>> during
>>>> >> >>> design
>>>> >> >>> and implementation. Same thing with rejected strategies --
>>>> >> >>> listing
>>>> >> >>> some
>>>> >> >>> of
>>>> >> >>> those is definitely useful sometimes, but if you make this a
>>>> >> >>> *required*
>>>> >> >>> section, people are just going to fill it in with bogus stuff
>>>> >> >>> (I've
>>>> >> >>> seen
>>>> >> >>> this happen before).
>>>> >> >>>
>>>> >> >>> Matei
>>>> >> >>>
>>>> >> >
>>>> >> >>> > On Oct 9, 2016, at 2:14 PM, Cody Koeninger <[hidden email]>
>>>> >> >>> > wrote:
>>>> >> >>> >
>>>> >> >>> > So to focus the discussion on the specific strategy I'm
>>>> >> >>> > suggesting,
>>>> >> >>> > documented at
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>> >> >>> >
>>>> >> >>> > "Goals: What must this allow people to do, that they can't
>>>> >> >>> > currently?"
>>>> >> >>> >
>>>> >> >>> > Is it unclear that this is focusing specifically on
>>>> >> >>> > people-visible
>>>> >> >>> > behavior?
>>>> >> >>> >
>>>> >> >>> > Rejected goals -  are important because otherwise people keep
>>>> >> >>> > trying
>>>> >> >>> > to argue about scope.  Of course you can change things later
>>>> >> >>> > with a
>>>> >> >>> > different SIP and different vote, the point is to focus.
>>>> >> >>> >
>>>> >> >>> > Use cases - are something that people are going to bring up in
>>>> >> >>> > discussion.  If they aren't clearly documented as a goal ("This
>>>> >> >>> > must
>>>> >> >>> > allow me to connect using SSL"), they should be added.
>>>> >> >>> >
>>>> >> >>> > Internal architecture - if the people who need specific
>>>> >> >>> > behavior are
>>>> >> >>> > implementers of other parts of the system, that's fine.
>>>> >> >>> >
>>>> >> >>> > Rejected strategies - If you have none of these, you have no
>>>> >> >>> > evidence
>>>> >> >>> > that the proponent didn't just go with the first thing they had
>>>> >> >>> > in
>>>> >> >>> > mind (or have already implemented), which is a big problem
>>>> >> >>> > currently.
>>>> >> >>> > Approval isn't binding as to specifics of implementation, so
>>>> >> >>> > these
>>>> >> >>> > aren't handcuffs.  The goals are the contract, the strategy is
>>>> >> >>> > evidence that contract can actually be met.
>>>> >> >>> >
>>>> >> >>> > Design docs - I'm not touching design docs.  The markdown file
>>>> >> >>> > I
>>>> >> >>> > linked specifically says of the strategy section "This is not a
>>>> >> >>> > full
>>>> >> >>> > design document."  Is this unclear?  Design docs can be worked
>>>> >> >>> > on
>>>> >> >>> > obviously, but that's not what I'm concerned with here.
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> > On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <[hidden email]>
>>>> >> >>> > wrote:
>>>> >> >>> >> Hi Cody,
>>>> >> >>> >>
>>>> >> >>> >> I think this would be a lot more concrete if we had a more
>>>> >> >>> >> detailed
>>>> >> >>> >> template
>>>> >> >>> >> for SIPs. Right now, it's not super clear what's in scope --
>>>> >> >>> >> e.g.
>>>> >> >>> >> are
>>>> >> >>> >> they
>>>> >> >>> >> a way to solicit feedback on the user-facing behavior or on
>>>> >> >>> >> the
>>>> >> >>> >> internals?
>>>> >> >>> >> "Goals" can cover both things. I've been thinking of SIPs more
>>>> >> >>> >> as
>>>> >> >>> >> Product
>>>> >> >>> >> Requirements Docs (PRDs), which focus on *what* a code change
>>>> >> >>> >> should
>>>> >> >>> >> do
>>>> >> >>> >> as
>>>> >> >>> >> opposed to how.
>>>> >> >>> >>
>>>> >> >>> >> In particular, here are some things that you may or may not
>>>> >> >>> >> consider
>>>> >> >>> >> in
>>>> >> >>> >> scope for SIPs:
>>>> >> >>> >>
>>>> >> >>> >> - Goals and non-goals: This is definitely in scope, and IMO
>>>> >> >>> >> should
>>>> >> >>> >> focus on
>>>> >> >>> >> user-visible behavior (e.g. "system supports SQL window
>>>> >> >>> >> functions"
>>>> >> >>> >> or
>>>> >> >>> >> "system continues working if one node fails"). BTW I wouldn't
>>>> >> >>> >> say
>>>> >> >>> >> "rejected
>>>> >> >>> >> goals" because some of them might become goals later, so we're
>>>> >> >>> >> not
>>>> >> >>> >> definitively rejecting them.
>>>> >> >>> >>
>>>> >> >>> >> - Public API: Probably should be included in most SIPs unless
>>>> >> >>> >> it's
>>>> >> >>> >> too
>>>> >> >>> >> large
>>>> >> >>> >> to fully specify then (e.g. "let's add an ML library").
>>>> >> >>> >>
>>>> >> >>> >> - Use cases: I usually find this very useful in PRDs to better
>>>> >> >>> >> communicate
>>>> >> >>> >> the goals.
>>>> >> >>> >>
>>>> >> >>> >> - Internal architecture: This is usually *not* a thing users
>>>> >> >>> >> can
>>>> >> >>> >> easily
>>>> >> >>> >> comment on and it sounds more like a design doc item. Of
>>>> >> >>> >> course
>>>> >> >>> >> it's
>>>> >> >>> >> important to show that the SIP is feasible to implement. One
>>>> >> >>> >> exception,
>>>> >> >>> >> however, is that I think we'll have some SIPs primarily on
>>>> >> >>> >> internals
>>>> >> >>> >> (e.g.
>>>> >> >>> >> if somebody wants to refactor Spark's query optimizer or
>>>> >> >>> >> something).
>>>> >> >>> >>
>>>> >> >>> >> - Rejected strategies: I personally wouldn't put this, because
>>>> >> >>> >> what's
>>>> >> >>> >> the
>>>> >> >>> >> point of voting to reject a strategy before you've really
>>>> >> >>> >> begun
>>>> >> >>> >> designing
>>>> >> >>> >> and implementing something? What if you discover that the
>>>> >> >>> >> strategy
>>>> >> >>> >> is
>>>> >> >>> >> actually better when you start doing stuff?
>>>> >> >>> >>
>>>> >> >>> >> At a super high level, it depends on whether you want the SIPs
>>>> >> >>> >> to
>>>> >> >>> >> be
>>>> >> >>> >> PRDs
>>>> >> >>> >> for getting some quick feedback on the goals of a feature
>>>> >> >>> >> before it
>>>> >> >>> >> is
>>>> >> >>> >> designed, or something more like full-fledged design docs
>>>> >> >>> >> (just a
>>>> >> >>> >> more
>>>> >> >>> >> visible design doc for bigger changes). I looked at Kafka's
>>>> >> >>> >> KIPs,
>>>> >> >>> >> and
>>>> >> >>> >> they
>>>> >> >>> >> actually seem to be more like design docs. This can work too
>>>> >> >>> >> but it
>>>> >> >>> >> does
>>>> >> >>> >> require more work from the proposer and it can lead to the
>>>> >> >>> >> same
>>>> >> >>> >> problems you
>>>> >> >>> >> mentioned with people already having a design and
>>>> >> >>> >> implementation in
>>>> >> >>> >> mind.
>>>> >> >>> >>
>>>> >> >>> >> Basically, the question is, are you trying to iterate faster
>>>> >> >>> >> on
>>>> >> >>> >> design
>>>> >> >>> >> by
>>>> >> >>> >> adding a step for user feedback earlier? Or are you just
>>>> >> >>> >> trying to
>>>> >> >>> >> make
>>>> >> >>> >> design docs for key features more visible (and their approval
>>>> >> >>> >> more
>>>> >> >>> >> formal)?
>>>> >> >>> >>
>>>> >> >>> >> BTW note that in either case, I'd like to have a template for
>>>> >> >>> >> design
>>>> >> >>> >> docs
>>>> >> >>> >> too, which should also include goals. I think that would've
>>>> >> >>> >> avoided
>>>> >> >>> >> some of
>>>> >> >>> >> the issues you brought up.
>>>> >> >>> >>
>>>> >> >>> >> Matei
>>>> >> >>> >>
>>>> >> >>> >> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <[hidden email]>
>>>> >> >>> >> wrote:
>>>> >> >>> >>
>>>> >> >>> >> Here's my specific proposal (meta-proposal?)
>>>> >> >>> >>
>>>> >> >>> >> Spark Improvement Proposals (SIP)
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> Background:
>>>> >> >>> >>
>>>> >> >>> >> The current problem is that design and implementation of large
>>>> >> >>> >> features
>>>> >> >>> >> are
>>>> >> >>> >> often done in private, before soliciting user feedback.
>>>> >> >>> >>
>>>> >> >>> >> When feedback is solicited, it is often as to detailed design
>>>> >> >>> >> specifics, not
>>>> >> >>> >> focused on goals.
>>>> >> >>> >>
>>>> >> >>> >> When implementation does take place after design, there is
>>>> >> >>> >> often
>>>> >> >>> >> disagreement as to what goals are or are not in scope.
>>>> >> >>> >>
>>>> >> >>> >> This results in commits that don't fully meet user needs.
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> Goals:
>>>> >> >>> >>
>>>> >> >>> >> - Ensure user, contributor, and committer goals are clearly
>>>> >> >>> >> identified
>>>> >> >>> >> and
>>>> >> >>> >> agreed upon, before implementation takes place.
>>>> >> >>> >>
>>>> >> >>> >> - Ensure that a technically feasible strategy is chosen that
>>>> >> >>> >> is
>>>> >> >>> >> likely
>>>> >> >>> >> to
>>>> >> >>> >> meet the goals.
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> Rejected Goals:
>>>> >> >>> >>
>>>> >> >>> >> - SIPs are not for detailed design.  Design by committee
>>>> >> >>> >> doesn't
>>>> >> >>> >> work.
>>>> >> >>> >>
>>>> >> >>> >> - SIPs are not for every change.  We dont need that much
>>>> >> >>> >> process.
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> Strategy:
>>>> >> >>> >>
>>>> >> >>> >> My suggestion is outlined as a Spark Improvement Proposal
>>>> >> >>> >> process
>>>> >> >>> >> documented
>>>> >> >>> >> at
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>> >> >>> >>
>>>> >> >>> >> Specifics of Jira manipulation are an implementation detail we
>>>> >> >>> >> can
>>>> >> >>> >> figure
>>>> >> >>> >> out.
>>>> >> >>> >>
>>>> >> >>> >> I'm suggesting voting; the need here is for a _clear_ outcome.
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> Rejected Strategies:
>>>> >> >>> >>
>>>> >> >>> >> Having someone who understands the problem implement it first
>>>> >> >>> >> works,
>>>> >> >>> >> but
>>>> >> >>> >> only if significant iteration after user feedback is allowed.
>>>> >> >>> >>
>>>> >> >>> >> Historically this has been problematic due to pressure to
>>>> >> >>> >> limit
>>>> >> >>> >> public
>>>> >> >>> >> api
>>>> >> >>> >> changes.
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <[hidden email]>
>>>> >> >>> >> wrote:
>>>> >> >>> >>>
>>>> >> >>> >>> Alright looks like there are quite a bit of support. We
>>>> >> >>> >>> should
>>>> >> >>> >>> wait
>>>> >> >>> >>> to
>>>> >> >>> >>> hear from more people too.
>>>> >> >>> >>>
>>>> >> >>> >>> To push this forward, Cody and I will be working together in
>>>> >> >>> >>> the
>>>> >> >>> >>> next
>>>> >> >>> >>> couple of weeks to come up with a concrete, detailed proposal
>>>> >> >>> >>> on
>>>> >> >>> >>> what
>>>> >> >>> >>> this
>>>> >> >>> >>> entails, and then we can discuss this the specific proposal
>>>> >> >>> >>> as
>>>> >> >>> >>> well.
>>>> >> >>> >>>
>>>> >> >>> >>>
>>>> >> >>> >>> On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <[hidden
>>>> >> >>> >>> email]>
>>>> >> >>> >>> wrote:
>>>> >> >>> >>>>
>>>> >> >>> >>>> Yeah, in case it wasn't clear, I was talking about SIPs for
>>>> >> >>> >>>> major
>>>> >> >>> >>>> user-facing or cross-cutting changes, not minor feature
>>>> >> >>> >>>> adds.
>>>> >> >>> >>>>
>>>> >> >>> >>>> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos
>>>> >> >>> >>>> <[hidden email]> wrote:
>>>> >> >>> >>>>>
>>>> >> >>> >>>>> +1 to the SIP label as long as it does not slow down things
>>>> >> >>> >>>>> and
>>>> >> >>> >>>>> it
>>>> >> >>> >>>>> targets optimizing efforts, coordination etc. For example
>>>> >> >>> >>>>> really
>>>> >> >>> >>>>> small
>>>> >> >>> >>>>> features should not need to go through this process
>>>> >> >>> >>>>> (assuming
>>>> >> >>> >>>>> they
>>>> >> >>> >>>>> dont
>>>> >> >>> >>>>> touch public interfaces)  or re-factorings and hope it will
>>>> >> >>> >>>>> be
>>>> >> >>> >>>>> kept
>>>> >> >>> >>>>> this
>>>> >> >>> >>>>> way. So as a guideline doc should be provided, like in the
>>>> >> >>> >>>>> KIP
>>>> >> >>> >>>>> case.
>>>> >> >>> >>>>>
>>>> >> >>> >>>>> IMHO so far aside from tagging things and linking them
>>>> >> >>> >>>>> elsewhere
>>>> >> >>> >>>>> simply
>>>> >> >>> >>>>> having design docs and prototypes implementations in PRs is
>>>> >> >>> >>>>> not
>>>> >> >>> >>>>> something
>>>> >> >>> >>>>> that has not worked so far. What is really a pain in many
>>>> >> >>> >>>>> projects
>>>> >> >>> >>>>> out there
>>>> >> >>> >>>>> is discontinuity in progress of PRs, missing features, slow
>>>> >> >>> >>>>> reviews
>>>> >> >>> >>>>> which is
>>>> >> >>> >>>>> understandable to some extent... it is not only about Spark
>>>> >> >>> >>>>> but
>>>> >> >>> >>>>> things can
>>>> >> >>> >>>>> be improved for sure for this project in particular as
>>>> >> >>> >>>>> already
>>>> >> >>> >>>>> stated.
>>>> >> >>> >>>>>
>>>> >> >>> >>>>> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <[hidden
>>>> >> >>> >>>>> email]>
>>>> >> >>> >>>>> wrote:
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> +1 to adding an SIP label and linking it from the website.
>>>> >> >>> >>>>>> I
>>>> >> >>> >>>>>> think
>>>> >> >>> >>>>>> it
>>>> >> >>> >>>>>> needs
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> - template that focuses it towards soliciting user goals /
>>>> >> >>> >>>>>> non
>>>> >> >>> >>>>>> goals
>>>> >> >>> >>>>>> - clear resolution as to which strategy was chosen to
>>>> >> >>> >>>>>> pursue.
>>>> >> >>> >>>>>> I'd
>>>> >> >>> >>>>>> recommend a vote.
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> Matei asked me to clarify what I meant by changing
>>>> >> >>> >>>>>> interfaces,
>>>> >> >>> >>>>>> I
>>>> >> >>> >>>>>> think
>>>> >> >>> >>>>>> it's directly relevant to the SIP idea so I'll clarify
>>>> >> >>> >>>>>> here,
>>>> >> >>> >>>>>> and
>>>> >> >>> >>>>>> split
>>>> >> >>> >>>>>> a thread for the other discussion per Nicholas' request.
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> I meant changing public user interfaces.  I think the
>>>> >> >>> >>>>>> first
>>>> >> >>> >>>>>> design
>>>> >> >>> >>>>>> is
>>>> >> >>> >>>>>> unlikely to be right, because it's done at a time when you
>>>> >> >>> >>>>>> have
>>>> >> >>> >>>>>> the
>>>> >> >>> >>>>>> least information.  As a user, I find it considerably more
>>>> >> >>> >>>>>> frustrating
>>>> >> >>> >>>>>> to be unable to use a tool to get my job done, than I do
>>>> >> >>> >>>>>> having
>>>> >> >>> >>>>>> to
>>>> >> >>> >>>>>> make minor changes to my code in order to take advantage
>>>> >> >>> >>>>>> of
>>>> >> >>> >>>>>> features.
>>>> >> >>> >>>>>> I've seen committers be seriously reluctant to allow
>>>> >> >>> >>>>>> changes to
>>>> >> >>> >>>>>> @experimental code that are needed in order for it to
>>>> >> >>> >>>>>> really
>>>> >> >>> >>>>>> work
>>>> >> >>> >>>>>> right.  You need to be able to iterate, and if people on
>>>> >> >>> >>>>>> both
>>>> >> >>> >>>>>> sides
>>>> >> >>> >>>>>> of
>>>> >> >>> >>>>>> the fence aren't going to respect that some newer apis are
>>>> >> >>> >>>>>> subject
>>>> >> >>> >>>>>> to
>>>> >> >>> >>>>>> change, then why even mark them as such?
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> Ideally a finished SIP should give me a checklist of
>>>> >> >>> >>>>>> things
>>>> >> >>> >>>>>> that
>>>> >> >>> >>>>>> an
>>>> >> >>> >>>>>> implementation must do, and things that it doesn't need to
>>>> >> >>> >>>>>> do.
>>>> >> >>> >>>>>> Contributors/committers should be seriously discouraged
>>>> >> >>> >>>>>> from
>>>> >> >>> >>>>>> putting
>>>> >> >>> >>>>>> out a version 0.1 that doesn't have at least a prototype
>>>> >> >>> >>>>>> implementation of all those things, especially if they're
>>>> >> >>> >>>>>> then
>>>> >> >>> >>>>>> going
>>>> >> >>> >>>>>> to argue against interface changes necessary to get the
>>>> >> >>> >>>>>> the
>>>> >> >>> >>>>>> rest
>>>> >> >>> >>>>>> of
>>>> >> >>> >>>>>> the things done in the 0.2 version.
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <[hidden
>>>> >> >>> >>>>>> email]>
>>>> >> >>> >>>>>> wrote:
>>>> >> >>> >>>>>>> I like the lightweight proposal to add a SIP label.
>>>> >> >>> >>>>>>>
>>>> >> >>> >>>>>>> During Spark 2.0 development, Tom (Graves) and I
>>>> >> >>> >>>>>>> suggested
>>>> >> >>> >>>>>>> using
>>>> >> >>> >>>>>>> wiki
>>>> >> >>> >>>>>>> to
>>>> >> >>> >>>>>>> track the list of major changes, but that never really
>>>> >> >>> >>>>>>> materialized
>>>> >> >>> >>>>>>> due to
>>>> >> >>> >>>>>>> the overhead. Adding a SIP label on major JIRAs and then
>>>> >> >>> >>>>>>> link
>>>> >> >>> >>>>>>> to
>>>> >> >>> >>>>>>> them
>>>> >> >>> >>>>>>> prominently on the Spark website makes a lot of sense.
>>>> >> >>> >>>>>>>
>>>> >> >>> >>>>>>>
>>>> >> >>> >>>>>>> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia
>>>> >> >>> >>>>>>> <[hidden email]>
>>>> >> >>> >>>>>>> wrote:
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> For the improvement proposals, I think one major point
>>>> >> >>> >>>>>>>> was to
>>>> >> >>> >>>>>>>> make
>>>> >> >>> >>>>>>>> them
>>>> >> >>> >>>>>>>> really visible to users who are not contributors, so we
>>>> >> >>> >>>>>>>> should
>>>> >> >>> >>>>>>>> do
>>>> >> >>> >>>>>>>> more than
>>>> >> >>> >>>>>>>> sending stuff to dev@. One very lightweight idea is to
>>>> >> >>> >>>>>>>> have a
>>>> >> >>> >>>>>>>> new
>>>> >> >>> >>>>>>>> type of
>>>> >> >>> >>>>>>>> JIRA called a SIP and have a link to a filter that shows
>>>> >> >>> >>>>>>>> all
>>>> >> >>> >>>>>>>> such
>>>> >> >>> >>>>>>>> JIRAs from
>>>> >> >>> >>>>>>>> http://spark.apache.org. I also like the idea of SIP and
>>>> >> >>> >>>>>>>> design
>>>> >> >>> >>>>>>>> doc
>>>> >> >>> >>>>>>>> templates (in fact many projects have them).
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> Matei
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden
>>>> >> >>> >>>>>>>> email]>
>>>> >> >>> >>>>>>>> wrote:
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> I called Cody last night and talked about some of the
>>>> >> >>> >>>>>>>> topics
>>>> >> >>> >>>>>>>> in
>>>> >> >>> >>>>>>>> his
>>>> >> >>> >>>>>>>> email.
>>>> >> >>> >>>>>>>> It became clear to me Cody genuinely cares about the
>>>> >> >>> >>>>>>>> project.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> Some of the frustrations come from the success of the
>>>> >> >>> >>>>>>>> project
>>>> >> >>> >>>>>>>> itself
>>>> >> >>> >>>>>>>> becoming very "hot", and it is difficult to get clarity
>>>> >> >>> >>>>>>>> from
>>>> >> >>> >>>>>>>> people
>>>> >> >>> >>>>>>>> who
>>>> >> >>> >>>>>>>> don't dedicate all their time to Spark. In fact, it is
>>>> >> >>> >>>>>>>> in
>>>> >> >>> >>>>>>>> some
>>>> >> >>> >>>>>>>> ways
>>>> >> >>> >>>>>>>> similar
>>>> >> >>> >>>>>>>> to scaling an engineering team in a successful startup:
>>>> >> >>> >>>>>>>> old
>>>> >> >>> >>>>>>>> processes that
>>>> >> >>> >>>>>>>> worked well might not work so well when it gets to a
>>>> >> >>> >>>>>>>> certain
>>>> >> >>> >>>>>>>> size,
>>>> >> >>> >>>>>>>> cultures
>>>> >> >>> >>>>>>>> can get diluted, building culture vs building process,
>>>> >> >>> >>>>>>>> etc.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> I also really like to have a more visible process for
>>>> >> >>> >>>>>>>> larger
>>>> >> >>> >>>>>>>> changes,
>>>> >> >>> >>>>>>>> especially major user facing API changes. Historically
>>>> >> >>> >>>>>>>> we
>>>> >> >>> >>>>>>>> upload
>>>> >> >>> >>>>>>>> design docs
>>>> >> >>> >>>>>>>> for major changes, but it is not always consistent and
>>>> >> >>> >>>>>>>> difficult
>>>> >> >>> >>>>>>>> to
>>>> >> >>> >>>>>>>> quality
>>>> >> >>> >>>>>>>> of the docs, due to the volunteering nature of the
>>>> >> >>> >>>>>>>> organization.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> Some of the more concrete ideas we discussed focus on
>>>> >> >>> >>>>>>>> building a
>>>> >> >>> >>>>>>>> culture
>>>> >> >>> >>>>>>>> to improve clarity:
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> - Process: Large changes should have design docs posted
>>>> >> >>> >>>>>>>> on
>>>> >> >>> >>>>>>>> JIRA.
>>>> >> >>> >>>>>>>> One
>>>> >> >>> >>>>>>>> thing
>>>> >> >>> >>>>>>>> Cody and I didn't discuss but an idea that just came to
>>>> >> >>> >>>>>>>> me is
>>>> >> >>> >>>>>>>> we
>>>> >> >>> >>>>>>>> should
>>>> >> >>> >>>>>>>> create a design doc template for the project and ask
>>>> >> >>> >>>>>>>> everybody
>>>> >> >>> >>>>>>>> to
>>>> >> >>> >>>>>>>> follow.
>>>> >> >>> >>>>>>>> The design doc template should also explicitly list
>>>> >> >>> >>>>>>>> goals and
>>>> >> >>> >>>>>>>> non-goals, to
>>>> >> >>> >>>>>>>> make design doc more consistent.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> - Process: Email dev@ to solicit feedback. We have some
>>>> >> >>> >>>>>>>> this
>>>> >> >>> >>>>>>>> with
>>>> >> >>> >>>>>>>> some
>>>> >> >>> >>>>>>>> changes, but again very inconsistent. Just posting
>>>> >> >>> >>>>>>>> something
>>>> >> >>> >>>>>>>> on
>>>> >> >>> >>>>>>>> JIRA
>>>> >> >>> >>>>>>>> isn't
>>>> >> >>> >>>>>>>> sufficient, because there are simply too many JIRAs and
>>>> >> >>> >>>>>>>> the
>>>> >> >>> >>>>>>>> signal
>>>> >> >>> >>>>>>>> get lost
>>>> >> >>> >>>>>>>> in the noise. While this is generally impossible to
>>>> >> >>> >>>>>>>> enforce
>>>> >> >>> >>>>>>>> because
>>>> >> >>> >>>>>>>> we can't
>>>> >> >>> >>>>>>>> force all volunteers to conform to a process (or they
>>>> >> >>> >>>>>>>> might
>>>> >> >>> >>>>>>>> not
>>>> >> >>> >>>>>>>> even
>>>> >> >>> >>>>>>>> be
>>>> >> >>> >>>>>>>> aware of this),  those who are more familiar with the
>>>> >> >>> >>>>>>>> project
>>>> >> >>> >>>>>>>> can
>>>> >> >>> >>>>>>>> help by
>>>> >> >>> >>>>>>>> emailing the dev@ when they see something that hasn't
>>>> >> >>> >>>>>>>> been.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> - Culture: The design doc author(s) should be open to
>>>> >> >>> >>>>>>>> feedback.
>>>> >> >>> >>>>>>>> A
>>>> >> >>> >>>>>>>> design
>>>> >> >>> >>>>>>>> doc should serve as the base for discussion and is by no
>>>> >> >>> >>>>>>>> means
>>>> >> >>> >>>>>>>> the
>>>> >> >>> >>>>>>>> final
>>>> >> >>> >>>>>>>> design. Of course, this does not mean the author has to
>>>> >> >>> >>>>>>>> accept
>>>> >> >>> >>>>>>>> every
>>>> >> >>> >>>>>>>> feedback. They should also be comfortable accepting /
>>>> >> >>> >>>>>>>> rejecting
>>>> >> >>> >>>>>>>> ideas on
>>>> >> >>> >>>>>>>> technical grounds.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> - Process / Culture: For major ongoing projects, it can
>>>> >> >>> >>>>>>>> be
>>>> >> >>> >>>>>>>> useful
>>>> >> >>> >>>>>>>> to
>>>> >> >>> >>>>>>>> have
>>>> >> >>> >>>>>>>> some monthly Google hangouts that are open to the world.
>>>> >> >>> >>>>>>>> I am
>>>> >> >>> >>>>>>>> actually not
>>>> >> >>> >>>>>>>> sure how well this will work, because of the
>>>> >> >>> >>>>>>>> volunteering
>>>> >> >>> >>>>>>>> nature
>>>> >> >>> >>>>>>>> and
>>>> >> >>> >>>>>>>> we need
>>>> >> >>> >>>>>>>> to adjust for timezones for people across the globe, but
>>>> >> >>> >>>>>>>> it
>>>> >> >>> >>>>>>>> seems
>>>> >> >>> >>>>>>>> worth
>>>> >> >>> >>>>>>>> trying.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> - Culture: Contributors (including committers) should be
>>>> >> >>> >>>>>>>> more
>>>> >> >>> >>>>>>>> direct
>>>> >> >>> >>>>>>>> in
>>>> >> >>> >>>>>>>> setting expectations, including whether they are working
>>>> >> >>> >>>>>>>> on a
>>>> >> >>> >>>>>>>> specific
>>>> >> >>> >>>>>>>> issue, whether they will be working on a specific issue,
>>>> >> >>> >>>>>>>> and
>>>> >> >>> >>>>>>>> whether
>>>> >> >>> >>>>>>>> an
>>>> >> >>> >>>>>>>> issue or pr or jira should be rejected. Most people I
>>>> >> >>> >>>>>>>> know in
>>>> >> >>> >>>>>>>> this
>>>> >> >>> >>>>>>>> community
>>>> >> >>> >>>>>>>> are nice and don't enjoy telling other people no, but it
>>>> >> >>> >>>>>>>> is
>>>> >> >>> >>>>>>>> often
>>>> >> >>> >>>>>>>> more
>>>> >> >>> >>>>>>>> annoying to a contributor to not know anything than
>>>> >> >>> >>>>>>>> getting a
>>>> >> >>> >>>>>>>> no.
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia
>>>> >> >>> >>>>>>>> <[hidden email]>
>>>> >> >>> >>>>>>>> wrote:
>>>> >> >>> >>>>>>>>>
>>>> >> >>> >>>>>>>>>
>>>> >> >>> >>>>>>>>> Love the idea of a more visible "Spark Improvement
>>>> >> >>> >>>>>>>>> Proposal"
>>>> >> >>> >>>>>>>>> process that
>>>> >> >>> >>>>>>>>> solicits user input on new APIs. For what it's worth, I
>>>> >> >>> >>>>>>>>> don't
>>>> >> >>> >>>>>>>>> think
>>>> >> >>> >>>>>>>>> committers are trying to minimize their own work --
>>>> >> >>> >>>>>>>>> every
>>>> >> >>> >>>>>>>>> committer
>>>> >> >>> >>>>>>>>> cares
>>>> >> >>> >>>>>>>>> about making the software useful for users. However, it
>>>> >> >>> >>>>>>>>> is
>>>> >> >>> >>>>>>>>> always
>>>> >> >>> >>>>>>>>> hard to
>>>> >> >>> >>>>>>>>> get user input and so it helps to have this kind of
>>>> >> >>> >>>>>>>>> process.
>>>> >> >>> >>>>>>>>> I've
>>>> >> >>> >>>>>>>>> certainly
>>>> >> >>> >>>>>>>>> looked at the *IPs a lot in other software I use just
>>>> >> >>> >>>>>>>>> to see
>>>> >> >>> >>>>>>>>> the
>>>> >> >>> >>>>>>>>> biggest
>>>> >> >>> >>>>>>>>> things on the roadmap.
>>>> >> >>> >>>>>>>>>
>>>> >> >>> >>>>>>>>> When you're talking about "changing interfaces", are
>>>> >> >>> >>>>>>>>> you
>>>> >> >>> >>>>>>>>> talking
>>>> >> >>> >>>>>>>>> about
>>>> >> >>> >>>>>>>>> public or internal APIs? I do think many people hate
>>>> >> >>> >>>>>>>>> changing
>>>> >> >>> >>>>>>>>> public APIs
>>>> >> >>> >>>>>>>>> and I actually think that's for the best of the
>>>> >> >>> >>>>>>>>> project.
>>>> >> >>> >>>>>>>>> That's
>>>> >> >>> >>>>>>>>> a
>>>> >> >>> >>>>>>>>> technical
>>>> >> >>> >>>>>>>>> debate, but basically, the worst thing when you're
>>>> >> >>> >>>>>>>>> using a
>>>> >> >>> >>>>>>>>> piece
>>>> >> >>> >>>>>>>>> of
>>>> >> >>> >>>>>>>>> software
>>>> >> >>> >>>>>>>>> is that the developers constantly ask you to rewrite
>>>> >> >>> >>>>>>>>> your
>>>> >> >>> >>>>>>>>> app
>>>> >> >>> >>>>>>>>> to
>>>> >> >>> >>>>>>>>> update to a
>>>> >> >>> >>>>>>>>> new version (and thus benefit from bug fixes, etc). Cue
>>>> >> >>> >>>>>>>>> anyone
>>>> >> >>> >>>>>>>>> who's used
>>>> >> >>> >>>>>>>>> Protobuf, or Guava. The "let's get everyone to change
>>>> >> >>> >>>>>>>>> their
>>>> >> >>> >>>>>>>>> code
>>>> >> >>> >>>>>>>>> this
>>>> >> >>> >>>>>>>>> release" model works well within a single large
>>>> >> >>> >>>>>>>>> company, but
>>>> >> >>> >>>>>>>>> doesn't work
>>>> >> >>> >>>>>>>>> well for a community, which is why nearly all *very*
>>>> >> >>> >>>>>>>>> widely
>>>> >> >>> >>>>>>>>> used
>>>> >> >>> >>>>>>>>> programming
>>>> >> >>> >>>>>>>>> interfaces (I'm talking things like Java standard
>>>> >> >>> >>>>>>>>> library,
>>>> >> >>> >>>>>>>>> Windows
>>>> >> >>> >>>>>>>>> API, etc)
>>>> >> >>> >>>>>>>>> almost *never* break backwards compatibility. All this
>>>> >> >>> >>>>>>>>> is
>>>> >> >>> >>>>>>>>> done
>>>> >> >>> >>>>>>>>> within reason
>>>> >> >>> >>>>>>>>> though, e.g. we do change things in major releases
>>>> >> >>> >>>>>>>>> (2.x,
>>>> >> >>> >>>>>>>>> 3.x,
>>>> >> >>> >>>>>>>>> etc).
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>>
>>>> >> >>> >>>>>>>
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>> ---------------------------------------------------------------------
>>>> >> >>> >>>>>> To unsubscribe e-mail: [hidden email]
>>>> >> >>> >>>>>>
>>>> >> >>> >>>>>
>>>> >> >>> >>>>>
>>>> >> >>> >>>>>
>>>> >> >>> >>>>> --
>>>> >> >>> >>>>> Stavros Kontopoulos
>>>> >> >>> >>>>> Senior Software Engineer
>>>> >> >>> >>>>> Lightbend, Inc.
>>>> >> >>> >>>>> p:  <a href="tel:%2B30%206977967274" value="+306977967274" target="_blank">+30 6977967274
>>>> >> >>> >>>>> e: [hidden email]
>>>> >> >>> >>>>>
>>>> >> >>> >>>>>
>>>> >> >>> >>>>
>>>> >> >>> >>>
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>>
>>>> >> >>
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > ---------------------------------------------------------------------
>>>> >> > To unsubscribe e-mail: [hidden email]
>>>> >> >
>>>> >> >
>>>> >> > ________________________________
>>>> >> >
>>>> >> > If you reply to this email, your message will be added to the
>>>> >> > discussion
>>>> >> > below:
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-tp19268p19359.html
>>>> >> >
>>>> >> > To start a new topic under Apache Spark Developers List, email
>>>> >> > [hidden
>>>> >> > email]
>>>> >> > To unsubscribe from Apache Spark Developers List, click here.
>>>> >> > NAML
>>>> >> >
>>>> >> >
>>>> >> > ________________________________
>>>> >> > View this message in context: RE: Spark Improvement Proposals
>>>> >> > Sent from the Apache Spark Developers List mailing list archive at
>>>> >> > Nabble.com.
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe e-mail: [hidden email]
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Ryan Blue
>>>> > Software Engineer
>>>> > Netflix
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: [hidden email]
>>>>
>>>
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Ryan Blue
Software Engineer
Netflix

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Tomasz Gawęda
In reply to this post by Ryan Blue
Hi everyone,

I'm quite late with my answer, but I think my suggestions may help a
little bit. :) Many technical and organizational topics were mentioned,
but I want to focus on these negative posts about Spark and about "haters"

I really like Spark. Easy of use, speed, very good community - it's
everything here. But Every project has to "flight" on "framework market"
to be still no 1. I'm following many Spark and Big Data communities,
maybe my mail will inspire someone :)

You (every Spark developer; so far I didn't have enough time to join
contributing to Spark) has done excellent job. So why are some people
saying that Flink (or other framework) is better, like it was posted in
this mailing list? No, not because that framework is better in all
cases.. In my opinion, many of these discussions where started after
Flink marketing-like posts. Please look at StackOverflow "Flink vs ...."
posts, almost every post in "winned" by Flink. Answers are sometimes
saying nothing about other frameworks, Flink's users (often PMC's) are
just posting same information about real-time streaming, about delta
iterations, etc. It look smart and very often it is marked as an aswer,
even if - in my opinion - there wasn't told all the truth.


My suggestion: I don't have enough money and knowledgle to perform huge
performance test. Maybe some company, that supports Spark (Databricks,
Cloudera? - just saying you're most visible in community :) ) could
perform performance test of:

- streaming engine - probably Spark will loose because of mini-batch
model, however currently the difference should be much lower that in
previous versions

- Machine Learning models

- batch jobs

- Graph jobs

- SQL queries

People will see that Spark is envolving and is also a modern framework,
because after reading posts mentioned above people may think "it is
outdated, future is in framework X".

Matei Zaharia posted excellent blog post about how Spark Structured
Streaming beats every other framework in terms of easy-of-use and
reliability. Performance tests, done in various environments (in
example: laptop, small 2 node cluster, 10-node cluster, 20-node
cluster), could be also very good marketing stuff to say "hey, you're
telling that you're better, but Spark is still faster and is still
getting even more fast!". This would be based on facts (just numbers),
not opinions. It would be good for companies, for marketing puproses and
for every Spark developer


Second: real-time streaming. I've written some time ago about real-time
streaming support in Spark Structured Streaming. Some work should be
done to make SSS more low-latency, but I think it's possible. Maybe
Spark may look at Gearpump, which is also built on top of Akka? I don't
know yet, it is good topic for SIP. However I think that Spark should
have real-time streaming support. Currently I see many posts/comments
that "Spark has too big latency". Spark Streaming is doing very good
jobs with micro-batches, however I think it is possible to add also more
real-time processing.

Other people said much more and I agree with proposal of SIP. I'm also
happy that PMC's are not saying that they will not listen to users, but
they really want to make Spark better for every user.


What do you think about these two topics? Especially I'm looking at Cody
(who has started this topic) and PMCs :)

Pozdrawiam / Best regards,

Tomasz


W dniu 2016-10-07 o 04:51, Cody Koeninger pisze:

> I love Spark.  3 or 4 years ago it was the first distributed computing
> environment that felt usable, and the community was welcoming.
>
> But I just got back from the Reactive Summit, and this is what I observed:
>
> - Industry leaders on stage making fun of Spark's streaming model
> - Open source project leaders saying they looked at Spark's governance
> as a model to avoid
> - Users saying they chose Flink because it was technically superior
> and they couldn't get any answers on the Spark mailing lists
>
> Whether you agree with the substance of any of this, when this stuff
> gets repeated enough people will believe it.
>
> Right now Spark is suffering from its own success, and I think
> something needs to change.
>
> - We need a clear process for planning significant changes to the codebase.
> I'm not saying you need to adopt Kafka Improvement Proposals exactly,
> but you need a documented process with a clear outcome (e.g. a vote).
> Passing around google docs after an implementation has largely been
> decided on doesn't cut it.
>
> - All technical communication needs to be public.
> Things getting decided in private chat, or when 1/3 of the committers
> work for the same company and can just talk to each other...
> Yes, it's convenient, but it's ultimately detrimental to the health of
> the project.
> The way structured streaming has played out has shown that there are
> significant technical blind spots (myself included).
> One way to address that is to get the people who have domain knowledge
> involved, and listen to them.
>
> - We need more committers, and more committer diversity.
> Per committer there are, what, more than 20 contributors and 10 new
> jira tickets a month?  It's too much.
> There are people (I am _not_ referring to myself) who have been around
> for years, contributed thousands of lines of code, helped educate the
> public around Spark... and yet are never going to be voted in.
>
> - We need a clear process for managing volunteer work.
> Too many tickets sit around unowned, unclosed, uncertain.
> If someone proposed something and it isn't up to snuff, tell them and
> close it.  It may be blunt, but it's clearer than "silent no".
> If someone wants to work on something, let them own the ticket and set
> a deadline. If they don't meet it, close it or reassign it.
>
> This is not me putting on an Apache Bureaucracy hat.  This is me
> saying, as a fellow hacker and loyal dissenter, something is wrong
> with the culture and process.
>
> Please, let's change it.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

debasish83
Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as soon as I looked into it since compared to writing Java map-reduce and Cascading code, Spark made writing distributed code fun...But now as we went deeper with Spark and real-time streaming use-case gets more prominent, I think it is time to bring a messaging model in conjunction with the batch/micro-batch API that Spark is good at....akka-streams close integration with spark micro-batching APIs looks like a great direction to stay in the game with Apache Flink...Spark 2.0 integrated streaming with batch with the assumption is that micro-batching is sufficient to run SQL commands on stream but do we really have time to do SQL processing at streaming data within 1-2 seconds ?

After reading the email chain, I started to look into Flink documentation and if you compare it with Spark documentation, I think we have major work to do detailing out Spark internals so that more people from community start to take active role in improving the issues so that Spark stays strong compared to Flink.



Spark is no longer an engine that works for micro-batch and batch...We (and I am sure many others) are pushing spark as an engine for stream and query processing.....we need to make it a state-of-the-art engine for high speed streaming data and user queries as well !

On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda <[hidden email]> wrote:
Hi everyone,

I'm quite late with my answer, but I think my suggestions may help a
little bit. :) Many technical and organizational topics were mentioned,
but I want to focus on these negative posts about Spark and about "haters"

I really like Spark. Easy of use, speed, very good community - it's
everything here. But Every project has to "flight" on "framework market"
to be still no 1. I'm following many Spark and Big Data communities,
maybe my mail will inspire someone :)

You (every Spark developer; so far I didn't have enough time to join
contributing to Spark) has done excellent job. So why are some people
saying that Flink (or other framework) is better, like it was posted in
this mailing list? No, not because that framework is better in all
cases.. In my opinion, many of these discussions where started after
Flink marketing-like posts. Please look at StackOverflow "Flink vs ...."
posts, almost every post in "winned" by Flink. Answers are sometimes
saying nothing about other frameworks, Flink's users (often PMC's) are
just posting same information about real-time streaming, about delta
iterations, etc. It look smart and very often it is marked as an aswer,
even if - in my opinion - there wasn't told all the truth.


My suggestion: I don't have enough money and knowledgle to perform huge
performance test. Maybe some company, that supports Spark (Databricks,
Cloudera? - just saying you're most visible in community :) ) could
perform performance test of:

- streaming engine - probably Spark will loose because of mini-batch
model, however currently the difference should be much lower that in
previous versions

- Machine Learning models

- batch jobs

- Graph jobs

- SQL queries

People will see that Spark is envolving and is also a modern framework,
because after reading posts mentioned above people may think "it is
outdated, future is in framework X".

Matei Zaharia posted excellent blog post about how Spark Structured
Streaming beats every other framework in terms of easy-of-use and
reliability. Performance tests, done in various environments (in
example: laptop, small 2 node cluster, 10-node cluster, 20-node
cluster), could be also very good marketing stuff to say "hey, you're
telling that you're better, but Spark is still faster and is still
getting even more fast!". This would be based on facts (just numbers),
not opinions. It would be good for companies, for marketing puproses and
for every Spark developer


Second: real-time streaming. I've written some time ago about real-time
streaming support in Spark Structured Streaming. Some work should be
done to make SSS more low-latency, but I think it's possible. Maybe
Spark may look at Gearpump, which is also built on top of Akka? I don't
know yet, it is good topic for SIP. However I think that Spark should
have real-time streaming support. Currently I see many posts/comments
that "Spark has too big latency". Spark Streaming is doing very good
jobs with micro-batches, however I think it is possible to add also more
real-time processing.

Other people said much more and I agree with proposal of SIP. I'm also
happy that PMC's are not saying that they will not listen to users, but
they really want to make Spark better for every user.


What do you think about these two topics? Especially I'm looking at Cody
(who has started this topic) and PMCs :)

Pozdrawiam / Best regards,

Tomasz


W dniu 2016-10-07 o 04:51, Cody Koeninger pisze:
> I love Spark.  3 or 4 years ago it was the first distributed computing
> environment that felt usable, and the community was welcoming.
>
> But I just got back from the Reactive Summit, and this is what I observed:
>
> - Industry leaders on stage making fun of Spark's streaming model
> - Open source project leaders saying they looked at Spark's governance
> as a model to avoid
> - Users saying they chose Flink because it was technically superior
> and they couldn't get any answers on the Spark mailing lists
>
> Whether you agree with the substance of any of this, when this stuff
> gets repeated enough people will believe it.
>
> Right now Spark is suffering from its own success, and I think
> something needs to change.
>
> - We need a clear process for planning significant changes to the codebase.
> I'm not saying you need to adopt Kafka Improvement Proposals exactly,
> but you need a documented process with a clear outcome (e.g. a vote).
> Passing around google docs after an implementation has largely been
> decided on doesn't cut it.
>
> - All technical communication needs to be public.
> Things getting decided in private chat, or when 1/3 of the committers
> work for the same company and can just talk to each other...
> Yes, it's convenient, but it's ultimately detrimental to the health of
> the project.
> The way structured streaming has played out has shown that there are
> significant technical blind spots (myself included).
> One way to address that is to get the people who have domain knowledge
> involved, and listen to them.
>
> - We need more committers, and more committer diversity.
> Per committer there are, what, more than 20 contributors and 10 new
> jira tickets a month?  It's too much.
> There are people (I am _not_ referring to myself) who have been around
> for years, contributed thousands of lines of code, helped educate the
> public around Spark... and yet are never going to be voted in.
>
> - We need a clear process for managing volunteer work.
> Too many tickets sit around unowned, unclosed, uncertain.
> If someone proposed something and it isn't up to snuff, tell them and
> close it.  It may be blunt, but it's clearer than "silent no".
> If someone wants to work on something, let them own the ticket and set
> a deadline. If they don't meet it, close it or reassign it.
>
> This is not me putting on an Apache Bureaucracy hat.  This is me
> saying, as a fellow hacker and loyal dissenter, something is wrong
> with the culture and process.
>
> Please, let's change it.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Improvement Proposals

Cody Koeninger-2
I think narrowly focusing on Flink or benchmarks is missing my point.

My point is evolve or die.  Spark's governance and organization is
hampering its ability to evolve technologically, and it needs to
change.

On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <[hidden email]> wrote:

> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
> soon as I looked into it since compared to writing Java map-reduce and
> Cascading code, Spark made writing distributed code fun...But now as we went
> deeper with Spark and real-time streaming use-case gets more prominent, I
> think it is time to bring a messaging model in conjunction with the
> batch/micro-batch API that Spark is good at....akka-streams close
> integration with spark micro-batching APIs looks like a great direction to
> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
> batch with the assumption is that micro-batching is sufficient to run SQL
> commands on stream but do we really have time to do SQL processing at
> streaming data within 1-2 seconds ?
>
> After reading the email chain, I started to look into Flink documentation
> and if you compare it with Spark documentation, I think we have major work
> to do detailing out Spark internals so that more people from community start
> to take active role in improving the issues so that Spark stays strong
> compared to Flink.
>
> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>
> Spark is no longer an engine that works for micro-batch and batch...We (and
> I am sure many others) are pushing spark as an engine for stream and query
> processing.....we need to make it a state-of-the-art engine for high speed
> streaming data and user queries as well !
>
> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda <[hidden email]>
> wrote:
>>
>> Hi everyone,
>>
>> I'm quite late with my answer, but I think my suggestions may help a
>> little bit. :) Many technical and organizational topics were mentioned,
>> but I want to focus on these negative posts about Spark and about "haters"
>>
>> I really like Spark. Easy of use, speed, very good community - it's
>> everything here. But Every project has to "flight" on "framework market"
>> to be still no 1. I'm following many Spark and Big Data communities,
>> maybe my mail will inspire someone :)
>>
>> You (every Spark developer; so far I didn't have enough time to join
>> contributing to Spark) has done excellent job. So why are some people
>> saying that Flink (or other framework) is better, like it was posted in
>> this mailing list? No, not because that framework is better in all
>> cases.. In my opinion, many of these discussions where started after
>> Flink marketing-like posts. Please look at StackOverflow "Flink vs ...."
>> posts, almost every post in "winned" by Flink. Answers are sometimes
>> saying nothing about other frameworks, Flink's users (often PMC's) are
>> just posting same information about real-time streaming, about delta
>> iterations, etc. It look smart and very often it is marked as an aswer,
>> even if - in my opinion - there wasn't told all the truth.
>>
>>
>> My suggestion: I don't have enough money and knowledgle to perform huge
>> performance test. Maybe some company, that supports Spark (Databricks,
>> Cloudera? - just saying you're most visible in community :) ) could
>> perform performance test of:
>>
>> - streaming engine - probably Spark will loose because of mini-batch
>> model, however currently the difference should be much lower that in
>> previous versions
>>
>> - Machine Learning models
>>
>> - batch jobs
>>
>> - Graph jobs
>>
>> - SQL queries
>>
>> People will see that Spark is envolving and is also a modern framework,
>> because after reading posts mentioned above people may think "it is
>> outdated, future is in framework X".
>>
>> Matei Zaharia posted excellent blog post about how Spark Structured
>> Streaming beats every other framework in terms of easy-of-use and
>> reliability. Performance tests, done in various environments (in
>> example: laptop, small 2 node cluster, 10-node cluster, 20-node
>> cluster), could be also very good marketing stuff to say "hey, you're
>> telling that you're better, but Spark is still faster and is still
>> getting even more fast!". This would be based on facts (just numbers),
>> not opinions. It would be good for companies, for marketing puproses and
>> for every Spark developer
>>
>>
>> Second: real-time streaming. I've written some time ago about real-time
>> streaming support in Spark Structured Streaming. Some work should be
>> done to make SSS more low-latency, but I think it's possible. Maybe
>> Spark may look at Gearpump, which is also built on top of Akka? I don't
>> know yet, it is good topic for SIP. However I think that Spark should
>> have real-time streaming support. Currently I see many posts/comments
>> that "Spark has too big latency". Spark Streaming is doing very good
>> jobs with micro-batches, however I think it is possible to add also more
>> real-time processing.
>>
>> Other people said much more and I agree with proposal of SIP. I'm also
>> happy that PMC's are not saying that they will not listen to users, but
>> they really want to make Spark better for every user.
>>
>>
>> What do you think about these two topics? Especially I'm looking at Cody
>> (who has started this topic) and PMCs :)
>>
>> Pozdrawiam / Best regards,
>>
>> Tomasz
>>
>>
>> W dniu 2016-10-07 o 04:51, Cody Koeninger pisze:
>> > I love Spark.  3 or 4 years ago it was the first distributed computing
>> > environment that felt usable, and the community was welcoming.
>> >
>> > But I just got back from the Reactive Summit, and this is what I
>> > observed:
>> >
>> > - Industry leaders on stage making fun of Spark's streaming model
>> > - Open source project leaders saying they looked at Spark's governance
>> > as a model to avoid
>> > - Users saying they chose Flink because it was technically superior
>> > and they couldn't get any answers on the Spark mailing lists
>> >
>> > Whether you agree with the substance of any of this, when this stuff
>> > gets repeated enough people will believe it.
>> >
>> > Right now Spark is suffering from its own success, and I think
>> > something needs to change.
>> >
>> > - We need a clear process for planning significant changes to the
>> > codebase.
>> > I'm not saying you need to adopt Kafka Improvement Proposals exactly,
>> > but you need a documented process with a clear outcome (e.g. a vote).
>> > Passing around google docs after an implementation has largely been
>> > decided on doesn't cut it.
>> >
>> > - All technical communication needs to be public.
>> > Things getting decided in private chat, or when 1/3 of the committers
>> > work for the same company and can just talk to each other...
>> > Yes, it's convenient, but it's ultimately detrimental to the health of
>> > the project.
>> > The way structured streaming has played out has shown that there are
>> > significant technical blind spots (myself included).
>> > One way to address that is to get the people who have domain knowledge
>> > involved, and listen to them.
>> >
>> > - We need more committers, and more committer diversity.
>> > Per committer there are, what, more than 20 contributors and 10 new
>> > jira tickets a month?  It's too much.
>> > There are people (I am _not_ referring to myself) who have been around
>> > for years, contributed thousands of lines of code, helped educate the
>> > public around Spark... and yet are never going to be voted in.
>> >
>> > - We need a clear process for managing volunteer work.
>> > Too many tickets sit around unowned, unclosed, uncertain.
>> > If someone proposed something and it isn't up to snuff, tell them and
>> > close it.  It may be blunt, but it's clearer than "silent no".
>> > If someone wants to work on something, let them own the ticket and set
>> > a deadline. If they don't meet it, close it or reassign it.
>> >
>> > This is not me putting on an Apache Bureaucracy hat.  This is me
>> > saying, as a fellow hacker and loyal dissenter, something is wrong
>> > with the culture and process.
>> >
>> > Please, let's change it.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: [hidden email]
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Odp.: Spark Improvement Proposals

Tomasz Gawęda

Maybe my mail was not clear enough.


I didn't want to write "lets focus on Flink" or any other framework. The idea with benchmarks was to show two things:

- why some people are doing bad PR for Spark

- how - in easy way - we can change it and show that Spark is still on the top


No more, no less. Benchmarks will be helpful, but I don't think they're the most important thing in Spark :) On the Spark main page there is still chart "Spark vs Hadoop". It is important to show that framework is not the same Spark with other API, but much faster and optimized, comparable or even faster than other frameworks. 


About real-time streaming, I think it would be just good to see it in Spark. I very like current Spark model, but many voices that says "we need more" - community should listen also them and try to help them. With SIPs it would be easier, I've just posted this example as "thing that may be changed with SIP".


I very like unification via Datasets, but there is a lot of algorithms inside - let's make easy API, but with strong background (articles, benchmarks, descriptions, etc) that shows that Spark is still modern framework.


Maybe now my intention will be clearer :) As I said organizational ideas were already mentioned and I agree with them, my mail was just to show some aspects from my side, so from theside of developer and person who is trying to help others with Spark (via StackOverflow or other ways)


Pozdrawiam / Best regards,

Tomasz



Od: Cody Koeninger <[hidden email]>
Wysłane: 17 października 2016 16:46
Do: Debasish Das
DW: Tomasz Gawęda; [hidden email]
Temat: Re: Spark Improvement Proposals
 
I think narrowly focusing on Flink or benchmarks is missing my point.

My point is evolve or die.  Spark's governance and organization is
hampering its ability to evolve technologically, and it needs to
change.

On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <[hidden email]> wrote:
> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
> soon as I looked into it since compared to writing Java map-reduce and
> Cascading code, Spark made writing distributed code fun...But now as we went
> deeper with Spark and real-time streaming use-case gets more prominent, I
> think it is time to bring a messaging model in conjunction with the
> batch/micro-batch API that Spark is good at....akka-streams close
> integration with spark micro-batching APIs looks like a great direction to
> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
> batch with the assumption is that micro-batching is sufficient to run SQL
> commands on stream but do we really have time to do SQL processing at
> streaming data within 1-2 seconds ?
>
> After reading the email chain, I started to look into Flink documentation
> and if you compare it with Spark documentation, I think we have major work
> to do detailing out Spark internals so that more people from community start
> to take active role in improving the issues so that Spark stays strong
> compared to Flink.
>
> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>
> Spark is no longer an engine that works for micro-batch and batch...We (and
> I am sure many others) are pushing spark as an engine for stream and query
> processing.....we need to make it a state-of-the-art engine for high speed
> streaming data and user queries as well !
>
> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda <[hidden email]>
> wrote:
>>
>> Hi everyone,
>>
>> I'm quite late with my answer, but I think my suggestions may help a
>> little bit. :) Many technical and organizational topics were mentioned,
>> but I want to focus on these negative posts about Spark and about "haters"
>>
>> I really like Spark. Easy of use, speed, very good community - it's
>> everything here. But Every project has to "flight" on "framework market"
>> to be still no 1. I'm following many Spark and Big Data communities,
>> maybe my mail will inspire someone :)
>>
>> You (every Spark developer; so far I didn't have enough time to join
>> contributing to Spark) has done excellent job. So why are some people
>> saying that Flink (or other framework) is better, like it was posted in
>> this mailing list? No, not because that framework is better in all
>> cases.. In my opinion, many of these discussions where started after
>> Flink marketing-like posts. Please look at StackOverflow "Flink vs ...."
>> posts, almost every post in "winned" by Flink. Answers are sometimes
>> saying nothing about other frameworks, Flink's users (often PMC's) are
>> just posting same information about real-time streaming, about delta
>> iterations, etc. It look smart and very often it is marked as an aswer,
>> even if - in my opinion - there wasn't told all the truth.
>>
>>
>> My suggestion: I don't have enough money and knowledgle to perform huge
>> performance test. Maybe some company, that supports Spark (Databricks,
>> Cloudera? - just saying you're most visible in community :) ) could
>> perform performance test of:
>>
>> - streaming engine - probably Spark will loose because of mini-batch
>> model, however currently the difference should be much lower that in
>> previous versions
>>
>> - Machine Learning models
>>
>> - batch jobs
>>
>> - Graph jobs
>>
>> - SQL queries
>>
>> People will see that Spark is envolving and is also a modern framework,
>> because after reading posts mentioned above people may think "it is
>> outdated, future is in framework X".
>>
>> Matei Zaharia posted excellent blog post about how Spark Structured
>> Streaming beats every other framework in terms of easy-of-use and
>> reliability. Performance tests, done in various environments (in
>> example: laptop, small 2 node cluster, 10-node cluster, 20-node
>> cluster), could be also very good marketing stuff to say "hey, you're
>> telling that you're better, but Spark is still faster and is still
>> getting even more fast!". This would be based on facts (just numbers),
>> not opinions. It would be good for companies, for marketing puproses and
>> for every Spark developer
>>
>>
>> Second: real-time streaming. I've written some time ago about real-time
>> streaming support in Spark Structured Streaming. Some work should be
>> done to make SSS more low-latency, but I think it's possible. Maybe
>> Spark may look at Gearpump, which is also built on top of Akka? I don't
>> know yet, it is good topic for SIP. However I think that Spark should
>> have real-time streaming support. Currently I see many posts/comments
>> that "Spark has too big latency". Spark Streaming is doing very good
>> jobs with micro-batches, however I think it is possible to add also more
>> real-time processing.
>>
>> Other people said much more and I agree with proposal of SIP. I'm also
>> happy that PMC's are not saying that they will not listen to users, but
>> they really want to make Spark better for every user.
>>
>>
>> What do you think about these two topics? Especially I'm looking at Cody
>> (who has started this topic) and PMCs :)
>>
>> Pozdrawiam / Best regards,
>>
>> Tomasz
>>
>>
>> W dniu 2016-10-07 o 04:51, Cody Koeninger pisze:
>> > I love Spark.  3 or 4 years ago it was the first distributed computing
>> > environment that felt usable, and the community was welcoming.
>> >
>> > But I just got back from the Reactive Summit, and this is what I
>> > observed:
>> >
>> > - Industry leaders on stage making fun of Spark's streaming model
>> > - Open source project leaders saying they looked at Spark's governance
>> > as a model to avoid
>> > - Users saying they chose Flink because it was technically superior
>> > and they couldn't get any answers on the Spark mailing lists
>> >
>> > Whether you agree with the substance of any of this, when this stuff
>> > gets repeated enough people will believe it.
>> >
>> > Right now Spark is suffering from its own success, and I think
>> > something needs to change.
>> >
>> > - We need a clear process for planning significant changes to the
>> > codebase.
>> > I'm not saying you need to adopt Kafka Improvement Proposals exactly,
>> > but you need a documented process with a clear outcome (e.g. a vote).
>> > Passing around google docs after an implementation has largely been
>> > decided on doesn't cut it.
>> >
>> > - All technical communication needs to be public.
>> > Things getting decided in private chat, or when 1/3 of the committers
>> > work for the same company and can just talk to each other...
>> > Yes, it's convenient, but it's ultimately detrimental to the health of
>> > the project.
>> > The way structured streaming has played out has shown that there are
>> > significant technical blind spots (myself included).
>> > One way to address that is to get the people who have domain knowledge
>> > involved, and listen to them.
>> >
>> > - We need more committers, and more committer diversity.
>> > Per committer there are, what, more than 20 contributors and 10 new
>> > jira tickets a month?  It's too much.
>> > There are people (I am _not_ referring to myself) who have been around
>> > for years, contributed thousands of lines of code, helped educate the
>> > public around Spark... and yet are never going to be voted in.
>> >
>> > - We need a clear process for managing volunteer work.
>> > Too many tickets sit around unowned, unclosed, uncertain.
>> > If someone proposed something and it isn't up to snuff, tell them and
>> > close it.  It may be blunt, but it's clearer than "silent no".
>> > If someone wants to work on something, let them own the ticket and set
>> > a deadline. If they don't meet it, close it or reassign it.
>> >
>> > This is not me putting on an Apache Bureaucracy hat.  This is me
>> > saying, as a fellow hacker and loyal dissenter, something is wrong
>> > with the culture and process.
>> >
>> > Please, let's change it.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: [hidden email]
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Odp.: Spark Improvement Proposals

Cody Koeninger-2
Now that spark summit europe is over, are any committers interested in
moving forward with this?

https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md

Or are we going to let this discussion die on the vine?

On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
<[hidden email]> wrote:

> Maybe my mail was not clear enough.
>
>
> I didn't want to write "lets focus on Flink" or any other framework. The
> idea with benchmarks was to show two things:
>
> - why some people are doing bad PR for Spark
>
> - how - in easy way - we can change it and show that Spark is still on the
> top
>
>
> No more, no less. Benchmarks will be helpful, but I don't think they're the
> most important thing in Spark :) On the Spark main page there is still chart
> "Spark vs Hadoop". It is important to show that framework is not the same
> Spark with other API, but much faster and optimized, comparable or even
> faster than other frameworks.
>
>
> About real-time streaming, I think it would be just good to see it in Spark.
> I very like current Spark model, but many voices that says "we need more" -
> community should listen also them and try to help them. With SIPs it would
> be easier, I've just posted this example as "thing that may be changed with
> SIP".
>
>
> I very like unification via Datasets, but there is a lot of algorithms
> inside - let's make easy API, but with strong background (articles,
> benchmarks, descriptions, etc) that shows that Spark is still modern
> framework.
>
>
> Maybe now my intention will be clearer :) As I said organizational ideas
> were already mentioned and I agree with them, my mail was just to show some
> aspects from my side, so from theside of developer and person who is trying
> to help others with Spark (via StackOverflow or other ways)
>
>
> Pozdrawiam / Best regards,
>
> Tomasz
>
>
> ________________________________
> Od: Cody Koeninger <[hidden email]>
> Wysłane: 17 października 2016 16:46
> Do: Debasish Das
> DW: Tomasz Gawęda; [hidden email]
> Temat: Re: Spark Improvement Proposals
>
> I think narrowly focusing on Flink or benchmarks is missing my point.
>
> My point is evolve or die.  Spark's governance and organization is
> hampering its ability to evolve technologically, and it needs to
> change.
>
> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <[hidden email]>
> wrote:
>> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
>> soon as I looked into it since compared to writing Java map-reduce and
>> Cascading code, Spark made writing distributed code fun...But now as we
>> went
>> deeper with Spark and real-time streaming use-case gets more prominent, I
>> think it is time to bring a messaging model in conjunction with the
>> batch/micro-batch API that Spark is good at....akka-streams close
>> integration with spark micro-batching APIs looks like a great direction to
>> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
>> batch with the assumption is that micro-batching is sufficient to run SQL
>> commands on stream but do we really have time to do SQL processing at
>> streaming data within 1-2 seconds ?
>>
>> After reading the email chain, I started to look into Flink documentation
>> and if you compare it with Spark documentation, I think we have major work
>> to do detailing out Spark internals so that more people from community
>> start
>> to take active role in improving the issues so that Spark stays strong
>> compared to Flink.
>>
>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>>
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>>
>> Spark is no longer an engine that works for micro-batch and batch...We
>> (and
>> I am sure many others) are pushing spark as an engine for stream and query
>> processing.....we need to make it a state-of-the-art engine for high speed
>> streaming data and user queries as well !
>>
>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda <[hidden email]>
>> wrote:
>>>
>>> Hi everyone,
>>>
>>> I'm quite late with my answer, but I think my suggestions may help a
>>> little bit. :) Many technical and organizational topics were mentioned,
>>> but I want to focus on these negative posts about Spark and about
>>> "haters"
>>>
>>> I really like Spark. Easy of use, speed, very good community - it's
>>> everything here. But Every project has to "flight" on "framework market"
>>> to be still no 1. I'm following many Spark and Big Data communities,
>>> maybe my mail will inspire someone :)
>>>
>>> You (every Spark developer; so far I didn't have enough time to join
>>> contributing to Spark) has done excellent job. So why are some people
>>> saying that Flink (or other framework) is better, like it was posted in
>>> this mailing list? No, not because that framework is better in all
>>> cases.. In my opinion, many of these discussions where started after
>>> Flink marketing-like posts. Please look at StackOverflow "Flink vs ...."
>>> posts, almost every post in "winned" by Flink. Answers are sometimes
>>> saying nothing about other frameworks, Flink's users (often PMC's) are
>>> just posting same information about real-time streaming, about delta
>>> iterations, etc. It look smart and very often it is marked as an aswer,
>>> even if - in my opinion - there wasn't told all the truth.
>>>
>>>
>>> My suggestion: I don't have enough money and knowledgle to perform huge
>>> performance test. Maybe some company, that supports Spark (Databricks,
>>> Cloudera? - just saying you're most visible in community :) ) could
>>> perform performance test of:
>>>
>>> - streaming engine - probably Spark will loose because of mini-batch
>>> model, however currently the difference should be much lower that in
>>> previous versions
>>>
>>> - Machine Learning models
>>>
>>> - batch jobs
>>>
>>> - Graph jobs
>>>
>>> - SQL queries
>>>
>>> People will see that Spark is envolving and is also a modern framework,
>>> because after reading posts mentioned above people may think "it is
>>> outdated, future is in framework X".
>>>
>>> Matei Zaharia posted excellent blog post about how Spark Structured
>>> Streaming beats every other framework in terms of easy-of-use and
>>> reliability. Performance tests, done in various environments (in
>>> example: laptop, small 2 node cluster, 10-node cluster, 20-node
>>> cluster), could be also very good marketing stuff to say "hey, you're
>>> telling that you're better, but Spark is still faster and is still
>>> getting even more fast!". This would be based on facts (just numbers),
>>> not opinions. It would be good for companies, for marketing puproses and
>>> for every Spark developer
>>>
>>>
>>> Second: real-time streaming. I've written some time ago about real-time
>>> streaming support in Spark Structured Streaming. Some work should be
>>> done to make SSS more low-latency, but I think it's possible. Maybe
>>> Spark may look at Gearpump, which is also built on top of Akka? I don't
>>> know yet, it is good topic for SIP. However I think that Spark should
>>> have real-time streaming support. Currently I see many posts/comments
>>> that "Spark has too big latency". Spark Streaming is doing very good
>>> jobs with micro-batches, however I think it is possible to add also more
>>> real-time processing.
>>>
>>> Other people said much more and I agree with proposal of SIP. I'm also
>>> happy that PMC's are not saying that they will not listen to users, but
>>> they really want to make Spark better for every user.
>>>
>>>
>>> What do you think about these two topics? Especially I'm looking at Cody
>>> (who has started this topic) and PMCs :)
>>>
>>> Pozdrawiam / Best regards,
>>>
>>> Tomasz
>>>
>>>
>>> W dniu 2016-10-07 o 04:51, Cody Koeninger pisze:
>>> > I love Spark.  3 or 4 years ago it was the first distributed computing
>>> > environment that felt usable, and the community was welcoming.
>>> >
>>> > But I just got back from the Reactive Summit, and this is what I
>>> > observed:
>>> >
>>> > - Industry leaders on stage making fun of Spark's streaming model
>>> > - Open source project leaders saying they looked at Spark's governance
>>> > as a model to avoid
>>> > - Users saying they chose Flink because it was technically superior
>>> > and they couldn't get any answers on the Spark mailing lists
>>> >
>>> > Whether you agree with the substance of any of this, when this stuff
>>> > gets repeated enough people will believe it.
>>> >
>>> > Right now Spark is suffering from its own success, and I think
>>> > something needs to change.
>>> >
>>> > - We need a clear process for planning significant changes to the
>>> > codebase.
>>> > I'm not saying you need to adopt Kafka Improvement Proposals exactly,
>>> > but you need a documented process with a clear outcome (e.g. a vote).
>>> > Passing around google docs after an implementation has largely been
>>> > decided on doesn't cut it.
>>> >
>>> > - All technical communication needs to be public.
>>> > Things getting decided in private chat, or when 1/3 of the committers
>>> > work for the same company and can just talk to each other...
>>> > Yes, it's convenient, but it's ultimately detrimental to the health of
>>> > the project.
>>> > The way structured streaming has played out has shown that there are
>>> > significant technical blind spots (myself included).
>>> > One way to address that is to get the people who have domain knowledge
>>> > involved, and listen to them.
>>> >
>>> > - We need more committers, and more committer diversity.
>>> > Per committer there are, what, more than 20 contributors and 10 new
>>> > jira tickets a month?  It's too much.
>>> > There are people (I am _not_ referring to myself) who have been around
>>> > for years, contributed thousands of lines of code, helped educate the
>>> > public around Spark... and yet are never going to be voted in.
>>> >
>>> > - We need a clear process for managing volunteer work.
>>> > Too many tickets sit around unowned, unclosed, uncertain.
>>> > If someone proposed something and it isn't up to snuff, tell them and
>>> > close it.  It may be blunt, but it's clearer than "silent no".
>>> > If someone wants to work on something, let them own the ticket and set
>>> > a deadline. If they don't meet it, close it or reassign it.
>>> >
>>> > This is not me putting on an Apache Bureaucracy hat.  This is me
>>> > saying, as a fellow hacker and loyal dissenter, something is wrong
>>> > with the culture and process.
>>> >
>>> > Please, let's change it.
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe e-mail: [hidden email]
>>> >
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Odp.: Spark Improvement Proposals

Ryan Blue
I agree, we should push forward on this. I think there is enough consensus to call a vote, unless someone else thinks that there is more to discuss?

rb

On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <[hidden email]> wrote:
Now that spark summit europe is over, are any committers interested in
moving forward with this?

https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md

Or are we going to let this discussion die on the vine?

On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
<[hidden email]> wrote:
> Maybe my mail was not clear enough.
>
>
> I didn't want to write "lets focus on Flink" or any other framework. The
> idea with benchmarks was to show two things:
>
> - why some people are doing bad PR for Spark
>
> - how - in easy way - we can change it and show that Spark is still on the
> top
>
>
> No more, no less. Benchmarks will be helpful, but I don't think they're the
> most important thing in Spark :) On the Spark main page there is still chart
> "Spark vs Hadoop". It is important to show that framework is not the same
> Spark with other API, but much faster and optimized, comparable or even
> faster than other frameworks.
>
>
> About real-time streaming, I think it would be just good to see it in Spark.
> I very like current Spark model, but many voices that says "we need more" -
> community should listen also them and try to help them. With SIPs it would
> be easier, I've just posted this example as "thing that may be changed with
> SIP".
>
>
> I very like unification via Datasets, but there is a lot of algorithms
> inside - let's make easy API, but with strong background (articles,
> benchmarks, descriptions, etc) that shows that Spark is still modern
> framework.
>
>
> Maybe now my intention will be clearer :) As I said organizational ideas
> were already mentioned and I agree with them, my mail was just to show some
> aspects from my side, so from theside of developer and person who is trying
> to help others with Spark (via StackOverflow or other ways)
>
>
> Pozdrawiam / Best regards,
>
> Tomasz
>
>
> ________________________________
> Od: Cody Koeninger <[hidden email]>
> Wysłane: 17 października 2016 16:46
> Do: Debasish Das
> DW: Tomasz Gawęda; [hidden email]
> Temat: Re: Spark Improvement Proposals
>
> I think narrowly focusing on Flink or benchmarks is missing my point.
>
> My point is evolve or die.  Spark's governance and organization is
> hampering its ability to evolve technologically, and it needs to
> change.
>
> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <[hidden email]>
> wrote:
>> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
>> soon as I looked into it since compared to writing Java map-reduce and
>> Cascading code, Spark made writing distributed code fun...But now as we
>> went
>> deeper with Spark and real-time streaming use-case gets more prominent, I
>> think it is time to bring a messaging model in conjunction with the
>> batch/micro-batch API that Spark is good at....akka-streams close
>> integration with spark micro-batching APIs looks like a great direction to
>> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
>> batch with the assumption is that micro-batching is sufficient to run SQL
>> commands on stream but do we really have time to do SQL processing at
>> streaming data within 1-2 seconds ?
>>
>> After reading the email chain, I started to look into Flink documentation
>> and if you compare it with Spark documentation, I think we have major work
>> to do detailing out Spark internals so that more people from community
>> start
>> to take active role in improving the issues so that Spark stays strong
>> compared to Flink.
>>
>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>>
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>>
>> Spark is no longer an engine that works for micro-batch and batch...We
>> (and
>> I am sure many others) are pushing spark as an engine for stream and query
>> processing.....we need to make it a state-of-the-art engine for high speed
>> streaming data and user queries as well !
>>
>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda <[hidden email]>
>> wrote:
>>>
>>> Hi everyone,
>>>
>>> I'm quite late with my answer, but I think my suggestions may help a
>>> little bit. :) Many technical and organizational topics were mentioned,
>>> but I want to focus on these negative posts about Spark and about
>>> "haters"
>>>
>>> I really like Spark. Easy of use, speed, very good community - it's
>>> everything here. But Every project has to "flight" on "framework market"
>>> to be still no 1. I'm following many Spark and Big Data communities,
>>> maybe my mail will inspire someone :)
>>>
>>> You (every Spark developer; so far I didn't have enough time to join
>>> contributing to Spark) has done excellent job. So why are some people
>>> saying that Flink (or other framework) is better, like it was posted in
>>> this mailing list? No, not because that framework is better in all
>>> cases.. In my opinion, many of these discussions where started after
>>> Flink marketing-like posts. Please look at StackOverflow "Flink vs ...."
>>> posts, almost every post in "winned" by Flink. Answers are sometimes
>>> saying nothing about other frameworks, Flink's users (often PMC's) are
>>> just posting same information about real-time streaming, about delta
>>> iterations, etc. It look smart and very often it is marked as an aswer,
>>> even if - in my opinion - there wasn't told all the truth.
>>>
>>>
>>> My suggestion: I don't have enough money and knowledgle to perform huge
>>> performance test. Maybe some company, that supports Spark (Databricks,
>>> Cloudera? - just saying you're most visible in community :) ) could
>>> perform performance test of:
>>>
>>> - streaming engine - probably Spark will loose because of mini-batch
>>> model, however currently the difference should be much lower that in
>>> previous versions
>>>
>>> - Machine Learning models
>>>
>>> - batch jobs
>>>
>>> - Graph jobs
>>>
>>> - SQL queries
>>>
>>> People will see that Spark is envolving and is also a modern framework,
>>> because after reading posts mentioned above people may think "it is
>>> outdated, future is in framework X".
>>>
>>> Matei Zaharia posted excellent blog post about how Spark Structured
>>> Streaming beats every other framework in terms of easy-of-use and
>>> reliability. Performance tests, done in various environments (in
>>> example: laptop, small 2 node cluster, 10-node cluster, 20-node
>>> cluster), could be also very good marketing stuff to say "hey, you're
>>> telling that you're better, but Spark is still faster and is still
>>> getting even more fast!". This would be based on facts (just numbers),
>>> not opinions. It would be good for companies, for marketing puproses and
>>> for every Spark developer
>>>
>>>
>>> Second: real-time streaming. I've written some time ago about real-time
>>> streaming support in Spark Structured Streaming. Some work should be
>>> done to make SSS more low-latency, but I think it's possible. Maybe
>>> Spark may look at Gearpump, which is also built on top of Akka? I don't
>>> know yet, it is good topic for SIP. However I think that Spark should
>>> have real-time streaming support. Currently I see many posts/comments
>>> that "Spark has too big latency". Spark Streaming is doing very good
>>> jobs with micro-batches, however I think it is possible to add also more
>>> real-time processing.
>>>
>>> Other people said much more and I agree with proposal of SIP. I'm also
>>> happy that PMC's are not saying that they will not listen to users, but
>>> they really want to make Spark better for every user.
>>>
>>>
>>> What do you think about these two topics? Especially I'm looking at Cody
>>> (who has started this topic) and PMCs :)
>>>
>>> Pozdrawiam / Best regards,
>>>
>>> Tomasz
>>>
>>>
>>> W dniu 2016-10-07 o 04:51, Cody Koeninger pisze:
>>> > I love Spark.  3 or 4 years ago it was the first distributed computing
>>> > environment that felt usable, and the community was welcoming.
>>> >
>>> > But I just got back from the Reactive Summit, and this is what I
>>> > observed:
>>> >
>>> > - Industry leaders on stage making fun of Spark's streaming model
>>> > - Open source project leaders saying they looked at Spark's governance
>>> > as a model to avoid
>>> > - Users saying they chose Flink because it was technically superior
>>> > and they couldn't get any answers on the Spark mailing lists
>>> >
>>> > Whether you agree with the substance of any of this, when this stuff
>>> > gets repeated enough people will believe it.
>>> >
>>> > Right now Spark is suffering from its own success, and I think
>>> > something needs to change.
>>> >
>>> > - We need a clear process for planning significant changes to the
>>> > codebase.
>>> > I'm not saying you need to adopt Kafka Improvement Proposals exactly,
>>> > but you need a documented process with a clear outcome (e.g. a vote).
>>> > Passing around google docs after an implementation has largely been
>>> > decided on doesn't cut it.
>>> >
>>> > - All technical communication needs to be public.
>>> > Things getting decided in private chat, or when 1/3 of the committers
>>> > work for the same company and can just talk to each other...
>>> > Yes, it's convenient, but it's ultimately detrimental to the health of
>>> > the project.
>>> > The way structured streaming has played out has shown that there are
>>> > significant technical blind spots (myself included).
>>> > One way to address that is to get the people who have domain knowledge
>>> > involved, and listen to them.
>>> >
>>> > - We need more committers, and more committer diversity.
>>> > Per committer there are, what, more than 20 contributors and 10 new
>>> > jira tickets a month?  It's too much.
>>> > There are people (I am _not_ referring to myself) who have been around
>>> > for years, contributed thousands of lines of code, helped educate the
>>> > public around Spark... and yet are never going to be voted in.
>>> >
>>> > - We need a clear process for managing volunteer work.
>>> > Too many tickets sit around unowned, unclosed, uncertain.
>>> > If someone proposed something and it isn't up to snuff, tell them and
>>> > close it.  It may be blunt, but it's clearer than "silent no".
>>> > If someone wants to work on something, let them own the ticket and set
>>> > a deadline. If they don't meet it, close it or reassign it.
>>> >
>>> > This is not me putting on an Apache Bureaucracy hat.  This is me
>>> > saying, as a fellow hacker and loyal dissenter, something is wrong
>>> > with the culture and process.
>>> >
>>> > Please, let's change it.
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe e-mail: [hidden email]
>>> >
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Odp.: Spark Improvement Proposals

Marcelo Vanzin
In reply to this post by Cody Koeninger-2
The proposal looks OK to me. I assume, even though it's not explicitly
called, that voting would happen by e-mail? A template for the
proposal document (instead of just a bullet nice) would also be nice,
but that can be done at any time.

BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
for a SIP, given the scope of the work. The document attached even
somewhat matches the proposed format. So if anyone wants to try out
the process...

On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <[hidden email]> wrote:

> Now that spark summit europe is over, are any committers interested in
> moving forward with this?
>
> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>
> Or are we going to let this discussion die on the vine?
>
> On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
> <[hidden email]> wrote:
>> Maybe my mail was not clear enough.
>>
>>
>> I didn't want to write "lets focus on Flink" or any other framework. The
>> idea with benchmarks was to show two things:
>>
>> - why some people are doing bad PR for Spark
>>
>> - how - in easy way - we can change it and show that Spark is still on the
>> top
>>
>>
>> No more, no less. Benchmarks will be helpful, but I don't think they're the
>> most important thing in Spark :) On the Spark main page there is still chart
>> "Spark vs Hadoop". It is important to show that framework is not the same
>> Spark with other API, but much faster and optimized, comparable or even
>> faster than other frameworks.
>>
>>
>> About real-time streaming, I think it would be just good to see it in Spark.
>> I very like current Spark model, but many voices that says "we need more" -
>> community should listen also them and try to help them. With SIPs it would
>> be easier, I've just posted this example as "thing that may be changed with
>> SIP".
>>
>>
>> I very like unification via Datasets, but there is a lot of algorithms
>> inside - let's make easy API, but with strong background (articles,
>> benchmarks, descriptions, etc) that shows that Spark is still modern
>> framework.
>>
>>
>> Maybe now my intention will be clearer :) As I said organizational ideas
>> were already mentioned and I agree with them, my mail was just to show some
>> aspects from my side, so from theside of developer and person who is trying
>> to help others with Spark (via StackOverflow or other ways)
>>
>>
>> Pozdrawiam / Best regards,
>>
>> Tomasz
>>
>>
>> ________________________________
>> Od: Cody Koeninger <[hidden email]>
>> Wysłane: 17 października 2016 16:46
>> Do: Debasish Das
>> DW: Tomasz Gawęda; [hidden email]
>> Temat: Re: Spark Improvement Proposals
>>
>> I think narrowly focusing on Flink or benchmarks is missing my point.
>>
>> My point is evolve or die.  Spark's governance and organization is
>> hampering its ability to evolve technologically, and it needs to
>> change.
>>
>> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <[hidden email]>
>> wrote:
>>> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
>>> soon as I looked into it since compared to writing Java map-reduce and
>>> Cascading code, Spark made writing distributed code fun...But now as we
>>> went
>>> deeper with Spark and real-time streaming use-case gets more prominent, I
>>> think it is time to bring a messaging model in conjunction with the
>>> batch/micro-batch API that Spark is good at....akka-streams close
>>> integration with spark micro-batching APIs looks like a great direction to
>>> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
>>> batch with the assumption is that micro-batching is sufficient to run SQL
>>> commands on stream but do we really have time to do SQL processing at
>>> streaming data within 1-2 seconds ?
>>>
>>> After reading the email chain, I started to look into Flink documentation
>>> and if you compare it with Spark documentation, I think we have major work
>>> to do detailing out Spark internals so that more people from community
>>> start
>>> to take active role in improving the issues so that Spark stays strong
>>> compared to Flink.
>>>
>>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>>>
>>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>>>
>>> Spark is no longer an engine that works for micro-batch and batch...We
>>> (and
>>> I am sure many others) are pushing spark as an engine for stream and query
>>> processing.....we need to make it a state-of-the-art engine for high speed
>>> streaming data and user queries as well !
>>>
>>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda <[hidden email]>
>>> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> I'm quite late with my answer, but I think my suggestions may help a
>>>> little bit. :) Many technical and organizational topics were mentioned,
>>>> but I want to focus on these negative posts about Spark and about
>>>> "haters"
>>>>
>>>> I really like Spark. Easy of use, speed, very good community - it's
>>>> everything here. But Every project has to "flight" on "framework market"
>>>> to be still no 1. I'm following many Spark and Big Data communities,
>>>> maybe my mail will inspire someone :)
>>>>
>>>> You (every Spark developer; so far I didn't have enough time to join
>>>> contributing to Spark) has done excellent job. So why are some people
>>>> saying that Flink (or other framework) is better, like it was posted in
>>>> this mailing list? No, not because that framework is better in all
>>>> cases.. In my opinion, many of these discussions where started after
>>>> Flink marketing-like posts. Please look at StackOverflow "Flink vs ...."
>>>> posts, almost every post in "winned" by Flink. Answers are sometimes
>>>> saying nothing about other frameworks, Flink's users (often PMC's) are
>>>> just posting same information about real-time streaming, about delta
>>>> iterations, etc. It look smart and very often it is marked as an aswer,
>>>> even if - in my opinion - there wasn't told all the truth.
>>>>
>>>>
>>>> My suggestion: I don't have enough money and knowledgle to perform huge
>>>> performance test. Maybe some company, that supports Spark (Databricks,
>>>> Cloudera? - just saying you're most visible in community :) ) could
>>>> perform performance test of:
>>>>
>>>> - streaming engine - probably Spark will loose because of mini-batch
>>>> model, however currently the difference should be much lower that in
>>>> previous versions
>>>>
>>>> - Machine Learning models
>>>>
>>>> - batch jobs
>>>>
>>>> - Graph jobs
>>>>
>>>> - SQL queries
>>>>
>>>> People will see that Spark is envolving and is also a modern framework,
>>>> because after reading posts mentioned above people may think "it is
>>>> outdated, future is in framework X".
>>>>
>>>> Matei Zaharia posted excellent blog post about how Spark Structured
>>>> Streaming beats every other framework in terms of easy-of-use and
>>>> reliability. Performance tests, done in various environments (in
>>>> example: laptop, small 2 node cluster, 10-node cluster, 20-node
>>>> cluster), could be also very good marketing stuff to say "hey, you're
>>>> telling that you're better, but Spark is still faster and is still
>>>> getting even more fast!". This would be based on facts (just numbers),
>>>> not opinions. It would be good for companies, for marketing puproses and
>>>> for every Spark developer
>>>>
>>>>
>>>> Second: real-time streaming. I've written some time ago about real-time
>>>> streaming support in Spark Structured Streaming. Some work should be
>>>> done to make SSS more low-latency, but I think it's possible. Maybe
>>>> Spark may look at Gearpump, which is also built on top of Akka? I don't
>>>> know yet, it is good topic for SIP. However I think that Spark should
>>>> have real-time streaming support. Currently I see many posts/comments
>>>> that "Spark has too big latency". Spark Streaming is doing very good
>>>> jobs with micro-batches, however I think it is possible to add also more
>>>> real-time processing.
>>>>
>>>> Other people said much more and I agree with proposal of SIP. I'm also
>>>> happy that PMC's are not saying that they will not listen to users, but
>>>> they really want to make Spark better for every user.
>>>>
>>>>
>>>> What do you think about these two topics? Especially I'm looking at Cody
>>>> (who has started this topic) and PMCs :)
>>>>
>>>> Pozdrawiam / Best regards,
>>>>
>>>> Tomasz
>>>>
>>>>
>>>> W dniu 2016-10-07 o 04:51, Cody Koeninger pisze:
>>>> > I love Spark.  3 or 4 years ago it was the first distributed computing
>>>> > environment that felt usable, and the community was welcoming.
>>>> >
>>>> > But I just got back from the Reactive Summit, and this is what I
>>>> > observed:
>>>> >
>>>> > - Industry leaders on stage making fun of Spark's streaming model
>>>> > - Open source project leaders saying they looked at Spark's governance
>>>> > as a model to avoid
>>>> > - Users saying they chose Flink because it was technically superior
>>>> > and they couldn't get any answers on the Spark mailing lists
>>>> >
>>>> > Whether you agree with the substance of any of this, when this stuff
>>>> > gets repeated enough people will believe it.
>>>> >
>>>> > Right now Spark is suffering from its own success, and I think
>>>> > something needs to change.
>>>> >
>>>> > - We need a clear process for planning significant changes to the
>>>> > codebase.
>>>> > I'm not saying you need to adopt Kafka Improvement Proposals exactly,
>>>> > but you need a documented process with a clear outcome (e.g. a vote).
>>>> > Passing around google docs after an implementation has largely been
>>>> > decided on doesn't cut it.
>>>> >
>>>> > - All technical communication needs to be public.
>>>> > Things getting decided in private chat, or when 1/3 of the committers
>>>> > work for the same company and can just talk to each other...
>>>> > Yes, it's convenient, but it's ultimately detrimental to the health of
>>>> > the project.
>>>> > The way structured streaming has played out has shown that there are
>>>> > significant technical blind spots (myself included).
>>>> > One way to address that is to get the people who have domain knowledge
>>>> > involved, and listen to them.
>>>> >
>>>> > - We need more committers, and more committer diversity.
>>>> > Per committer there are, what, more than 20 contributors and 10 new
>>>> > jira tickets a month?  It's too much.
>>>> > There are people (I am _not_ referring to myself) who have been around
>>>> > for years, contributed thousands of lines of code, helped educate the
>>>> > public around Spark... and yet are never going to be voted in.
>>>> >
>>>> > - We need a clear process for managing volunteer work.
>>>> > Too many tickets sit around unowned, unclosed, uncertain.
>>>> > If someone proposed something and it isn't up to snuff, tell them and
>>>> > close it.  It may be blunt, but it's clearer than "silent no".
>>>> > If someone wants to work on something, let them own the ticket and set
>>>> > a deadline. If they don't meet it, close it or reassign it.
>>>> >
>>>> > This is not me putting on an Apache Bureaucracy hat.  This is me
>>>> > saying, as a fellow hacker and loyal dissenter, something is wrong
>>>> > with the culture and process.
>>>> >
>>>> > Please, let's change it.
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe e-mail: [hidden email]
>>>> >
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>



--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Odp.: Spark Improvement Proposals

rxin
Most things looked OK to me too, although I do plan to take a closer look after Nov 1st when we cut the release branch for 2.1.


On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <[hidden email]> wrote:
The proposal looks OK to me. I assume, even though it's not explicitly
called, that voting would happen by e-mail? A template for the
proposal document (instead of just a bullet nice) would also be nice,
but that can be done at any time.

BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
for a SIP, given the scope of the work. The document attached even
somewhat matches the proposed format. So if anyone wants to try out
the process...

On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <[hidden email]> wrote:
> Now that spark summit europe is over, are any committers interested in
> moving forward with this?
>
> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>
> Or are we going to let this discussion die on the vine?
>
> On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
> <[hidden email]> wrote:
>> Maybe my mail was not clear enough.
>>
>>
>> I didn't want to write "lets focus on Flink" or any other framework. The
>> idea with benchmarks was to show two things:
>>
>> - why some people are doing bad PR for Spark
>>
>> - how - in easy way - we can change it and show that Spark is still on the
>> top
>>
>>
>> No more, no less. Benchmarks will be helpful, but I don't think they're the
>> most important thing in Spark :) On the Spark main page there is still chart
>> "Spark vs Hadoop". It is important to show that framework is not the same
>> Spark with other API, but much faster and optimized, comparable or even
>> faster than other frameworks.
>>
>>
>> About real-time streaming, I think it would be just good to see it in Spark.
>> I very like current Spark model, but many voices that says "we need more" -
>> community should listen also them and try to help them. With SIPs it would
>> be easier, I've just posted this example as "thing that may be changed with
>> SIP".
>>
>>
>> I very like unification via Datasets, but there is a lot of algorithms
>> inside - let's make easy API, but with strong background (articles,
>> benchmarks, descriptions, etc) that shows that Spark is still modern
>> framework.
>>
>>
>> Maybe now my intention will be clearer :) As I said organizational ideas
>> were already mentioned and I agree with them, my mail was just to show some
>> aspects from my side, so from theside of developer and person who is trying
>> to help others with Spark (via StackOverflow or other ways)
>>
>>
>> Pozdrawiam / Best regards,
>>
>> Tomasz
>>
>>
>> ________________________________
>> Od: Cody Koeninger <[hidden email]>
>> Wysłane: 17 października 2016 16:46
>> Do: Debasish Das
>> DW: Tomasz Gawęda; [hidden email]
>> Temat: Re: Spark Improvement Proposals
>>
>> I think narrowly focusing on Flink or benchmarks is missing my point.
>>
>> My point is evolve or die.  Spark's governance and organization is
>> hampering its ability to evolve technologically, and it needs to
>> change.
>>
>> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <[hidden email]>
>> wrote:
>>> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
>>> soon as I looked into it since compared to writing Java map-reduce and
>>> Cascading code, Spark made writing distributed code fun...But now as we
>>> went
>>> deeper with Spark and real-time streaming use-case gets more prominent, I
>>> think it is time to bring a messaging model in conjunction with the
>>> batch/micro-batch API that Spark is good at....akka-streams close
>>> integration with spark micro-batching APIs looks like a great direction to
>>> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
>>> batch with the assumption is that micro-batching is sufficient to run SQL
>>> commands on stream but do we really have time to do SQL processing at
>>> streaming data within 1-2 seconds ?
>>>
>>> After reading the email chain, I started to look into Flink documentation
>>> and if you compare it with Spark documentation, I think we have major work
>>> to do detailing out Spark internals so that more people from community
>>> start
>>> to take active role in improving the issues so that Spark stays strong
>>> compared to Flink.
>>>
>>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>>>
>>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>>>
>>> Spark is no longer an engine that works for micro-batch and batch...We
>>> (and
>>> I am sure many others) are pushing spark as an engine for stream and query
>>> processing.....we need to make it a state-of-the-art engine for high speed
>>> streaming data and user queries as well !
>>>
>>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda <[hidden email]>
>>> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> I'm quite late with my answer, but I think my suggestions may help a
>>>> little bit. :) Many technical and organizational topics were mentioned,
>>>> but I want to focus on these negative posts about Spark and about
>>>> "haters"
>>>>
>>>> I really like Spark. Easy of use, speed, very good community - it's
>>>> everything here. But Every project has to "flight" on "framework market"
>>>> to be still no 1. I'm following many Spark and Big Data communities,
>>>> maybe my mail will inspire someone :)
>>>>
>>>> You (every Spark developer; so far I didn't have enough time to join
>>>> contributing to Spark) has done excellent job. So why are some people
>>>> saying that Flink (or other framework) is better, like it was posted in
>>>> this mailing list? No, not because that framework is better in all
>>>> cases.. In my opinion, many of these discussions where started after
>>>> Flink marketing-like posts. Please look at StackOverflow "Flink vs ...."
>>>> posts, almost every post in "winned" by Flink. Answers are sometimes
>>>> saying nothing about other frameworks, Flink's users (often PMC's) are
>>>> just posting same information about real-time streaming, about delta
>>>> iterations, etc. It look smart and very often it is marked as an aswer,
>>>> even if - in my opinion - there wasn't told all the truth.
>>>>
>>>>
>>>> My suggestion: I don't have enough money and knowledgle to perform huge
>>>> performance test. Maybe some company, that supports Spark (Databricks,
>>>> Cloudera? - just saying you're most visible in community :) ) could
>>>> perform performance test of:
>>>>
>>>> - streaming engine - probably Spark will loose because of mini-batch
>>>> model, however currently the difference should be much lower that in
>>>> previous versions
>>>>
>>>> - Machine Learning models
>>>>
>>>> - batch jobs
>>>>
>>>> - Graph jobs
>>>>
>>>> - SQL queries
>>>>
>>>> People will see that Spark is envolving and is also a modern framework,
>>>> because after reading posts mentioned above people may think "it is
>>>> outdated, future is in framework X".
>>>>
>>>> Matei Zaharia posted excellent blog post about how Spark Structured
>>>> Streaming beats every other framework in terms of easy-of-use and
>>>> reliability. Performance tests, done in various environments (in
>>>> example: laptop, small 2 node cluster, 10-node cluster, 20-node
>>>> cluster), could be also very good marketing stuff to say "hey, you're
>>>> telling that you're better, but Spark is still faster and is still
>>>> getting even more fast!". This would be based on facts (just numbers),
>>>> not opinions. It would be good for companies, for marketing puproses and
>>>> for every Spark developer
>>>>
>>>>
>>>> Second: real-time streaming. I've written some time ago about real-time
>>>> streaming support in Spark Structured Streaming. Some work should be
>>>> done to make SSS more low-latency, but I think it's possible. Maybe
>>>> Spark may look at Gearpump, which is also built on top of Akka? I don't
>>>> know yet, it is good topic for SIP. However I think that Spark should
>>>> have real-time streaming support. Currently I see many posts/comments
>>>> that "Spark has too big latency". Spark Streaming is doing very good
>>>> jobs with micro-batches, however I think it is possible to add also more
>>>> real-time processing.
>>>>
>>>> Other people said much more and I agree with proposal of SIP. I'm also
>>>> happy that PMC's are not saying that they will not listen to users, but
>>>> they really want to make Spark better for every user.
>>>>
>>>>
>>>> What do you think about these two