Apache Training contribution for Spark - Feedback welcome

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Apache Training contribution for Spark - Feedback welcome

Lars Francke
Hi Spark community,

you may or may not have heard of a new-ish (February 2019) project at Apache: Apache Training (incubating). We aim to develop training material about various projects inside and outside the ASF: <http://training.apache.org/>

One of our users wants to contribute material on Spark[1] 

We've done something similar for ZooKeeper[1] in the past and the ZooKeeper community provided excellent feedback which helped make the product much better[3].

That's why I'd like to invite everyone here to provide any kind of feedback on the content donation. It is currently in PowerPoint format which makes it a bit harder to review so we're happy to accept feedback in any form.

The idea is to convert the material to AsciiDoc at some point.

Cheers,
Lars

(I didn't want to cross post to user@ as well but this is obviously not limited to dev@ users)

Reply | Threaded
Open this post in threaded view
|

Re: Apache Training contribution for Spark - Feedback welcome

Sean Owen-2
Generally speaking, I think we want to encourage more training and
tutorial content out there, for sure, so, the more the merrier.

My reservation here is that as an Apache project, it might appear to
'bless' one set of materials as authoritative over all the others out
there. And there are already lots of good ones. For example, Jacek has
long maintained a very comprehensive set of free Spark training
materials at https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
In comparison the slides I see proposed so far only seem like
outlines?

It's also a separate project from Spark. We might have trouble
ensuring the info is maintained and up to date, and sometimes outdated
or incorrect info is worse than none - especially if it appears quasi
official. The Spark project already maintains and updates its docs
(which can always be better), so already has its hands full there.

Personally, no strong objection here, but, what's the upside to
running this as an ASF project vs just letting people continue to
publish quality tutorials online?



On Fri, Jul 26, 2019 at 9:00 AM Lars Francke <[hidden email]> wrote:

>
> Hi Spark community,
>
> you may or may not have heard of a new-ish (February 2019) project at Apache: Apache Training (incubating). We aim to develop training material about various projects inside and outside the ASF: <http://training.apache.org/>
>
> One of our users wants to contribute material on Spark[1]
>
> We've done something similar for ZooKeeper[1] in the past and the ZooKeeper community provided excellent feedback which helped make the product much better[3].
>
> That's why I'd like to invite everyone here to provide any kind of feedback on the content donation. It is currently in PowerPoint format which makes it a bit harder to review so we're happy to accept feedback in any form.
>
> The idea is to convert the material to AsciiDoc at some point.
>
> Cheers,
> Lars
>
> (I didn't want to cross post to user@ as well but this is obviously not limited to dev@ users)
>
> [1] <https://issues.apache.org/jira/browse/TRAINING-17>
> [2] <https://issues.apache.org/jira/browse/TRAINING-13>
> [3] You can see the content here <https://github.com/apache/incubator-training/blob/master/content/ZooKeeper/src/main/asciidoc/index_en.adoc>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Apache Training contribution for Spark - Feedback welcome

Lars Francke
Sean,

thanks for taking the time to comment.

We've discussed those issues during the proposal stage for the Incubator as others brought them up as well. I can't remember all the details but let me go through your points inline.

My reservation here is that as an Apache project, it might appear to
'bless' one set of materials as authoritative over all the others out
there.

I understand why it might be seen that way and we need to make sure to point out that we have no intention of becoming "The official Apache Spark training" because that's not our intention at all.
 
And there are already lots of good ones. For example, Jacek has
long maintained a very comprehensive set of free Spark training
materials at https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
In comparison the slides I see proposed so far only seem like
outlines?

Jacek is indeed doing a fantastic job (and I'm sure others as well).

In this case, however, a company decided to donate their internal material - they didn't create this from scratch for the Apache Training project.
We want to encourage contributions and just because someone else has already created material shouldn't stop us from accepting this.

The opposite in fact: There's very little collaboration - in general - around training material.
Every company creates its own material as an asset to sell. There's very little quality open-source material out there.
I'm not sure how many companies have created Spark training courses. I wouldn't be surprised if it goes into the hundreds. And everyone draws the same or very similar slides (what's an RDD, what's a DataFrame etc.)
We hope to change that and this contribution can be a first start.

We did some research around training and especially open-source training before we started the initiative and there are some projects out there that do this but all we found were silos with a relatively narrow focus and no greater community.

Regarding your "outlines" comment: No, this is the "final" material (pending review of course). With "Training" we mean training in the sense that Cloudera, Databricks et. al. sell as well where an instructor-led course is being given using slides. These slides can, but don't have to speak for themselves. We're fine with the requirement that an experienced instructor needs to give this training. But this is just this content. We're also happy to accept other forms of content that are meant for a different way of consumption (self-serve). We don't intend to write exhaustive or authoritative documentation for projects.

It just frees people from having to do the tedious work of creating (and updating) hundreds of slides.

It's also a separate project from Spark. We might have trouble
ensuring the info is maintained and up to date, and sometimes outdated
or incorrect info is worse than none - especially if it appears quasi
official. The Spark project already maintains and updates its docs
(which can always be better), so already has its hands full there.

Definitely. Outdated information is always a danger and I have no guarantee that this isn't going to happen here.
The fact that this is hosted and governed by the ASF makes it less likely to be completely abandoned though as there are clear processes in place for collaboration that don't depend on a single person (which might be the case with some of the other things that already exist).
We also hope that communities - like Spark - are also interested in collaborating and while patches are always welcome so is creating a Jira to point out outdated information.
 
Personally, no strong objection here, but, what's the upside to
running this as an ASF project vs just letting people continue to
publish quality tutorials online?

Some points come to mind, this list is neither exhaustive nor do all points apply equally to all the material that others have published:

- Clear and easy guidelines for collaboration
- Not a "bus factor" of one
- Everything is open-source with a friendly license and customizable
- We're still just getting started but because we already have four or five different contributions we can share one technology stack between all of them making it easier to collaborate ("everything looks familiar") and every piece of content benefits from improvements in the technical stack
- We hope to have non-tool focused sessions later as well (e.g. Ingesting data from Kafka into Elasticsearch using Spark [okay, this would maybe be a bit too specific for now but something along the lines of a "Data Ingestion" training]) where we can mix and match from the content we have

I'd have to dig into the original discuss threads in the incubator to find more but I hope this helps a bit?

Cheers,
Lars




On Fri, Jul 26, 2019 at 9:00 AM Lars Francke <[hidden email]> wrote:
>
> Hi Spark community,
>
> you may or may not have heard of a new-ish (February 2019) project at Apache: Apache Training (incubating). We aim to develop training material about various projects inside and outside the ASF: <http://training.apache.org/>
>
> One of our users wants to contribute material on Spark[1]
>
> We've done something similar for ZooKeeper[1] in the past and the ZooKeeper community provided excellent feedback which helped make the product much better[3].
>
> That's why I'd like to invite everyone here to provide any kind of feedback on the content donation. It is currently in PowerPoint format which makes it a bit harder to review so we're happy to accept feedback in any form.
>
> The idea is to convert the material to AsciiDoc at some point.
>
> Cheers,
> Lars
>
> (I didn't want to cross post to user@ as well but this is obviously not limited to dev@ users)
>
> [1] <https://issues.apache.org/jira/browse/TRAINING-17>
> [2] <https://issues.apache.org/jira/browse/TRAINING-13>
> [3] You can see the content here <https://github.com/apache/incubator-training/blob/master/content/ZooKeeper/src/main/asciidoc/index_en.adoc>
Reply | Threaded
Open this post in threaded view
|

Re: Apache Training contribution for Spark - Feedback welcome

Sean Owen-2
On Fri, Jul 26, 2019 at 4:01 PM Lars Francke <[hidden email]> wrote:
> I understand why it might be seen that way and we need to make sure to point out that we have no intention of becoming "The official Apache Spark training" because that's not our intention at all.

Of course that's the intention; the problem is perception, and I think
that's a real problem no matter the intention.


> In this case, however, a company decided to donate their internal material - they didn't create this from scratch for the Apache Training project.
> We want to encourage contributions and just because someone else has already created material shouldn't stop us from accepting this.

This much doesn't seem like a compelling motive. Anyone can already
donate their materials to the public domain or publish under the ALv2.
The existence of an Apache project around it doesn't do anything...
except your point below maybe:


> Every company creates its own material as an asset to sell. There's very little quality open-source material out there.

(Except the example I already gave, among many others! There's a lot
of free content)


> We did some research around training and especially open-source training before we started the initiative and there are some projects out there that do this but all we found were silos with a relatively narrow focus and no greater community.

I think your premise is that people will _collaborate_ on training
materials if there's an ASF project around it. Maybe so but see below.


> Regarding your "outlines" comment: No, this is the "final" material (pending review of course). With "Training" we mean training in the sense that Cloudera, Databricks et. al. sell as well where an instructor-led course is being given using slides. These slides can, but don't have to speak for themselves. We're fine with the requirement that an experienced instructor needs to give this training. But this is just this content. We're also happy to accept other forms of content that are meant for a different way of consumption (self-serve). We don't intend to write exhaustive or authoritative documentation for projects.

Are we talking about the content attached at TRAINING-17? It doesn't
look nearly complete or comprehensive enough to endorse as Spark
training material, IMHO. Again compare to even Jacek's site and
content for an example of what I think that would look like. It's
orders of magnitude more complete. I speak for myself, but I would not
want to endorse that as Spark training with my Apache hat.

I know the premise is, I think, these are _slides_ that trainers can
deliver, but by themselves there is not enough content for trainers to
know what to train.

What is the need the solves -- is there really demand for 'open
source' training materials? my experience is that training is by
definition professional services, and has to be delivered by people as
a for-pay business, and they need to differentiate on the quality they
provide. It's just materially different from having open standard
software.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Apache Training contribution for Spark - Feedback welcome

Lars Francke
Happy to discuss this here but you're also invited to bring those points up at dev@training as other projects might have similar concerns.

The request for assistance still stands. If anyone here is interested in helping out reviewing and improving the material please reach out.


On Sat, Jul 27, 2019 at 12:01 AM Sean Owen <[hidden email]> wrote:
On Fri, Jul 26, 2019 at 4:01 PM Lars Francke <[hidden email]> wrote:
> I understand why it might be seen that way and we need to make sure to point out that we have no intention of becoming "The official Apache Spark training" because that's not our intention at all.

Of course that's the intention; the problem is perception, and I think
that's a real problem no matter the intention.

Agreed. But that won't stop us from accepting or publishing content. If that were a dealbreaker then we could move the Training project to the Attic now.
Along with Livy, Toree, Phoenix, Hivemall and probably dozens of other ASF projects which provide things on top of other ASF projects.
Neither of those are endorsed as "The official X for Y".
 
> In this case, however, a company decided to donate their internal material - they didn't create this from scratch for the Apache Training project.
> We want to encourage contributions and just because someone else has already created material shouldn't stop us from accepting this.

This much doesn't seem like a compelling motive. Anyone can already
donate their materials to the public domain or publish under the ALv2.
The existence of an Apache project around it doesn't do anything...
except your point below maybe:


> Every company creates its own material as an asset to sell. There's very little quality open-source material out there.

(Except the example I already gave, among many others! There's a lot
of free content)

The way I read your point is that anyone can publish material (which includes source code) under the ALv2 outside of the ASF so why should they donate anything to the ASF?
If that's what you meant why have Apache Spark or any other Apache project for that matter.

But I don't think that's what you're trying to say.
Hence I believe I must misunderstand and would ask you to rephrase/reiterate the point your point, please.
 
> We did some research around training and especially open-source training before we started the initiative and there are some projects out there that do this but all we found were silos with a relatively narrow focus and no greater community.

I think your premise is that people will _collaborate_ on training
materials if there's an ASF project around it. Maybe so but see below.

That's our hope, yes. Should we not do this because it _could_ fail?
 
> Regarding your "outlines" comment: No, this is the "final" material (pending review of course). With "Training" we mean training in the sense that Cloudera, Databricks et. al. sell as well where an instructor-led course is being given using slides. These slides can, but don't have to speak for themselves. We're fine with the requirement that an experienced instructor needs to give this training. But this is just this content. We're also happy to accept other forms of content that are meant for a different way of consumption (self-serve). We don't intend to write exhaustive or authoritative documentation for projects.

Are we talking about the content attached at TRAINING-17? It doesn't
look nearly complete or comprehensive enough to endorse as Spark
training material, IMHO. Again compare to even Jacek's site and
content for an example of what I think that would look like. It's
orders of magnitude more complete. I speak for myself, but I would not
want to endorse that as Spark training with my Apache hat.

I know the premise is, I think, these are _slides_ that trainers can
deliver, but by themselves there is not enough content for trainers to
know what to train.

No one wants to endorse anything as "official" anything.
And yes: This material is not perfect but that's how open-source works, doesn't it?
This is an initial patch which can be used to collaborate and improve upon.
This is how Spark also works otherwise it'd have been perfect from version 0.1.

Again: I agree Jacek's material is more complete and we could reach out to him (assuming he reads this anyway) but the fact is that this company did so first and I want to encourage contributions.

All we're asking for here is help from the Spark community in making our content better hoping that someone is interested. If not we'll do the best we can ourselves. But this is where the experts are.
 
What is the need the solves -- is there really demand for 'open
source' training materials? my experience is that training is by
definition professional services, and has to be delivered by people as
a for-pay business, and they need to differentiate on the quality they
provide. It's just materially different from having open standard
software.

Yes, there is a demand and I disagree that it's materially different from having open standard software.
I have not compared Jacek's material to the one in TRAINING-17 or to my own but I'm willing to bet that there are lots and lots of redundancies.
The same concepts explained over and over in similar terms.
What's the value in that?

We - as a company - have created material and sold it for years but every time I give a training I see something that I should have updated and it's become impossible to keep up. I see the same outdated material from other organizations, we've talked to half a dozen or so training companies and they all have the same problem. To create quality training material you really need someone with deep insider knowledge, and those people are hard to come by.
So we're trying to shift and collaborate on the material and then differentiate ourselves by the trainer itself.
We'll see how that works out.

Cheers,
Lars
Reply | Threaded
Open this post in threaded view
|

Re: Apache Training contribution for Spark - Feedback welcome

Sean Owen-2
TL;DR is: take the below as feedback to consider, and proceed as you
see fit. Nobody's suggesting you can't do this.

On Mon, Jul 29, 2019 at 2:58 AM Lars Francke <[hidden email]> wrote:
> The way I read your point is that anyone can publish material (which includes source code) under the ALv2 outside of the ASF so why should they donate anything to the ASF?
> If that's what you meant why have Apache Spark or any other Apache project for that matter.
>> I think your premise is that people will _collaborate_ on training
>> materials if there's an ASF project around it. Maybe so but see below.
> That's our hope, yes. Should we not do this because it _could_ fail?

Yep this is the answer to your question. The ASF exists to facilitate
collaboration, not just host. I think the dynamics around
collaboration on open standard software vs training materials are
materially different.

> We - as a company - have created material and sold it for years but every time I give a training I see something that I should have updated and it's become impossible to keep up. I see the same outdated material from other organizations, we've talked to half a dozen or so training companies and they all have the same problem. To create quality training material you really need someone with deep insider knowledge, and those people are hard to come by.
> So we're trying to shift and collaborate on the material and then differentiate ourselves by the trainer itself.

I think this hand-waves past a lot of the concern raised here, but OK
it's an experiment.
I don't think it's 'wrong' to try to get people to collaborate on
slides, sure. It may work well. If it doesn't for reasons raised here,
well, worse things have happened.
Consider how you might mitigate possible problems:
a) what happens when another company wants to donate its Spark content?
b) can you enshrine some best practices like making sure the content
disclaims official association with the ASF? e.g. a trainer delivering
it has to note the source but make clear it's not Apache training,
etc.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Apache Training contribution for Spark - Feedback welcome

Lars Francke
On Mon, Jul 29, 2019 at 2:46 PM Sean Owen <[hidden email]> wrote:
TL;DR is: take the below as feedback to consider, and proceed as you
see fit. Nobody's suggesting you can't do this.

On Mon, Jul 29, 2019 at 2:58 AM Lars Francke <[hidden email]> wrote:
> The way I read your point is that anyone can publish material (which includes source code) under the ALv2 outside of the ASF so why should they donate anything to the ASF?
> If that's what you meant why have Apache Spark or any other Apache project for that matter.
>> I think your premise is that people will _collaborate_ on training
>> materials if there's an ASF project around it. Maybe so but see below.
> That's our hope, yes. Should we not do this because it _could_ fail?

Yep this is the answer to your question. The ASF exists to facilitate
collaboration, not just host. I think the dynamics around
collaboration on open standard software vs training materials are
materially different.

I don't see a big difference between the two things.
Content is already being collaborated on today (see documentation, websites and the few instances of training that exist or Wikipedia for that matter).
I'm afraid we'll need to agree to disagree on this one.
 
> We - as a company - have created material and sold it for years but every time I give a training I see something that I should have updated and it's become impossible to keep up. I see the same outdated material from other organizations, we've talked to half a dozen or so training companies and they all have the same problem. To create quality training material you really need someone with deep insider knowledge, and those people are hard to come by.
> So we're trying to shift and collaborate on the material and then differentiate ourselves by the trainer itself.

I think this hand-waves past a lot of the concern raised here, but OK
it's an experiment.
I don't think it's 'wrong' to try to get people to collaborate on
slides, sure. It may work well. If it doesn't for reasons raised here,
well, worse things have happened.
Consider how you might mitigate possible problems:
a) what happens when another company wants to donate its Spark content?

This has been decided at the ASF level already (allow competing projects, e.g. Flink & Spark). At the Apache Training level we briefly talked about that as well. I don't want to go into details of the process but the short version is: We'd accept anything and would then try to incorporate it into existing stuff.

b) can you enshrine some best practices like making sure the content
disclaims official association with the ASF? e.g. a trainer delivering
it has to note the source but make clear it's not Apache training,

Yes.
 
etc.