Quantcast

RFC: deprecate SparkStatusTracker, remove JobProgressListener

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RFC: deprecate SparkStatusTracker, remove JobProgressListener

Marcelo Vanzin
Hello all,

For those not following, I'm working on SPARK-18085, where my goal is
to decouple the storage of UI data from the actual UI implementation.
This is mostly targeted at the history server, so that it's possible
to quickly load a "database" with UI information instead of the
existing way of re-parsing event logs, but I think it also helps with
the live UI, since it doesn't require storing UI information in memory
and thus relieves some memory pressure on the driver. (I may still add
an in-memory database in that project, but that's digressing from the
topic at hand.)

One of my (unwritten) goals in that project was to get rid of
JobProgressListener. Now that I'm at a point where I can do that from
the UI's p.o.v., I ran into SparkStatusTracker. So I'd like to get
people's views on two topics.

(i) deprecate SparkStatusTracker, provide a new API based on the
public REST types.

SparkStatusTracker provides yet another way of getting job, stage and
executor information (aside from the UI and the API). It has its own
types that model those, which are based on the existing UI types but
not the same. It could be replaced by making REST calls to the UI
endpoint, but that's sub-optimal since it doesn't make a lot of sense
to do that when you already have an instance of SparkContext to play
with.

Since that's a public, stable API, it can't be removed right away. But
I'd like to propose that we deprecate it, and provide a new API that
is based on the REST types (which, with my work, are also used in the
UI). The existing "SparkStatusTracker" would still exist until we can
remove it, of course.

What do people think about this approach? Another option is to not add
the new API, but keep SparkStatusTracker around using the new UI
database to back it.

(ii) Remove JobProgressListener

I didn't notice it before, but JobProgressListener is public-ish
(@DeveloperApi). I'm not sure why that is, and it's a weird thing
because it exposes non-public types (from UIData.scala) in its API.
With the work I'm doing, and the above suggestion about
SparkStatusTracker, JobProgressListener becomes unused in Spark
itself, and keeping it would just mean the driver keeps using unneeded
memory.

Are there concerns about removing that class? Its functionality is
available in both SparkStatusTracker and the REST API, so it's mostly
redundant.


So, thoughts?


Note to self: (i) above means I'd have to scale back some of my goals
for SPARK-18085. More specifically, the code that creates the UI
database will always need to run (just like JobProgressListener always
exists now), so that SparkStatusTracker still works. Which also means
moving some of the code I was hoping to keep in a separate module into
core/.

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFC: deprecate SparkStatusTracker, remove JobProgressListener

Josh Rosen-2
I think that it should be safe to remove JobProgressListener but I'd like to keep the SparkStatusTracker API.

SparkStatusTracker was originally developed to provide a stable programmatic status API for use by Hive on Spark. SparkStatusTracker predated the Spark REST APIs for status tracking which is why there's some overlap of functionality between those APIs. Given that SparkStatusTracker is a longstanding stable public API I'd prefer to not remove it because there may be a lot of existing user code that depends on it. It's also a relatively easy-to-support API because it presents a clean query abstraction and doesn't expose mutable data structures via its public interface, so we should be able to support this interface with an implementation based on the new UI database.

JobProgressListener, on the other hand, has a messy interface which was not designed for use by code outside of Spark. This interface was marked as @DeveloperAPI as part of Spark 1.0 (see https://github.com/apache/spark/pull/648) but I think that decision was a mistake because the interface exposes mutable internal state. For example, if a user wanted to query completed stages using JobProgressListener then they would access a field declared as 

  val completedStages = ListBuffer[StageInfo]()

which requires the user to explicitly synchronize on the JobProgressListener instance in order to safely access this field. This is a bad API and it's not really possible to cleanly present this same interface with a database-backed implementation. In addition, this interface has not been fully stable over time and there's currently no public / DeveloperAPI mechanism to get access to the Spark-constructed instance of JobProgressListener.

Given all of this this, I think that it's unlikely that users are relying on JobProgressListener since Spark has other APIs for status tracking which are more stable and easier to work with. If anyone is relying on this then they could inline the JobProgressListener source in their own project and instantiate and register the listener themselves.

Thus I think it's fine to remove JobProgressListener but think we should keep SparkStatusTracker. I think that the decision of whether we want to make a next-generation "V2" programmatic status API based on the REST API types can happen later / independently.

On Thu, Mar 23, 2017 at 1:32 PM Marcelo Vanzin <[hidden email]> wrote:
Hello all,

For those not following, I'm working on SPARK-18085, where my goal is
to decouple the storage of UI data from the actual UI implementation.
This is mostly targeted at the history server, so that it's possible
to quickly load a "database" with UI information instead of the
existing way of re-parsing event logs, but I think it also helps with
the live UI, since it doesn't require storing UI information in memory
and thus relieves some memory pressure on the driver. (I may still add
an in-memory database in that project, but that's digressing from the
topic at hand.)

One of my (unwritten) goals in that project was to get rid of
JobProgressListener. Now that I'm at a point where I can do that from
the UI's p.o.v., I ran into SparkStatusTracker. So I'd like to get
people's views on two topics.

(i) deprecate SparkStatusTracker, provide a new API based on the
public REST types.

SparkStatusTracker provides yet another way of getting job, stage and
executor information (aside from the UI and the API). It has its own
types that model those, which are based on the existing UI types but
not the same. It could be replaced by making REST calls to the UI
endpoint, but that's sub-optimal since it doesn't make a lot of sense
to do that when you already have an instance of SparkContext to play
with.

Since that's a public, stable API, it can't be removed right away. But
I'd like to propose that we deprecate it, and provide a new API that
is based on the REST types (which, with my work, are also used in the
UI). The existing "SparkStatusTracker" would still exist until we can
remove it, of course.

What do people think about this approach? Another option is to not add
the new API, but keep SparkStatusTracker around using the new UI
database to back it.

(ii) Remove JobProgressListener

I didn't notice it before, but JobProgressListener is public-ish
(@DeveloperApi). I'm not sure why that is, and it's a weird thing
because it exposes non-public types (from UIData.scala) in its API.
With the work I'm doing, and the above suggestion about
SparkStatusTracker, JobProgressListener becomes unused in Spark
itself, and keeping it would just mean the driver keeps using unneeded
memory.

Are there concerns about removing that class? Its functionality is
available in both SparkStatusTracker and the REST API, so it's mostly
redundant.


So, thoughts?


Note to self: (i) above means I'd have to scale back some of my goals
for SPARK-18085. More specifically, the code that creates the UI
database will always need to run (just like JobProgressListener always
exists now), so that SparkStatusTracker still works. Which also means
moving some of the code I was hoping to keep in a separate module into
core/.

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFC: deprecate SparkStatusTracker, remove JobProgressListener

Marcelo Vanzin
On Fri, Mar 24, 2017 at 12:07 PM, Josh Rosen <[hidden email]> wrote:
> I think that it should be safe to remove JobProgressListener but I'd like to
> keep the SparkStatusTracker API.

Thanks Josh. I can work with that. My main concern would be keeping
the listener around.

Is it worth it to add a deprecated annotation to it?

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFC: deprecate SparkStatusTracker, remove JobProgressListener

Josh Rosen-2
Let's not deprecate SparkStatusTracker (at least not until there's a suitable replacement that's just as easy to use). Deprecating it now would leave users with an unactionable warning and that's not great for users who try to keep their build warnings clean.

If you want to be really proactive in notifying anyone who might be affected by JobProgressListener's removal we could mark JobProgressListener (and the other listener implementations made into DeveloperAPIs in https://github.com/apache/spark/pull/648) as deprecated now and mention the deprecation in the 2.2.0 release notes.

On Fri, Mar 24, 2017 at 1:05 PM Marcelo Vanzin <[hidden email]> wrote:
On Fri, Mar 24, 2017 at 12:07 PM, Josh Rosen <[hidden email]> wrote:
> I think that it should be safe to remove JobProgressListener but I'd like to
> keep the SparkStatusTracker API.

Thanks Josh. I can work with that. My main concern would be keeping
the listener around.

Is it worth it to add a deprecated annotation to it?

--
Marcelo
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFC: deprecate SparkStatusTracker, remove JobProgressListener

Marcelo Vanzin
On Fri, Mar 24, 2017 at 1:13 PM, Josh Rosen <[hidden email]> wrote:
> Let's not deprecate SparkStatusTracker (at least not until there's a
> suitable replacement that's just as easy to use). Deprecating it now would
> leave users with an unactionable warning and that's not great for users who
> try to keep their build warnings clean.

I'm not planning on changing that API right now; we can look at it
later if desired.

> If you want to be really proactive in notifying anyone who might be affected
> by JobProgressListener's removal we could mark JobProgressListener

I'll make a change for that later. It will add warnings in the Spark
build, but then we have a ton of those already...

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFC: deprecate SparkStatusTracker, remove JobProgressListener

Ryan Blue
In reply to this post by Marcelo Vanzin
+1 for deprecating JobProgressListener and removing it eventually. Thanks for working toward a better back-end for the UI.

For the status tracker, I'd like to see it replaced with something better before deprecating it. I've been looking at it for implementing better feedback in notebooks, and it looks sufficient for that at the moment. Is there a reason to remove it other than that it differs from the rest API?

rb

On Fri, Mar 24, 2017 at 1:05 PM, Marcelo Vanzin <[hidden email]> wrote:
On Fri, Mar 24, 2017 at 12:07 PM, Josh Rosen <[hidden email]> wrote:
> I think that it should be safe to remove JobProgressListener but I'd like to
> keep the SparkStatusTracker API.

Thanks Josh. I can work with that. My main concern would be keeping
the listener around.

Is it worth it to add a deprecated annotation to it?

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFC: deprecate SparkStatusTracker, remove JobProgressListener

Marcelo Vanzin
On Fri, Mar 24, 2017 at 1:18 PM, Ryan Blue <[hidden email]> wrote:
> For the status tracker, I'd like to see it replaced with something better
> before deprecating it. I've been looking at it for implementing better
> feedback in notebooks, and it looks sufficient for that at the moment. Is
> there a reason to remove it other than that it differs from the rest API?

My original thinking was to eventually replace it with a programmatic
API that exposes all the REST API information. Basically the REST API
without the need to have an HTTP client.

That still might be worth it in the future (just to avoid having sort
of duplicate APIs), but at this point I'm happy with just changing the
backend of the status tracker to not use JobProgressListener.


--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Loading...