Quantcast

[SparkR] - options around setting up SparkSession / SparkContext

classic Classic list List threaded Threaded
5 messages Options
Vin
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[SparkR] - options around setting up SparkSession / SparkContext

Vin

I need to make an R environment available where the SparkSession/SparkContext needs to be setup a specific way. The user simply accesses this environment and executes his/her code. If the user code does not access any Spark functions, I do not want to create a SparkContext unnecessarily. 

In Scala/Python environments, the user can't access spark without first referencing SparkContext / SparkSession classes. So the above (lazy and/or custom SparkSession/Context creation) is easily met by offering sparkContext/sparkSession handles to the user that are either wrappers on Spark's classes or have lazy evaluation semantics. This way only when the user accesses these handles to sparkContext/Session will the SparkSession/Context actually get set up without the user needing to know all the details about initing the SparkContext/Session. 

However, achieving the same doesn't appear to be so straightforward in R. From what I see, executing sparkR.session(...) sets up private variables in SparkR:::.sparkREnv (.sparkRjsc , .sparkRsession). The way SparkR api works, a user doesn't need a handle to the spark session as such. Executing functions like so:  "df <- as.DataFrame(..)" implicitly access the private vars in SparkR:::.sparkREnv to get access to the sparkContext etc that are expected to have been created by a prior call to sparkR.session()/sparkR.init() etc. 

Therefore, to inject any custom/lazy behavior into this I don't see a way except through having my code (that sits outside of Spark) apply a delayedAssign() or a makeActiveBinding( ) on SparkR:::.sparkRsession / .sparkRjsc  variables. This way when spark code internally references them, my wrapper/lazy code gets executed to do whatever I need done. 

However, I am seeing some limitations of applying even this approach to SparkR - it will not work unless some minor changes are made in the SparkR code. But, before I opened a PR that would do these changes in SparkR I wanted to check if there was a better way to achieve this? I am far less than an R expert, and could be missing something here.

If you'd rather see this in a JIRA and a PR, let me know and I'll go ahead and open one.

Regards,
Vin.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [SparkR] - options around setting up SparkSession / SparkContext

Felix Cheung
How would you handle this in Scala?

If you are adding a wrapper func like getSparkSession for Scala, and have your users call it, can't you do that same in SparkR? After all, while true you don't need a SparkSession object to call the R API, someone still needs to call sparkR.session() to initial the current session?

Also what Spark environment you want to customize? 

Can these be set in environment variables or via spark-defaults.conf spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties


_____________________________
From: Vin J <[hidden email]>
Sent: Friday, April 21, 2017 2:22 PM
Subject: [SparkR] - options around setting up SparkSession / SparkContext
To: <[hidden email]>



I need to make an R environment available where the SparkSession/SparkContext needs to be setup a specific way. The user simply accesses this environment and executes his/her code. If the user code does not access any Spark functions, I do not want to create a SparkContext unnecessarily. 

In Scala/Python environments, the user can't access spark without first referencing SparkContext / SparkSession classes. So the above (lazy and/or custom SparkSession/Context creation) is easily met by offering sparkContext/sparkSession handles to the user that are either wrappers on Spark's classes or have lazy evaluation semantics. This way only when the user accesses these handles to sparkContext/Session will the SparkSession/Context actually get set up without the user needing to know all the details about initing the SparkContext/Session. 

However, achieving the same doesn't appear to be so straightforward in R. From what I see, executing sparkR.session(...) sets up private variables in SparkR:::.sparkREnv (.sparkRjsc , .sparkRsession). The way SparkR api works, a user doesn't need a handle to the spark session as such. Executing functions like so:  "df <- as.DataFrame(..)" implicitly access the private vars in SparkR:::.sparkREnv to get access to the sparkContext etc that are expected to have been created by a prior call to sparkR.session()/sparkR.init() etc. 

Therefore, to inject any custom/lazy behavior into this I don't see a way except through having my code (that sits outside of Spark) apply a delayedAssign() or a makeActiveBinding( ) on SparkR:::.sparkRsession / .sparkRjsc  variables. This way when spark code internally references them, my wrapper/lazy code gets executed to do whatever I need done. 

However, I am seeing some limitations of applying even this approach to SparkR - it will not work unless some minor changes are made in the SparkR code. But, before I opened a PR that would do these changes in SparkR I wanted to check if there was a better way to achieve this? I am far less than an R expert, and could be missing something here.

If you'd rather see this in a JIRA and a PR, let me know and I'll go ahead and open one.

Regards,
Vin.




Vin
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [SparkR] - options around setting up SparkSession / SparkContext

Vin
This is for a notebook env that has the spark session/context bootstrapped for the user. There are settings that are user specific so not all of those can go into the spark-defaults.conf - such settings need to be dynamically applied when creating the session/context.

In Scala/Python, I would bootstrap a "spark" handle similar to what spark-shell / psyspark-shell startup scripts do. In my case the bootstrapped object could be of a wrapper class that took care of whatever customization I needed while exposing the regular  SparkSession scala/python API. The user uses this object as he/she would use a regular SparkSession to submit work to the Spark cluster. Since I am certain there is no other way for users to perform Spark work except to go via the bootstrapped object, I can achieve my objective of delaying creation of SparkSession/Context until a call comes to my custom spark object.

If I want to do the same in R, and let users write SparkR code as they normally would, but bootstrapping a SparkContext/Session for them, then I hit the issues as I explained earlier. There is no single entry point for SparkContext/Session in SparkR API and so to achieve lazy creation of SparkContext/session, it looks like the only  option is to do some trickery with the SparkR:::.sparkREnv$.sparkRjsc and SparkR:::.sparkREnv$.sparkRsession vars. 

Regards,
Vin.

On Sat, Apr 22, 2017 at 3:33 AM, Felix Cheung <[hidden email]> wrote:
How would you handle this in Scala?

If you are adding a wrapper func like getSparkSession for Scala, and have your users call it, can't you do that same in SparkR? After all, while true you don't need a SparkSession object to call the R API, someone still needs to call sparkR.session() to initial the current session?

Also what Spark environment you want to customize? 

Can these be set in environment variables or via spark-defaults.conf spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties


_____________________________
From: Vin J <[hidden email]>
Sent: Friday, April 21, 2017 2:22 PM
Subject: [SparkR] - options around setting up SparkSession / SparkContext
To: <[hidden email]>




I need to make an R environment available where the SparkSession/SparkContext needs to be setup a specific way. The user simply accesses this environment and executes his/her code. If the user code does not access any Spark functions, I do not want to create a SparkContext unnecessarily. 

In Scala/Python environments, the user can't access spark without first referencing SparkContext / SparkSession classes. So the above (lazy and/or custom SparkSession/Context creation) is easily met by offering sparkContext/sparkSession handles to the user that are either wrappers on Spark's classes or have lazy evaluation semantics. This way only when the user accesses these handles to sparkContext/Session will the SparkSession/Context actually get set up without the user needing to know all the details about initing the SparkContext/Session. 

However, achieving the same doesn't appear to be so straightforward in R. From what I see, executing sparkR.session(...) sets up private variables in SparkR:::.sparkREnv (.sparkRjsc , .sparkRsession). The way SparkR api works, a user doesn't need a handle to the spark session as such. Executing functions like so:  "df <- as.DataFrame(..)" implicitly access the private vars in SparkR:::.sparkREnv to get access to the sparkContext etc that are expected to have been created by a prior call to sparkR.session()/sparkR.init() etc. 

Therefore, to inject any custom/lazy behavior into this I don't see a way except through having my code (that sits outside of Spark) apply a delayedAssign() or a makeActiveBinding( ) on SparkR:::.sparkRsession / .sparkRjsc  variables. This way when spark code internally references them, my wrapper/lazy code gets executed to do whatever I need done. 

However, I am seeing some limitations of applying even this approach to SparkR - it will not work unless some minor changes are made in the SparkR code. But, before I opened a PR that would do these changes in SparkR I wanted to check if there was a better way to achieve this? I am far less than an R expert, and could be missing something here.

If you'd rather see this in a JIRA and a PR, let me know and I'll go ahead and open one.

Regards,
Vin.





Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [SparkR] - options around setting up SparkSession / SparkContext

Felix Cheung
This seems some what unique. Most notebook environment, that I know of, has a preset processing engine tied to the notebook; in other words when Spark is selected as the engine then it is always initialized, not lazily as you describe.

What is this notebook platform you use?

_____________________________
From: Vin J <[hidden email]>
Sent: Saturday, April 22, 2017 12:33 AM
Subject: Re: [SparkR] - options around setting up SparkSession / SparkContext
To: Felix Cheung <[hidden email]>
Cc: <[hidden email]>


This is for a notebook env that has the spark session/context bootstrapped for the user. There are settings that are user specific so not all of those can go into the spark-defaults.conf - such settings need to be dynamically applied when creating the session/context.

In Scala/Python, I would bootstrap a "spark" handle similar to what spark-shell / psyspark-shell startup scripts do. In my case the bootstrapped object could be of a wrapper class that took care of whatever customization I needed while exposing the regular  SparkSession scala/python API. The user uses this object as he/she would use a regular SparkSession to submit work to the Spark cluster. Since I am certain there is no other way for users to perform Spark work except to go via the bootstrapped object, I can achieve my objective of delaying creation of SparkSession/Context until a call comes to my custom spark object.

If I want to do the same in R, and let users write SparkR code as they normally would, but bootstrapping a SparkContext/Session for them, then I hit the issues as I explained earlier. There is no single entry point for SparkContext/Session in SparkR API and so to achieve lazy creation of SparkContext/session, it looks like the only  option is to do some trickery with the SparkR:::.sparkREnv$.sparkRjsc and SparkR:::.sparkREnv$.sparkRsession vars. 

Regards,
Vin.

On Sat, Apr 22, 2017 at 3:33 AM, Felix Cheung <[hidden email]> wrote:
How would you handle this in Scala?

If you are adding a wrapper func like getSparkSession for Scala, and have your users call it, can't you do that same in SparkR? After all, while true you don't need a SparkSession object to call the R API, someone still needs to call sparkR.session() to initial the current session?

Also what Spark environment you want to customize? 

Can these be set in environment variables or via spark-defaults.conf spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties


_____________________________
From: Vin J <[hidden email]>
Sent: Friday, April 21, 2017 2:22 PM
Subject: [SparkR] - options around setting up SparkSession / SparkContext
To: <[hidden email]>




I need to make an R environment available where the SparkSession/SparkContext needs to be setup a specific way. The user simply accesses this environment and executes his/her code. If the user code does not access any Spark functions, I do not want to create a SparkContext unnecessarily. 

In Scala/Python environments, the user can't access spark without first referencing SparkContext / SparkSession classes. So the above (lazy and/or custom SparkSession/Context creation) is easily met by offering sparkContext/sparkSession handles to the user that are either wrappers on Spark's classes or have lazy evaluation semantics. This way only when the user accesses these handles to sparkContext/Session will the SparkSession/Context actually get set up without the user needing to know all the details about initing the SparkContext/Session. 

However, achieving the same doesn't appear to be so straightforward in R. From what I see, executing sparkR.session(...) sets up private variables in SparkR:::.sparkREnv (.sparkRjsc , .sparkRsession). The way SparkR api works, a user doesn't need a handle to the spark session as such. Executing functions like so:  "df <- as.DataFrame(..)" implicitly access the private vars in SparkR:::.sparkREnv to get access to the sparkContext etc that are expected to have been created by a prior call to sparkR.session()/sparkR.init() etc. 

Therefore, to inject any custom/lazy behavior into this I don't see a way except through having my code (that sits outside of Spark) apply a delayedAssign() or a makeActiveBinding( ) on SparkR:::.sparkRsession / .sparkRjsc  variables. This way when spark code internally references them, my wrapper/lazy code gets executed to do whatever I need done. 

However, I am seeing some limitations of applying even this approach to SparkR - it will not work unless some minor changes are made in the SparkR code. But, before I opened a PR that would do these changes in SparkR I wanted to check if there was a better way to achieve this? I am far less than an R expert, and could be missing something here.

If you'd rather see this in a JIRA and a PR, let me know and I'll go ahead and open one.

Regards,
Vin.







Vin
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [SparkR] - options around setting up SparkSession / SparkContext

Vin
This is a Jupyter based environment where we would like to put off binding a Spark session/context to the notebook until needed. In a YARN cluster simply bootstrapping the Spark context/session will require a couple of containers to be allocated which is wasteful unless the user really does perform (optional) Spark processing. 

I opened a JIRA https://issues.apache.org/jira/browse/SPARK-20440 and attached PR 17731 to it as I think it better conveys both the problem and solution.

Regards,
Vin.

On Sat, Apr 22, 2017 at 1:39 PM, Felix Cheung <[hidden email]> wrote:
This seems some what unique. Most notebook environment, that I know of, has a preset processing engine tied to the notebook; in other words when Spark is selected as the engine then it is always initialized, not lazily as you describe.

What is this notebook platform you use?

_____________________________
From: Vin J <[hidden email]>
Sent: Saturday, April 22, 2017 12:33 AM
Subject: Re: [SparkR] - options around setting up SparkSession / SparkContext
To: Felix Cheung <[hidden email]>
Cc: <[hidden email]>



This is for a notebook env that has the spark session/context bootstrapped for the user. There are settings that are user specific so not all of those can go into the spark-defaults.conf - such settings need to be dynamically applied when creating the session/context.

In Scala/Python, I would bootstrap a "spark" handle similar to what spark-shell / psyspark-shell startup scripts do. In my case the bootstrapped object could be of a wrapper class that took care of whatever customization I needed while exposing the regular  SparkSession scala/python API. The user uses this object as he/she would use a regular SparkSession to submit work to the Spark cluster. Since I am certain there is no other way for users to perform Spark work except to go via the bootstrapped object, I can achieve my objective of delaying creation of SparkSession/Context until a call comes to my custom spark object.

If I want to do the same in R, and let users write SparkR code as they normally would, but bootstrapping a SparkContext/Session for them, then I hit the issues as I explained earlier. There is no single entry point for SparkContext/Session in SparkR API and so to achieve lazy creation of SparkContext/session, it looks like the only  option is to do some trickery with the SparkR:::.sparkREnv$.sparkRjsc and SparkR:::.sparkREnv$.sparkRsession vars. 

Regards,
Vin.

On Sat, Apr 22, 2017 at 3:33 AM, Felix Cheung <[hidden email]> wrote:
How would you handle this in Scala?

If you are adding a wrapper func like getSparkSession for Scala, and have your users call it, can't you do that same in SparkR? After all, while true you don't need a SparkSession object to call the R API, someone still needs to call sparkR.session() to initial the current session?

Also what Spark environment you want to customize? 

Can these be set in environment variables or via spark-defaults.conf spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties


_____________________________
From: Vin J <[hidden email]>
Sent: Friday, April 21, 2017 2:22 PM
Subject: [SparkR] - options around setting up SparkSession / SparkContext
To: <[hidden email]>




I need to make an R environment available where the SparkSession/SparkContext needs to be setup a specific way. The user simply accesses this environment and executes his/her code. If the user code does not access any Spark functions, I do not want to create a SparkContext unnecessarily. 

In Scala/Python environments, the user can't access spark without first referencing SparkContext / SparkSession classes. So the above (lazy and/or custom SparkSession/Context creation) is easily met by offering sparkContext/sparkSession handles to the user that are either wrappers on Spark's classes or have lazy evaluation semantics. This way only when the user accesses these handles to sparkContext/Session will the SparkSession/Context actually get set up without the user needing to know all the details about initing the SparkContext/Session. 

However, achieving the same doesn't appear to be so straightforward in R. From what I see, executing sparkR.session(...) sets up private variables in SparkR:::.sparkREnv (.sparkRjsc , .sparkRsession). The way SparkR api works, a user doesn't need a handle to the spark session as such. Executing functions like so:  "df <- as.DataFrame(..)" implicitly access the private vars in SparkR:::.sparkREnv to get access to the sparkContext etc that are expected to have been created by a prior call to sparkR.session()/sparkR.init() etc. 

Therefore, to inject any custom/lazy behavior into this I don't see a way except through having my code (that sits outside of Spark) apply a delayedAssign() or a makeActiveBinding( ) on SparkR:::.sparkRsession / .sparkRjsc  variables. This way when spark code internally references them, my wrapper/lazy code gets executed to do whatever I need done. 

However, I am seeing some limitations of applying even this approach to SparkR - it will not work unless some minor changes are made in the SparkR code. But, before I opened a PR that would do these changes in SparkR I wanted to check if there was a better way to achieve this? I am far less than an R expert, and could be missing something here.

If you'd rather see this in a JIRA and a PR, let me know and I'll go ahead and open one.

Regards,
Vin.








Loading...