[Discuss] Datasource v2 support for Kerberos

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

[Discuss] Datasource v2 support for Kerberos

tigerquoll
The current V2 Datasource API provides support for querying a portion of the
SparkConfig namespace (spark.datasource.*) via the SessionConfigSupport API.
This was designed with the assumption that all configuration information for
v2 data sources should be separate from each other.

Unfortunately, there are some cross-cutting concerns such as authentication
that touch multiple data sources - this means that common configuration
items need to be shared amongst multiple data sources.
In particular, Kerberos setup can use the following configuration items:

* userPrincipal,
* userKeytabPath
* krb5ConfPath
* kerberos debugging flags
* spark.security.credentials.${service}.enabled
* JAAS config
* ZKServerPrincipal ??

So potential solutions I can think of to pass this information to various
data sources are:

* Pass the entire SparkContext object to data sources (not likely)
* Pass the entire SparkConfig Map object to data sources
* Pass all required configuration via environment variables
* Extend SessionConfigSupport to support passing specific white-listed
configuration values
* Add a specific data source v2 API "SupportsKerberos" so that a data source
can indicate that it supports Kerberos and also provide the means to pass
needed configuration info.
* Expand out all Kerberos configuration items to be in each data source
config namespace that needs it.

If the data source requires TLS support then we also need to support passing
all the  configuration values under  "spark.ssl.*"

What do people think?  Placeholder Issue has been added at SPARK-25329.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Discuss] Datasource v2 support for Kerberos

cloud0fan
I'm +1 for this proposal: "Extend SessionConfigSupport to support passing specific white-listed configuration values"

One goal of data source v2 API is to not depend on any high-level APIs like SparkSession, SQLConf, etc. If users do want to access these high-level APIs, there is a workaround: calling `SparkSession.getActive` or `SQLConf.get`.

In the meanwhile, I think you use case makes sense. `SessionConfigSupport` is created for this use case but it's not powerful enough yet. I think it should support multiple key-prefixes and white-list.

Feel free to submit a patch, and thanks for looking into it!

On Sun, Sep 16, 2018 at 2:40 PM tigerquoll <[hidden email]> wrote:
The current V2 Datasource API provides support for querying a portion of the
SparkConfig namespace (spark.datasource.*) via the SessionConfigSupport API.
This was designed with the assumption that all configuration information for
v2 data sources should be separate from each other.

Unfortunately, there are some cross-cutting concerns such as authentication
that touch multiple data sources - this means that common configuration
items need to be shared amongst multiple data sources.
In particular, Kerberos setup can use the following configuration items:

* userPrincipal,
* userKeytabPath
* krb5ConfPath
* kerberos debugging flags
* spark.security.credentials.${service}.enabled
* JAAS config
* ZKServerPrincipal ??

So potential solutions I can think of to pass this information to various
data sources are:

* Pass the entire SparkContext object to data sources (not likely)
* Pass the entire SparkConfig Map object to data sources
* Pass all required configuration via environment variables
* Extend SessionConfigSupport to support passing specific white-listed
configuration values
* Add a specific data source v2 API "SupportsKerberos" so that a data source
can indicate that it supports Kerberos and also provide the means to pass
needed configuration info.
* Expand out all Kerberos configuration items to be in each data source
config namespace that needs it.

If the data source requires TLS support then we also need to support passing
all the  configuration values under  "spark.ssl.*"

What do people think?  Placeholder Issue has been added at SPARK-25329.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Discuss] Datasource v2 support for Kerberos

Ryan Blue

I’m not a huge fan of special cases for configuration values like this. Is there something that we can do to pass a set of values to all sources (and catalogs for #21306)?

I would prefer adding a special prefix for options that are passed to all sources, like this:

spark.sql.catalog.shared.shared-property = value0
spark.sql.catalog.jdbc-prod.prop = value1
spark.datasource.source-name.prop = value2

All of the properties in the shared namespace would be passed to all catalogs and sources. What do you think?


On Sun, Sep 16, 2018 at 6:51 PM Wenchen Fan <[hidden email]> wrote:
I'm +1 for this proposal: "Extend SessionConfigSupport to support passing specific white-listed configuration values"

One goal of data source v2 API is to not depend on any high-level APIs like SparkSession, SQLConf, etc. If users do want to access these high-level APIs, there is a workaround: calling `SparkSession.getActive` or `SQLConf.get`.

In the meanwhile, I think you use case makes sense. `SessionConfigSupport` is created for this use case but it's not powerful enough yet. I think it should support multiple key-prefixes and white-list.

Feel free to submit a patch, and thanks for looking into it!

On Sun, Sep 16, 2018 at 2:40 PM tigerquoll <[hidden email]> wrote:
The current V2 Datasource API provides support for querying a portion of the
SparkConfig namespace (spark.datasource.*) via the SessionConfigSupport API.
This was designed with the assumption that all configuration information for
v2 data sources should be separate from each other.

Unfortunately, there are some cross-cutting concerns such as authentication
that touch multiple data sources - this means that common configuration
items need to be shared amongst multiple data sources.
In particular, Kerberos setup can use the following configuration items:

* userPrincipal,
* userKeytabPath
* krb5ConfPath
* kerberos debugging flags
* spark.security.credentials.${service}.enabled
* JAAS config
* ZKServerPrincipal ??

So potential solutions I can think of to pass this information to various
data sources are:

* Pass the entire SparkContext object to data sources (not likely)
* Pass the entire SparkConfig Map object to data sources
* Pass all required configuration via environment variables
* Extend SessionConfigSupport to support passing specific white-listed
configuration values
* Add a specific data source v2 API "SupportsKerberos" so that a data source
can indicate that it supports Kerberos and also provide the means to pass
needed configuration info.
* Expand out all Kerberos configuration items to be in each data source
config namespace that needs it.

If the data source requires TLS support then we also need to support passing
all the  configuration values under  "spark.ssl.*"

What do people think?  Placeholder Issue has been added at SPARK-25329.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [Discuss] Datasource v2 support for Kerberos

tigerquoll
I believe the current spark config system is unfortunate in the way it has
grown - you have no way of telling which sub-systems uses which
configuration options without direct and detailed reading of the code.

Isolating config items for datasources into a separate namespaces (rather
then using a whitelist), is a nice idea - unfortunately in this case we are
dealing with configuration items that have been exposed to end-users in
their current from for a significant amount of time, and Kerberos cross-cuts
not only datasources, but also things like YARN.

So given that fact - the best options of a way forward I can think of are:
1. Whitelisting of specific sub sections of the configuration space, or
2. Just pass in a Map[String,String] of all config values
3. Implement a specific interface for data sources to indicate/implement
Kerberos support

Option (1), is pretty arbitrary, and more then likely the whitelist will
change from version to version as additional items get added to it.  Data
sources will develop dependencies on certain configuration values being
present in the white list.

Option (2) would work, but continues the practice of having a vaguely
specified grab-bag of config items as a dependency for practically all Spark
code.

I am beginning to to warm to option (3), it would be a clean way of
declaring that a data source supports Kerberos, and also a cleanly specified
way of injecting the relevant Kerberos configuration information into the
data source - and we will not need to change any user-facing configuration
items as well.
 




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Discuss] Datasource v2 support for Kerberos

Ryan Blue
Dale, what do you think about the option that I suggested? I think that's different from the ones that you just listed.

Basically, the idea is to have a "shared" set of options that are passed to all sources. This would not be a whitelist, it would be a namespace that ends up passed in everywhere. That way, kerberos options would be set in the shared space, but could be set directly if you want to override.

The problem I have with your option 1 is that it requires a whiltelist, which is difficult to maintain and doesn't have obvious behavior. If a user wants to share an option, it has to be a special one. Otherwise the user has to wait until we add it to a whitelist, which is slow.

I don't think your option 2 works because that's no better than what we do today. And as you said, isolating config is a good goal.

Your option 3 is basically a whitelist, but with additional interfaces to activate the option sets to forward. I think that's a bit too intrusive and shares the problems that a whitelist has.

The option I'm proposing gets around those issues because it is obvious what is happening. Any option under the shared namespace is copied to all sources and catalogs. That doesn't require Spark to do anything to support specific sets of options and is predictable behavior for users to understand. It also allows us to maintain separation instead of passing all options. I think this is a good option overall.

What do you think?

rb

On Sun, Sep 23, 2018 at 5:21 PM tigerquoll <[hidden email]> wrote:
I believe the current spark config system is unfortunate in the way it has
grown - you have no way of telling which sub-systems uses which
configuration options without direct and detailed reading of the code.

Isolating config items for datasources into a separate namespaces (rather
then using a whitelist), is a nice idea - unfortunately in this case we are
dealing with configuration items that have been exposed to end-users in
their current from for a significant amount of time, and Kerberos cross-cuts
not only datasources, but also things like YARN.

So given that fact - the best options of a way forward I can think of are:
1. Whitelisting of specific sub sections of the configuration space, or
2. Just pass in a Map[String,String] of all config values
3. Implement a specific interface for data sources to indicate/implement
Kerberos support

Option (1), is pretty arbitrary, and more then likely the whitelist will
change from version to version as additional items get added to it.  Data
sources will develop dependencies on certain configuration values being
present in the white list.

Option (2) would work, but continues the practice of having a vaguely
specified grab-bag of config items as a dependency for practically all Spark
code.

I am beginning to to warm to option (3), it would be a clean way of
declaring that a data source supports Kerberos, and also a cleanly specified
way of injecting the relevant Kerberos configuration information into the
data source - and we will not need to change any user-facing configuration
items as well.





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [Discuss] Datasource v2 support for Kerberos

tigerquoll
I like the shared namespace option better then the white listing option for
any newly defined configuration information.  

All of the Kerberos options already exist in their own legacy locations
though - changing their location could break a lot of systems.

Perhaps we can use the shared namespace option for any new option and
whitelisting for the existing options?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Discuss] Datasource v2 support for Kerberos

cloud0fan
> All of the Kerberos options already exist in their own legacy locations
though - changing their location could break a lot of systems.

We can define the prefix for shared options, and we can strip the prefix when passing these options to the data source. Will this work for your case?

On Tue, Sep 25, 2018 at 12:57 PM tigerquoll <[hidden email]> wrote:
I like the shared namespace option better then the white listing option for
any newly defined configuration information. 

All of the Kerberos options already exist in their own legacy locations
though - changing their location could break a lot of systems.

Perhaps we can use the shared namespace option for any new option and
whitelisting for the existing options?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Discuss] Datasource v2 support for Kerberos

tigerquoll
To give some Kerberos specific examples, The spark-submit args:
-–conf spark.yarn.keytab=path_to_keytab -–conf
spark.yarn.principal=[hidden email]

are currently not passed through to the data sources.





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Discuss] Datasource v2 support for Kerberos

Ryan Blue
I agree with Wenchen that we'd remove the prefix when passing to a source, so you could use the same "spark.yarn.keytab" option in both places. But I think the problem is that "spark.yarn.keytab" still needs to be set, and it clearly isn't in a shared namespace for catalog options. So I think we would still need a solution for existing options. I'm more comfortable with a white list for existing options that we want to maintain compatibility with.

rb



On Mon, Sep 24, 2018 at 11:52 PM tigerquoll <[hidden email]> wrote:
To give some Kerberos specific examples, The spark-submit args:
-–conf spark.yarn.keytab=path_to_keytab -–conf
spark.yarn.principal=[hidden email]

are currently not passed through to the data sources.





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [Discuss] Datasource v2 support for Kerberos

Steve Loughran
In reply to this post by tigerquoll


> On 25 Sep 2018, at 07:52, tigerquoll <[hidden email]> wrote:
>
> To give some Kerberos specific examples, The spark-submit args:
> -–conf spark.yarn.keytab=path_to_keytab -–conf
> spark.yarn.principal=[hidden email]
>
> are currently not passed through to the data sources.
>
>
>


I'm not sure why the data sources would need to know the kerberos login details, certainly I wouldn't give them the keytab path (or indeed, access to it), and as for the principal, UserGroupInformation getCurrentUser() should return that, including with support for UGI.doAs() and the ability to issue calls as different users from same process.

I'd also be reluctant to blindly pass on kerberos secrets over the network. What does matter is that code interacting with a data source, dest, filesystem, etc should be executing it in the context of the intended caller, which UGI getCurrentUser() should do.

What does matter is that whatever authentication information is needed to authenticate with a data source is passed to it. That's done in the spark submit code for yarn by asking the filesystems, hive & hbase; I don't know about zookeeper there.

I think what might be good here is to enumerate what datasources are expected to need from kerberos (JIRA? google doc), and from any forms of service tokens, then see how they could be handled in a way which fits into the existing world of Kerberos ticket & Hadoop service token creation on submission or in job driver, and handoff to workers which need them

-Steve




---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Discuss] Datasource v2 support for Kerberos

tigerquoll
In reply to this post by tigerquoll
Hi Steve,
I think that passing a kerberos keytab around is one of those bad ideas that
is entirely appropriate to re-question every single time you come across it.
It has been used already in spark when interacting with Kerberos systems
that do not support delegation tokens. Any such system will eventually stop
talking to Spark once the passed Kerberos tickets expire and are unable to
be renewed.

It is one of those "best bad idea we have" type situations that has arisen,
been discussed to death, and finally, grudgingly, an interim-only solution
settled on as passing the keytab to the worker to renew Kerberos tickets. A
long-time notable offender in this area is secure Kafka. Thankfully Kafka
delegation tokens are soon to be supported in spark, removing the need to
pass keytabs around when interacting with Kafka.

This particular thread could probably be better renamed as Generic
Datasource v2 support for Kerberos configuration - I would like to divert
from conversation on alternate architectures that could handle a lack of
delegation tickets (it is a worthwhile conversation, but a long and involved
one that will distract from this particular narrowly defined topic), and
focus just on configuration. information.   A very quick look through
various client code has identified at least the following configuration
information that potentially could be of use to a datasource that uses
Kerberos.

* krb5ConfPath
* kerberos debugging flags
* spark.security.credentials.${service}.enabled
* JAAS config
* ZKServerPrincipal ??

It is entirely feasible that each datasource may require its own unique
Kerberos configuration (e.g. You are pulling from a external datasource that
has a different KDC then the yarn cluster you are running on).



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Discuss] Datasource v2 support for Kerberos

Steve Loughran


On 2 Oct 2018, at 04:44, tigerquoll <[hidden email]> wrote:

Hi Steve,
I think that passing a kerberos keytab around is one of those bad ideas that
is entirely appropriate to re-question every single time you come across it.
It has been used already in spark when interacting with Kerberos systems
that do not support delegation tokens. Any such system will eventually stop
talking to Spark once the passed Kerberos tickets expire and are unable to
be renewed.

It is one of those "best bad idea we have" type situations that has arisen,
been discussed to death, and finally, grudgingly, an interim-only solution
settled on as passing the keytab to the worker to renew Kerberos tickets.

Spark AM, generally, with it pushing out tickets to the workers,  I don't believe the workers get to see the keytab —do they?

Gabor's illustration in the kafka SPIP is probably the best illustration of it I've ever seen


A
long-time notable offender in this area is secure Kafka. Thankfully Kafka
delegation tokens are soon to be supported in spark, removing the need to
pass keytabs around when interacting with Kafka.

This particular thread could probably be better renamed as Generic
Datasource v2 support for Kerberos configuration - I would like to divert
from conversation on alternate architectures that could handle a lack of
delegation tickets (it is a worthwhile conversation, but a long and involved
one that will distract from this particular narrowly defined topic), and
focus just on configuration. information.   A very quick look through
various client code has identified at least the following configuration
information that potentially could be of use to a datasource that uses
Kerberos.

* krb5ConfPath
* kerberos debugging flags

mmm. https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/secrets.html

FWIW, Hadoop 2.8+ has the KDiag entry point which can also be run inside an application —though there's always the risk that going near UGI too early can "collapse" kerberos state too early


if Spark needs something like that for 2.7.x too, copying & repackaging that class would be a place to start


* spark.security.credentials.${service}.enabled
* JAAS config
* ZKServerPrincipal ??

It is entirely feasible that each datasource may require its own unique
Kerberos configuration (e.g. You are pulling from a external datasource that
has a different KDC then the yarn cluster you are running on).

This is a use-case I've never encountered, instead everyone relies on cross-AD trust. That's complex enough as it is