Request to document the direct relationship between other configurations

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Request to document the direct relationship between other configurations

Hyukjin Kwon
Hi all,

I happened to review some PRs and I noticed that some configurations don't have some information
necessary.

To be explicit, I would like to make sure we document the direct relationship between other configurations
in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's clearly documented.
We're good in general given that we document them in general in Apache Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown', etc. 

However, I noticed such a pattern that such information is missing in some components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and `spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't know the relationship and I myself
had to read the codes to confirm when I review. 
I don't plan to document this officially yet because to me it looks a pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Request to document the direct relationship between other configurations

Hyukjin Kwon
> I don't plan to document this officially yet
Just to prevent confusion, I meant I don't yet plan to document the fact that we should write the relationships in configurations as a code/review guideline in https://spark.apache.org/contributing.html


2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[hidden email]>님이 작성:
Hi all,

I happened to review some PRs and I noticed that some configurations don't have some information
necessary.

To be explicit, I would like to make sure we document the direct relationship between other configurations
in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's clearly documented.
We're good in general given that we document them in general in Apache Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown', etc. 

However, I noticed such a pattern that such information is missing in some components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and `spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't know the relationship and I myself
had to read the codes to confirm when I review. 
I don't plan to document this officially yet because to me it looks a pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Request to document the direct relationship between other configurations

Jungtaek Lim-2
I'm sorry if I miss something, but this is ideally better to be started as [DISCUSS] as I haven't seen any reference to have consensus on this practice.

For me it's just there're two different practices co-existing on the codebase, meaning it's closer to the preference of individual (with implicitly agreeing that others have different preferences), or it hasn't been discussed thoughtfully.

Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major benefits of the structured configurations.

If we want to make it explicit, "all" sub-configurations should have redundant part of the doc. More redundant if the condition is nested. I agree this is the good step of "be kind" but less pragmatic.

I'd be happy to follow the consensus we would make in this thread. Appreciate more voices.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <[hidden email]> wrote:
> I don't plan to document this officially yet
Just to prevent confusion, I meant I don't yet plan to document the fact that we should write the relationships in configurations as a code/review guideline in https://spark.apache.org/contributing.html


2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[hidden email]>님이 작성:
Hi all,

I happened to review some PRs and I noticed that some configurations don't have some information
necessary.

To be explicit, I would like to make sure we document the direct relationship between other configurations
in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's clearly documented.
We're good in general given that we document them in general in Apache Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown', etc. 

However, I noticed such a pattern that such information is missing in some components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and `spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't know the relationship and I myself
had to read the codes to confirm when I review. 
I don't plan to document this officially yet because to me it looks a pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Request to document the direct relationship between other configurations

Hyukjin Kwon
Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it although it might be "redundant" :-) since anyone can give feedback to any thread in Spark dev mailing list, and discuss.

This is actually more prevailing given my rough reading of configuration files. I would like to see this missing relationship as a bad pattern, started from a personal preference.

> Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations.
> E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major
> benefits of the structured configurations.

I don't think this is a good idea we assume for users to know such contexts. One might think `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
partially enable the feature. It is better to be explicit to document since some of configurations are even difficult for users to confirm if it is working or not.
For instance, one might think setting 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, they realise the log is not rolling later after the file
size becomes bigger.


2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm sorry if I miss something, but this is ideally better to be started as [DISCUSS] as I haven't seen any reference to have consensus on this practice.

For me it's just there're two different practices co-existing on the codebase, meaning it's closer to the preference of individual (with implicitly agreeing that others have different preferences), or it hasn't been discussed thoughtfully.

Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major benefits of the structured configurations.

If we want to make it explicit, "all" sub-configurations should have redundant part of the doc. More redundant if the condition is nested. I agree this is the good step of "be kind" but less pragmatic.

I'd be happy to follow the consensus we would make in this thread. Appreciate more voices.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <[hidden email]> wrote:
> I don't plan to document this officially yet
Just to prevent confusion, I meant I don't yet plan to document the fact that we should write the relationships in configurations as a code/review guideline in https://spark.apache.org/contributing.html


2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[hidden email]>님이 작성:
Hi all,

I happened to review some PRs and I noticed that some configurations don't have some information
necessary.

To be explicit, I would like to make sure we document the direct relationship between other configurations
in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's clearly documented.
We're good in general given that we document them in general in Apache Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown', etc. 

However, I noticed such a pattern that such information is missing in some components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and `spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't know the relationship and I myself
had to read the codes to confirm when I review. 
I don't plan to document this officially yet because to me it looks a pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Request to document the direct relationship between other configurations

Jungtaek Lim-2
I'm looking into the case of `spark.dynamicAllocation` and this seems to be the thing to support my voice.


I don't disagree with adding "This requires spark.shuffle.service.enabled to be set." in the description of `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly, hence it should be better to have it.

Why I'm in favor of structured configuration & implicit effect over describing everything explicitly is there. 

1. There're 10 configurations (if the doc doesn't miss any other configuration) except `spark.dynamicAllocation.enabled`, and only 4 configurations are referred in the description of `spark.dynamicAllocation.enabled` - majority of config keys are missing.
2. I think it's intentional, but the table starts with `spark.dynamicAllocation.enabled` which talks implicitly but intuitively that if you disable this then everything on dynamic allocation won't work. Missing majority of references on config keys don't get it hard to understand.
3. Even `spark.dynamicAllocation` has bad case - see `spark.dynamicAllocation.shuffleTracking.enabled` and `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the structure of configuration. I think this is worse than not explicitly mentioning the description. Let's assume the name has been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false` would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?

Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.


On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <[hidden email]> wrote:
Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it although it might be "redundant" :-) since anyone can give feedback to any thread in Spark dev mailing list, and discuss.

This is actually more prevailing given my rough reading of configuration files. I would like to see this missing relationship as a bad pattern, started from a personal preference.

> Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations.
> E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major
> benefits of the structured configurations.

I don't think this is a good idea we assume for users to know such contexts. One might think `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
partially enable the feature. It is better to be explicit to document since some of configurations are even difficult for users to confirm if it is working or not.
For instance, one might think setting 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, they realise the log is not rolling later after the file
size becomes bigger.


2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm sorry if I miss something, but this is ideally better to be started as [DISCUSS] as I haven't seen any reference to have consensus on this practice.

For me it's just there're two different practices co-existing on the codebase, meaning it's closer to the preference of individual (with implicitly agreeing that others have different preferences), or it hasn't been discussed thoughtfully.

Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major benefits of the structured configurations.

If we want to make it explicit, "all" sub-configurations should have redundant part of the doc. More redundant if the condition is nested. I agree this is the good step of "be kind" but less pragmatic.

I'd be happy to follow the consensus we would make in this thread. Appreciate more voices.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <[hidden email]> wrote:
> I don't plan to document this officially yet
Just to prevent confusion, I meant I don't yet plan to document the fact that we should write the relationships in configurations as a code/review guideline in https://spark.apache.org/contributing.html


2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[hidden email]>님이 작성:
Hi all,

I happened to review some PRs and I noticed that some configurations don't have some information
necessary.

To be explicit, I would like to make sure we document the direct relationship between other configurations
in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's clearly documented.
We're good in general given that we document them in general in Apache Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown', etc. 

However, I noticed such a pattern that such information is missing in some components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and `spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't know the relationship and I myself
had to read the codes to confirm when I review. 
I don't plan to document this officially yet because to me it looks a pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Request to document the direct relationship between other configurations

Hyukjin Kwon
To do that, we should explicitly document such structured configuration and implicit effect, which is currently missing.
I would be more than happy if we document such implied relationship, and if we are very sure all configurations are structured correctly coherently.
Until that point, I think it might be more practical to simply document it for now.

> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.
This is actually something we should fix too. in SQL configuration, now we don't have such duplications as of https://github.com/apache/spark/pull/27459 as it generates. We should do it in other configurations.


2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm looking into the case of `spark.dynamicAllocation` and this seems to be the thing to support my voice.


I don't disagree with adding "This requires spark.shuffle.service.enabled to be set." in the description of `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly, hence it should be better to have it.

Why I'm in favor of structured configuration & implicit effect over describing everything explicitly is there. 

1. There're 10 configurations (if the doc doesn't miss any other configuration) except `spark.dynamicAllocation.enabled`, and only 4 configurations are referred in the description of `spark.dynamicAllocation.enabled` - majority of config keys are missing.
2. I think it's intentional, but the table starts with `spark.dynamicAllocation.enabled` which talks implicitly but intuitively that if you disable this then everything on dynamic allocation won't work. Missing majority of references on config keys don't get it hard to understand.
3. Even `spark.dynamicAllocation` has bad case - see `spark.dynamicAllocation.shuffleTracking.enabled` and `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the structure of configuration. I think this is worse than not explicitly mentioning the description. Let's assume the name has been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false` would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?

Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.


On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <[hidden email]> wrote:
Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it although it might be "redundant" :-) since anyone can give feedback to any thread in Spark dev mailing list, and discuss.

This is actually more prevailing given my rough reading of configuration files. I would like to see this missing relationship as a bad pattern, started from a personal preference.

> Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations.
> E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major
> benefits of the structured configurations.

I don't think this is a good idea we assume for users to know such contexts. One might think `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
partially enable the feature. It is better to be explicit to document since some of configurations are even difficult for users to confirm if it is working or not.
For instance, one might think setting 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, they realise the log is not rolling later after the file
size becomes bigger.


2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm sorry if I miss something, but this is ideally better to be started as [DISCUSS] as I haven't seen any reference to have consensus on this practice.

For me it's just there're two different practices co-existing on the codebase, meaning it's closer to the preference of individual (with implicitly agreeing that others have different preferences), or it hasn't been discussed thoughtfully.

Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major benefits of the structured configurations.

If we want to make it explicit, "all" sub-configurations should have redundant part of the doc. More redundant if the condition is nested. I agree this is the good step of "be kind" but less pragmatic.

I'd be happy to follow the consensus we would make in this thread. Appreciate more voices.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <[hidden email]> wrote:
> I don't plan to document this officially yet
Just to prevent confusion, I meant I don't yet plan to document the fact that we should write the relationships in configurations as a code/review guideline in https://spark.apache.org/contributing.html


2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[hidden email]>님이 작성:
Hi all,

I happened to review some PRs and I noticed that some configurations don't have some information
necessary.

To be explicit, I would like to make sure we document the direct relationship between other configurations
in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's clearly documented.
We're good in general given that we document them in general in Apache Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown', etc. 

However, I noticed such a pattern that such information is missing in some components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and `spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't know the relationship and I myself
had to read the codes to confirm when I review. 
I don't plan to document this officially yet because to me it looks a pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Request to document the direct relationship between other configurations

Hyukjin Kwon
Also, I would like to hear other people' thoughts on here. Could I ask what you guys think about this in general?

2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <[hidden email]>님이 작성:
To do that, we should explicitly document such structured configuration and implicit effect, which is currently missing.
I would be more than happy if we document such implied relationship, and if we are very sure all configurations are structured correctly coherently.
Until that point, I think it might be more practical to simply document it for now.

> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.
This is actually something we should fix too. in SQL configuration, now we don't have such duplications as of https://github.com/apache/spark/pull/27459 as it generates. We should do it in other configurations.


2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm looking into the case of `spark.dynamicAllocation` and this seems to be the thing to support my voice.


I don't disagree with adding "This requires spark.shuffle.service.enabled to be set." in the description of `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly, hence it should be better to have it.

Why I'm in favor of structured configuration & implicit effect over describing everything explicitly is there. 

1. There're 10 configurations (if the doc doesn't miss any other configuration) except `spark.dynamicAllocation.enabled`, and only 4 configurations are referred in the description of `spark.dynamicAllocation.enabled` - majority of config keys are missing.
2. I think it's intentional, but the table starts with `spark.dynamicAllocation.enabled` which talks implicitly but intuitively that if you disable this then everything on dynamic allocation won't work. Missing majority of references on config keys don't get it hard to understand.
3. Even `spark.dynamicAllocation` has bad case - see `spark.dynamicAllocation.shuffleTracking.enabled` and `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the structure of configuration. I think this is worse than not explicitly mentioning the description. Let's assume the name has been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false` would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?

Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.


On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <[hidden email]> wrote:
Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it although it might be "redundant" :-) since anyone can give feedback to any thread in Spark dev mailing list, and discuss.

This is actually more prevailing given my rough reading of configuration files. I would like to see this missing relationship as a bad pattern, started from a personal preference.

> Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations.
> E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major
> benefits of the structured configurations.

I don't think this is a good idea we assume for users to know such contexts. One might think `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
partially enable the feature. It is better to be explicit to document since some of configurations are even difficult for users to confirm if it is working or not.
For instance, one might think setting 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, they realise the log is not rolling later after the file
size becomes bigger.


2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm sorry if I miss something, but this is ideally better to be started as [DISCUSS] as I haven't seen any reference to have consensus on this practice.

For me it's just there're two different practices co-existing on the codebase, meaning it's closer to the preference of individual (with implicitly agreeing that others have different preferences), or it hasn't been discussed thoughtfully.

Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major benefits of the structured configurations.

If we want to make it explicit, "all" sub-configurations should have redundant part of the doc. More redundant if the condition is nested. I agree this is the good step of "be kind" but less pragmatic.

I'd be happy to follow the consensus we would make in this thread. Appreciate more voices.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <[hidden email]> wrote:
> I don't plan to document this officially yet
Just to prevent confusion, I meant I don't yet plan to document the fact that we should write the relationships in configurations as a code/review guideline in https://spark.apache.org/contributing.html


2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[hidden email]>님이 작성:
Hi all,

I happened to review some PRs and I noticed that some configurations don't have some information
necessary.

To be explicit, I would like to make sure we document the direct relationship between other configurations
in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's clearly documented.
We're good in general given that we document them in general in Apache Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown', etc. 

However, I noticed such a pattern that such information is missing in some components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and `spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't know the relationship and I myself
had to read the codes to confirm when I review. 
I don't plan to document this officially yet because to me it looks a pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Request to document the direct relationship between other configurations

cloud0fan
In general I think it's better to have more detailed documents, but we don't have to force everyone to do it if the config name is structured. I would +1 to document the relationship of we can't tell it from the config names, e.g. spark.shuffle.service.enabled and spark.dynamicAllocation.enabled.

On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <[hidden email]> wrote:
Also, I would like to hear other people' thoughts on here. Could I ask what you guys think about this in general?

2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <[hidden email]>님이 작성:
To do that, we should explicitly document such structured configuration and implicit effect, which is currently missing.
I would be more than happy if we document such implied relationship, and if we are very sure all configurations are structured correctly coherently.
Until that point, I think it might be more practical to simply document it for now.

> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.
This is actually something we should fix too. in SQL configuration, now we don't have such duplications as of https://github.com/apache/spark/pull/27459 as it generates. We should do it in other configurations.


2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm looking into the case of `spark.dynamicAllocation` and this seems to be the thing to support my voice.


I don't disagree with adding "This requires spark.shuffle.service.enabled to be set." in the description of `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly, hence it should be better to have it.

Why I'm in favor of structured configuration & implicit effect over describing everything explicitly is there. 

1. There're 10 configurations (if the doc doesn't miss any other configuration) except `spark.dynamicAllocation.enabled`, and only 4 configurations are referred in the description of `spark.dynamicAllocation.enabled` - majority of config keys are missing.
2. I think it's intentional, but the table starts with `spark.dynamicAllocation.enabled` which talks implicitly but intuitively that if you disable this then everything on dynamic allocation won't work. Missing majority of references on config keys don't get it hard to understand.
3. Even `spark.dynamicAllocation` has bad case - see `spark.dynamicAllocation.shuffleTracking.enabled` and `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the structure of configuration. I think this is worse than not explicitly mentioning the description. Let's assume the name has been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false` would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?

Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.


On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <[hidden email]> wrote:
Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it although it might be "redundant" :-) since anyone can give feedback to any thread in Spark dev mailing list, and discuss.

This is actually more prevailing given my rough reading of configuration files. I would like to see this missing relationship as a bad pattern, started from a personal preference.

> Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations.
> E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major
> benefits of the structured configurations.

I don't think this is a good idea we assume for users to know such contexts. One might think `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
partially enable the feature. It is better to be explicit to document since some of configurations are even difficult for users to confirm if it is working or not.
For instance, one might think setting 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, they realise the log is not rolling later after the file
size becomes bigger.


2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm sorry if I miss something, but this is ideally better to be started as [DISCUSS] as I haven't seen any reference to have consensus on this practice.

For me it's just there're two different practices co-existing on the codebase, meaning it's closer to the preference of individual (with implicitly agreeing that others have different preferences), or it hasn't been discussed thoughtfully.

Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major benefits of the structured configurations.

If we want to make it explicit, "all" sub-configurations should have redundant part of the doc. More redundant if the condition is nested. I agree this is the good step of "be kind" but less pragmatic.

I'd be happy to follow the consensus we would make in this thread. Appreciate more voices.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <[hidden email]> wrote:
> I don't plan to document this officially yet
Just to prevent confusion, I meant I don't yet plan to document the fact that we should write the relationships in configurations as a code/review guideline in https://spark.apache.org/contributing.html


2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[hidden email]>님이 작성:
Hi all,

I happened to review some PRs and I noticed that some configurations don't have some information
necessary.

To be explicit, I would like to make sure we document the direct relationship between other configurations
in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's clearly documented.
We're good in general given that we document them in general in Apache Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown', etc. 

However, I noticed such a pattern that such information is missing in some components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and `spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't know the relationship and I myself
had to read the codes to confirm when I review. 
I don't plan to document this officially yet because to me it looks a pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Request to document the direct relationship between other configurations

Hyukjin Kwon
Yeah, that's one of my point why I dont want to document this in the guide yet.

I would like to make sure dev people are on the same page that documenting is a better practice. I dont intend to force as a hard requirement; however, if that's pointed out, it should better to address.


On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <[hidden email]> wrote:
In general I think it's better to have more detailed documents, but we don't have to force everyone to do it if the config name is structured. I would +1 to document the relationship of we can't tell it from the config names, e.g. spark.shuffle.service.enabled and spark.dynamicAllocation.enabled.

On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <[hidden email]> wrote:
Also, I would like to hear other people' thoughts on here. Could I ask what you guys think about this in general?

2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <[hidden email]>님이 작성:
To do that, we should explicitly document such structured configuration and implicit effect, which is currently missing.
I would be more than happy if we document such implied relationship, and if we are very sure all configurations are structured correctly coherently.
Until that point, I think it might be more practical to simply document it for now.

> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.
This is actually something we should fix too. in SQL configuration, now we don't have such duplications as of https://github.com/apache/spark/pull/27459 as it generates. We should do it in other configurations.


2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm looking into the case of `spark.dynamicAllocation` and this seems to be the thing to support my voice.


I don't disagree with adding "This requires spark.shuffle.service.enabled to be set." in the description of `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly, hence it should be better to have it.

Why I'm in favor of structured configuration & implicit effect over describing everything explicitly is there. 

1. There're 10 configurations (if the doc doesn't miss any other configuration) except `spark.dynamicAllocation.enabled`, and only 4 configurations are referred in the description of `spark.dynamicAllocation.enabled` - majority of config keys are missing.
2. I think it's intentional, but the table starts with `spark.dynamicAllocation.enabled` which talks implicitly but intuitively that if you disable this then everything on dynamic allocation won't work. Missing majority of references on config keys don't get it hard to understand.
3. Even `spark.dynamicAllocation` has bad case - see `spark.dynamicAllocation.shuffleTracking.enabled` and `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the structure of configuration. I think this is worse than not explicitly mentioning the description. Let's assume the name has been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false` would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?

Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.


On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <[hidden email]> wrote:
Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it although it might be "redundant" :-) since anyone can give feedback to any thread in Spark dev mailing list, and discuss.

This is actually more prevailing given my rough reading of configuration files. I would like to see this missing relationship as a bad pattern, started from a personal preference.

> Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations.
> E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major
> benefits of the structured configurations.

I don't think this is a good idea we assume for users to know such contexts. One might think `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
partially enable the feature. It is better to be explicit to document since some of configurations are even difficult for users to confirm if it is working or not.
For instance, one might think setting 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, they realise the log is not rolling later after the file
size becomes bigger.


2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm sorry if I miss something, but this is ideally better to be started as [DISCUSS] as I haven't seen any reference to have consensus on this practice.

For me it's just there're two different practices co-existing on the codebase, meaning it's closer to the preference of individual (with implicitly agreeing that others have different preferences), or it hasn't been discussed thoughtfully.

Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major benefits of the structured configurations.

If we want to make it explicit, "all" sub-configurations should have redundant part of the doc. More redundant if the condition is nested. I agree this is the good step of "be kind" but less pragmatic.

I'd be happy to follow the consensus we would make in this thread. Appreciate more voices.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <[hidden email]> wrote:
> I don't plan to document this officially yet
Just to prevent confusion, I meant I don't yet plan to document the fact that we should write the relationships in configurations as a code/review guideline in https://spark.apache.org/contributing.html


2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[hidden email]>님이 작성:
Hi all,

I happened to review some PRs and I noticed that some configurations don't have some information
necessary.

To be explicit, I would like to make sure we document the direct relationship between other configurations
in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's clearly documented.
We're good in general given that we document them in general in Apache Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown', etc. 

However, I noticed such a pattern that such information is missing in some components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and `spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't know the relationship and I myself
had to read the codes to confirm when I review. 
I don't plan to document this officially yet because to me it looks a pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Request to document the direct relationship between other configurations

Jules Damji-2
All are valid and valuable observations to put into practice: 

* structured and meaningful config names 
* explainable text or succinct description
* easily accessible or searchable 

While these are aspirational but gradually doable if we make it part of the dev and review cycle. Often meaningful config names, like security, come as after thought. 

At the AMA in Amsterdam Spark Summit last year, a few developers lamented the lack of finding Spark configs—what they do; what they are used for; when and why; and what are their default values. 

Though you one fetch them programmatically, one still has to know what specific config one islooking for. 

Cheers 
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

On Feb 12, 2020, at 5:19 AM, Hyukjin Kwon <[hidden email]> wrote:


Yeah, that's one of my point why I dont want to document this in the guide yet.

I would like to make sure dev people are on the same page that documenting is a better practice. I dont intend to force as a hard requirement; however, if that's pointed out, it should better to address.


On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <[hidden email]> wrote:
In general I think it's better to have more detailed documents, but we don't have to force everyone to do it if the config name is structured. I would +1 to document the relationship of we can't tell it from the config names, e.g. spark.shuffle.service.enabled and spark.dynamicAllocation.enabled.

On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <[hidden email]> wrote:
Also, I would like to hear other people' thoughts on here. Could I ask what you guys think about this in general?

2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <[hidden email]>님이 작성:
To do that, we should explicitly document such structured configuration and implicit effect, which is currently missing.
I would be more than happy if we document such implied relationship, and if we are very sure all configurations are structured correctly coherently.
Until that point, I think it might be more practical to simply document it for now.

> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.
This is actually something we should fix too. in SQL configuration, now we don't have such duplications as of https://github.com/apache/spark/pull/27459 as it generates. We should do it in other configurations.


2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm looking into the case of `spark.dynamicAllocation` and this seems to be the thing to support my voice.


I don't disagree with adding "This requires spark.shuffle.service.enabled to be set." in the description of `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly, hence it should be better to have it.

Why I'm in favor of structured configuration & implicit effect over describing everything explicitly is there. 

1. There're 10 configurations (if the doc doesn't miss any other configuration) except `spark.dynamicAllocation.enabled`, and only 4 configurations are referred in the description of `spark.dynamicAllocation.enabled` - majority of config keys are missing.
2. I think it's intentional, but the table starts with `spark.dynamicAllocation.enabled` which talks implicitly but intuitively that if you disable this then everything on dynamic allocation won't work. Missing majority of references on config keys don't get it hard to understand.
3. Even `spark.dynamicAllocation` has bad case - see `spark.dynamicAllocation.shuffleTracking.enabled` and `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the structure of configuration. I think this is worse than not explicitly mentioning the description. Let's assume the name has been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false` would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?

Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.


On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <[hidden email]> wrote:
Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it although it might be "redundant" :-) since anyone can give feedback to any thread in Spark dev mailing list, and discuss.

This is actually more prevailing given my rough reading of configuration files. I would like to see this missing relationship as a bad pattern, started from a personal preference.

> Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations.
> E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major
> benefits of the structured configurations.

I don't think this is a good idea we assume for users to know such contexts. One might think `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
partially enable the feature. It is better to be explicit to document since some of configurations are even difficult for users to confirm if it is working or not.
For instance, one might think setting 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, they realise the log is not rolling later after the file
size becomes bigger.


2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm sorry if I miss something, but this is ideally better to be started as [DISCUSS] as I haven't seen any reference to have consensus on this practice.

For me it's just there're two different practices co-existing on the codebase, meaning it's closer to the preference of individual (with implicitly agreeing that others have different preferences), or it hasn't been discussed thoughtfully.

Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major benefits of the structured configurations.

If we want to make it explicit, "all" sub-configurations should have redundant part of the doc. More redundant if the condition is nested. I agree this is the good step of "be kind" but less pragmatic.

I'd be happy to follow the consensus we would make in this thread. Appreciate more voices.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <[hidden email]> wrote:
> I don't plan to document this officially yet
Just to prevent confusion, I meant I don't yet plan to document the fact that we should write the relationships in configurations as a code/review guideline in https://spark.apache.org/contributing.html


2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[hidden email]>님이 작성:
Hi all,

I happened to review some PRs and I noticed that some configurations don't have some information
necessary.

To be explicit, I would like to make sure we document the direct relationship between other configurations
in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's clearly documented.
We're good in general given that we document them in general in Apache Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown', etc. 

However, I noticed such a pattern that such information is missing in some components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and `spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't know the relationship and I myself
had to read the codes to confirm when I review. 
I don't plan to document this officially yet because to me it looks a pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Request to document the direct relationship between other configurations

Dongjoon Hyun-2
Thank you for raising the issue, Hyukjin.

According to the current status of discussion, it seems that we are able to agree on updating the non-structured configurations and keeping the structured configuration AS-IS.

I'm +1 for the revisiting the configurations if that is our direction. If there is some mismatch in structured configurations, we may fix them together.

Bests,
Dongjoon.

On Wed, Feb 12, 2020 at 8:08 AM Jules Damji <[hidden email]> wrote:
All are valid and valuable observations to put into practice: 

* structured and meaningful config names 
* explainable text or succinct description
* easily accessible or searchable 

While these are aspirational but gradually doable if we make it part of the dev and review cycle. Often meaningful config names, like security, come as after thought. 

At the AMA in Amsterdam Spark Summit last year, a few developers lamented the lack of finding Spark configs—what they do; what they are used for; when and why; and what are their default values. 

Though you one fetch them programmatically, one still has to know what specific config one islooking for. 

Cheers 
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

On Feb 12, 2020, at 5:19 AM, Hyukjin Kwon <[hidden email]> wrote:


Yeah, that's one of my point why I dont want to document this in the guide yet.

I would like to make sure dev people are on the same page that documenting is a better practice. I dont intend to force as a hard requirement; however, if that's pointed out, it should better to address.


On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <[hidden email]> wrote:
In general I think it's better to have more detailed documents, but we don't have to force everyone to do it if the config name is structured. I would +1 to document the relationship of we can't tell it from the config names, e.g. spark.shuffle.service.enabled and spark.dynamicAllocation.enabled.

On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <[hidden email]> wrote:
Also, I would like to hear other people' thoughts on here. Could I ask what you guys think about this in general?

2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <[hidden email]>님이 작성:
To do that, we should explicitly document such structured configuration and implicit effect, which is currently missing.
I would be more than happy if we document such implied relationship, and if we are very sure all configurations are structured correctly coherently.
Until that point, I think it might be more practical to simply document it for now.

> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.
This is actually something we should fix too. in SQL configuration, now we don't have such duplications as of https://github.com/apache/spark/pull/27459 as it generates. We should do it in other configurations.


2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm looking into the case of `spark.dynamicAllocation` and this seems to be the thing to support my voice.


I don't disagree with adding "This requires spark.shuffle.service.enabled to be set." in the description of `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly, hence it should be better to have it.

Why I'm in favor of structured configuration & implicit effect over describing everything explicitly is there. 

1. There're 10 configurations (if the doc doesn't miss any other configuration) except `spark.dynamicAllocation.enabled`, and only 4 configurations are referred in the description of `spark.dynamicAllocation.enabled` - majority of config keys are missing.
2. I think it's intentional, but the table starts with `spark.dynamicAllocation.enabled` which talks implicitly but intuitively that if you disable this then everything on dynamic allocation won't work. Missing majority of references on config keys don't get it hard to understand.
3. Even `spark.dynamicAllocation` has bad case - see `spark.dynamicAllocation.shuffleTracking.enabled` and `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the structure of configuration. I think this is worse than not explicitly mentioning the description. Let's assume the name has been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false` would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?

Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.


On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <[hidden email]> wrote:
Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it although it might be "redundant" :-) since anyone can give feedback to any thread in Spark dev mailing list, and discuss.

This is actually more prevailing given my rough reading of configuration files. I would like to see this missing relationship as a bad pattern, started from a personal preference.

> Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations.
> E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major
> benefits of the structured configurations.

I don't think this is a good idea we assume for users to know such contexts. One might think `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
partially enable the feature. It is better to be explicit to document since some of configurations are even difficult for users to confirm if it is working or not.
For instance, one might think setting 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, they realise the log is not rolling later after the file
size becomes bigger.


2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm sorry if I miss something, but this is ideally better to be started as [DISCUSS] as I haven't seen any reference to have consensus on this practice.

For me it's just there're two different practices co-existing on the codebase, meaning it's closer to the preference of individual (with implicitly agreeing that others have different preferences), or it hasn't been discussed thoughtfully.

Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major benefits of the structured configurations.

If we want to make it explicit, "all" sub-configurations should have redundant part of the doc. More redundant if the condition is nested. I agree this is the good step of "be kind" but less pragmatic.

I'd be happy to follow the consensus we would make in this thread. Appreciate more voices.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <[hidden email]> wrote:
> I don't plan to document this officially yet
Just to prevent confusion, I meant I don't yet plan to document the fact that we should write the relationships in configurations as a code/review guideline in https://spark.apache.org/contributing.html


2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[hidden email]>님이 작성:
Hi all,

I happened to review some PRs and I noticed that some configurations don't have some information
necessary.

To be explicit, I would like to make sure we document the direct relationship between other configurations
in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's clearly documented.
We're good in general given that we document them in general in Apache Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown', etc. 

However, I noticed such a pattern that such information is missing in some components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and `spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't know the relationship and I myself
had to read the codes to confirm when I review. 
I don't plan to document this officially yet because to me it looks a pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Request to document the direct relationship between other configurations

Hyukjin Kwon
Yes, that's probably our final goal to revisit the configurations to make it structured and deduplicated documentation cleanly. +1.

One point I would like to add is though to add such information to the documentation until we actually manage our final goal
since probably it's going to take a while to revisit/fix and make it fully structured with the documentation.
If we go more conservatively, we can add such information to the new configurations being added from now on at least, and keeping the existing configurations as are.


On Thu, 13 Feb 2020, 06:12 Dongjoon Hyun, <[hidden email]> wrote:
Thank you for raising the issue, Hyukjin.

According to the current status of discussion, it seems that we are able to agree on updating the non-structured configurations and keeping the structured configuration AS-IS.

I'm +1 for the revisiting the configurations if that is our direction. If there is some mismatch in structured configurations, we may fix them together.

Bests,
Dongjoon.

On Wed, Feb 12, 2020 at 8:08 AM Jules Damji <[hidden email]> wrote:
All are valid and valuable observations to put into practice: 

* structured and meaningful config names 
* explainable text or succinct description
* easily accessible or searchable 

While these are aspirational but gradually doable if we make it part of the dev and review cycle. Often meaningful config names, like security, come as after thought. 

At the AMA in Amsterdam Spark Summit last year, a few developers lamented the lack of finding Spark configs—what they do; what they are used for; when and why; and what are their default values. 

Though you one fetch them programmatically, one still has to know what specific config one islooking for. 

Cheers 
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

On Feb 12, 2020, at 5:19 AM, Hyukjin Kwon <[hidden email]> wrote:


Yeah, that's one of my point why I dont want to document this in the guide yet.

I would like to make sure dev people are on the same page that documenting is a better practice. I dont intend to force as a hard requirement; however, if that's pointed out, it should better to address.


On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <[hidden email]> wrote:
In general I think it's better to have more detailed documents, but we don't have to force everyone to do it if the config name is structured. I would +1 to document the relationship of we can't tell it from the config names, e.g. spark.shuffle.service.enabled and spark.dynamicAllocation.enabled.

On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <[hidden email]> wrote:
Also, I would like to hear other people' thoughts on here. Could I ask what you guys think about this in general?

2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <[hidden email]>님이 작성:
To do that, we should explicitly document such structured configuration and implicit effect, which is currently missing.
I would be more than happy if we document such implied relationship, and if we are very sure all configurations are structured correctly coherently.
Until that point, I think it might be more practical to simply document it for now.

> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.
This is actually something we should fix too. in SQL configuration, now we don't have such duplications as of https://github.com/apache/spark/pull/27459 as it generates. We should do it in other configurations.


2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm looking into the case of `spark.dynamicAllocation` and this seems to be the thing to support my voice.


I don't disagree with adding "This requires spark.shuffle.service.enabled to be set." in the description of `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly, hence it should be better to have it.

Why I'm in favor of structured configuration & implicit effect over describing everything explicitly is there. 

1. There're 10 configurations (if the doc doesn't miss any other configuration) except `spark.dynamicAllocation.enabled`, and only 4 configurations are referred in the description of `spark.dynamicAllocation.enabled` - majority of config keys are missing.
2. I think it's intentional, but the table starts with `spark.dynamicAllocation.enabled` which talks implicitly but intuitively that if you disable this then everything on dynamic allocation won't work. Missing majority of references on config keys don't get it hard to understand.
3. Even `spark.dynamicAllocation` has bad case - see `spark.dynamicAllocation.shuffleTracking.enabled` and `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the structure of configuration. I think this is worse than not explicitly mentioning the description. Let's assume the name has been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false` would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?

Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.


On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <[hidden email]> wrote:
Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it although it might be "redundant" :-) since anyone can give feedback to any thread in Spark dev mailing list, and discuss.

This is actually more prevailing given my rough reading of configuration files. I would like to see this missing relationship as a bad pattern, started from a personal preference.

> Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations.
> E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major
> benefits of the structured configurations.

I don't think this is a good idea we assume for users to know such contexts. One might think `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
partially enable the feature. It is better to be explicit to document since some of configurations are even difficult for users to confirm if it is working or not.
For instance, one might think setting 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, they realise the log is not rolling later after the file
size becomes bigger.


2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm sorry if I miss something, but this is ideally better to be started as [DISCUSS] as I haven't seen any reference to have consensus on this practice.

For me it's just there're two different practices co-existing on the codebase, meaning it's closer to the preference of individual (with implicitly agreeing that others have different preferences), or it hasn't been discussed thoughtfully.

Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major benefits of the structured configurations.

If we want to make it explicit, "all" sub-configurations should have redundant part of the doc. More redundant if the condition is nested. I agree this is the good step of "be kind" but less pragmatic.

I'd be happy to follow the consensus we would make in this thread. Appreciate more voices.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <[hidden email]> wrote:
> I don't plan to document this officially yet
Just to prevent confusion, I meant I don't yet plan to document the fact that we should write the relationships in configurations as a code/review guideline in https://spark.apache.org/contributing.html


2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[hidden email]>님이 작성:
Hi all,

I happened to review some PRs and I noticed that some configurations don't have some information
necessary.

To be explicit, I would like to make sure we document the direct relationship between other configurations
in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's clearly documented.
We're good in general given that we document them in general in Apache Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown', etc. 

However, I noticed such a pattern that such information is missing in some components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and `spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't know the relationship and I myself
had to read the codes to confirm when I review. 
I don't plan to document this officially yet because to me it looks a pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Request to document the direct relationship between other configurations

Hyukjin Kwon
Adding those information is already a more prevailing style at this moment, and this is usual to follow prevailing side if there isn't a specific reason.
If there is confusion about this, I will explicitly add it into the guide (https://spark.apache.org/contributing.html). Let me know if this still confuses or disagree.

2020년 2월 13일 (목) 오전 9:47, Hyukjin Kwon <[hidden email]>님이 작성:
Yes, that's probably our final goal to revisit the configurations to make it structured and deduplicated documentation cleanly. +1.

One point I would like to add is though to add such information to the documentation until we actually manage our final goal
since probably it's going to take a while to revisit/fix and make it fully structured with the documentation.
If we go more conservatively, we can add such information to the new configurations being added from now on at least, and keeping the existing configurations as are.


On Thu, 13 Feb 2020, 06:12 Dongjoon Hyun, <[hidden email]> wrote:
Thank you for raising the issue, Hyukjin.

According to the current status of discussion, it seems that we are able to agree on updating the non-structured configurations and keeping the structured configuration AS-IS.

I'm +1 for the revisiting the configurations if that is our direction. If there is some mismatch in structured configurations, we may fix them together.

Bests,
Dongjoon.

On Wed, Feb 12, 2020 at 8:08 AM Jules Damji <[hidden email]> wrote:
All are valid and valuable observations to put into practice: 

* structured and meaningful config names 
* explainable text or succinct description
* easily accessible or searchable 

While these are aspirational but gradually doable if we make it part of the dev and review cycle. Often meaningful config names, like security, come as after thought. 

At the AMA in Amsterdam Spark Summit last year, a few developers lamented the lack of finding Spark configs—what they do; what they are used for; when and why; and what are their default values. 

Though you one fetch them programmatically, one still has to know what specific config one islooking for. 

Cheers 
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

On Feb 12, 2020, at 5:19 AM, Hyukjin Kwon <[hidden email]> wrote:


Yeah, that's one of my point why I dont want to document this in the guide yet.

I would like to make sure dev people are on the same page that documenting is a better practice. I dont intend to force as a hard requirement; however, if that's pointed out, it should better to address.


On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <[hidden email]> wrote:
In general I think it's better to have more detailed documents, but we don't have to force everyone to do it if the config name is structured. I would +1 to document the relationship of we can't tell it from the config names, e.g. spark.shuffle.service.enabled and spark.dynamicAllocation.enabled.

On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <[hidden email]> wrote:
Also, I would like to hear other people' thoughts on here. Could I ask what you guys think about this in general?

2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <[hidden email]>님이 작성:
To do that, we should explicitly document such structured configuration and implicit effect, which is currently missing.
I would be more than happy if we document such implied relationship, and if we are very sure all configurations are structured correctly coherently.
Until that point, I think it might be more practical to simply document it for now.

> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.
This is actually something we should fix too. in SQL configuration, now we don't have such duplications as of https://github.com/apache/spark/pull/27459 as it generates. We should do it in other configurations.


2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm looking into the case of `spark.dynamicAllocation` and this seems to be the thing to support my voice.


I don't disagree with adding "This requires spark.shuffle.service.enabled to be set." in the description of `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly, hence it should be better to have it.

Why I'm in favor of structured configuration & implicit effect over describing everything explicitly is there. 

1. There're 10 configurations (if the doc doesn't miss any other configuration) except `spark.dynamicAllocation.enabled`, and only 4 configurations are referred in the description of `spark.dynamicAllocation.enabled` - majority of config keys are missing.
2. I think it's intentional, but the table starts with `spark.dynamicAllocation.enabled` which talks implicitly but intuitively that if you disable this then everything on dynamic allocation won't work. Missing majority of references on config keys don't get it hard to understand.
3. Even `spark.dynamicAllocation` has bad case - see `spark.dynamicAllocation.shuffleTracking.enabled` and `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the structure of configuration. I think this is worse than not explicitly mentioning the description. Let's assume the name has been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false` would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?

Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.


On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <[hidden email]> wrote:
Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it although it might be "redundant" :-) since anyone can give feedback to any thread in Spark dev mailing list, and discuss.

This is actually more prevailing given my rough reading of configuration files. I would like to see this missing relationship as a bad pattern, started from a personal preference.

> Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations.
> E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major
> benefits of the structured configurations.

I don't think this is a good idea we assume for users to know such contexts. One might think `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
partially enable the feature. It is better to be explicit to document since some of configurations are even difficult for users to confirm if it is working or not.
For instance, one might think setting 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, they realise the log is not rolling later after the file
size becomes bigger.


2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm sorry if I miss something, but this is ideally better to be started as [DISCUSS] as I haven't seen any reference to have consensus on this practice.

For me it's just there're two different practices co-existing on the codebase, meaning it's closer to the preference of individual (with implicitly agreeing that others have different preferences), or it hasn't been discussed thoughtfully.

Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major benefits of the structured configurations.

If we want to make it explicit, "all" sub-configurations should have redundant part of the doc. More redundant if the condition is nested. I agree this is the good step of "be kind" but less pragmatic.

I'd be happy to follow the consensus we would make in this thread. Appreciate more voices.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <[hidden email]> wrote:
> I don't plan to document this officially yet
Just to prevent confusion, I meant I don't yet plan to document the fact that we should write the relationships in configurations as a code/review guideline in https://spark.apache.org/contributing.html


2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[hidden email]>님이 작성:
Hi all,

I happened to review some PRs and I noticed that some configurations don't have some information
necessary.

To be explicit, I would like to make sure we document the direct relationship between other configurations
in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's clearly documented.
We're good in general given that we document them in general in Apache Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown', etc. 

However, I noticed such a pattern that such information is missing in some components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and `spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't know the relationship and I myself
had to read the codes to confirm when I review. 
I don't plan to document this officially yet because to me it looks a pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Request to document the direct relationship between other configurations

Jungtaek Lim-2
I tend to agree that there should be a time to make thing be consistent (and I'm very happy to see the new thread on discussion) and we may want to take some practice in the interim.

But for me it is not clear what is the practice in the interim. I pointed out the problems of existing style and if we all agree the problems are valid then we may need to fix it before start using it widely.

For me the major question is "where" to put - at least there're two different approaches:

1) put it to the description of `.enable` and describe the range of impact (e.g.) put the description of "spark.A.enable" saying it will affect the following configurations under "spark.A". 
(I don't think it's good to enumerate all of affected configs, as `spark.dynamicAllocation.enable` is telling it is fragile.)

2) put it to the description of all affected configurations and describe which configuration is the prerequisite to let this be effective. e.g. put it on all configurations under `spark.dynamicAllocation` mentioning `spark.dynamicAllocation.enable` should be enabled to make this be effective.

(I intended to skip mentioning "cross reference". Let's be pragmatic.)

2) has also two ways to describe: 

2-1) Just mention simply - like "When dynamic allocation is enabled,", not pointing out the key to toggle. This hugely simplify the description, though end users may have to do the second guess to find the key to toggle. (It'll be intuitive when we structurize the configurations.)

2-2) Mention the key to toggle directly - like "This is effective only if spark.A.enable is set to true.". It's going to be longer depending on how long the configuration key is. Personally I'd feel verbose unless the key to toggle is not placed to the spot we can infer, but others may have different opinions.

I may be missing some details, so please participate to add the details. Otherwise we may want to choose the best one, and have a sample form of description. (Once we reach here, it may be OK to add to the contribution doc, as that is the thing we agree about.)

Without the details it's going to be a some sort of "preference" which is natural to have disagreement, hence it doesn't make sense someone is forced to do something if something turns out to be "preference".

On Thu, Feb 13, 2020 at 11:10 AM Hyukjin Kwon <[hidden email]> wrote:
Adding those information is already a more prevailing style at this moment, and this is usual to follow prevailing side if there isn't a specific reason.
If there is confusion about this, I will explicitly add it into the guide (https://spark.apache.org/contributing.html). Let me know if this still confuses or disagree.

2020년 2월 13일 (목) 오전 9:47, Hyukjin Kwon <[hidden email]>님이 작성:
Yes, that's probably our final goal to revisit the configurations to make it structured and deduplicated documentation cleanly. +1.

One point I would like to add is though to add such information to the documentation until we actually manage our final goal
since probably it's going to take a while to revisit/fix and make it fully structured with the documentation.
If we go more conservatively, we can add such information to the new configurations being added from now on at least, and keeping the existing configurations as are.


On Thu, 13 Feb 2020, 06:12 Dongjoon Hyun, <[hidden email]> wrote:
Thank you for raising the issue, Hyukjin.

According to the current status of discussion, it seems that we are able to agree on updating the non-structured configurations and keeping the structured configuration AS-IS.

I'm +1 for the revisiting the configurations if that is our direction. If there is some mismatch in structured configurations, we may fix them together.

Bests,
Dongjoon.

On Wed, Feb 12, 2020 at 8:08 AM Jules Damji <[hidden email]> wrote:
All are valid and valuable observations to put into practice: 

* structured and meaningful config names 
* explainable text or succinct description
* easily accessible or searchable 

While these are aspirational but gradually doable if we make it part of the dev and review cycle. Often meaningful config names, like security, come as after thought. 

At the AMA in Amsterdam Spark Summit last year, a few developers lamented the lack of finding Spark configs—what they do; what they are used for; when and why; and what are their default values. 

Though you one fetch them programmatically, one still has to know what specific config one islooking for. 

Cheers 
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

On Feb 12, 2020, at 5:19 AM, Hyukjin Kwon <[hidden email]> wrote:


Yeah, that's one of my point why I dont want to document this in the guide yet.

I would like to make sure dev people are on the same page that documenting is a better practice. I dont intend to force as a hard requirement; however, if that's pointed out, it should better to address.


On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <[hidden email]> wrote:
In general I think it's better to have more detailed documents, but we don't have to force everyone to do it if the config name is structured. I would +1 to document the relationship of we can't tell it from the config names, e.g. spark.shuffle.service.enabled and spark.dynamicAllocation.enabled.

On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <[hidden email]> wrote:
Also, I would like to hear other people' thoughts on here. Could I ask what you guys think about this in general?

2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <[hidden email]>님이 작성:
To do that, we should explicitly document such structured configuration and implicit effect, which is currently missing.
I would be more than happy if we document such implied relationship, and if we are very sure all configurations are structured correctly coherently.
Until that point, I think it might be more practical to simply document it for now.

> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.
This is actually something we should fix too. in SQL configuration, now we don't have such duplications as of https://github.com/apache/spark/pull/27459 as it generates. We should do it in other configurations.


2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm looking into the case of `spark.dynamicAllocation` and this seems to be the thing to support my voice.


I don't disagree with adding "This requires spark.shuffle.service.enabled to be set." in the description of `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly, hence it should be better to have it.

Why I'm in favor of structured configuration & implicit effect over describing everything explicitly is there. 

1. There're 10 configurations (if the doc doesn't miss any other configuration) except `spark.dynamicAllocation.enabled`, and only 4 configurations are referred in the description of `spark.dynamicAllocation.enabled` - majority of config keys are missing.
2. I think it's intentional, but the table starts with `spark.dynamicAllocation.enabled` which talks implicitly but intuitively that if you disable this then everything on dynamic allocation won't work. Missing majority of references on config keys don't get it hard to understand.
3. Even `spark.dynamicAllocation` has bad case - see `spark.dynamicAllocation.shuffleTracking.enabled` and `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the structure of configuration. I think this is worse than not explicitly mentioning the description. Let's assume the name has been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false` would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?

Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.


On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <[hidden email]> wrote:
Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it although it might be "redundant" :-) since anyone can give feedback to any thread in Spark dev mailing list, and discuss.

This is actually more prevailing given my rough reading of configuration files. I would like to see this missing relationship as a bad pattern, started from a personal preference.

> Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations.
> E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major
> benefits of the structured configurations.

I don't think this is a good idea we assume for users to know such contexts. One might think `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
partially enable the feature. It is better to be explicit to document since some of configurations are even difficult for users to confirm if it is working or not.
For instance, one might think setting 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, they realise the log is not rolling later after the file
size becomes bigger.


2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm sorry if I miss something, but this is ideally better to be started as [DISCUSS] as I haven't seen any reference to have consensus on this practice.

For me it's just there're two different practices co-existing on the codebase, meaning it's closer to the preference of individual (with implicitly agreeing that others have different preferences), or it hasn't been discussed thoughtfully.

Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major benefits of the structured configurations.

If we want to make it explicit, "all" sub-configurations should have redundant part of the doc. More redundant if the condition is nested. I agree this is the good step of "be kind" but less pragmatic.

I'd be happy to follow the consensus we would make in this thread. Appreciate more voices.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <[hidden email]> wrote:
> I don't plan to document this officially yet
Just to prevent confusion, I meant I don't yet plan to document the fact that we should write the relationships in configurations as a code/review guideline in https://spark.apache.org/contributing.html


2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[hidden email]>님이 작성:
Hi all,

I happened to review some PRs and I noticed that some configurations don't have some information
necessary.

To be explicit, I would like to make sure we document the direct relationship between other configurations
in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's clearly documented.
We're good in general given that we document them in general in Apache Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown', etc. 

However, I noticed such a pattern that such information is missing in some components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and `spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't know the relationship and I myself
had to read the codes to confirm when I review. 
I don't plan to document this officially yet because to me it looks a pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Request to document the direct relationship between other configurations

Hyukjin Kwon

I think it’s just fine as long as we’re consistent with the instances having the description, for instance:

  When true and ‘spark.xx.xx’ is enabled, …

I think this is 2-2 in most cases so far. I think we can reference other configuration keys in another configuration documentation by using
ADAPTIVE_EXECUTION_ENABLED.key as an example. So we don’t have duplication problem in most cases.

Being consistent with the current base is a general guidance if I am not mistaken. We have identified a problem and a good goal to reach.
If we want to change, let's do it as our final goal. Otherwise, let's simplify it to reduce the overhead rather then having a policy for the mid-term specifically.



2020년 2월 13일 (목) 오후 12:24, Jungtaek Lim <[hidden email]>님이 작성:
I tend to agree that there should be a time to make thing be consistent (and I'm very happy to see the new thread on discussion) and we may want to take some practice in the interim.

But for me it is not clear what is the practice in the interim. I pointed out the problems of existing style and if we all agree the problems are valid then we may need to fix it before start using it widely.

For me the major question is "where" to put - at least there're two different approaches:

1) put it to the description of `.enable` and describe the range of impact (e.g.) put the description of "spark.A.enable" saying it will affect the following configurations under "spark.A". 
(I don't think it's good to enumerate all of affected configs, as `spark.dynamicAllocation.enable` is telling it is fragile.)

2) put it to the description of all affected configurations and describe which configuration is the prerequisite to let this be effective. e.g. put it on all configurations under `spark.dynamicAllocation` mentioning `spark.dynamicAllocation.enable` should be enabled to make this be effective.

(I intended to skip mentioning "cross reference". Let's be pragmatic.)

2) has also two ways to describe: 

2-1) Just mention simply - like "When dynamic allocation is enabled,", not pointing out the key to toggle. This hugely simplify the description, though end users may have to do the second guess to find the key to toggle. (It'll be intuitive when we structurize the configurations.)

2-2) Mention the key to toggle directly - like "This is effective only if spark.A.enable is set to true.". It's going to be longer depending on how long the configuration key is. Personally I'd feel verbose unless the key to toggle is not placed to the spot we can infer, but others may have different opinions.

I may be missing some details, so please participate to add the details. Otherwise we may want to choose the best one, and have a sample form of description. (Once we reach here, it may be OK to add to the contribution doc, as that is the thing we agree about.)

Without the details it's going to be a some sort of "preference" which is natural to have disagreement, hence it doesn't make sense someone is forced to do something if something turns out to be "preference".

On Thu, Feb 13, 2020 at 11:10 AM Hyukjin Kwon <[hidden email]> wrote:
Adding those information is already a more prevailing style at this moment, and this is usual to follow prevailing side if there isn't a specific reason.
If there is confusion about this, I will explicitly add it into the guide (https://spark.apache.org/contributing.html). Let me know if this still confuses or disagree.

2020년 2월 13일 (목) 오전 9:47, Hyukjin Kwon <[hidden email]>님이 작성:
Yes, that's probably our final goal to revisit the configurations to make it structured and deduplicated documentation cleanly. +1.

One point I would like to add is though to add such information to the documentation until we actually manage our final goal
since probably it's going to take a while to revisit/fix and make it fully structured with the documentation.
If we go more conservatively, we can add such information to the new configurations being added from now on at least, and keeping the existing configurations as are.


On Thu, 13 Feb 2020, 06:12 Dongjoon Hyun, <[hidden email]> wrote:
Thank you for raising the issue, Hyukjin.

According to the current status of discussion, it seems that we are able to agree on updating the non-structured configurations and keeping the structured configuration AS-IS.

I'm +1 for the revisiting the configurations if that is our direction. If there is some mismatch in structured configurations, we may fix them together.

Bests,
Dongjoon.

On Wed, Feb 12, 2020 at 8:08 AM Jules Damji <[hidden email]> wrote:
All are valid and valuable observations to put into practice: 

* structured and meaningful config names 
* explainable text or succinct description
* easily accessible or searchable 

While these are aspirational but gradually doable if we make it part of the dev and review cycle. Often meaningful config names, like security, come as after thought. 

At the AMA in Amsterdam Spark Summit last year, a few developers lamented the lack of finding Spark configs—what they do; what they are used for; when and why; and what are their default values. 

Though you one fetch them programmatically, one still has to know what specific config one islooking for. 

Cheers 
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

On Feb 12, 2020, at 5:19 AM, Hyukjin Kwon <[hidden email]> wrote:


Yeah, that's one of my point why I dont want to document this in the guide yet.

I would like to make sure dev people are on the same page that documenting is a better practice. I dont intend to force as a hard requirement; however, if that's pointed out, it should better to address.


On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <[hidden email]> wrote:
In general I think it's better to have more detailed documents, but we don't have to force everyone to do it if the config name is structured. I would +1 to document the relationship of we can't tell it from the config names, e.g. spark.shuffle.service.enabled and spark.dynamicAllocation.enabled.

On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <[hidden email]> wrote:
Also, I would like to hear other people' thoughts on here. Could I ask what you guys think about this in general?

2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <[hidden email]>님이 작성:
To do that, we should explicitly document such structured configuration and implicit effect, which is currently missing.
I would be more than happy if we document such implied relationship, and if we are very sure all configurations are structured correctly coherently.
Until that point, I think it might be more practical to simply document it for now.

> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.
This is actually something we should fix too. in SQL configuration, now we don't have such duplications as of https://github.com/apache/spark/pull/27459 as it generates. We should do it in other configurations.


2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm looking into the case of `spark.dynamicAllocation` and this seems to be the thing to support my voice.


I don't disagree with adding "This requires spark.shuffle.service.enabled to be set." in the description of `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly, hence it should be better to have it.

Why I'm in favor of structured configuration & implicit effect over describing everything explicitly is there. 

1. There're 10 configurations (if the doc doesn't miss any other configuration) except `spark.dynamicAllocation.enabled`, and only 4 configurations are referred in the description of `spark.dynamicAllocation.enabled` - majority of config keys are missing.
2. I think it's intentional, but the table starts with `spark.dynamicAllocation.enabled` which talks implicitly but intuitively that if you disable this then everything on dynamic allocation won't work. Missing majority of references on config keys don't get it hard to understand.
3. Even `spark.dynamicAllocation` has bad case - see `spark.dynamicAllocation.shuffleTracking.enabled` and `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the structure of configuration. I think this is worse than not explicitly mentioning the description. Let's assume the name has been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false` would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?

Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.


On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <[hidden email]> wrote:
Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it although it might be "redundant" :-) since anyone can give feedback to any thread in Spark dev mailing list, and discuss.

This is actually more prevailing given my rough reading of configuration files. I would like to see this missing relationship as a bad pattern, started from a personal preference.

> Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations.
> E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major
> benefits of the structured configurations.

I don't think this is a good idea we assume for users to know such contexts. One might think `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
partially enable the feature. It is better to be explicit to document since some of configurations are even difficult for users to confirm if it is working or not.
For instance, one might think setting 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, they realise the log is not rolling later after the file
size becomes bigger.


2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm sorry if I miss something, but this is ideally better to be started as [DISCUSS] as I haven't seen any reference to have consensus on this practice.

For me it's just there're two different practices co-existing on the codebase, meaning it's closer to the preference of individual (with implicitly agreeing that others have different preferences), or it hasn't been discussed thoughtfully.

Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major benefits of the structured configurations.

If we want to make it explicit, "all" sub-configurations should have redundant part of the doc. More redundant if the condition is nested. I agree this is the good step of "be kind" but less pragmatic.

I'd be happy to follow the consensus we would make in this thread. Appreciate more voices.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <[hidden email]> wrote:
> I don't plan to document this officially yet
Just to prevent confusion, I meant I don't yet plan to document the fact that we should write the relationships in configurations as a code/review guideline in https://spark.apache.org/contributing.html


2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[hidden email]>님이 작성:
Hi all,

I happened to review some PRs and I noticed that some configurations don't have some information
necessary.

To be explicit, I would like to make sure we document the direct relationship between other configurations
in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's clearly documented.
We're good in general given that we document them in general in Apache Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown', etc. 

However, I noticed such a pattern that such information is missing in some components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and `spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't know the relationship and I myself
had to read the codes to confirm when I review. 
I don't plan to document this officially yet because to me it looks a pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Request to document the direct relationship between other configurations

Jungtaek Lim-2
Even spark.dynamicAllocation.* doesn't follow 2-2, right? It follows the mix of 1 and 2-1, though 1 is even broken there.

It doesn't matter If we just discuss about one-time decision - it may be OK to not to be strict on consistency, though it's not ideal. The thing is that these kind of "preferences" are making confusion on review phase: reviewers provide different comments and try to "educate" contributors on their preferences. Expectations for such cases heavily depends on who is/are reviewers of PR - there's no value on education.

The codebase is the reference of implicit rules/policies which would apply to all contributors including newcomers. Let's just put our best efforts on being consistent on codebase. (We should have consensus to do this anyway.)


On Thu, Feb 13, 2020 at 12:44 PM Hyukjin Kwon <[hidden email]> wrote:

I think it’s just fine as long as we’re consistent with the instances having the description, for instance:

  When true and ‘spark.xx.xx’ is enabled, …

I think this is 2-2 in most cases so far. I think we can reference other configuration keys in another configuration documentation by using
ADAPTIVE_EXECUTION_ENABLED.key as an example. So we don’t have duplication problem in most cases.

Being consistent with the current base is a general guidance if I am not mistaken. We have identified a problem and a good goal to reach.
If we want to change, let's do it as our final goal. Otherwise, let's simplify it to reduce the overhead rather then having a policy for the mid-term specifically.



2020년 2월 13일 (목) 오후 12:24, Jungtaek Lim <[hidden email]>님이 작성:
I tend to agree that there should be a time to make thing be consistent (and I'm very happy to see the new thread on discussion) and we may want to take some practice in the interim.

But for me it is not clear what is the practice in the interim. I pointed out the problems of existing style and if we all agree the problems are valid then we may need to fix it before start using it widely.

For me the major question is "where" to put - at least there're two different approaches:

1) put it to the description of `.enable` and describe the range of impact (e.g.) put the description of "spark.A.enable" saying it will affect the following configurations under "spark.A". 
(I don't think it's good to enumerate all of affected configs, as `spark.dynamicAllocation.enable` is telling it is fragile.)

2) put it to the description of all affected configurations and describe which configuration is the prerequisite to let this be effective. e.g. put it on all configurations under `spark.dynamicAllocation` mentioning `spark.dynamicAllocation.enable` should be enabled to make this be effective.

(I intended to skip mentioning "cross reference". Let's be pragmatic.)

2) has also two ways to describe: 

2-1) Just mention simply - like "When dynamic allocation is enabled,", not pointing out the key to toggle. This hugely simplify the description, though end users may have to do the second guess to find the key to toggle. (It'll be intuitive when we structurize the configurations.)

2-2) Mention the key to toggle directly - like "This is effective only if spark.A.enable is set to true.". It's going to be longer depending on how long the configuration key is. Personally I'd feel verbose unless the key to toggle is not placed to the spot we can infer, but others may have different opinions.

I may be missing some details, so please participate to add the details. Otherwise we may want to choose the best one, and have a sample form of description. (Once we reach here, it may be OK to add to the contribution doc, as that is the thing we agree about.)

Without the details it's going to be a some sort of "preference" which is natural to have disagreement, hence it doesn't make sense someone is forced to do something if something turns out to be "preference".

On Thu, Feb 13, 2020 at 11:10 AM Hyukjin Kwon <[hidden email]> wrote:
Adding those information is already a more prevailing style at this moment, and this is usual to follow prevailing side if there isn't a specific reason.
If there is confusion about this, I will explicitly add it into the guide (https://spark.apache.org/contributing.html). Let me know if this still confuses or disagree.

2020년 2월 13일 (목) 오전 9:47, Hyukjin Kwon <[hidden email]>님이 작성:
Yes, that's probably our final goal to revisit the configurations to make it structured and deduplicated documentation cleanly. +1.

One point I would like to add is though to add such information to the documentation until we actually manage our final goal
since probably it's going to take a while to revisit/fix and make it fully structured with the documentation.
If we go more conservatively, we can add such information to the new configurations being added from now on at least, and keeping the existing configurations as are.


On Thu, 13 Feb 2020, 06:12 Dongjoon Hyun, <[hidden email]> wrote:
Thank you for raising the issue, Hyukjin.

According to the current status of discussion, it seems that we are able to agree on updating the non-structured configurations and keeping the structured configuration AS-IS.

I'm +1 for the revisiting the configurations if that is our direction. If there is some mismatch in structured configurations, we may fix them together.

Bests,
Dongjoon.

On Wed, Feb 12, 2020 at 8:08 AM Jules Damji <[hidden email]> wrote:
All are valid and valuable observations to put into practice: 

* structured and meaningful config names 
* explainable text or succinct description
* easily accessible or searchable 

While these are aspirational but gradually doable if we make it part of the dev and review cycle. Often meaningful config names, like security, come as after thought. 

At the AMA in Amsterdam Spark Summit last year, a few developers lamented the lack of finding Spark configs—what they do; what they are used for; when and why; and what are their default values. 

Though you one fetch them programmatically, one still has to know what specific config one islooking for. 

Cheers 
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

On Feb 12, 2020, at 5:19 AM, Hyukjin Kwon <[hidden email]> wrote:


Yeah, that's one of my point why I dont want to document this in the guide yet.

I would like to make sure dev people are on the same page that documenting is a better practice. I dont intend to force as a hard requirement; however, if that's pointed out, it should better to address.


On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <[hidden email]> wrote:
In general I think it's better to have more detailed documents, but we don't have to force everyone to do it if the config name is structured. I would +1 to document the relationship of we can't tell it from the config names, e.g. spark.shuffle.service.enabled and spark.dynamicAllocation.enabled.

On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <[hidden email]> wrote:
Also, I would like to hear other people' thoughts on here. Could I ask what you guys think about this in general?

2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <[hidden email]>님이 작성:
To do that, we should explicitly document such structured configuration and implicit effect, which is currently missing.
I would be more than happy if we document such implied relationship, and if we are very sure all configurations are structured correctly coherently.
Until that point, I think it might be more practical to simply document it for now.

> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.
This is actually something we should fix too. in SQL configuration, now we don't have such duplications as of https://github.com/apache/spark/pull/27459 as it generates. We should do it in other configurations.


2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm looking into the case of `spark.dynamicAllocation` and this seems to be the thing to support my voice.


I don't disagree with adding "This requires spark.shuffle.service.enabled to be set." in the description of `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly, hence it should be better to have it.

Why I'm in favor of structured configuration & implicit effect over describing everything explicitly is there. 

1. There're 10 configurations (if the doc doesn't miss any other configuration) except `spark.dynamicAllocation.enabled`, and only 4 configurations are referred in the description of `spark.dynamicAllocation.enabled` - majority of config keys are missing.
2. I think it's intentional, but the table starts with `spark.dynamicAllocation.enabled` which talks implicitly but intuitively that if you disable this then everything on dynamic allocation won't work. Missing majority of references on config keys don't get it hard to understand.
3. Even `spark.dynamicAllocation` has bad case - see `spark.dynamicAllocation.shuffleTracking.enabled` and `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the structure of configuration. I think this is worse than not explicitly mentioning the description. Let's assume the name has been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false` would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?

Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.


On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <[hidden email]> wrote:
Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it although it might be "redundant" :-) since anyone can give feedback to any thread in Spark dev mailing list, and discuss.

This is actually more prevailing given my rough reading of configuration files. I would like to see this missing relationship as a bad pattern, started from a personal preference.

> Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations.
> E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major
> benefits of the structured configurations.

I don't think this is a good idea we assume for users to know such contexts. One might think `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
partially enable the feature. It is better to be explicit to document since some of configurations are even difficult for users to confirm if it is working or not.
For instance, one might think setting 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, they realise the log is not rolling later after the file
size becomes bigger.


2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm sorry if I miss something, but this is ideally better to be started as [DISCUSS] as I haven't seen any reference to have consensus on this practice.

For me it's just there're two different practices co-existing on the codebase, meaning it's closer to the preference of individual (with implicitly agreeing that others have different preferences), or it hasn't been discussed thoughtfully.

Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major benefits of the structured configurations.

If we want to make it explicit, "all" sub-configurations should have redundant part of the doc. More redundant if the condition is nested. I agree this is the good step of "be kind" but less pragmatic.

I'd be happy to follow the consensus we would make in this thread. Appreciate more voices.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <[hidden email]> wrote:
> I don't plan to document this officially yet
Just to prevent confusion, I meant I don't yet plan to document the fact that we should write the relationships in configurations as a code/review guideline in https://spark.apache.org/contributing.html


2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[hidden email]>님이 작성:
Hi all,

I happened to review some PRs and I noticed that some configurations don't have some information
necessary.

To be explicit, I would like to make sure we document the direct relationship between other configurations
in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's clearly documented.
We're good in general given that we document them in general in Apache Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown', etc. 

However, I noticed such a pattern that such information is missing in some components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and `spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't know the relationship and I myself
had to read the codes to confirm when I review. 
I don't plan to document this officially yet because to me it looks a pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Request to document the direct relationship between other configurations

Hyukjin Kwon
It's okay to just follow one prevailing style. The main point I would like to say is the fact that we should document the direct relationship of configurations.
For this case specifically, I don't think there is so much point here to decide one hard requirement to follow for the mid-term. We have the final goal to reach.

There are always many cases that requires committer's judgement. It is impossible to put every single item to a guide line in practice.
So we usually trust their judgement in practice unless there's no explicit objection. Here, my explicit objection is about no documentation for the direct relationship
between configurations. Seems people think we should document this in general as well.

In practice, I don't particularly mind what style is used in fact. I vaguely guess the style isn't an issue to many other committers in general.
If this is the case, let's open a separate thread to discuss to confirm one style - this wasn't the main issue I wanted to address in this thread.

Shall we conclude this thread by deciding to document the direct relationship between configurations preferably in one prevailing style?


2020년 2월 14일 (금) 오전 11:36, Jungtaek Lim <[hidden email]>님이 작성:
Even spark.dynamicAllocation.* doesn't follow 2-2, right? It follows the mix of 1 and 2-1, though 1 is even broken there.

It doesn't matter If we just discuss about one-time decision - it may be OK to not to be strict on consistency, though it's not ideal. The thing is that these kind of "preferences" are making confusion on review phase: reviewers provide different comments and try to "educate" contributors on their preferences. Expectations for such cases heavily depends on who is/are reviewers of PR - there's no value on education.

The codebase is the reference of implicit rules/policies which would apply to all contributors including newcomers. Let's just put our best efforts on being consistent on codebase. (We should have consensus to do this anyway.)


On Thu, Feb 13, 2020 at 12:44 PM Hyukjin Kwon <[hidden email]> wrote:

I think it’s just fine as long as we’re consistent with the instances having the description, for instance:

  When true and ‘spark.xx.xx’ is enabled, …

I think this is 2-2 in most cases so far. I think we can reference other configuration keys in another configuration documentation by using
ADAPTIVE_EXECUTION_ENABLED.key as an example. So we don’t have duplication problem in most cases.

Being consistent with the current base is a general guidance if I am not mistaken. We have identified a problem and a good goal to reach.
If we want to change, let's do it as our final goal. Otherwise, let's simplify it to reduce the overhead rather then having a policy for the mid-term specifically.



2020년 2월 13일 (목) 오후 12:24, Jungtaek Lim <[hidden email]>님이 작성:
I tend to agree that there should be a time to make thing be consistent (and I'm very happy to see the new thread on discussion) and we may want to take some practice in the interim.

But for me it is not clear what is the practice in the interim. I pointed out the problems of existing style and if we all agree the problems are valid then we may need to fix it before start using it widely.

For me the major question is "where" to put - at least there're two different approaches:

1) put it to the description of `.enable` and describe the range of impact (e.g.) put the description of "spark.A.enable" saying it will affect the following configurations under "spark.A". 
(I don't think it's good to enumerate all of affected configs, as `spark.dynamicAllocation.enable` is telling it is fragile.)

2) put it to the description of all affected configurations and describe which configuration is the prerequisite to let this be effective. e.g. put it on all configurations under `spark.dynamicAllocation` mentioning `spark.dynamicAllocation.enable` should be enabled to make this be effective.

(I intended to skip mentioning "cross reference". Let's be pragmatic.)

2) has also two ways to describe: 

2-1) Just mention simply - like "When dynamic allocation is enabled,", not pointing out the key to toggle. This hugely simplify the description, though end users may have to do the second guess to find the key to toggle. (It'll be intuitive when we structurize the configurations.)

2-2) Mention the key to toggle directly - like "This is effective only if spark.A.enable is set to true.". It's going to be longer depending on how long the configuration key is. Personally I'd feel verbose unless the key to toggle is not placed to the spot we can infer, but others may have different opinions.

I may be missing some details, so please participate to add the details. Otherwise we may want to choose the best one, and have a sample form of description. (Once we reach here, it may be OK to add to the contribution doc, as that is the thing we agree about.)

Without the details it's going to be a some sort of "preference" which is natural to have disagreement, hence it doesn't make sense someone is forced to do something if something turns out to be "preference".

On Thu, Feb 13, 2020 at 11:10 AM Hyukjin Kwon <[hidden email]> wrote:
Adding those information is already a more prevailing style at this moment, and this is usual to follow prevailing side if there isn't a specific reason.
If there is confusion about this, I will explicitly add it into the guide (https://spark.apache.org/contributing.html). Let me know if this still confuses or disagree.

2020년 2월 13일 (목) 오전 9:47, Hyukjin Kwon <[hidden email]>님이 작성:
Yes, that's probably our final goal to revisit the configurations to make it structured and deduplicated documentation cleanly. +1.

One point I would like to add is though to add such information to the documentation until we actually manage our final goal
since probably it's going to take a while to revisit/fix and make it fully structured with the documentation.
If we go more conservatively, we can add such information to the new configurations being added from now on at least, and keeping the existing configurations as are.


On Thu, 13 Feb 2020, 06:12 Dongjoon Hyun, <[hidden email]> wrote:
Thank you for raising the issue, Hyukjin.

According to the current status of discussion, it seems that we are able to agree on updating the non-structured configurations and keeping the structured configuration AS-IS.

I'm +1 for the revisiting the configurations if that is our direction. If there is some mismatch in structured configurations, we may fix them together.

Bests,
Dongjoon.

On Wed, Feb 12, 2020 at 8:08 AM Jules Damji <[hidden email]> wrote:
All are valid and valuable observations to put into practice: 

* structured and meaningful config names 
* explainable text or succinct description
* easily accessible or searchable 

While these are aspirational but gradually doable if we make it part of the dev and review cycle. Often meaningful config names, like security, come as after thought. 

At the AMA in Amsterdam Spark Summit last year, a few developers lamented the lack of finding Spark configs—what they do; what they are used for; when and why; and what are their default values. 

Though you one fetch them programmatically, one still has to know what specific config one islooking for. 

Cheers 
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

On Feb 12, 2020, at 5:19 AM, Hyukjin Kwon <[hidden email]> wrote:


Yeah, that's one of my point why I dont want to document this in the guide yet.

I would like to make sure dev people are on the same page that documenting is a better practice. I dont intend to force as a hard requirement; however, if that's pointed out, it should better to address.


On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <[hidden email]> wrote:
In general I think it's better to have more detailed documents, but we don't have to force everyone to do it if the config name is structured. I would +1 to document the relationship of we can't tell it from the config names, e.g. spark.shuffle.service.enabled and spark.dynamicAllocation.enabled.

On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <[hidden email]> wrote:
Also, I would like to hear other people' thoughts on here. Could I ask what you guys think about this in general?

2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <[hidden email]>님이 작성:
To do that, we should explicitly document such structured configuration and implicit effect, which is currently missing.
I would be more than happy if we document such implied relationship, and if we are very sure all configurations are structured correctly coherently.
Until that point, I think it might be more practical to simply document it for now.

> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.
This is actually something we should fix too. in SQL configuration, now we don't have such duplications as of https://github.com/apache/spark/pull/27459 as it generates. We should do it in other configurations.


2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm looking into the case of `spark.dynamicAllocation` and this seems to be the thing to support my voice.


I don't disagree with adding "This requires spark.shuffle.service.enabled to be set." in the description of `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly, hence it should be better to have it.

Why I'm in favor of structured configuration & implicit effect over describing everything explicitly is there. 

1. There're 10 configurations (if the doc doesn't miss any other configuration) except `spark.dynamicAllocation.enabled`, and only 4 configurations are referred in the description of `spark.dynamicAllocation.enabled` - majority of config keys are missing.
2. I think it's intentional, but the table starts with `spark.dynamicAllocation.enabled` which talks implicitly but intuitively that if you disable this then everything on dynamic allocation won't work. Missing majority of references on config keys don't get it hard to understand.
3. Even `spark.dynamicAllocation` has bad case - see `spark.dynamicAllocation.shuffleTracking.enabled` and `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the structure of configuration. I think this is worse than not explicitly mentioning the description. Let's assume the name has been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false` would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?

Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.


On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <[hidden email]> wrote:
Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it although it might be "redundant" :-) since anyone can give feedback to any thread in Spark dev mailing list, and discuss.

This is actually more prevailing given my rough reading of configuration files. I would like to see this missing relationship as a bad pattern, started from a personal preference.

> Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations.
> E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major
> benefits of the structured configurations.

I don't think this is a good idea we assume for users to know such contexts. One might think `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
partially enable the feature. It is better to be explicit to document since some of configurations are even difficult for users to confirm if it is working or not.
For instance, one might think setting 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, they realise the log is not rolling later after the file
size becomes bigger.


2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm sorry if I miss something, but this is ideally better to be started as [DISCUSS] as I haven't seen any reference to have consensus on this practice.

For me it's just there're two different practices co-existing on the codebase, meaning it's closer to the preference of individual (with implicitly agreeing that others have different preferences), or it hasn't been discussed thoughtfully.

Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major benefits of the structured configurations.

If we want to make it explicit, "all" sub-configurations should have redundant part of the doc. More redundant if the condition is nested. I agree this is the good step of "be kind" but less pragmatic.

I'd be happy to follow the consensus we would make in this thread. Appreciate more voices.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <[hidden email]> wrote:
> I don't plan to document this officially yet
Just to prevent confusion, I meant I don't yet plan to document the fact that we should write the relationships in configurations as a code/review guideline in https://spark.apache.org/contributing.html


2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[hidden email]>님이 작성:
Hi all,

I happened to review some PRs and I noticed that some configurations don't have some information
necessary.

To be explicit, I would like to make sure we document the direct relationship between other configurations
in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's clearly documented.
We're good in general given that we document them in general in Apache Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown', etc. 

However, I noticed such a pattern that such information is missing in some components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and `spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't know the relationship and I myself
had to read the codes to confirm when I review. 
I don't plan to document this officially yet because to me it looks a pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Request to document the direct relationship between other configurations

Jungtaek Lim-2
OK I agree this is going forward as we decided the final goal and it seems someone starts to work on. In the meanwhile we agree about documenting the direct relationship, and which style to use is "open".

Thanks again to initiate the discussion thread - this thread led the following thread for the final goal.

On Fri, Feb 14, 2020 at 1:43 PM Hyukjin Kwon <[hidden email]> wrote:
It's okay to just follow one prevailing style. The main point I would like to say is the fact that we should document the direct relationship of configurations.
For this case specifically, I don't think there is so much point here to decide one hard requirement to follow for the mid-term. We have the final goal to reach.

There are always many cases that requires committer's judgement. It is impossible to put every single item to a guide line in practice.
So we usually trust their judgement in practice unless there's no explicit objection. Here, my explicit objection is about no documentation for the direct relationship
between configurations. Seems people think we should document this in general as well.

In practice, I don't particularly mind what style is used in fact. I vaguely guess the style isn't an issue to many other committers in general.
If this is the case, let's open a separate thread to discuss to confirm one style - this wasn't the main issue I wanted to address in this thread.

Shall we conclude this thread by deciding to document the direct relationship between configurations preferably in one prevailing style?


2020년 2월 14일 (금) 오전 11:36, Jungtaek Lim <[hidden email]>님이 작성:
Even spark.dynamicAllocation.* doesn't follow 2-2, right? It follows the mix of 1 and 2-1, though 1 is even broken there.

It doesn't matter If we just discuss about one-time decision - it may be OK to not to be strict on consistency, though it's not ideal. The thing is that these kind of "preferences" are making confusion on review phase: reviewers provide different comments and try to "educate" contributors on their preferences. Expectations for such cases heavily depends on who is/are reviewers of PR - there's no value on education.

The codebase is the reference of implicit rules/policies which would apply to all contributors including newcomers. Let's just put our best efforts on being consistent on codebase. (We should have consensus to do this anyway.)


On Thu, Feb 13, 2020 at 12:44 PM Hyukjin Kwon <[hidden email]> wrote:

I think it’s just fine as long as we’re consistent with the instances having the description, for instance:

  When true and ‘spark.xx.xx’ is enabled, …

I think this is 2-2 in most cases so far. I think we can reference other configuration keys in another configuration documentation by using
ADAPTIVE_EXECUTION_ENABLED.key as an example. So we don’t have duplication problem in most cases.

Being consistent with the current base is a general guidance if I am not mistaken. We have identified a problem and a good goal to reach.
If we want to change, let's do it as our final goal. Otherwise, let's simplify it to reduce the overhead rather then having a policy for the mid-term specifically.



2020년 2월 13일 (목) 오후 12:24, Jungtaek Lim <[hidden email]>님이 작성:
I tend to agree that there should be a time to make thing be consistent (and I'm very happy to see the new thread on discussion) and we may want to take some practice in the interim.

But for me it is not clear what is the practice in the interim. I pointed out the problems of existing style and if we all agree the problems are valid then we may need to fix it before start using it widely.

For me the major question is "where" to put - at least there're two different approaches:

1) put it to the description of `.enable` and describe the range of impact (e.g.) put the description of "spark.A.enable" saying it will affect the following configurations under "spark.A". 
(I don't think it's good to enumerate all of affected configs, as `spark.dynamicAllocation.enable` is telling it is fragile.)

2) put it to the description of all affected configurations and describe which configuration is the prerequisite to let this be effective. e.g. put it on all configurations under `spark.dynamicAllocation` mentioning `spark.dynamicAllocation.enable` should be enabled to make this be effective.

(I intended to skip mentioning "cross reference". Let's be pragmatic.)

2) has also two ways to describe: 

2-1) Just mention simply - like "When dynamic allocation is enabled,", not pointing out the key to toggle. This hugely simplify the description, though end users may have to do the second guess to find the key to toggle. (It'll be intuitive when we structurize the configurations.)

2-2) Mention the key to toggle directly - like "This is effective only if spark.A.enable is set to true.". It's going to be longer depending on how long the configuration key is. Personally I'd feel verbose unless the key to toggle is not placed to the spot we can infer, but others may have different opinions.

I may be missing some details, so please participate to add the details. Otherwise we may want to choose the best one, and have a sample form of description. (Once we reach here, it may be OK to add to the contribution doc, as that is the thing we agree about.)

Without the details it's going to be a some sort of "preference" which is natural to have disagreement, hence it doesn't make sense someone is forced to do something if something turns out to be "preference".

On Thu, Feb 13, 2020 at 11:10 AM Hyukjin Kwon <[hidden email]> wrote:
Adding those information is already a more prevailing style at this moment, and this is usual to follow prevailing side if there isn't a specific reason.
If there is confusion about this, I will explicitly add it into the guide (https://spark.apache.org/contributing.html). Let me know if this still confuses or disagree.

2020년 2월 13일 (목) 오전 9:47, Hyukjin Kwon <[hidden email]>님이 작성:
Yes, that's probably our final goal to revisit the configurations to make it structured and deduplicated documentation cleanly. +1.

One point I would like to add is though to add such information to the documentation until we actually manage our final goal
since probably it's going to take a while to revisit/fix and make it fully structured with the documentation.
If we go more conservatively, we can add such information to the new configurations being added from now on at least, and keeping the existing configurations as are.


On Thu, 13 Feb 2020, 06:12 Dongjoon Hyun, <[hidden email]> wrote:
Thank you for raising the issue, Hyukjin.

According to the current status of discussion, it seems that we are able to agree on updating the non-structured configurations and keeping the structured configuration AS-IS.

I'm +1 for the revisiting the configurations if that is our direction. If there is some mismatch in structured configurations, we may fix them together.

Bests,
Dongjoon.

On Wed, Feb 12, 2020 at 8:08 AM Jules Damji <[hidden email]> wrote:
All are valid and valuable observations to put into practice: 

* structured and meaningful config names 
* explainable text or succinct description
* easily accessible or searchable 

While these are aspirational but gradually doable if we make it part of the dev and review cycle. Often meaningful config names, like security, come as after thought. 

At the AMA in Amsterdam Spark Summit last year, a few developers lamented the lack of finding Spark configs—what they do; what they are used for; when and why; and what are their default values. 

Though you one fetch them programmatically, one still has to know what specific config one islooking for. 

Cheers 
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

On Feb 12, 2020, at 5:19 AM, Hyukjin Kwon <[hidden email]> wrote:


Yeah, that's one of my point why I dont want to document this in the guide yet.

I would like to make sure dev people are on the same page that documenting is a better practice. I dont intend to force as a hard requirement; however, if that's pointed out, it should better to address.


On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <[hidden email]> wrote:
In general I think it's better to have more detailed documents, but we don't have to force everyone to do it if the config name is structured. I would +1 to document the relationship of we can't tell it from the config names, e.g. spark.shuffle.service.enabled and spark.dynamicAllocation.enabled.

On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <[hidden email]> wrote:
Also, I would like to hear other people' thoughts on here. Could I ask what you guys think about this in general?

2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <[hidden email]>님이 작성:
To do that, we should explicitly document such structured configuration and implicit effect, which is currently missing.
I would be more than happy if we document such implied relationship, and if we are very sure all configurations are structured correctly coherently.
Until that point, I think it might be more practical to simply document it for now.

> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.
This is actually something we should fix too. in SQL configuration, now we don't have such duplications as of https://github.com/apache/spark/pull/27459 as it generates. We should do it in other configurations.


2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm looking into the case of `spark.dynamicAllocation` and this seems to be the thing to support my voice.


I don't disagree with adding "This requires spark.shuffle.service.enabled to be set." in the description of `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly, hence it should be better to have it.

Why I'm in favor of structured configuration & implicit effect over describing everything explicitly is there. 

1. There're 10 configurations (if the doc doesn't miss any other configuration) except `spark.dynamicAllocation.enabled`, and only 4 configurations are referred in the description of `spark.dynamicAllocation.enabled` - majority of config keys are missing.
2. I think it's intentional, but the table starts with `spark.dynamicAllocation.enabled` which talks implicitly but intuitively that if you disable this then everything on dynamic allocation won't work. Missing majority of references on config keys don't get it hard to understand.
3. Even `spark.dynamicAllocation` has bad case - see `spark.dynamicAllocation.shuffleTracking.enabled` and `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the structure of configuration. I think this is worse than not explicitly mentioning the description. Let's assume the name has been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false` would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?

Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.


On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <[hidden email]> wrote:
Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it although it might be "redundant" :-) since anyone can give feedback to any thread in Spark dev mailing list, and discuss.

This is actually more prevailing given my rough reading of configuration files. I would like to see this missing relationship as a bad pattern, started from a personal preference.

> Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations.
> E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major
> benefits of the structured configurations.

I don't think this is a good idea we assume for users to know such contexts. One might think `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
partially enable the feature. It is better to be explicit to document since some of configurations are even difficult for users to confirm if it is working or not.
For instance, one might think setting 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, they realise the log is not rolling later after the file
size becomes bigger.


2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm sorry if I miss something, but this is ideally better to be started as [DISCUSS] as I haven't seen any reference to have consensus on this practice.

For me it's just there're two different practices co-existing on the codebase, meaning it's closer to the preference of individual (with implicitly agreeing that others have different preferences), or it hasn't been discussed thoughtfully.

Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major benefits of the structured configurations.

If we want to make it explicit, "all" sub-configurations should have redundant part of the doc. More redundant if the condition is nested. I agree this is the good step of "be kind" but less pragmatic.

I'd be happy to follow the consensus we would make in this thread. Appreciate more voices.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <[hidden email]> wrote:
> I don't plan to document this officially yet
Just to prevent confusion, I meant I don't yet plan to document the fact that we should write the relationships in configurations as a code/review guideline in https://spark.apache.org/contributing.html


2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[hidden email]>님이 작성:
Hi all,

I happened to review some PRs and I noticed that some configurations don't have some information
necessary.

To be explicit, I would like to make sure we document the direct relationship between other configurations
in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's clearly documented.
We're good in general given that we document them in general in Apache Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown', etc. 

However, I noticed such a pattern that such information is missing in some components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and `spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't know the relationship and I myself
had to read the codes to confirm when I review. 
I don't plan to document this officially yet because to me it looks a pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Request to document the direct relationship between other configurations

Hyukjin Kwon
Thanks Jungtaek!

2020년 2월 14일 (금) 오후 3:57, Jungtaek Lim <[hidden email]>님이 작성:
OK I agree this is going forward as we decided the final goal and it seems someone starts to work on. In the meanwhile we agree about documenting the direct relationship, and which style to use is "open".

Thanks again to initiate the discussion thread - this thread led the following thread for the final goal.

On Fri, Feb 14, 2020 at 1:43 PM Hyukjin Kwon <[hidden email]> wrote:
It's okay to just follow one prevailing style. The main point I would like to say is the fact that we should document the direct relationship of configurations.
For this case specifically, I don't think there is so much point here to decide one hard requirement to follow for the mid-term. We have the final goal to reach.

There are always many cases that requires committer's judgement. It is impossible to put every single item to a guide line in practice.
So we usually trust their judgement in practice unless there's no explicit objection. Here, my explicit objection is about no documentation for the direct relationship
between configurations. Seems people think we should document this in general as well.

In practice, I don't particularly mind what style is used in fact. I vaguely guess the style isn't an issue to many other committers in general.
If this is the case, let's open a separate thread to discuss to confirm one style - this wasn't the main issue I wanted to address in this thread.

Shall we conclude this thread by deciding to document the direct relationship between configurations preferably in one prevailing style?


2020년 2월 14일 (금) 오전 11:36, Jungtaek Lim <[hidden email]>님이 작성:
Even spark.dynamicAllocation.* doesn't follow 2-2, right? It follows the mix of 1 and 2-1, though 1 is even broken there.

It doesn't matter If we just discuss about one-time decision - it may be OK to not to be strict on consistency, though it's not ideal. The thing is that these kind of "preferences" are making confusion on review phase: reviewers provide different comments and try to "educate" contributors on their preferences. Expectations for such cases heavily depends on who is/are reviewers of PR - there's no value on education.

The codebase is the reference of implicit rules/policies which would apply to all contributors including newcomers. Let's just put our best efforts on being consistent on codebase. (We should have consensus to do this anyway.)


On Thu, Feb 13, 2020 at 12:44 PM Hyukjin Kwon <[hidden email]> wrote:

I think it’s just fine as long as we’re consistent with the instances having the description, for instance:

  When true and ‘spark.xx.xx’ is enabled, …

I think this is 2-2 in most cases so far. I think we can reference other configuration keys in another configuration documentation by using
ADAPTIVE_EXECUTION_ENABLED.key as an example. So we don’t have duplication problem in most cases.

Being consistent with the current base is a general guidance if I am not mistaken. We have identified a problem and a good goal to reach.
If we want to change, let's do it as our final goal. Otherwise, let's simplify it to reduce the overhead rather then having a policy for the mid-term specifically.



2020년 2월 13일 (목) 오후 12:24, Jungtaek Lim <[hidden email]>님이 작성:
I tend to agree that there should be a time to make thing be consistent (and I'm very happy to see the new thread on discussion) and we may want to take some practice in the interim.

But for me it is not clear what is the practice in the interim. I pointed out the problems of existing style and if we all agree the problems are valid then we may need to fix it before start using it widely.

For me the major question is "where" to put - at least there're two different approaches:

1) put it to the description of `.enable` and describe the range of impact (e.g.) put the description of "spark.A.enable" saying it will affect the following configurations under "spark.A". 
(I don't think it's good to enumerate all of affected configs, as `spark.dynamicAllocation.enable` is telling it is fragile.)

2) put it to the description of all affected configurations and describe which configuration is the prerequisite to let this be effective. e.g. put it on all configurations under `spark.dynamicAllocation` mentioning `spark.dynamicAllocation.enable` should be enabled to make this be effective.

(I intended to skip mentioning "cross reference". Let's be pragmatic.)

2) has also two ways to describe: 

2-1) Just mention simply - like "When dynamic allocation is enabled,", not pointing out the key to toggle. This hugely simplify the description, though end users may have to do the second guess to find the key to toggle. (It'll be intuitive when we structurize the configurations.)

2-2) Mention the key to toggle directly - like "This is effective only if spark.A.enable is set to true.". It's going to be longer depending on how long the configuration key is. Personally I'd feel verbose unless the key to toggle is not placed to the spot we can infer, but others may have different opinions.

I may be missing some details, so please participate to add the details. Otherwise we may want to choose the best one, and have a sample form of description. (Once we reach here, it may be OK to add to the contribution doc, as that is the thing we agree about.)

Without the details it's going to be a some sort of "preference" which is natural to have disagreement, hence it doesn't make sense someone is forced to do something if something turns out to be "preference".

On Thu, Feb 13, 2020 at 11:10 AM Hyukjin Kwon <[hidden email]> wrote:
Adding those information is already a more prevailing style at this moment, and this is usual to follow prevailing side if there isn't a specific reason.
If there is confusion about this, I will explicitly add it into the guide (https://spark.apache.org/contributing.html). Let me know if this still confuses or disagree.

2020년 2월 13일 (목) 오전 9:47, Hyukjin Kwon <[hidden email]>님이 작성:
Yes, that's probably our final goal to revisit the configurations to make it structured and deduplicated documentation cleanly. +1.

One point I would like to add is though to add such information to the documentation until we actually manage our final goal
since probably it's going to take a while to revisit/fix and make it fully structured with the documentation.
If we go more conservatively, we can add such information to the new configurations being added from now on at least, and keeping the existing configurations as are.


On Thu, 13 Feb 2020, 06:12 Dongjoon Hyun, <[hidden email]> wrote:
Thank you for raising the issue, Hyukjin.

According to the current status of discussion, it seems that we are able to agree on updating the non-structured configurations and keeping the structured configuration AS-IS.

I'm +1 for the revisiting the configurations if that is our direction. If there is some mismatch in structured configurations, we may fix them together.

Bests,
Dongjoon.

On Wed, Feb 12, 2020 at 8:08 AM Jules Damji <[hidden email]> wrote:
All are valid and valuable observations to put into practice: 

* structured and meaningful config names 
* explainable text or succinct description
* easily accessible or searchable 

While these are aspirational but gradually doable if we make it part of the dev and review cycle. Often meaningful config names, like security, come as after thought. 

At the AMA in Amsterdam Spark Summit last year, a few developers lamented the lack of finding Spark configs—what they do; what they are used for; when and why; and what are their default values. 

Though you one fetch them programmatically, one still has to know what specific config one islooking for. 

Cheers 
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

On Feb 12, 2020, at 5:19 AM, Hyukjin Kwon <[hidden email]> wrote:


Yeah, that's one of my point why I dont want to document this in the guide yet.

I would like to make sure dev people are on the same page that documenting is a better practice. I dont intend to force as a hard requirement; however, if that's pointed out, it should better to address.


On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <[hidden email]> wrote:
In general I think it's better to have more detailed documents, but we don't have to force everyone to do it if the config name is structured. I would +1 to document the relationship of we can't tell it from the config names, e.g. spark.shuffle.service.enabled and spark.dynamicAllocation.enabled.

On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <[hidden email]> wrote:
Also, I would like to hear other people' thoughts on here. Could I ask what you guys think about this in general?

2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <[hidden email]>님이 작성:
To do that, we should explicitly document such structured configuration and implicit effect, which is currently missing.
I would be more than happy if we document such implied relationship, and if we are very sure all configurations are structured correctly coherently.
Until that point, I think it might be more practical to simply document it for now.

> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.
This is actually something we should fix too. in SQL configuration, now we don't have such duplications as of https://github.com/apache/spark/pull/27459 as it generates. We should do it in other configurations.


2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm looking into the case of `spark.dynamicAllocation` and this seems to be the thing to support my voice.


I don't disagree with adding "This requires spark.shuffle.service.enabled to be set." in the description of `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly, hence it should be better to have it.

Why I'm in favor of structured configuration & implicit effect over describing everything explicitly is there. 

1. There're 10 configurations (if the doc doesn't miss any other configuration) except `spark.dynamicAllocation.enabled`, and only 4 configurations are referred in the description of `spark.dynamicAllocation.enabled` - majority of config keys are missing.
2. I think it's intentional, but the table starts with `spark.dynamicAllocation.enabled` which talks implicitly but intuitively that if you disable this then everything on dynamic allocation won't work. Missing majority of references on config keys don't get it hard to understand.
3. Even `spark.dynamicAllocation` has bad case - see `spark.dynamicAllocation.shuffleTracking.enabled` and `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the structure of configuration. I think this is worse than not explicitly mentioning the description. Let's assume the name has been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false` would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?

Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one.


On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <[hidden email]> wrote:
Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it although it might be "redundant" :-) since anyone can give feedback to any thread in Spark dev mailing list, and discuss.

This is actually more prevailing given my rough reading of configuration files. I would like to see this missing relationship as a bad pattern, started from a personal preference.

> Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations.
> E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major
> benefits of the structured configurations.

I don't think this is a good idea we assume for users to know such contexts. One might think `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
partially enable the feature. It is better to be explicit to document since some of configurations are even difficult for users to confirm if it is working or not.
For instance, one might think setting 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, they realise the log is not rolling later after the file
size becomes bigger.


2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <[hidden email]>님이 작성:
I'm sorry if I miss something, but this is ideally better to be started as [DISCUSS] as I haven't seen any reference to have consensus on this practice.

For me it's just there're two different practices co-existing on the codebase, meaning it's closer to the preference of individual (with implicitly agreeing that others have different preferences), or it hasn't been discussed thoughtfully.

Personally I'd rather not think someone won't understand setting `.enabled` to `false` means the functionality is disabled and effectively it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one of major benefits of the structured configurations.

If we want to make it explicit, "all" sub-configurations should have redundant part of the doc. More redundant if the condition is nested. I agree this is the good step of "be kind" but less pragmatic.

I'd be happy to follow the consensus we would make in this thread. Appreciate more voices.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <[hidden email]> wrote:
> I don't plan to document this officially yet
Just to prevent confusion, I meant I don't yet plan to document the fact that we should write the relationships in configurations as a code/review guideline in https://spark.apache.org/contributing.html


2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[hidden email]>님이 작성:
Hi all,

I happened to review some PRs and I noticed that some configurations don't have some information
necessary.

To be explicit, I would like to make sure we document the direct relationship between other configurations
in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's clearly documented.
We're good in general given that we document them in general in Apache Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown', etc. 

However, I noticed such a pattern that such information is missing in some components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and `spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't know the relationship and I myself
had to read the codes to confirm when I review. 
I don't plan to document this officially yet because to me it looks a pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.