

Hi Everyone,
I was wondering if we could make any enhancements to the FPGrowth algorithm in spark/pyspark.
Many times I am looking for a rule for a particular consequent, so I don't need the rules for all the other consequents. I know I can filter the rules to get the desired output, but if I could input this in the algorithm itself, the execution time would reduce drastically.
Also, sometimes I want the rules to be small, maybe of length 56. Again, I can filter on length but I was wondering if we could take this as input into the algo. Given the Depth first nature of FPGrowth, I am not sure that is feasible.
I am willing to work on these suggestions, if someone thinks they are feasible. Thanks to the dev team for all the hard work!
Regards, Aditya Addepalli


You could just filter the input for sets containing the desired item,
and discard the rest. That doesn't mean all of the item sets have that
item, and you'd still have to filter, but may be much faster to
compute.
Increasing min support might generally have the effect of smaller
rules, though it doesn't impose a cap. That could help perf, if that's
what you're trying to improve.
I don't know if it's worth new params in the implementation, maybe. I
think there would have to be an argument this generalizes.
On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli < [hidden email]> wrote:
>
> Hi Everyone,
>
> I was wondering if we could make any enhancements to the FPGrowth algorithm in spark/pyspark.
>
> Many times I am looking for a rule for a particular consequent, so I don't need the rules for all the other consequents. I know I can filter the rules to get the desired output, but if I could input this in the algorithm itself, the execution time would reduce drastically.
>
> Also, sometimes I want the rules to be small, maybe of length 56. Again, I can filter on length but I was wondering if we could take this as input into the algo. Given the Depth first nature of FPGrowth, I am not sure that is feasible.
>
> I am willing to work on these suggestions, if someone thinks they are feasible. Thanks to the dev team for all the hard work!
>
> Regards,
> Aditya Addepalli

To unsubscribe email: [hidden email]


Hi Sean,
I understand your approach, but there's a slight problem.
If we generate rules after filtering for our desired consequent, we are introducing some bias into our rules. The confidence of the rules on the filtered input can be very high but this may not be the case on the entire dataset. Thus we can get biased rules which wrongly depict the patterns in the data. This is why I think having a parameter to mention the consequent would help greatly.
Reducing the support doesn't really work in my case simply because rules for the consequents I am mining for occur very rarely in the data. Sometimes this can be 1e4 or 1e5, so my minSupport has to be less than that to capture the rules for that consequent.
Thanks for your reply. Let me know what you think.
Regards. Aditya Addepalli
You could just filter the input for sets containing the desired item,
and discard the rest. That doesn't mean all of the item sets have that
item, and you'd still have to filter, but may be much faster to
compute.
Increasing min support might generally have the effect of smaller
rules, though it doesn't impose a cap. That could help perf, if that's
what you're trying to improve.
I don't know if it's worth new params in the implementation, maybe. I
think there would have to be an argument this generalizes.
On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli <[hidden email]> wrote:
>
> Hi Everyone,
>
> I was wondering if we could make any enhancements to the FPGrowth algorithm in spark/pyspark.
>
> Many times I am looking for a rule for a particular consequent, so I don't need the rules for all the other consequents. I know I can filter the rules to get the desired output, but if I could input this in the algorithm itself, the execution time would reduce drastically.
>
> Also, sometimes I want the rules to be small, maybe of length 56. Again, I can filter on length but I was wondering if we could take this as input into the algo. Given the Depth first nature of FPGrowth, I am not sure that is feasible.
>
> I am willing to work on these suggestions, if someone thinks they are feasible. Thanks to the dev team for all the hard work!
>
> Regards,
> Aditya Addepalli


Hi,
I understand that this is not a priority with everything going on, but if you think generating rules for only a single consequent adds value, I would like to contribute.
Thanks & Regards, Aditya On Sat, May 2, 2020 at 9:34 PM Aditya Addepalli < [hidden email]> wrote: Hi Sean,
I understand your approach, but there's a slight problem.
If we generate rules after filtering for our desired consequent, we are introducing some bias into our rules. The confidence of the rules on the filtered input can be very high but this may not be the case on the entire dataset. Thus we can get biased rules which wrongly depict the patterns in the data. This is why I think having a parameter to mention the consequent would help greatly.
Reducing the support doesn't really work in my case simply because rules for the consequents I am mining for occur very rarely in the data. Sometimes this can be 1e4 or 1e5, so my minSupport has to be less than that to capture the rules for that consequent.
Thanks for your reply. Let me know what you think.
Regards. Aditya Addepalli
You could just filter the input for sets containing the desired item,
and discard the rest. That doesn't mean all of the item sets have that
item, and you'd still have to filter, but may be much faster to
compute.
Increasing min support might generally have the effect of smaller
rules, though it doesn't impose a cap. That could help perf, if that's
what you're trying to improve.
I don't know if it's worth new params in the implementation, maybe. I
think there would have to be an argument this generalizes.
On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli <[hidden email]> wrote:
>
> Hi Everyone,
>
> I was wondering if we could make any enhancements to the FPGrowth algorithm in spark/pyspark.
>
> Many times I am looking for a rule for a particular consequent, so I don't need the rules for all the other consequents. I know I can filter the rules to get the desired output, but if I could input this in the algorithm itself, the execution time would reduce drastically.
>
> Also, sometimes I want the rules to be small, maybe of length 56. Again, I can filter on length but I was wondering if we could take this as input into the algo. Given the Depth first nature of FPGrowth, I am not sure that is feasible.
>
> I am willing to work on these suggestions, if someone thinks they are feasible. Thanks to the dev team for all the hard work!
>
> Regards,
> Aditya Addepalli


Yes, you can get the correct support this way by accounting for how
many rows were filtered out, but not the right confidence, as it
depends on counting support in rows without the items of interest.
But computing confidence depends on computing all that support; how
would you optimize it even if you knew the consequent you cared about?
maybe there's a way, sure, I don't know the code well but it wasn't
obvious at a glance how to take advantage of it.
I can see how limiting the rule size could help.
On Sat, May 2, 2020 at 11:04 AM Aditya Addepalli < [hidden email]> wrote:
>
> Hi Sean,
>
> I understand your approach, but there's a slight problem.
>
> If we generate rules after filtering for our desired consequent, we are introducing some bias into our rules.
> The confidence of the rules on the filtered input can be very high but this may not be the case on the entire dataset.
> Thus we can get biased rules which wrongly depict the patterns in the data.
> This is why I think having a parameter to mention the consequent would help greatly.
>
> Reducing the support doesn't really work in my case simply because rules for the consequents I am mining for occur very rarely in the data.
> Sometimes this can be 1e4 or 1e5, so my minSupport has to be less than that to capture the rules for that consequent.
>
> Thanks for your reply. Let me know what you think.
>
> Regards.
> Aditya Addepalli
>
>
>
>
> On Sat, 2 May, 2020, 9:13 pm Sean Owen, < [hidden email]> wrote:
>>
>> You could just filter the input for sets containing the desired item,
>> and discard the rest. That doesn't mean all of the item sets have that
>> item, and you'd still have to filter, but may be much faster to
>> compute.
>> Increasing min support might generally have the effect of smaller
>> rules, though it doesn't impose a cap. That could help perf, if that's
>> what you're trying to improve.
>> I don't know if it's worth new params in the implementation, maybe. I
>> think there would have to be an argument this generalizes.
>>
>> On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli < [hidden email]> wrote:
>> >
>> > Hi Everyone,
>> >
>> > I was wondering if we could make any enhancements to the FPGrowth algorithm in spark/pyspark.
>> >
>> > Many times I am looking for a rule for a particular consequent, so I don't need the rules for all the other consequents. I know I can filter the rules to get the desired output, but if I could input this in the algorithm itself, the execution time would reduce drastically.
>> >
>> > Also, sometimes I want the rules to be small, maybe of length 56. Again, I can filter on length but I was wondering if we could take this as input into the algo. Given the Depth first nature of FPGrowth, I am not sure that is feasible.
>> >
>> > I am willing to work on these suggestions, if someone thinks they are feasible. Thanks to the dev team for all the hard work!
>> >
>> > Regards,
>> > Aditya Addepalli

To unsubscribe email: [hidden email]


Hi Sean,
1. I was thinking that by specifying the consequent we can (somehow?) skip the confidence calculation for all the other consequents.
This would greatly reduce the time taken as we avoid computation for consequents we don't care about.
2. Is limiting rule size even possible? I thought because of FP growth's depth first nature it might not be possible.
My experience with Fpgrowth has largely been in python where the API is limited. I will take a look at the scala source code and get back to you with more concrete answers.
Thanks & Regards, Aditya Yes, you can get the correct support this way by accounting for how
many rows were filtered out, but not the right confidence, as it
depends on counting support in rows without the items of interest.
But computing confidence depends on computing all that support; how
would you optimize it even if you knew the consequent you cared about?
maybe there's a way, sure, I don't know the code well but it wasn't
obvious at a glance how to take advantage of it.
I can see how limiting the rule size could help.
On Sat, May 2, 2020 at 11:04 AM Aditya Addepalli <[hidden email]> wrote:
>
> Hi Sean,
>
> I understand your approach, but there's a slight problem.
>
> If we generate rules after filtering for our desired consequent, we are introducing some bias into our rules.
> The confidence of the rules on the filtered input can be very high but this may not be the case on the entire dataset.
> Thus we can get biased rules which wrongly depict the patterns in the data.
> This is why I think having a parameter to mention the consequent would help greatly.
>
> Reducing the support doesn't really work in my case simply because rules for the consequents I am mining for occur very rarely in the data.
> Sometimes this can be 1e4 or 1e5, so my minSupport has to be less than that to capture the rules for that consequent.
>
> Thanks for your reply. Let me know what you think.
>
> Regards.
> Aditya Addepalli
>
>
>
>
> On Sat, 2 May, 2020, 9:13 pm Sean Owen, <[hidden email]> wrote:
>>
>> You could just filter the input for sets containing the desired item,
>> and discard the rest. That doesn't mean all of the item sets have that
>> item, and you'd still have to filter, but may be much faster to
>> compute.
>> Increasing min support might generally have the effect of smaller
>> rules, though it doesn't impose a cap. That could help perf, if that's
>> what you're trying to improve.
>> I don't know if it's worth new params in the implementation, maybe. I
>> think there would have to be an argument this generalizes.
>>
>> On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli <[hidden email]> wrote:
>> >
>> > Hi Everyone,
>> >
>> > I was wondering if we could make any enhancements to the FPGrowth algorithm in spark/pyspark.
>> >
>> > Many times I am looking for a rule for a particular consequent, so I don't need the rules for all the other consequents. I know I can filter the rules to get the desired output, but if I could input this in the algorithm itself, the execution time would reduce drastically.
>> >
>> > Also, sometimes I want the rules to be small, maybe of length 56. Again, I can filter on length but I was wondering if we could take this as input into the algo. Given the Depth first nature of FPGrowth, I am not sure that is feasible.
>> >
>> > I am willing to work on these suggestions, if someone thinks they are feasible. Thanks to the dev team for all the hard work!
>> >
>> > Regards,
>> > Aditya Addepalli


The confidence calculation is pretty trivial, the work is finding the supports needed. Not sure how to optimize that. Hi Sean,
1. I was thinking that by specifying the consequent we can (somehow?) skip the confidence calculation for all the other consequents.
This would greatly reduce the time taken as we avoid computation for consequents we don't care about.
2. Is limiting rule size even possible? I thought because of FP growth's depth first nature it might not be possible.
My experience with Fpgrowth has largely been in python where the API is limited. I will take a look at the scala source code and get back to you with more concrete answers.
Thanks & Regards, Aditya
Yes, you can get the correct support this way by accounting for how
many rows were filtered out, but not the right confidence, as it
depends on counting support in rows without the items of interest.
But computing confidence depends on computing all that support; how
would you optimize it even if you knew the consequent you cared about?
maybe there's a way, sure, I don't know the code well but it wasn't
obvious at a glance how to take advantage of it.
I can see how limiting the rule size could help.
On Sat, May 2, 2020 at 11:04 AM Aditya Addepalli <[hidden email]> wrote:
>
> Hi Sean,
>
> I understand your approach, but there's a slight problem.
>
> If we generate rules after filtering for our desired consequent, we are introducing some bias into our rules.
> The confidence of the rules on the filtered input can be very high but this may not be the case on the entire dataset.
> Thus we can get biased rules which wrongly depict the patterns in the data.
> This is why I think having a parameter to mention the consequent would help greatly.
>
> Reducing the support doesn't really work in my case simply because rules for the consequents I am mining for occur very rarely in the data.
> Sometimes this can be 1e4 or 1e5, so my minSupport has to be less than that to capture the rules for that consequent.
>
> Thanks for your reply. Let me know what you think.
>
> Regards.
> Aditya Addepalli
>
>
>
>
> On Sat, 2 May, 2020, 9:13 pm Sean Owen, <[hidden email]> wrote:
>>
>> You could just filter the input for sets containing the desired item,
>> and discard the rest. That doesn't mean all of the item sets have that
>> item, and you'd still have to filter, but may be much faster to
>> compute.
>> Increasing min support might generally have the effect of smaller
>> rules, though it doesn't impose a cap. That could help perf, if that's
>> what you're trying to improve.
>> I don't know if it's worth new params in the implementation, maybe. I
>> think there would have to be an argument this generalizes.
>>
>> On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli <[hidden email]> wrote:
>> >
>> > Hi Everyone,
>> >
>> > I was wondering if we could make any enhancements to the FPGrowth algorithm in spark/pyspark.
>> >
>> > Many times I am looking for a rule for a particular consequent, so I don't need the rules for all the other consequents. I know I can filter the rules to get the desired output, but if I could input this in the algorithm itself, the execution time would reduce drastically.
>> >
>> > Also, sometimes I want the rules to be small, maybe of length 56. Again, I can filter on length but I was wondering if we could take this as input into the algo. Given the Depth first nature of FPGrowth, I am not sure that is feasible.
>> >
>> > I am willing to work on these suggestions, if someone thinks they are feasible. Thanks to the dev team for all the hard work!
>> >
>> > Regards,
>> > Aditya Addepalli


Absolutely. I meant to say the confidence calculation depends on the support calculations and hence would reduce the time. Thanks for pointing that out. The confidence calculation is pretty trivial, the work is finding the supports needed. Not sure how to optimize that.
Hi Sean,
1. I was thinking that by specifying the consequent we can (somehow?) skip the confidence calculation for all the other consequents.
This would greatly reduce the time taken as we avoid computation for consequents we don't care about.
2. Is limiting rule size even possible? I thought because of FP growth's depth first nature it might not be possible.
My experience with Fpgrowth has largely been in python where the API is limited. I will take a look at the scala source code and get back to you with more concrete answers.
Thanks & Regards, Aditya
Yes, you can get the correct support this way by accounting for how
many rows were filtered out, but not the right confidence, as it
depends on counting support in rows without the items of interest.
But computing confidence depends on computing all that support; how
would you optimize it even if you knew the consequent you cared about?
maybe there's a way, sure, I don't know the code well but it wasn't
obvious at a glance how to take advantage of it.
I can see how limiting the rule size could help.
On Sat, May 2, 2020 at 11:04 AM Aditya Addepalli <[hidden email]> wrote:
>
> Hi Sean,
>
> I understand your approach, but there's a slight problem.
>
> If we generate rules after filtering for our desired consequent, we are introducing some bias into our rules.
> The confidence of the rules on the filtered input can be very high but this may not be the case on the entire dataset.
> Thus we can get biased rules which wrongly depict the patterns in the data.
> This is why I think having a parameter to mention the consequent would help greatly.
>
> Reducing the support doesn't really work in my case simply because rules for the consequents I am mining for occur very rarely in the data.
> Sometimes this can be 1e4 or 1e5, so my minSupport has to be less than that to capture the rules for that consequent.
>
> Thanks for your reply. Let me know what you think.
>
> Regards.
> Aditya Addepalli
>
>
>
>
> On Sat, 2 May, 2020, 9:13 pm Sean Owen, <[hidden email]> wrote:
>>
>> You could just filter the input for sets containing the desired item,
>> and discard the rest. That doesn't mean all of the item sets have that
>> item, and you'd still have to filter, but may be much faster to
>> compute.
>> Increasing min support might generally have the effect of smaller
>> rules, though it doesn't impose a cap. That could help perf, if that's
>> what you're trying to improve.
>> I don't know if it's worth new params in the implementation, maybe. I
>> think there would have to be an argument this generalizes.
>>
>> On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli <[hidden email]> wrote:
>> >
>> > Hi Everyone,
>> >
>> > I was wondering if we could make any enhancements to the FPGrowth algorithm in spark/pyspark.
>> >
>> > Many times I am looking for a rule for a particular consequent, so I don't need the rules for all the other consequents. I know I can filter the rules to get the desired output, but if I could input this in the algorithm itself, the execution time would reduce drastically.
>> >
>> > Also, sometimes I want the rules to be small, maybe of length 56. Again, I can filter on length but I was wondering if we could take this as input into the algo. Given the Depth first nature of FPGrowth, I am not sure that is feasible.
>> >
>> > I am willing to work on these suggestions, if someone thinks they are feasible. Thanks to the dev team for all the hard work!
>> >
>> > Regards,
>> > Aditya Addepalli

