Spark FP-growth

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark FP-growth

Aditya Addepalli
Hi Everyone,

I was wondering if we could make any enhancements to the FP-Growth algorithm in spark/pyspark.

Many times I am looking for a rule for a particular consequent, so I don't need the rules for all the other consequents. I know I can filter the rules to get the desired output, but if I could input this in the algorithm itself, the execution time would reduce drastically.

Also, sometimes I want the rules to be small, maybe of length 5-6. Again, I can filter on length but I was wondering if we could take this as input into the algo. Given the Depth first nature of FP-Growth, I am not sure that is feasible.

 I am willing to work on these suggestions, if someone thinks they are feasible. Thanks to the dev team for all the hard work!

Regards,
Aditya Addepalli
Reply | Threaded
Open this post in threaded view
|

Re: Spark FP-growth

Sean Owen-2
You could just filter the input for sets containing the desired item,
and discard the rest. That doesn't mean all of the item sets have that
item, and you'd still have to filter, but may be much faster to
compute.
Increasing min support might generally have the effect of smaller
rules, though it doesn't impose a cap. That could help perf, if that's
what you're trying to improve.
I don't know if it's worth new params in the implementation, maybe. I
think there would have to be an argument this generalizes.

On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli <[hidden email]> wrote:

>
> Hi Everyone,
>
> I was wondering if we could make any enhancements to the FP-Growth algorithm in spark/pyspark.
>
> Many times I am looking for a rule for a particular consequent, so I don't need the rules for all the other consequents. I know I can filter the rules to get the desired output, but if I could input this in the algorithm itself, the execution time would reduce drastically.
>
> Also, sometimes I want the rules to be small, maybe of length 5-6. Again, I can filter on length but I was wondering if we could take this as input into the algo. Given the Depth first nature of FP-Growth, I am not sure that is feasible.
>
>  I am willing to work on these suggestions, if someone thinks they are feasible. Thanks to the dev team for all the hard work!
>
> Regards,
> Aditya Addepalli

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark FP-growth

Aditya Addepalli
Hi Sean,

I understand your approach, but there's a slight problem.

If we generate rules after filtering for our desired consequent, we are introducing some bias into our rules.
The confidence of the rules on the filtered input can be very high but this may not be the case on the entire dataset.
Thus we can get biased rules which wrongly depict the patterns in the data.
This is why I think having a parameter to mention the consequent would help greatly.

Reducing the support doesn't really work in my case simply because rules for the consequents I am mining for occur very rarely in the data.
Sometimes this can be 1e-4 or 1e-5, so my minSupport has to be less than that to capture the rules for that consequent.

Thanks for your reply. Let me know what you think.

Regards.
Aditya Addepalli




On Sat, 2 May, 2020, 9:13 pm Sean Owen, <[hidden email]> wrote:
You could just filter the input for sets containing the desired item,
and discard the rest. That doesn't mean all of the item sets have that
item, and you'd still have to filter, but may be much faster to
compute.
Increasing min support might generally have the effect of smaller
rules, though it doesn't impose a cap. That could help perf, if that's
what you're trying to improve.
I don't know if it's worth new params in the implementation, maybe. I
think there would have to be an argument this generalizes.

On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli <[hidden email]> wrote:
>
> Hi Everyone,
>
> I was wondering if we could make any enhancements to the FP-Growth algorithm in spark/pyspark.
>
> Many times I am looking for a rule for a particular consequent, so I don't need the rules for all the other consequents. I know I can filter the rules to get the desired output, but if I could input this in the algorithm itself, the execution time would reduce drastically.
>
> Also, sometimes I want the rules to be small, maybe of length 5-6. Again, I can filter on length but I was wondering if we could take this as input into the algo. Given the Depth first nature of FP-Growth, I am not sure that is feasible.
>
>  I am willing to work on these suggestions, if someone thinks they are feasible. Thanks to the dev team for all the hard work!
>
> Regards,
> Aditya Addepalli
Reply | Threaded
Open this post in threaded view
|

Re: Spark FP-growth

Aditya Addepalli
Hi,

I understand that this is not a priority with everything going on, but if you think generating rules for only a single consequent adds value, I would like to contribute.

Thanks & Regards,
Aditya

On Sat, May 2, 2020 at 9:34 PM Aditya Addepalli <[hidden email]> wrote:
Hi Sean,

I understand your approach, but there's a slight problem.

If we generate rules after filtering for our desired consequent, we are introducing some bias into our rules.
The confidence of the rules on the filtered input can be very high but this may not be the case on the entire dataset.
Thus we can get biased rules which wrongly depict the patterns in the data.
This is why I think having a parameter to mention the consequent would help greatly.

Reducing the support doesn't really work in my case simply because rules for the consequents I am mining for occur very rarely in the data.
Sometimes this can be 1e-4 or 1e-5, so my minSupport has to be less than that to capture the rules for that consequent.

Thanks for your reply. Let me know what you think.

Regards.
Aditya Addepalli




On Sat, 2 May, 2020, 9:13 pm Sean Owen, <[hidden email]> wrote:
You could just filter the input for sets containing the desired item,
and discard the rest. That doesn't mean all of the item sets have that
item, and you'd still have to filter, but may be much faster to
compute.
Increasing min support might generally have the effect of smaller
rules, though it doesn't impose a cap. That could help perf, if that's
what you're trying to improve.
I don't know if it's worth new params in the implementation, maybe. I
think there would have to be an argument this generalizes.

On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli <[hidden email]> wrote:
>
> Hi Everyone,
>
> I was wondering if we could make any enhancements to the FP-Growth algorithm in spark/pyspark.
>
> Many times I am looking for a rule for a particular consequent, so I don't need the rules for all the other consequents. I know I can filter the rules to get the desired output, but if I could input this in the algorithm itself, the execution time would reduce drastically.
>
> Also, sometimes I want the rules to be small, maybe of length 5-6. Again, I can filter on length but I was wondering if we could take this as input into the algo. Given the Depth first nature of FP-Growth, I am not sure that is feasible.
>
>  I am willing to work on these suggestions, if someone thinks they are feasible. Thanks to the dev team for all the hard work!
>
> Regards,
> Aditya Addepalli
Reply | Threaded
Open this post in threaded view
|

Re: Spark FP-growth

Sean Owen-2
In reply to this post by Aditya Addepalli
Yes, you can get the correct support this way by accounting for how
many rows were filtered out, but not the right confidence, as it
depends on counting support in rows without the items of interest.

But computing confidence depends on computing all that support; how
would you optimize it even if you knew the consequent you cared about?
maybe there's a way, sure, I don't know the code well but it wasn't
obvious at a glance how to take advantage of it.

I can see how limiting the rule size could help.

On Sat, May 2, 2020 at 11:04 AM Aditya Addepalli <[hidden email]> wrote:

>
> Hi Sean,
>
> I understand your approach, but there's a slight problem.
>
> If we generate rules after filtering for our desired consequent, we are introducing some bias into our rules.
> The confidence of the rules on the filtered input can be very high but this may not be the case on the entire dataset.
> Thus we can get biased rules which wrongly depict the patterns in the data.
> This is why I think having a parameter to mention the consequent would help greatly.
>
> Reducing the support doesn't really work in my case simply because rules for the consequents I am mining for occur very rarely in the data.
> Sometimes this can be 1e-4 or 1e-5, so my minSupport has to be less than that to capture the rules for that consequent.
>
> Thanks for your reply. Let me know what you think.
>
> Regards.
> Aditya Addepalli
>
>
>
>
> On Sat, 2 May, 2020, 9:13 pm Sean Owen, <[hidden email]> wrote:
>>
>> You could just filter the input for sets containing the desired item,
>> and discard the rest. That doesn't mean all of the item sets have that
>> item, and you'd still have to filter, but may be much faster to
>> compute.
>> Increasing min support might generally have the effect of smaller
>> rules, though it doesn't impose a cap. That could help perf, if that's
>> what you're trying to improve.
>> I don't know if it's worth new params in the implementation, maybe. I
>> think there would have to be an argument this generalizes.
>>
>> On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli <[hidden email]> wrote:
>> >
>> > Hi Everyone,
>> >
>> > I was wondering if we could make any enhancements to the FP-Growth algorithm in spark/pyspark.
>> >
>> > Many times I am looking for a rule for a particular consequent, so I don't need the rules for all the other consequents. I know I can filter the rules to get the desired output, but if I could input this in the algorithm itself, the execution time would reduce drastically.
>> >
>> > Also, sometimes I want the rules to be small, maybe of length 5-6. Again, I can filter on length but I was wondering if we could take this as input into the algo. Given the Depth first nature of FP-Growth, I am not sure that is feasible.
>> >
>> >  I am willing to work on these suggestions, if someone thinks they are feasible. Thanks to the dev team for all the hard work!
>> >
>> > Regards,
>> > Aditya Addepalli

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark FP-growth

Aditya Addepalli
Hi Sean,

1.
I was thinking that by specifying the consequent we can (somehow?) skip the confidence calculation for all the other consequents.

This would greatly reduce the time taken as we avoid computation for consequents we don't care about.


2.
Is limiting rule size even possible? I thought because of FP growth's depth first nature it might not be possible.

My experience with Fp-growth has largely been in python where the API is limited. I will take a look at the scala source code and get back to you with more concrete answers.

Thanks & Regards,
Aditya

On Thu, 7 May, 2020, 11:21 pm Sean Owen, <[hidden email]> wrote:
Yes, you can get the correct support this way by accounting for how
many rows were filtered out, but not the right confidence, as it
depends on counting support in rows without the items of interest.

But computing confidence depends on computing all that support; how
would you optimize it even if you knew the consequent you cared about?
maybe there's a way, sure, I don't know the code well but it wasn't
obvious at a glance how to take advantage of it.

I can see how limiting the rule size could help.

On Sat, May 2, 2020 at 11:04 AM Aditya Addepalli <[hidden email]> wrote:
>
> Hi Sean,
>
> I understand your approach, but there's a slight problem.
>
> If we generate rules after filtering for our desired consequent, we are introducing some bias into our rules.
> The confidence of the rules on the filtered input can be very high but this may not be the case on the entire dataset.
> Thus we can get biased rules which wrongly depict the patterns in the data.
> This is why I think having a parameter to mention the consequent would help greatly.
>
> Reducing the support doesn't really work in my case simply because rules for the consequents I am mining for occur very rarely in the data.
> Sometimes this can be 1e-4 or 1e-5, so my minSupport has to be less than that to capture the rules for that consequent.
>
> Thanks for your reply. Let me know what you think.
>
> Regards.
> Aditya Addepalli
>
>
>
>
> On Sat, 2 May, 2020, 9:13 pm Sean Owen, <[hidden email]> wrote:
>>
>> You could just filter the input for sets containing the desired item,
>> and discard the rest. That doesn't mean all of the item sets have that
>> item, and you'd still have to filter, but may be much faster to
>> compute.
>> Increasing min support might generally have the effect of smaller
>> rules, though it doesn't impose a cap. That could help perf, if that's
>> what you're trying to improve.
>> I don't know if it's worth new params in the implementation, maybe. I
>> think there would have to be an argument this generalizes.
>>
>> On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli <[hidden email]> wrote:
>> >
>> > Hi Everyone,
>> >
>> > I was wondering if we could make any enhancements to the FP-Growth algorithm in spark/pyspark.
>> >
>> > Many times I am looking for a rule for a particular consequent, so I don't need the rules for all the other consequents. I know I can filter the rules to get the desired output, but if I could input this in the algorithm itself, the execution time would reduce drastically.
>> >
>> > Also, sometimes I want the rules to be small, maybe of length 5-6. Again, I can filter on length but I was wondering if we could take this as input into the algo. Given the Depth first nature of FP-Growth, I am not sure that is feasible.
>> >
>> >  I am willing to work on these suggestions, if someone thinks they are feasible. Thanks to the dev team for all the hard work!
>> >
>> > Regards,
>> > Aditya Addepalli
Reply | Threaded
Open this post in threaded view
|

Re: Spark FP-growth

Sean Owen-2
The confidence calculation is pretty trivial, the work is finding the supports needed. Not sure how to optimize that. 

On Thu, May 7, 2020, 1:12 PM Aditya Addepalli <[hidden email]> wrote:
Hi Sean,

1.
I was thinking that by specifying the consequent we can (somehow?) skip the confidence calculation for all the other consequents.

This would greatly reduce the time taken as we avoid computation for consequents we don't care about.


2.
Is limiting rule size even possible? I thought because of FP growth's depth first nature it might not be possible.

My experience with Fp-growth has largely been in python where the API is limited. I will take a look at the scala source code and get back to you with more concrete answers.

Thanks & Regards,
Aditya

On Thu, 7 May, 2020, 11:21 pm Sean Owen, <[hidden email]> wrote:
Yes, you can get the correct support this way by accounting for how
many rows were filtered out, but not the right confidence, as it
depends on counting support in rows without the items of interest.

But computing confidence depends on computing all that support; how
would you optimize it even if you knew the consequent you cared about?
maybe there's a way, sure, I don't know the code well but it wasn't
obvious at a glance how to take advantage of it.

I can see how limiting the rule size could help.

On Sat, May 2, 2020 at 11:04 AM Aditya Addepalli <[hidden email]> wrote:
>
> Hi Sean,
>
> I understand your approach, but there's a slight problem.
>
> If we generate rules after filtering for our desired consequent, we are introducing some bias into our rules.
> The confidence of the rules on the filtered input can be very high but this may not be the case on the entire dataset.
> Thus we can get biased rules which wrongly depict the patterns in the data.
> This is why I think having a parameter to mention the consequent would help greatly.
>
> Reducing the support doesn't really work in my case simply because rules for the consequents I am mining for occur very rarely in the data.
> Sometimes this can be 1e-4 or 1e-5, so my minSupport has to be less than that to capture the rules for that consequent.
>
> Thanks for your reply. Let me know what you think.
>
> Regards.
> Aditya Addepalli
>
>
>
>
> On Sat, 2 May, 2020, 9:13 pm Sean Owen, <[hidden email]> wrote:
>>
>> You could just filter the input for sets containing the desired item,
>> and discard the rest. That doesn't mean all of the item sets have that
>> item, and you'd still have to filter, but may be much faster to
>> compute.
>> Increasing min support might generally have the effect of smaller
>> rules, though it doesn't impose a cap. That could help perf, if that's
>> what you're trying to improve.
>> I don't know if it's worth new params in the implementation, maybe. I
>> think there would have to be an argument this generalizes.
>>
>> On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli <[hidden email]> wrote:
>> >
>> > Hi Everyone,
>> >
>> > I was wondering if we could make any enhancements to the FP-Growth algorithm in spark/pyspark.
>> >
>> > Many times I am looking for a rule for a particular consequent, so I don't need the rules for all the other consequents. I know I can filter the rules to get the desired output, but if I could input this in the algorithm itself, the execution time would reduce drastically.
>> >
>> > Also, sometimes I want the rules to be small, maybe of length 5-6. Again, I can filter on length but I was wondering if we could take this as input into the algo. Given the Depth first nature of FP-Growth, I am not sure that is feasible.
>> >
>> >  I am willing to work on these suggestions, if someone thinks they are feasible. Thanks to the dev team for all the hard work!
>> >
>> > Regards,
>> > Aditya Addepalli
Reply | Threaded
Open this post in threaded view
|

Re: Spark FP-growth

Aditya Addepalli
Absolutely. I meant to say the confidence calculation depends on the support calculations and hence would reduce the time. Thanks for pointing that out.

On Thu, 7 May, 2020, 11:56 pm Sean Owen, <[hidden email]> wrote:
The confidence calculation is pretty trivial, the work is finding the supports needed. Not sure how to optimize that. 

On Thu, May 7, 2020, 1:12 PM Aditya Addepalli <[hidden email]> wrote:
Hi Sean,

1.
I was thinking that by specifying the consequent we can (somehow?) skip the confidence calculation for all the other consequents.

This would greatly reduce the time taken as we avoid computation for consequents we don't care about.


2.
Is limiting rule size even possible? I thought because of FP growth's depth first nature it might not be possible.

My experience with Fp-growth has largely been in python where the API is limited. I will take a look at the scala source code and get back to you with more concrete answers.

Thanks & Regards,
Aditya

On Thu, 7 May, 2020, 11:21 pm Sean Owen, <[hidden email]> wrote:
Yes, you can get the correct support this way by accounting for how
many rows were filtered out, but not the right confidence, as it
depends on counting support in rows without the items of interest.

But computing confidence depends on computing all that support; how
would you optimize it even if you knew the consequent you cared about?
maybe there's a way, sure, I don't know the code well but it wasn't
obvious at a glance how to take advantage of it.

I can see how limiting the rule size could help.

On Sat, May 2, 2020 at 11:04 AM Aditya Addepalli <[hidden email]> wrote:
>
> Hi Sean,
>
> I understand your approach, but there's a slight problem.
>
> If we generate rules after filtering for our desired consequent, we are introducing some bias into our rules.
> The confidence of the rules on the filtered input can be very high but this may not be the case on the entire dataset.
> Thus we can get biased rules which wrongly depict the patterns in the data.
> This is why I think having a parameter to mention the consequent would help greatly.
>
> Reducing the support doesn't really work in my case simply because rules for the consequents I am mining for occur very rarely in the data.
> Sometimes this can be 1e-4 or 1e-5, so my minSupport has to be less than that to capture the rules for that consequent.
>
> Thanks for your reply. Let me know what you think.
>
> Regards.
> Aditya Addepalli
>
>
>
>
> On Sat, 2 May, 2020, 9:13 pm Sean Owen, <[hidden email]> wrote:
>>
>> You could just filter the input for sets containing the desired item,
>> and discard the rest. That doesn't mean all of the item sets have that
>> item, and you'd still have to filter, but may be much faster to
>> compute.
>> Increasing min support might generally have the effect of smaller
>> rules, though it doesn't impose a cap. That could help perf, if that's
>> what you're trying to improve.
>> I don't know if it's worth new params in the implementation, maybe. I
>> think there would have to be an argument this generalizes.
>>
>> On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli <[hidden email]> wrote:
>> >
>> > Hi Everyone,
>> >
>> > I was wondering if we could make any enhancements to the FP-Growth algorithm in spark/pyspark.
>> >
>> > Many times I am looking for a rule for a particular consequent, so I don't need the rules for all the other consequents. I know I can filter the rules to get the desired output, but if I could input this in the algorithm itself, the execution time would reduce drastically.
>> >
>> > Also, sometimes I want the rules to be small, maybe of length 5-6. Again, I can filter on length but I was wondering if we could take this as input into the algo. Given the Depth first nature of FP-Growth, I am not sure that is feasible.
>> >
>> >  I am willing to work on these suggestions, if someone thinks they are feasible. Thanks to the dev team for all the hard work!
>> >
>> > Regards,
>> > Aditya Addepalli