[MLlib] Term Frequency in TF-IDF seems incorrect

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[MLlib] Term Frequency in TF-IDF seems incorrect

invkrh
When computing term frequency, we can use either HashTF or CountVectorizer feature extractors.
However, both of them just use the number of times that a term appears in a document.
It is not a true frequency. Acutally, it should be divided by the length of the document. 

Is this a wanted feature ?

--
Hao Ren

Data Engineer @ leboncoin

Paris, France
Reply | Threaded
Open this post in threaded view
|

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

Yanbo Liang-2
Hi Hao,

HashingTF directly apply a hash function (Murmurhash3) to the features to determine their column index. It excluded any thought about the term frequency or the length of the document. It does similar work compared with sklearn FeatureHasher. The result is increased speed and reduced memory usage, but it does not remember what the input features looked like and can not convert the output back to the original features. Actually we misnamed this transformer, it only does the work of feature hashing rather than computing hashing term frequency.

CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus to build the hash table of the features. So it will consume more memory than HashingTF. However, we can convert the output back to the original feature.

Both of the transformers do not consider the length of each document. If you want to compute term frequency divided by the length of the document, you should write your own function based on transformers provided by MLlib.

Thanks
Yanbo

2016-08-01 15:29 GMT-07:00 Hao Ren <[hidden email]>:
When computing term frequency, we can use either HashTF or CountVectorizer feature extractors.
However, both of them just use the number of times that a term appears in a document.
It is not a true frequency. Acutally, it should be divided by the length of the document. 

Is this a wanted feature ?

--
Hao Ren

Data Engineer @ leboncoin

Paris, France

Reply | Threaded
Open this post in threaded view
|

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

Nick Pentreath
Note that both HashingTF and CountVectorizer are usually used for creating TF-IDF normalized vectors. The definition (https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition) of term frequency in TF-IDF is actually the "number of times the term occurs in the document".

So it's perhaps a bit of a misnomer, but the implementation is correct.

On Tue, 2 Aug 2016 at 05:44 Yanbo Liang <[hidden email]> wrote:
Hi Hao,

HashingTF directly apply a hash function (Murmurhash3) to the features to determine their column index. It excluded any thought about the term frequency or the length of the document. It does similar work compared with sklearn FeatureHasher. The result is increased speed and reduced memory usage, but it does not remember what the input features looked like and can not convert the output back to the original features. Actually we misnamed this transformer, it only does the work of feature hashing rather than computing hashing term frequency.

CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus to build the hash table of the features. So it will consume more memory than HashingTF. However, we can convert the output back to the original feature.

Both of the transformers do not consider the length of each document. If you want to compute term frequency divided by the length of the document, you should write your own function based on transformers provided by MLlib.

Thanks
Yanbo

2016-08-01 15:29 GMT-07:00 Hao Ren <[hidden email]>:
When computing term frequency, we can use either HashTF or CountVectorizer feature extractors.
However, both of them just use the number of times that a term appears in a document.
It is not a true frequency. Acutally, it should be divided by the length of the document. 

Is this a wanted feature ?

--
Hao Ren

Data Engineer @ leboncoin

Paris, France