Definition of the TfIdfEncodingPolicy class.
More...
#include <tf_idf_encoding_policy.hpp>
|
enum | TfTypes { BINARY,
RAW_COUNT,
TERM_FREQUENCY,
SUBLINEAR_TF
} |
| Enum class used to identify the type of the term frequency statistics. More...
|
|
|
| TfIdfEncodingPolicy (const TfTypes tfType=TfTypes::RAW_COUNT, const bool smoothIdf=true) |
| Construct this using the term frequency type and the inverse document frequency type. More...
|
|
void | Reset () |
| Clear the necessary internal variables.
|
|
template<typename MatType > |
void | Encode (MatType &output, const size_t value, const size_t line, const size_t) |
| The function performs the TfIdf encoding algorithm i.e. More...
|
|
template<typename ElemType > |
void | Encode (std::vector< std::vector< ElemType >> &output, const size_t value, const size_t line, const size_t) |
| The function performs the TfIdf encoding algorithm i.e. More...
|
|
void | PreprocessToken (const size_t line, const size_t, const size_t value) |
|
const std::vector< std::unordered_map< size_t, size_t > > & | TokensFrequences () const |
| Return token frequencies.
|
|
std::vector< std::unordered_map< size_t, size_t > > & | TokensFrequences () |
| Modify token frequencies.
|
|
const std::unordered_map< size_t, size_t > & | NumContainingStrings () const |
| Get the number of containing strings depending on the given token.
|
|
std::unordered_map< size_t, size_t > & | NumContainingStrings () |
| Modify the number of containing strings depending on the given token.
|
|
const std::vector< size_t > & | LinesSizes () const |
| Return the lines sizes.
|
|
std::vector< size_t > & | LinesSizes () |
| Modify the lines sizes.
|
|
TfTypes | TfType () const |
| Return the term frequency type.
|
|
TfTypes & | TfType () |
| Modify the term frequency type.
|
|
bool | SmoothIdf () const |
| Determine the idf algorithm type (whether it's smooth or not).
|
|
bool & | SmoothIdf () |
| Modify the idf algorithm type (whether it's smooth or not).
|
|
template<typename Archive > |
void | serialize (Archive &ar, const uint32_t) |
| Serialize the class to the given archive.
|
|
|
template<typename MatType > |
static void | InitMatrix (MatType &output, const size_t datasetSize, const size_t, const size_t dictionarySize) |
| The function initializes the output matrix. More...
|
|
template<typename ElemType > |
static void | InitMatrix (std::vector< std::vector< ElemType >> &output, const size_t datasetSize, const size_t, const size_t dictionarySize) |
| The function initializes the output matrix. More...
|
|
Definition of the TfIdfEncodingPolicy class.
TfIdfEncodingPolicy is used as a helper class for StringEncoding.
Tf-idf is a weighting scheme that takes into account the importance of encoded tokens. The tf-idf statistics is equal to term frequency (tf) multiplied by inverse document frequency (idf). The encoder assigns the corresponding tf-idf value to each token. The order in which the tokens are labeled is defined by the dictionary used by the StringEncoding class. The encoder writes data either in the column-major order or in the row-major order depending on the output data type.
◆ TfTypes
Enum class used to identify the type of the term frequency statistics.
The present implementation supports the following types: BINARY Term frequency equals 1 if the row contains the encoded token and 0 otherwise. RAW_COUNT Term frequency equals the number of times when the encoded token occurs in the row. TERM_FREQUENCY Term frequency equals the number of times when the encoded token occurs in the row divided by the total number of tokens in the row. SUBLINEAR_TF Term frequency equals \( 1 + log(rawCount), \) where rawCount is equal to the number of times when the encoded token occurs in the row.
◆ TfIdfEncodingPolicy()
mlpack::data::TfIdfEncodingPolicy::TfIdfEncodingPolicy |
( |
const TfTypes |
tfType = TfTypes::RAW_COUNT , |
|
|
const bool |
smoothIdf = true |
|
) |
| |
|
inline |
Construct this using the term frequency type and the inverse document frequency type.
- Parameters
-
tfType | Type of the term frequency statistics. |
smoothIdf | Used to indicate whether to use smooth idf or not. If idf is smooth it's calculated by the following formula: \( idf(T) = \log \frac{1 + N}{1 + df(T)} + 1, \) where \( N \) is the total number of strings in the document, \( T \) is the current encoded token, \( df(T) \) equals the number of strings which contain the token. If idf isn't smooth then the following rule applies: \( idf(T) = \log \frac{N}{df(T)} + 1. \) |
◆ Encode() [1/2]
template<typename MatType >
void mlpack::data::TfIdfEncodingPolicy::Encode |
( |
MatType & |
output, |
|
|
const size_t |
value, |
|
|
const size_t |
line, |
|
|
const size_t |
|
|
) |
| |
|
inline |
The function performs the TfIdf encoding algorithm i.e.
it writes the encoded token to the output. The encoder writes data in the column-major order.
- Template Parameters
-
MatType | The output matrix type. |
- Parameters
-
output | Output matrix to store the encoded results (sp_mat or mat). |
value | The encoded token. |
line | The line number at which the encoding is performed. |
* | (index) The token index in the line. |
◆ Encode() [2/2]
template<typename ElemType >
void mlpack::data::TfIdfEncodingPolicy::Encode |
( |
std::vector< std::vector< ElemType >> & |
output, |
|
|
const size_t |
value, |
|
|
const size_t |
line, |
|
|
const size_t |
|
|
) |
| |
|
inline |
The function performs the TfIdf encoding algorithm i.e.
it writes the encoded token to the output. The encoder writes data in the row-major order.
Overloaded function to accept vector<vector<ElemType>> as the output type.
- Template Parameters
-
ElemType | Type of the output values. |
- Parameters
-
output | Output matrix to store the encoded results. |
value | The encoded token. |
line | The line number at which the encoding is performed. |
* | (index) The token index in the line. |
◆ InitMatrix() [1/2]
template<typename MatType >
static void mlpack::data::TfIdfEncodingPolicy::InitMatrix |
( |
MatType & |
output, |
|
|
const size_t |
datasetSize, |
|
|
const size_t |
, |
|
|
const size_t |
dictionarySize |
|
) |
| |
|
inlinestatic |
The function initializes the output matrix.
The encoder writes data in the row-major order.
- Template Parameters
-
MatType | The output matrix type. |
- Parameters
-
output | Output matrix to store the encoded results (sp_mat or mat). |
datasetSize | The number of strings in the input dataset. |
* | (maxNumTokens) The maximum number of tokens in the strings of the input dataset (not used). |
dictionarySize | The size of the dictionary. |
◆ InitMatrix() [2/2]
template<typename ElemType >
static void mlpack::data::TfIdfEncodingPolicy::InitMatrix |
( |
std::vector< std::vector< ElemType >> & |
output, |
|
|
const size_t |
datasetSize, |
|
|
const size_t |
, |
|
|
const size_t |
dictionarySize |
|
) |
| |
|
inlinestatic |
The function initializes the output matrix.
The encoder writes data in the row-major order.
Overloaded function to save the result in vector<vector<ElemType>>.
- Template Parameters
-
ElemType | Type of the output values. |
- Parameters
-
output | Output matrix to store the encoded results. |
datasetSize | The number of strings in the input dataset. |
* | (maxNumTokens) The maximum number of tokens in the strings of the input dataset (not used). |
dictionarySize | The size of the dictionary. |
The documentation for this class was generated from the following file: