mlpack
Public Types | Public Member Functions | Static Public Member Functions | List of all members
mlpack::data::TfIdfEncodingPolicy Class Reference

Definition of the TfIdfEncodingPolicy class. More...

#include <tf_idf_encoding_policy.hpp>

Public Types

enum  TfTypes { BINARY, RAW_COUNT, TERM_FREQUENCY, SUBLINEAR_TF }
 Enum class used to identify the type of the term frequency statistics. More...
 

Public Member Functions

 TfIdfEncodingPolicy (const TfTypes tfType=TfTypes::RAW_COUNT, const bool smoothIdf=true)
 Construct this using the term frequency type and the inverse document frequency type. More...
 
void Reset ()
 Clear the necessary internal variables.
 
template<typename MatType >
void Encode (MatType &output, const size_t value, const size_t line, const size_t)
 The function performs the TfIdf encoding algorithm i.e. More...
 
template<typename ElemType >
void Encode (std::vector< std::vector< ElemType >> &output, const size_t value, const size_t line, const size_t)
 The function performs the TfIdf encoding algorithm i.e. More...
 
void PreprocessToken (const size_t line, const size_t, const size_t value)
 
const std::vector< std::unordered_map< size_t, size_t > > & TokensFrequences () const
 Return token frequencies.
 
std::vector< std::unordered_map< size_t, size_t > > & TokensFrequences ()
 Modify token frequencies.
 
const std::unordered_map< size_t, size_t > & NumContainingStrings () const
 Get the number of containing strings depending on the given token.
 
std::unordered_map< size_t, size_t > & NumContainingStrings ()
 Modify the number of containing strings depending on the given token.
 
const std::vector< size_t > & LinesSizes () const
 Return the lines sizes.
 
std::vector< size_t > & LinesSizes ()
 Modify the lines sizes.
 
TfTypes TfType () const
 Return the term frequency type.
 
TfTypesTfType ()
 Modify the term frequency type.
 
bool SmoothIdf () const
 Determine the idf algorithm type (whether it's smooth or not).
 
bool & SmoothIdf ()
 Modify the idf algorithm type (whether it's smooth or not).
 
template<typename Archive >
void serialize (Archive &ar, const uint32_t)
 Serialize the class to the given archive.
 

Static Public Member Functions

template<typename MatType >
static void InitMatrix (MatType &output, const size_t datasetSize, const size_t, const size_t dictionarySize)
 The function initializes the output matrix. More...
 
template<typename ElemType >
static void InitMatrix (std::vector< std::vector< ElemType >> &output, const size_t datasetSize, const size_t, const size_t dictionarySize)
 The function initializes the output matrix. More...
 

Detailed Description

Definition of the TfIdfEncodingPolicy class.

TfIdfEncodingPolicy is used as a helper class for StringEncoding.

Tf-idf is a weighting scheme that takes into account the importance of encoded tokens. The tf-idf statistics is equal to term frequency (tf) multiplied by inverse document frequency (idf). The encoder assigns the corresponding tf-idf value to each token. The order in which the tokens are labeled is defined by the dictionary used by the StringEncoding class. The encoder writes data either in the column-major order or in the row-major order depending on the output data type.

Member Enumeration Documentation

◆ TfTypes

Enum class used to identify the type of the term frequency statistics.

The present implementation supports the following types: BINARY Term frequency equals 1 if the row contains the encoded token and 0 otherwise. RAW_COUNT Term frequency equals the number of times when the encoded token occurs in the row. TERM_FREQUENCY Term frequency equals the number of times when the encoded token occurs in the row divided by the total number of tokens in the row. SUBLINEAR_TF Term frequency equals \( 1 + log(rawCount), \) where rawCount is equal to the number of times when the encoded token occurs in the row.

Constructor & Destructor Documentation

◆ TfIdfEncodingPolicy()

mlpack::data::TfIdfEncodingPolicy::TfIdfEncodingPolicy ( const TfTypes  tfType = TfTypes::RAW_COUNT,
const bool  smoothIdf = true 
)
inline

Construct this using the term frequency type and the inverse document frequency type.

Parameters
tfTypeType of the term frequency statistics.
smoothIdfUsed to indicate whether to use smooth idf or not. If idf is smooth it's calculated by the following formula: \( idf(T) = \log \frac{1 + N}{1 + df(T)} + 1, \) where \( N \) is the total number of strings in the document, \( T \) is the current encoded token, \( df(T) \) equals the number of strings which contain the token. If idf isn't smooth then the following rule applies: \( idf(T) = \log \frac{N}{df(T)} + 1. \)

Member Function Documentation

◆ Encode() [1/2]

template<typename MatType >
void mlpack::data::TfIdfEncodingPolicy::Encode ( MatType &  output,
const size_t  value,
const size_t  line,
const size_t   
)
inline

The function performs the TfIdf encoding algorithm i.e.

it writes the encoded token to the output. The encoder writes data in the column-major order.

Template Parameters
MatTypeThe output matrix type.
Parameters
outputOutput matrix to store the encoded results (sp_mat or mat).
valueThe encoded token.
lineThe line number at which the encoding is performed.
*(index) The token index in the line.

◆ Encode() [2/2]

template<typename ElemType >
void mlpack::data::TfIdfEncodingPolicy::Encode ( std::vector< std::vector< ElemType >> &  output,
const size_t  value,
const size_t  line,
const size_t   
)
inline

The function performs the TfIdf encoding algorithm i.e.

it writes the encoded token to the output. The encoder writes data in the row-major order.

Overloaded function to accept vector<vector<ElemType>> as the output type.

Template Parameters
ElemTypeType of the output values.
Parameters
outputOutput matrix to store the encoded results.
valueThe encoded token.
lineThe line number at which the encoding is performed.
*(index) The token index in the line.

◆ InitMatrix() [1/2]

template<typename MatType >
static void mlpack::data::TfIdfEncodingPolicy::InitMatrix ( MatType &  output,
const size_t  datasetSize,
const size_t  ,
const size_t  dictionarySize 
)
inlinestatic

The function initializes the output matrix.

The encoder writes data in the row-major order.

Template Parameters
MatTypeThe output matrix type.
Parameters
outputOutput matrix to store the encoded results (sp_mat or mat).
datasetSizeThe number of strings in the input dataset.
*(maxNumTokens) The maximum number of tokens in the strings of the input dataset (not used).
dictionarySizeThe size of the dictionary.

◆ InitMatrix() [2/2]

template<typename ElemType >
static void mlpack::data::TfIdfEncodingPolicy::InitMatrix ( std::vector< std::vector< ElemType >> &  output,
const size_t  datasetSize,
const size_t  ,
const size_t  dictionarySize 
)
inlinestatic

The function initializes the output matrix.

The encoder writes data in the row-major order.

Overloaded function to save the result in vector<vector<ElemType>>.

Template Parameters
ElemTypeType of the output values.
Parameters
outputOutput matrix to store the encoded results.
datasetSizeThe number of strings in the input dataset.
*(maxNumTokens) The maximum number of tokens in the strings of the input dataset (not used).
dictionarySizeThe size of the dictionary.

The documentation for this class was generated from the following file: