Definition of the TfIdfEncodingPolicy class. More...

#include <tf_idf_encoding_policy.hpp>

Public Types
enum	TfTypes { BINARY, RAW_COUNT, TERM_FREQUENCY, SUBLINEAR_TF }
	Enum class used to identify the type of the term frequency statistics. More...

Public Member Functions
	TfIdfEncodingPolicy (const TfTypes tfType=TfTypes::RAW_COUNT, const bool smoothIdf=true)
	Construct this using the term frequency type and the inverse document frequency type. More...

void	Reset ()
	Clear the necessary internal variables.

template<typename MatType >
void	Encode (MatType &output, const size_t value, const size_t line, const size_t)
	The function performs the TfIdf encoding algorithm i.e. More...

template<typename ElemType >
void	Encode (std::vector< std::vector< ElemType >> &output, const size_t value, const size_t line, const size_t)
	The function performs the TfIdf encoding algorithm i.e. More...

void	PreprocessToken (const size_t line, const size_t, const size_t value)

const std::vector< std::unordered_map< size_t, size_t > > &	TokensFrequences () const
	Return token frequencies.

std::vector< std::unordered_map< size_t, size_t > > &	TokensFrequences ()
	Modify token frequencies.

const std::unordered_map< size_t, size_t > &	NumContainingStrings () const
	Get the number of containing strings depending on the given token.

std::unordered_map< size_t, size_t > &	NumContainingStrings ()
	Modify the number of containing strings depending on the given token.

const std::vector< size_t > &	LinesSizes () const
	Return the lines sizes.

std::vector< size_t > &	LinesSizes ()
	Modify the lines sizes.

TfTypes	TfType () const
	Return the term frequency type.

TfTypes &	TfType ()
	Modify the term frequency type.

bool	SmoothIdf () const
	Determine the idf algorithm type (whether it's smooth or not).

bool &	SmoothIdf ()
	Modify the idf algorithm type (whether it's smooth or not).

template<typename Archive >
void	serialize (Archive &ar, const uint32_t)
	Serialize the class to the given archive.

Static Public Member Functions
template<typename MatType >
static void	InitMatrix (MatType &output, const size_t datasetSize, const size_t, const size_t dictionarySize)
	The function initializes the output matrix. More...

template<typename ElemType >
static void	InitMatrix (std::vector< std::vector< ElemType >> &output, const size_t datasetSize, const size_t, const size_t dictionarySize)
	The function initializes the output matrix. More...

Detailed Description

Definition of the TfIdfEncodingPolicy class.

TfIdfEncodingPolicy is used as a helper class for StringEncoding.

Tf-idf is a weighting scheme that takes into account the importance of encoded tokens. The tf-idf statistics is equal to term frequency (tf) multiplied by inverse document frequency (idf). The encoder assigns the corresponding tf-idf value to each token. The order in which the tokens are labeled is defined by the dictionary used by the StringEncoding class. The encoder writes data either in the column-major order or in the row-major order depending on the output data type.

Member Enumeration Documentation

◆ TfTypes

enum mlpack::data::TfIdfEncodingPolicy::TfTypes

strong

Enum class used to identify the type of the term frequency statistics.

The present implementation supports the following types: BINARY Term frequency equals 1 if the row contains the encoded token and 0 otherwise. RAW_COUNT Term frequency equals the number of times when the encoded token occurs in the row. TERM_FREQUENCY Term frequency equals the number of times when the encoded token occurs in the row divided by the total number of tokens in the row. SUBLINEAR_TF Term frequency equals \( 1 + log(rawCount), \) where rawCount is equal to the number of times when the encoded token occurs in the row.

Constructor & Destructor Documentation

◆ TfIdfEncodingPolicy()

mlpack::data::TfIdfEncodingPolicy::TfIdfEncodingPolicy	(	const TfTypes	tfType = `TfTypes::RAW_COUNT`,
		const bool	smoothIdf = `true`
	)

inline

Construct this using the term frequency type and the inverse document frequency type.

Parameters

tfType Type of the term frequency statistics.

smoothIdf Used to indicate whether to use smooth idf or not. If idf is smooth it's calculated by the following formula: \( idf(T) = \log \frac{1 + N}{1 + df(T)} + 1, \) where \( N \) is the total number of strings in the document, \( T \) is the current encoded token, \( df(T) \) equals the number of strings which contain the token. If idf isn't smooth then the following rule applies: \( idf(T) = \log \frac{N}{df(T)} + 1. \)

Member Function Documentation

◆ Encode() [1/2]

template<typename MatType >

void mlpack::data::TfIdfEncodingPolicy::Encode	(	MatType &	output,
		const size_t	value,
		const size_t	line,
		const size_t
	)

inline

The function performs the TfIdf encoding algorithm i.e.

it writes the encoded token to the output. The encoder writes data in the column-major order.

Template Parameters

MatType The output matrix type.

Parameters

output	Output matrix to store the encoded results (sp_mat or mat).
value	The encoded token.
line	The line number at which the encoding is performed.
*	(index) The token index in the line.

◆ Encode() [2/2]

template<typename ElemType >

void mlpack::data::TfIdfEncodingPolicy::Encode	(	std::vector< std::vector< ElemType >> &	output,
		const size_t	value,
		const size_t	line,
		const size_t
	)

inline

The function performs the TfIdf encoding algorithm i.e.

it writes the encoded token to the output. The encoder writes data in the row-major order.

Overloaded function to accept vector<vector<ElemType>> as the output type.

Template Parameters

ElemType Type of the output values.

Parameters

output	Output matrix to store the encoded results.
value	The encoded token.
line	The line number at which the encoding is performed.
*	(index) The token index in the line.

◆ InitMatrix() [1/2]

template<typename MatType >

static void mlpack::data::TfIdfEncodingPolicy::InitMatrix	(	MatType &	output,
		const size_t	datasetSize,
		const size_t	,
		const size_t	dictionarySize
	)

inlinestatic

The function initializes the output matrix.

The encoder writes data in the row-major order.

Template Parameters

MatType The output matrix type.

Parameters

output	Output matrix to store the encoded results (sp_mat or mat).
datasetSize	The number of strings in the input dataset.
*	(maxNumTokens) The maximum number of tokens in the strings of the input dataset (not used).
dictionarySize	The size of the dictionary.

◆ InitMatrix() [2/2]

template<typename ElemType >

static void mlpack::data::TfIdfEncodingPolicy::InitMatrix	(	std::vector< std::vector< ElemType >> &	output,
		const size_t	datasetSize,
		const size_t	,
		const size_t	dictionarySize
	)

inlinestatic

The function initializes the output matrix.

The encoder writes data in the row-major order.

Overloaded function to save the result in vector<vector<ElemType>>.

Template Parameters

ElemType Type of the output values.

Parameters

output	Output matrix to store the encoded results.
datasetSize	The number of strings in the input dataset.
*	(maxNumTokens) The maximum number of tokens in the strings of the input dataset (not used).
dictionarySize	The size of the dictionary.

The documentation for this class was generated from the following file:

src/mlpack/core/data/string_encoding_policies/tf_idf_encoding_policy.hpp

Public Types

Public Member Functions

Static Public Member Functions

Detailed Description

Member Enumeration Documentation

◆ TfTypes

Constructor & Destructor Documentation

◆ TfIdfEncodingPolicy()

Member Function Documentation

◆ Encode() [1/2]

◆ Encode() [2/2]

◆ InitMatrix() [1/2]

◆ InitMatrix() [2/2]