mlpack
Public Member Functions | Static Public Member Functions | List of all members
mlpack::data::BagOfWordsEncodingPolicy Class Reference

Definition of the BagOfWordsEncodingPolicy class. More...

#include <bag_of_words_encoding_policy.hpp>

Public Member Functions

template<typename Archive >
void serialize (Archive &, const uint32_t)
 Serialize the class to the given archive.
 

Static Public Member Functions

static void Reset ()
 Clear the necessary internal variables.
 
template<typename MatType >
static void InitMatrix (MatType &output, const size_t datasetSize, const size_t, const size_t dictionarySize)
 The function initializes the output matrix. More...
 
template<typename ElemType >
static void InitMatrix (std::vector< std::vector< ElemType >> &output, const size_t datasetSize, const size_t, const size_t dictionarySize)
 The function initializes the output matrix. More...
 
template<typename MatType >
static void Encode (MatType &output, const size_t value, const size_t line, const size_t)
 The function performs the bag of words encoding algorithm i.e. More...
 
template<typename ElemType >
static void Encode (std::vector< std::vector< ElemType >> &output, const size_t value, const size_t line, const size_t)
 The function performs the bag of words encoding algorithm i.e. More...
 
static void PreprocessToken (size_t, size_t, size_t)
 The function is not used by the bag of words encoding policy. More...
 

Detailed Description

Definition of the BagOfWordsEncodingPolicy class.

BagOfWords is used as a helper class for StringEncoding. The encoder maps each dataset item to a vector of size N, where N is equal to the total unique number of tokens. The i-th coordinate of the output vector is equal to the number of times when the i-th token occurs in the corresponding dataset item. The order in which the tokens are labeled is defined by the dictionary used by the StringEncoding class. The encoder writes data either in the column-major order or in the row-major order depending on the output data type.

Member Function Documentation

◆ Encode() [1/2]

template<typename MatType >
static void mlpack::data::BagOfWordsEncodingPolicy::Encode ( MatType &  output,
const size_t  value,
const size_t  line,
const size_t   
)
inlinestatic

The function performs the bag of words encoding algorithm i.e.

it writes the encoded token to the output. The encoder writes data in the column-major order.

Template Parameters
MatTypeThe output matrix type.
Parameters
outputOutput matrix to store the encoded results (sp_mat or mat).
valueThe encoded token.
lineThe line number at which the encoding is performed.
*(index) The token index in the line.

◆ Encode() [2/2]

template<typename ElemType >
static void mlpack::data::BagOfWordsEncodingPolicy::Encode ( std::vector< std::vector< ElemType >> &  output,
const size_t  value,
const size_t  line,
const size_t   
)
inlinestatic

The function performs the bag of words encoding algorithm i.e.

it writes the encoded token to the output. The encoder writes data in the row-major order.

Overloaded function to accept vector<vector<ElemType>> as the output type.

Template Parameters
ElemTypeType of the output values.
Parameters
outputOutput matrix to store the encoded results.
valueThe encoded token.
lineThe line number at which the encoding is performed.
*(index) The line token number at which the encoding is performed.

◆ InitMatrix() [1/2]

template<typename MatType >
static void mlpack::data::BagOfWordsEncodingPolicy::InitMatrix ( MatType &  output,
const size_t  datasetSize,
const size_t  ,
const size_t  dictionarySize 
)
inlinestatic

The function initializes the output matrix.

The encoder writes data in the column-major order.

Template Parameters
MatTypeThe output matrix type.
Parameters
outputOutput matrix to store the encoded results (sp_mat or mat).
datasetSizeThe number of strings in the input dataset.
*(maxNumTokens) The maximum number of tokens in the strings of the input dataset (not used).
dictionarySizeThe size of the dictionary.

◆ InitMatrix() [2/2]

template<typename ElemType >
static void mlpack::data::BagOfWordsEncodingPolicy::InitMatrix ( std::vector< std::vector< ElemType >> &  output,
const size_t  datasetSize,
const size_t  ,
const size_t  dictionarySize 
)
inlinestatic

The function initializes the output matrix.

The encoder writes data in the row-major order.

Overloaded function to save the result in vector<vector<ElemType>>.

Template Parameters
ElemTypeType of the output values.
Parameters
outputOutput matrix to store the encoded results.
datasetSizeThe number of strings in the input dataset.
*(maxNumTokens) The maximum number of tokens in the strings of the input dataset (not used).
dictionarySizeThe size of the dictionary.

◆ PreprocessToken()

static void mlpack::data::BagOfWordsEncodingPolicy::PreprocessToken ( size_t  ,
size_t  ,
size_t   
)
inlinestatic

The function is not used by the bag of words encoding policy.

Parameters
*(line) The line number at which the encoding is performed.
*(index) The token sequence number in the line.
*(value) The encoded token.

The documentation for this class was generated from the following file: