mlpack
Public Member Functions | List of all members
mlpack::data::StringEncoding< EncodingPolicyType, DictionaryType > Class Template Reference

The class translates a set of strings into numbers using various encoding algorithms. More...

#include <string_encoding.hpp>

Public Member Functions

template<typename ... ArgTypes>
 StringEncoding (ArgTypes &&... args)
 Pass the given arguments to the policy constructor and create the StringEncoding object using the policy.
 
 StringEncoding (EncodingPolicyType encodingPolicy)
 Construct the class from the given encoding policy. More...
 
 StringEncoding (StringEncoding &)
 A variant of the copy constructor for non-constant objects.
 
 StringEncoding (const StringEncoding &)
 Default copy-constructor.
 
StringEncodingoperator= (const StringEncoding &)=default
 Default copy assignment operator.
 
 StringEncoding (StringEncoding &&)
 Default move-constructor.
 
StringEncodingoperator= (StringEncoding &&)=default
 Default move assignment operator.
 
template<typename TokenizerType >
void CreateMap (const std::string &input, const TokenizerType &tokenizer)
 Initialize the dictionary using the given corpus. More...
 
void Clear ()
 Clear the dictionary.
 
template<typename OutputType , typename TokenizerType >
void Encode (const std::vector< std::string > &input, OutputType &output, const TokenizerType &tokenizer)
 Encode the given text and write the result to the given output. More...
 
const DictionaryType & Dictionary () const
 Return the dictionary.
 
DictionaryType & Dictionary ()
 Modify the dictionary.
 
const EncodingPolicyType & EncodingPolicy () const
 Return the encoding policy object.
 
EncodingPolicyType & EncodingPolicy ()
 Modify the encoding policy object.
 
template<typename Archive >
void serialize (Archive &ar, const uint32_t)
 Serialize the class to the given archive.
 
template<typename MatType , typename TokenizerType , typename PolicyType >
void EncodeHelper (const std::vector< std::string > &input, MatType &output, const TokenizerType &tokenizer, PolicyType &policy)
 

Detailed Description

template<typename EncodingPolicyType, typename DictionaryType>
class mlpack::data::StringEncoding< EncodingPolicyType, DictionaryType >

The class translates a set of strings into numbers using various encoding algorithms.

The encoder writes data either in the column-major order or in the row-major order depending on the output data type.

Template Parameters
EncodingPolicyTypeType of the encoding algorithm itself.
DictionaryTypeType of the dictionary.

Constructor & Destructor Documentation

◆ StringEncoding()

template<typename EncodingPolicyType , typename DictionaryType >
mlpack::data::StringEncoding< EncodingPolicyType, DictionaryType >::StringEncoding ( EncodingPolicyType  encodingPolicy)

Construct the class from the given encoding policy.

Parameters
encodingPolicyThe given encoding policy.

Member Function Documentation

◆ CreateMap()

template<typename EncodingPolicyType , typename DictionaryType >
template<typename TokenizerType >
void mlpack::data::StringEncoding< EncodingPolicyType, DictionaryType >::CreateMap ( const std::string &  input,
const TokenizerType &  tokenizer 
)

Initialize the dictionary using the given corpus.

Template Parameters
TokenizerTypeType of the tokenizer.
Parameters
inputCorpus of text to encode.
tokenizerThe tokenizer object.

The tokenization algorithm has to be an object with two public methods:

  1. operator() which accepts a reference to boost::string_view, extracts the next token from the given view, removes the prefix containing the extracted token and returns the token;
  2. IsTokenEmpty() that accepts a token and returns true if the given token is empty.

◆ Encode()

template<typename EncodingPolicyType , typename DictionaryType >
template<typename OutputType , typename TokenizerType >
void mlpack::data::StringEncoding< EncodingPolicyType, DictionaryType >::Encode ( const std::vector< std::string > &  input,
OutputType &  output,
const TokenizerType &  tokenizer 
)

Encode the given text and write the result to the given output.

The encoder writes data in the column-major order or in the row-major order depending on the output data type.

If the output type is either arma::mat or arma::sp_mat then the function writes it in the column-major order. If the output type is 2D std::vector then the function writes it in the row major order.

Template Parameters
OutputTypeType of the output container. The function supports the following types: arma::mat, arma::sp_mat, std::vector<std::vector<>>.
TokenizerTypeType of the tokenizer.
Parameters
inputCorpus of text to encode.
outputOutput container to store the result.
tokenizerThe tokenizer object.

The tokenization algorithm has to be an object with two public methods:

  1. operator() which accepts a reference to boost::string_view, extracts the next token from the given view, removes the prefix containing the extracted token and returns the token;
  2. IsTokenEmpty() that accepts a token and returns true if the given token is empty.

The documentation for this class was generated from the following files: