docfun - Apply function to words in documents - MATLAB (original) (raw)

Apply function to words in documents

Syntax

Description

[newDocuments](#d126e18068) = docfun([func](#d126e17989),[documents](#d126e18048)) calls the function specified by the function handle func and passes elements of documents as a string vector of words.

If func accepts exactly one input argument, then the words of newDocuments(i) are the output offunc(string(documents(i))).
If func accepts two input arguments, then the words of newDocuments(i) are the output offunc(string(documents(i)),details), wheredetails contains the corresponding token details output by tokenDetails.
If func changes the number of words in the document, then docfun removes the token details from that document.

docfun does not perform the calls to functionfunc in a specific order.

example

[newDocuments](#d126e18068) = docfun([func](#d126e17989),documents1,...,documentsN) calls the function specified by the function handle func and passes elements of documents1,…,documentsN as string vectors of words, where N is the number of inputs to the functionfunc. The words of newDocuments(i) are the output offunc(string(documents1(i)),...,string(documentsN(i))).

Each of documents1,…,documentsN must be the same size.

example

Examples

collapse all

Apply reverse to each word in a document array.

documents = tokenizedDocument([ ... "an example of a short sentence" "a second short sentence"])

documents = 2×1 tokenizedDocument:

6 tokens: an example of a short sentence
4 tokens: a second short sentence

func = @reverse; newDocuments = docfun(func,documents)

newDocuments = 2×1 tokenizedDocument:

6 tokens: na elpmaxe fo a trohs ecnetnes
4 tokens: a dnoces trohs ecnetnes

Tag words by combining the words from one document array with another, using the string function plus.

Create the first tokenizedDocument array. Erase the punctuation and convert the text to lowercase.

str = [ ... "An example of a short sentence." "A second short sentence."]; str = erasePunctuation(str); str = lower(str); documents1 = tokenizedDocument(str)

documents1 = 2×1 tokenizedDocument:

6 tokens: an example of a short sentence
4 tokens: a second short sentence

Create the second tokenizedDocument array. The documents have the same number of words as the corresponding documents in documents1. The words of documents2 are POS tags for the corresponding words.

documents2 = tokenizedDocument([ ... "_det _noun _prep _det _adj _noun" "_det _adj _adj _noun"])

documents2 = 2×1 tokenizedDocument:

6 tokens: _det _noun _prep _det _adj _noun
4 tokens: _det _adj _adj _noun

func = @plus; newDocuments = docfun(func,documents1,documents2)

newDocuments = 2×1 tokenizedDocument:

6 tokens: an_det example_noun of_prep a_det short_adj sentence_noun
4 tokens: a_det second_adj short_adj sentence_noun

The output is not the same as calling plus on the documents directly.

plus(documents1,documents2)

ans = 2×1 tokenizedDocument:

12 tokens: an example of a short sentence _det _noun _prep _det _adj _noun
 8 tokens: a second short sentence _det _adj _adj _noun

Input Arguments

collapse all

Function handle that accepts N string arrays as inputs and outputs a string array. func must acceptstring(documents1(i)),...,string(documentsN(i)) as input.

Function handle to apply to words in documents. The function must have one of the following syntaxes:

newWords = func(words), wherewords is a string array of the words of a single document.
newWords = func(words,details), wherewords is a string array of the words of a single document, and details is the corresponding table of token details given by tokenDetails.
newWords = func(words1,...,wordsN), wherewords1,...,wordsN are string arrays of words.

Example: @reverse

Data Types: function_handle

Output Arguments

Version History

Introduced in R2017b