Text Tokenization

Segment text into tokens, each of which is an instance of a particular ‘type’.

text_tokens(x, filter = NULL, ...)

text_ntoken(x, filter = NULL, ...)

Arguments

x	object to be tokenized.
filter	if non-`NULL`, a text filter to to use instead of the default text filter for `x`.
...	additional properties to set on the text filter.

Details

text_tokens splits texts into token sequences. Each token is an instance of a particular type. This operation proceeds in a series of stages, controlled by the filter argument:

First, we segment the text into words and spaces using the boundaries defined by Unicode Standard Annex #29, Section 4, with special handling for @mentions, #hashtags, and URLs.
Next, we normalize the words by applying the character mappings indicated by the map_case, map_quote, and remove_ignorable properties. We replace sequences of spaces by a space (U+0020). At the end of the second stage, we have segmented the text into a sequence of normalized words and spaces, in Unicode composed normal form (NFC).
In the third stage, if the combine property is non-NULL, we scan the word sequence from left to right, searching for the longest possible match in the combine list. If a match exists, we replace the word sequence with a single token for that term; otherwise, we leave the word as-is. We drop spaces at this point, unless they are part of a multi-word term. See the ‘Combining words’ section below for more details.
Next, if the stemmer property is non-NULL, we apply the indicated stemming algorithm to each word that does not match one of the elements of the stem_except character vector. Terms that stem to NA get dropped from the sequence.
After stemming, we categorize each remaining token as "letter", "number", "punct", or "symbol" according to the first character in the word. For words that start with extenders like underscore (_), use the first non-extender to classify it.
If any of drop_letter, drop_number, drop_punct, or drop_symbol are TRUE, then we drop the tokens in the corresponding categories. We also drop any terms that match an element of the drop character vector. We can add exceptions to the drop rules by specifying a non-NULL value for the drop_except property: drop_except is a character vector, then we we restore tokens that match elements of vector to their values prior to dropping.
Finally, we replace sequences of white-space in the terms with the specified connector, which defaults to a low line character (_, U+005F).

Multi-word terms specified by the combine property can be specified as tokens, prior to normalization. Terms specified by the stem_except, drop, and drop_except need to be normalized and stemmed (if stemmer is non-NULL). Thus, for example, if map_case = TRUE, then a token filter with combine = "Mx." produces the same results as a token filter with combine = "mx.". However, drop = "Mx." behaves different from drop = "mx.".

Combining words

The combine property of a text_filter enables transformations that combine two or more words into a single token. For example, specifying combine = "new york" will cause consecutive instances of the words new and york to get replaced by a single token, new york.

Value

text_tokens returns a list of the same length as x, with the same names. Each list item is a character vector with the tokens for the corresponding element of x.

text_ntoken returns a numeric vector the same length as x, with each element giving the number of tokens in the corresponding text.

Examples

text_tokens("The quick ('brown') fox can't jump 32.3 feet, right?")
#> [[1]]
#>  [1] "the"   "quick" "("     "'"     "brown" "'"     ")"     "fox"   "can't"
#> [10] "jump"  "32.3"  "feet"  ","     "right" "?"    
#> 

# count tokens:
text_ntoken("The quick ('brown') fox can't jump 32.3 feet, right?")
#> [1] 15

# don't change case or quotes:
f <- text_filter(map_case = FALSE, map_quote = FALSE)
text_tokens("The quick ('brown') fox can't jump 32.3 feet, right?", f)
#> [[1]]
#>  [1] "The"   "quick" "("     "'"     "brown" "'"     ")"     "fox"   "can't"
#> [10] "jump"  "32.3"  "feet"  ","     "right" "?"    
#> 

# drop common function words ('stop' words):
text_tokens("Able was I ere I saw Elba.",
            text_filter(drop = stopwords_en))
#> [[1]]
#> [1] "able" "ere"  "saw"  "elba" "."   
#> 

# drop numbers, with some exceptions:"
text_tokens("0, 1, 2, 3, 4, 5",
            text_filter(drop_number = TRUE,
                        drop_except = c("0", "2", "4")))
#> [[1]]
#> [1] "0" "," "," "2" "," "," "4" ","
#> 

# apply stemming...
text_tokens("Mary is running", text_filter(stemmer = "english"))
#> [[1]]
#> [1] "mari" "is"   "run" 
#> 

# ...except for certain words
text_tokens("Mary is running",
            text_filter(stemmer = "english", stem_except = "mary"))
#> [[1]]
#> [1] "mary" "is"   "run" 
#> 

# default tokenization
text_tokens("Ms. Jones")
#> [[1]]
#> [1] "ms"    "."     "jones"
#> 

# combine abbreviations
text_tokens("Ms. Jones", text_filter(combine = abbreviations_en))
#> [[1]]
#> [1] "ms."   "jones"
#> 

# add custom combinations
text_tokens("Ms. Jones is from New York City, New York.",
            text_filter(combine = c(abbreviations_en,
                                    "new york", "new york city")))
#> [[1]]
#> [1] "ms."           "jones"         "is"            "from"         
#> [5] "new_york_city" ","             "new_york"      "."            
#>

Arguments

Details

Combining words

Value

See also

Examples