Corpus

Corpus is an R text processing package with full support for international text (Unicode). It includes functions for reading data from newline-delimited JSON files, for normalizing and tokenizing text, for searching for term occurrences, and for computing term occurrence frequencies (including n-grams).

Corpus does not provide any language models, part-of-speech tagging, topic models, or word vectors, but it can be used in conjunction with other packages that provide these features.

Corpus was built and maintained from 2017–2020 by Patrick Perry. It is currently maintained by Leslie Huang.

Installation

Stable version

Corpus is available on CRAN.To install the latest released version, run the following command in R:

Development version

To install the latest development version, run the following:

tmp <- tempfile()
system2("git", c("clone", "--recursive",
                 shQuote("https://github.com/leslie-huang/r-corpus.git"), shQuote(tmp)))
devtools::install(tmp)

Note that corpus uses a git submodule, so you cannot use devtools::install_github.

Usage

Here’s how to get the most common non-punctuation, non-stop-word terms in The Federalist Papers:

Here’s how to find all instances of tokens that stem to “power”:

> text_locate(federalist, "power", stemmer = "en")
   text             before              instance              after             
1  1    …ay hazard a diminution of the   power   , emolument,\nand consequence …
2  1    …s. So numerous indeed and so\n powerful  are the causes which serve to…
3  1    … of a temper fond of despotic   power    and\nhostile to the principle…
4  2    …der to vest it with requisite   powers  . It is well worthy\nof consid…
5  2    …head of each the same kind of   powers   which they are advised to\npl…
6  2    …\nwithout having been awed by   power   , or influenced by any passion…
7  3    …ment, vested with sufficient\n  powers   for all general and national …
8  3    … of nations towards all these   powers  , and to me it\nappears eviden…
9  3    …he wrong themselves, nor want   power    or\ninclination to prevent or…
10 3    …it will also be more in their   power    to\naccommodate and settle th…
11 3    …cy of little consideration or   power   .\n\nIn the year 1685, the sta…
12 3    …ain, or Britain, or any other  POWERFUL  nation?\n\nPUBLIUS.\n         
13 4    … our advancement in union, in   power    and\nconsequence by land and …
14 4    …t can apply the resources and   power    of the whole to the\ndefense …
15 4    …\ncombining and directing the   powers   and resources of the whole, w…
16 5    …h tend to beget and\nincrease   power    in one part and to impede its…
17 6    … description are the love of\n  power    or the desire of pre-eminence…
18 6    …nd dominion--the jealousy of\n  power   , or the desire of equality an…
19 6    …rest of this enterprising and  powerful  monarch, he\nprecipitated Eng…
20 6    …rprising a passion as that of   power    or glory? Have there not\nbee…
⋮  (912 rows total)

Here’s how to get a term frequency matrix of all 1-, 2-, 3-, 4-, and 5-grams.

This computation uses only a single CPU, yet it still completes in under three seconds.

For a more complete introduction to the package, see the getting started guide and the other articles at corpustext.com.

Contributing

The project maintainer welcomes contributions in the form of feature requests, bug reports, comments, unit tests, vignettes, or other code. If you’d like to contribute, either

This project is released with a Contributor Code of Conduct, and if you choose to contribute, you must adhere to its terms.

Acknowledgments

The API and feature set for corpus draw inspiration from quanteda, developed by Ken Benoit and collaborators; stringr, developed by Hadley Wickham; and tidytext, developed by Julia Silge and David Robinson.