The corpus package does not define a special corpus object, but it does define a new data type, corpus_text
, for storing a collection of texts. You can create values of this type using the as_corpus_text()
or as_corpus_frame()
function.
Take, for example, the following sample text, created as an R character vector.
# raw text for the first two paragraphs of _The Tale of Peter Rabbit_,
# by Beatrix Potter
raw <- c(
para1 =
paste("Once upon a time there were four little Rabbits,",
"and their names were: Flopsy, Mopsy, Cottontail, and Peter.",
"They lived with their Mother in a sandbank,",
"underneath the root of a very big fir tree.",
sep = "\n"),
para2 =
paste("'Now, my dears,' said old Mrs. Rabbit one morning,",
"'you may go into the fields or down the lane,",
"but don't go into Mr. McGregor's garden --",
"your Father had an accident there;",
"he was put in a pie by Mrs. McGregor.'",
sep = "\n"))
We can convert the text to a corpus_text
object using the as_corpus_text()
function:
text <- as_corpus_text(raw)
Alternatively, we can convert it to a data frame with a column named "text"
of type corpus_text
using the as_corpus_frame()
function:
data <- as_corpus_frame(raw)
Both as_corpus_frame()
and as_corpus_text()
are generic; they work on character vectors, data frames, tm Corpus
objects, and quanteda corpus
objects. Whenever you call a corpus function expecting text, that function calls as_corpus_text()
on its input. We will see some examples of converting tm and quanteda objects below.
The corpus_text
type behaves like an R character vector in most respects, but using this type enables some new features.
Each corpus_text
object has a text_filter
property that can be get or set using the text_filter()
generic function. This property allows us to specify the preprocessing decisions that define the text normalization, token, and sentence boundaries.
# get the text filter
print(text_filter(text))
Text filter with the following options:
map_case: TRUE
map_quote: TRUE
remove_ignorable: TRUE
combine: NULL
stemmer: NULL
stem_dropped: FALSE
stem_except: NULL
drop_letter: FALSE
drop_number: FALSE
drop_punct: FALSE
drop_symbol: FALSE
drop: NULL
drop_except: NULL
connector: _
sent_crlf: FALSE
sent_suppress: chr [1:155] "A." "A.D." "a.m." "A.M." "A.S." "AA." "AB." "Abs." "AD." ...
# set a new text filter
text_filter(text) <- text_filter(drop_punct = TRUE, drop = stopwords_en)
# update a text filter property
text_filter(text)$drop_punct <- TRUE
# switch to the default text filter
text_filter(text) <- NULL
All corpus_text
objects contain valid UTF-8 data. It is impossible to create a corpus_text
object with invalid UTF-8:
# the input is encoded in Latin-1, but declared as UTF-8
x <- "fa\xE7ile"
Encoding(x) <- "UTF-8"
as_corpus_text(x) ## fails
Error in as_corpus_text.character(x): argument entry 1 is incorrectly marked as "UTF-8": leading byte 0xE7 followed by invalid continuation byte (0x69) at position 4
as_corpus_text(iconv(x, "Latin1", "UTF-8")) ## succeeds
[1] "façile"
All corpus_text
objects have unique names. When you coerce an object with duplicate names to corpus_text
, the duplicates get renamed with a warning:
x <- as_corpus_text(c(a = "hello", b = "world", a = "repeated name")) # fails
Warning in as_corpus_text.character(c(a = "hello", b = "world", a = "repeated name")): renaming
entries with duplicate names
names(x)
[1] "a" "b" "a.1"
Setting repeated names fails:
Error in `names<-.corpus_text`(`*tmp*`, value = c("a", "b", "a")): duplicate 'names' are not allowed
You can set the names
to NULL
if you don’t want them:
NULL
corpus_text
objects can manage collections of texts that do not fit into memory, through the memory-map interface provided by read_ndjson()
. When dealing with such objects, internally the corpus_text
object just stores an offset into the file containing the text, not the text itself.
# store some sample data in a newline-delimited JSON file
tmp <- tempfile()
writeLines(c('{"text": "A sample text", "metadata": 7 }',
'{"text": "Another text."}',
'{"text": "A third text.", "metadata": -100}'),
tmp)
# memory-map the "text" field
data <- read_ndjson(tmp, text = "text", mmap = TRUE)
# display the internal representation of the resulting 'corpus_text' object
unclass(data$text)
$handle
<pointer: 0x7fba65dd4ba0>
$sources
$sources[[1]]
JSON data set with 3 rows of type text
$table
source row start stop
1 1 1 1 13
2 1 2 1 13
3 1 3 1 13
$names
NULL
$filter
NULL
Above, the handle
component of the text object is a memory-mapped view of the file. The table
component stores the row numbers for the data, along with the start and stop indices of the text.
corpus_text
objects can reference substrings of other R character objects, without copying the parent object. This allows functions like text_split()
and text_locate()
to quickly split a text into smaller segments without allocating new character strings.
# segment text into chunks of at most 10 tokens
(chunks <- text_split(text, "tokens", 10))
parent index text
1 para1 1 Once upon a time there were four little Rabbits
2 para1 2 ,\nand their names were: Flopsy, Mopsy
3 para1 3 , Cottontail, and Peter.\nThey lived with
4 para1 4 their Mother in a sandbank,\nunderneath the
5 para1 5 root of a very big fir tree.
6 para2 1 'Now, my dears,' said old Mrs
7 para2 2 . Rabbit one morning,\n'you may go into
8 para2 3 the fields or down the lane,\nbut don't
9 para2 4 go into Mr. McGregor's garden --\nyour
10 para2 5 Father had an accident there;\nhe was put
11 para2 6 in a pie by Mrs. McGregor.'
# segmenting does not allocate new objects for the segments
unclass(chunks$text) # inspect text internals
$handle
<pointer: 0x7fba65fa0fb0>
$sources
$sources[[1]]
para1
"Once upon a time there were four little Rabbits,\nand their names were: Flopsy, Mopsy, Cottontail, and Peter.\nThey lived with their Mother in a sandbank,\nunderneath the root of a very big fir tree."
para2
"'Now, my dears,' said old Mrs. Rabbit one morning,\n'you may go into the fields or down the lane,\nbut don't go into Mr. McGregor's garden --\nyour Father had an accident there;\nhe was put in a pie by Mrs. McGregor.'"
$table
source row start stop
1 1 1 1 47
2 1 1 48 84
3 1 1 85 125
4 1 1 126 168
5 1 1 169 196
6 1 2 1 29
7 1 2 30 68
8 1 2 69 107
9 1 2 108 145
10 1 2 146 186
11 1 2 187 213
$names
NULL
$filter
NULL
The call to text_split()
did not allocate any new character objects; the result uses the same sources, but with different start and stop indices.
Printing corpus_text
objects truncates the output at the end of the line instead of printing the entire contents.
# print a character object
print(raw)
para1
"Once upon a time there were four little Rabbits,\nand their names were: Flopsy, Mopsy, Cottontail, and Peter.\nThey lived with their Mother in a sandbank,\nunderneath the root of a very big fir tree."
para2
"'Now, my dears,' said old Mrs. Rabbit one morning,\n'you may go into the fields or down the lane,\nbut don't go into Mr. McGregor's garden --\nyour Father had an accident there;\nhe was put in a pie by Mrs. McGregor.'"
# print a corpus_text object
print(text)
para1
"Once upon a time there were four little Rabbits,\nand their names were: Flopsy, Mopsy, Cotto…"
para2
"'Now, my dears,' said old Mrs. Rabbit one morning,\n'you may go into the fields or down the …"
If you’d like to print the entire contents, you can convert the text to character and then print:
# print entire contents
print(as.character(text))
[1] "Once upon a time there were four little Rabbits,\nand their names were: Flopsy, Mopsy, Cottontail, and Peter.\nThey lived with their Mother in a sandbank,\nunderneath the root of a very big fir tree."
[2] "'Now, my dears,' said old Mrs. Rabbit one morning,\n'you may go into the fields or down the lane,\nbut don't go into Mr. McGregor's garden --\nyour Father had an accident there;\nhe was put in a pie by Mrs. McGregor.'"
The one disadvantage of using a corpus_text
object instead of an R character vector is that many functions from other packages that expect R character vectors will not work on corpus_text
. To get around this, you can call as.character
on a corpus_text
object to convert it to an R character vector.
# some R methods coerce their inputs to character; 'message' is one example
message(as_corpus_text("hello world"))
hello world
# others do not;
cat(as_corpus_text("hello world"), "\n") # fails
Error in cat(as_corpus_text("hello world"), "\n"): argument 1 (type 'list') cannot be handled by 'cat'
# we must call 'as.character' on the inputs
cat(as.character(as_corpus_text("hello world")), "\n")
hello world
All corpus functions expecting text work on quanteda corpus objects:
uk2010immigCorpus <-
quanteda::corpus(quanteda::data_char_ukimmig2010,
docvars = data.frame(party = names(quanteda::data_char_ukimmig2010)),
metacorpus = list(notes = "Immigration-related sections of 2010 UK party manifestos"))
# search for terms in a quanteda corpus that stem to "immigr":
text_locate(uk2010immigCorpus, "immigr", stemmer = "en")
text before instance after
1 BNP IMMIGRATION : AN UNPARALLELED CRISIS WHICH ONLY …
2 BNP …THE BNP CAN SOLVE. \n\n- At current immigration and birth rates, indigenous British…
3 BNP … will include a halt to all further immigration , the deportation of all illegal imm…
4 BNP …ion, the deportation of all illegal immigrants , a halt to the "asylum" swindle and…
5 BNP …mes in Britain, regardless of their immigration status.\n\n- The BNP will review al…
6 BNP …mission that they orchestrated mass immigration to change forcibly Britain's demogr…
7 BNP …ce is in grave peril, threatened by immigration and multiculturalism. In the absenc…
8 BNP …Statistics (ONS), legal Third World immigrants made up 14.7 percent (7.5 million) …
9 BNP …rths to second and third generation immigrant mothers. Figures released by the ON…
10 BNP …hen these figures are added in, the immigrant birth rate is estimated to be aroun…
11 BNP …ales.\n\n- The majority of the 'new immigrants ' are not from Eastern Europe, as is…
12 BNP …imed. According to the ONS figures, immigrants from Eastern Europe had 25,000 chil…
13 BNP …er exponentially, and given current immigration and birth rates, will utterly overw…
14 BNP …s.\n\nThe Disastrous Effect of Mass Immigration on British Society\n\nThere is no e…
15 BNP …here that it is 'racist' to discuss immigration and population density.\n\nThe word…
16 BNP …between 300,000-500,000 Third World immigrants each year is an issue that all thre…
17 BNP …d so on.\n\nA Case Study: Crime and Immigration \n\nImmigration has had a dramatic e…
18 BNP …ase Study: Crime and Immigration\n\n Immigration has had a dramatic effect on Britai…
19 BNP …onately involving ethnic groups.\n\n Immigration Has Harmed British Jobs\n\nThe conc…
20 BNP …s working than in 1997.\n\nOverall, immigrants have taken up more than 1.64 millio…
⋮ (87 rows total)
You can convert a quanteda corpus to a data frame using as_corpus_frame()
, or extract its text using as_corpus_text()
:
# convert to corpus_text:
print(as_corpus_text(uk2010immigCorpus))
BNP
"IMMIGRATION: AN UNPARALLELED CRISIS WHICH ONLY THE BNP CAN SOLVE. \n\n- At current immigrati…"
Coalition
"IMMIGRATION. \n\nThe Government believes that immigration has enriched our culture and stren…"
Conservative
"Attract the brightest and best to our country.\n\nImmigration has enriched our nation over t…"
Greens
"Immigration.\n\nMigration is a fact of life. People have always moved from one country to a…"
Labour
"Crime and immigration\n\nThe challenge for Britain\n\nWe will control immigration with our n…"
LibDem
"firm but fair immigration system\n\nBritain has always been an open, welcoming country, and …"
PC
"As a welcoming nation, Plaid Cymru recognises the invaluable contribution that migrants have…"
SNP
"And we will argue for Scotland to take responsibility for immigration so that we can develop…"
UKIP
"Immigration & Asylum.\n\nAs a member of the EU, Britain has lost control of her borders. Som…"
# convert to data frame:
print(as_corpus_frame(uk2010immigCorpus))
party text
BNP BNP IMMIGRATION: AN UNPARALLELED CRISIS WHICH ONLY THE BNP CAN SOLVE. \n…
Coalition Coalition IMMIGRATION. \n\nThe Government believes that immigration has enrich…
Conservative Conservative Attract the brightest and best to our country.\n\nImmigration has en…
Greens Greens Immigration.\n\nMigration is a fact of life. People have always mov…
Labour Labour Crime and immigration\n\nThe challenge for Britain\n\nWe will contro…
LibDem LibDem firm but fair immigration system\n\nBritain has always been an open,…
PC PC As a welcoming nation, Plaid Cymru recognises the invaluable contrib…
SNP SNP And we will argue for Scotland to take responsibility for immigratio…
UKIP UKIP Immigration & Asylum.\n\nAs a member of the EU, Britain has lost con…
All corpus functions expecting text also work on tm Corpus objects:
data("crude", package = "tm")
# get the top terms in a tm corpus:
term_stats(crude, drop_punct = TRUE, drop = stopwords_en)
term count support
1 oil 85 20
2 said 73 20
3 reuter 20 20
4 prices 48 15
5 last 24 12
6 one 17 12
7 dlrs 23 11
8 opec 44 10
9 barrel 15 10
10 mln 31 9
11 crude 21 9
12 new 14 9
13 pct 14 9
14 price 13 9
15 also 9 9
16 petroleum 9 9
17 market 20 8
18 barrels 11 8
19 industry 10 8
20 world 10 8
⋮ (1042 rows total)
The as_corpus_frame()
and as_corpus_text()
functions also work on tm Corpus objects:
# convert to corpus_text
print(as_corpus_text(crude))
127
"Diamond Shamrock Corp said that\neffective today it had cut its contract prices for crude oi…"
144
"OPEC may be forced to meet before a\nscheduled June session to readdress its production cutt…"
191
"Texaco Canada said it lowered the\ncontract price it will pay for crude oil 64 Canadian cts …"
194
"Marathon Petroleum Co said it reduced\nthe contract price it will pay for all grades of crud…"
211
"Houston Oil Trust said that independent\npetroleum engineers completed an annual study that …"
236
"Kuwait\"s Oil Minister, in remarks\npublished today, said there were no plans for an emergen…"
237
"Indonesia appears to be nearing a\npolitical crossroads over measures to deregulate its prot…"
242
"Saudi riyal interbank deposits were\nsteady at yesterday's higher levels in a quiet market.…"
246
"The Gulf oil state of Qatar, recovering\nslightly from last year's decline in world oil pric…"
248
"Saudi Arabian Oil Minister Hisham Nazer\nreiterated the kingdom's commitment to last Decembe…"
273
"Saudi crude oil output last month fell\nto an average of 3.5 mln barrels per day (bpd) from …"
349
"Deputy oil ministers from six Gulf\nArab states will meet in Bahrain today to discuss coordi…"
352
"Saudi Arabian Oil Minister Hisham Nazer\nreiterated the kingdom's commitment to last Decembe…"
353
"Kuwait's oil minister said in a newspaper\ninterview that there were no plans for an emergen…"
368
"The port of Philadelphia was closed\nwhen a Cypriot oil tanker, Seapride II, ran aground aft…"
489
"A study group said the United States\nshould increase its strategic petroleum reserve to one…"
502
"A study group said the United States\nshould increase its strategic petroleum reserve to one…"
543
"Unocal Corp's Union Oil Co said it\nlowered its posted prices for crude oil one to 1.50 dlrs…"
704
"The New York Mercantile Exchange set\nApril one for the debut of a new procedure in the ener…"
708
"Argentine crude oil production was\ndown 10.8 pct in January 1987 to 12.32 mln barrels, from…"
# convert to data frame
print(as_corpus_frame(crude))
text
127 Diamond Shamrock Corp said that\neffective today it had cut its contract prices for crude …
144 OPEC may be forced to meet before a\nscheduled June session to readdress its production cu…
191 Texaco Canada said it lowered the\ncontract price it will pay for crude oil 64 Canadian ct…
194 Marathon Petroleum Co said it reduced\nthe contract price it will pay for all grades of cr…
211 Houston Oil Trust said that independent\npetroleum engineers completed an annual study tha…
236 Kuwait"s Oil Minister, in remarks\npublished today, said there were no plans for an emerge…
237 Indonesia appears to be nearing a\npolitical crossroads over measures to deregulate its pr…
242 Saudi riyal interbank deposits were\nsteady at yesterday's higher levels in a quiet market…
246 The Gulf oil state of Qatar, recovering\nslightly from last year's decline in world oil pr…
248 Saudi Arabian Oil Minister Hisham Nazer\nreiterated the kingdom's commitment to last Decem…
273 Saudi crude oil output last month fell\nto an average of 3.5 mln barrels per day (bpd) fro…
349 Deputy oil ministers from six Gulf\nArab states will meet in Bahrain today to discuss coor…
352 Saudi Arabian Oil Minister Hisham Nazer\nreiterated the kingdom's commitment to last Decem…
353 Kuwait's oil minister said in a newspaper\ninterview that there were no plans for an emerg…
368 The port of Philadelphia was closed\nwhen a Cypriot oil tanker, Seapride II, ran aground a…
489 A study group said the United States\nshould increase its strategic petroleum reserve to o…
502 A study group said the United States\nshould increase its strategic petroleum reserve to o…
543 Unocal Corp's Union Oil Co said it\nlowered its posted prices for crude oil one to 1.50 dl…
704 The New York Mercantile Exchange set\nApril one for the debut of a new procedure in the en…
708 Argentine crude oil production was\ndown 10.8 pct in January 1987 to 12.32 mln barrels, fr…