Select vocabulary using cross-validated out-of-sample prediction

Selects optimal vocabulary quantile(s) for model fitting using performance on predicting out-of-sampletexts.

stylest_select_vocab(
  x,
  speaker,
  filter = NULL,
  smooth = 0.5,
  nfold = 5,
  cutoff_pcts = c(50, 60, 70, 80, 90, 99),
  cutoffs_term_weights = NULL,
  fill_method = "value",
  fill_weight = 1,
  weight_varname = "mean_distance"
)

Arguments

x	Corpus as text vector. May be a `corpus_frame` object
speaker	Vector of speaker labels. Should be the same length as `x`
filter	if not `NULL`, a `corpus` text_filter
smooth	value for smoothing. Defaults to 0.5
nfold	Number of folds for cross-validation. Defaults to 5
cutoff_pcts	Vector of cutoff percentages to test. Defaults to `c(50, 60, 70, 80, 90, 99)`
cutoffs_term_weights	Named list of dataframes of term weights, where the names correspond to the `cutoff_pcts`. Each dataframe should have one column $word and a second column $weight_varname containing the weight for the word. See the vignette for details.
fill_method	if `"value"` (default), `fill_weight` is used to fill any terms with `NA` weight. If `"mean"`, the mean term_weight should be used as the fill value
fill_weight	numeric value to fill in as weight for any term which does not have a weight specified in `term_weights`, default=`1.0`
weight_varname	Name of the column in each term_weights dataframe containing the weights, default=`"mean_distance"`

Value

List of: best cutoff percent with the best speaker classification rate; cutoff percentages that were tested; matrix of the mean percentage of incorrectly identified speakers for each cutoff percent and fold; and the number of folds for cross-validation

Examples

if (FALSE) {
data(novels_excerpts)
stylest_select_vocab(novels_excerpts$text, novels_excerpts$author, cutoff_pcts = c(50, 90))
}

Select vocabulary using cross-validated out-of-sample prediction

Arguments

Value

Examples

Contents