Selects optimal vocabulary quantile(s) for model fitting using performance on predicting out-of-sampletexts.

stylest_select_vocab(
  x,
  speaker,
  filter = NULL,
  smooth = 0.5,
  nfold = 5,
  cutoff_pcts = c(50, 60, 70, 80, 90, 99),
  cutoffs_term_weights = NULL,
  fill_method = "value",
  fill_weight = 1,
  weight_varname = "mean_distance"
)

Arguments

x

Corpus as text vector. May be a corpus_frame object

speaker

Vector of speaker labels. Should be the same length as x

filter

if not NULL, a corpus text_filter

smooth

value for smoothing. Defaults to 0.5

nfold

Number of folds for cross-validation. Defaults to 5

cutoff_pcts

Vector of cutoff percentages to test. Defaults to c(50, 60, 70, 80, 90, 99)

cutoffs_term_weights

Named list of dataframes of term weights, where the names correspond to the cutoff_pcts. Each dataframe should have one column $word and a second column $weight_varname containing the weight for the word. See the vignette for details.

fill_method

if "value" (default), fill_weight is used to fill any terms with NA weight. If "mean", the mean term_weight should be used as the fill value

fill_weight

numeric value to fill in as weight for any term which does not have a weight specified in term_weights, default=1.0

weight_varname

Name of the column in each term_weights dataframe containing the weights, default="mean_distance"

Value

List of: best cutoff percent with the best speaker classification rate; cutoff percentages that were tested; matrix of the mean percentage of incorrectly identified speakers for each cutoff percent and fold; and the number of folds for cross-validation

Examples

if (FALSE) { data(novels_excerpts) stylest_select_vocab(novels_excerpts$text, novels_excerpts$author, cutoff_pcts = c(50, 90)) }