R/stylest_select_vocab.R
stylest_select_vocab.Rd
Selects optimal vocabulary quantile(s) for model fitting using performance on predicting out-of-sampletexts.
stylest_select_vocab( x, speaker, filter = NULL, smooth = 0.5, nfold = 5, cutoff_pcts = c(50, 60, 70, 80, 90, 99), cutoffs_term_weights = NULL, fill_method = "value", fill_weight = 1, weight_varname = "mean_distance" )
x | Corpus as text vector. May be a |
---|---|
speaker | Vector of speaker labels. Should be the same length as
|
filter | if not |
smooth | value for smoothing. Defaults to 0.5 |
nfold | Number of folds for cross-validation. Defaults to 5 |
cutoff_pcts | Vector of cutoff percentages to test. Defaults to
|
cutoffs_term_weights | Named list of dataframes of term weights,
where the names correspond to the |
fill_method | if |
fill_weight | numeric value to fill in as weight for any term
which does not have a weight specified in |
weight_varname | Name of the column in each term_weights dataframe containing
the weights, default= |
List of: best cutoff percent with the best speaker classification rate; cutoff percentages that were tested; matrix of the mean percentage of incorrectly identified speakers for each cutoff percent and fold; and the number of folds for cross-validation