API¶
Equivalence Module¶
AnalysisPipeline¶
-
class
surveyequivalence.equivalence.AnalysisPipeline(W: pandas.core.frame.DataFrame, sparse_experts: bool = True, expert_cols: Sequence[str] = [], amateur_cols: Sequence[str] = [], classifier_predictions: Optional[pandas.core.frame.DataFrame] = None, combiner: Optional[surveyequivalence.combiners.Combiner] = None, scorer: Optional[surveyequivalence.scoring_functions.Scorer] = None, allowable_labels: Optional[Sequence[str]] = None, min_k=0, num_bootstrap_item_samples=100, max_rater_subsets=200, max_K=10, ratersets_memo=None, predictions_memo=None, item_samples=None, verbosity=1, run_on_creation=True, procs=1)¶ Bases:
objectThe main class for running an analysis
- Parameters
W (pd.DataFrame) -- The ratings dataframe with one column for each rater, one row for each item
sparse_experts (bool) -- True (default) if some raters may not have rated all items
expert_cols (Sequence[str] = []) -- A list of column names, one for each potential "reference rater" whose the classifier is trying to predict. These are also the columns used for computing the power curve for survey equivalence
amateur_cols (Sequence[str] = []) -- A list of column names, one for each potential "other rater". Their ratings are not used for evaluating the classifier, but a separate power curve may be computed for them, using surveys of k of them to predict a reference rater's label. Survey equivalences can also be calculated between j "other raters" and k reference raters.
classifier_predictions (pd.DataFrame = None) -- A dataframe with one column for each classifier for which we want to compute survey equivalences. One row for each item; row indexes should be the same as for W
combiner (Combiner = None) -- A combiner that is used to make a prediction about the next label for an item, given labels from some other raters.
scorer (Scorer = None) -- A scorer that takes a vector of predictions and a vector of realized reference rater labels and yields a numeric score.
allowable_labels (Sequence[str] = None) -- A list of the potential label strings that a rater is permitted to assign to an item
min_k=0 -- When computing power curves, the smallest survey size to include
num_bootstrap_item_samples=100 -- When computing error bars, how many bootstrap samples of items to create
max_rater_subsets=200 -- When computing power curves, we compute the average score over predictions made from many subsets of reference raters of size k. When k is small, we choose all subsets of size k. For larger k, we take a sample from the powerset. This parameters determines how many subsets to select.
max_K=10 -- When computing computing curves, the largest survey size to include. Cannot be larger than the number of reference raters in W, minus one.
ratersets_memo=None -- While running, a dictionary is create to memoize certain computations, for efficiency. A value be passed in in order to reuse the memoized computations from a previous run.
predictions_memo=None -- While running, a dictionary is create to memoize certain computations, for efficiency. A value be passed in in order to reuse the memoized computations from a previous run.
item_samples=None -- If specified, the set of bootstrap item samples to use for computing error bars. If not specified, a new set of bootstrap item samples will be created.
verbosity=1 -- Controls how much information is printed to the console during execution. Set a higher number to help with debugging.
= True (run_on_creation) -- Whether to actually run the analysis pipeline
- 1 (procs=pathos.helpers.cpu_count()) -- How many processors are available for parallel execution
-
output_csv(fname)¶ output the dataframe and the expert predictions
-
path_for_saving(dirname_base='analysis_pipeline', include_timestamp=True)¶ - Parameters
dirname_base -- A name that describes the analysis; / will be treated as a subdirectory
include_timestamp -- Whether to make a folder indicating the timestamp at which the run was done.
- Returns
A path of the form {self.run_timestamp}/{dirname_base}
If the path does not exist yet, it is created.
-
run()¶ Create the power curve(s); normally invoked during __init__ but can be called separately.
-
save(path=None, msg='', save_results=True)¶ Save instance and results to files
- Parameters
dirname_base="analysis_pipeline" -- A subdirectory name in which to store saved results
msg -- A text string to write in a README file that is generated
save_results=True -- If True, generates a results_summary.txt file with power curve and survey equivalence summary stats
-
surveyequivalence.equivalence.load_saved_pipeline(path)¶ Loads dataset, predictions, classifiers scores, and power curve(s) previously saved using
surveyequivalence.equivalence.AnalysisPipeline.save()
Plot¶
-
class
surveyequivalence.equivalence.Plot(ax, expert_power_curve, amateur_power_curve=None, classifier_scores=None, color_map={'amateur_power_curve': 'blue', 'classifier': 'green', 'expert_power_curve': 'black'}, y_axis_label='Agreement with reference rater', center_on=None, y_range=None, name='powercurve', legend_label='Expert raters', amateur_legend_label='Lay raters', verbosity=1, generate_pgf=False)¶ Bases:
objectGenerates visual display of power curve(s) and classifier scores, as matplotlib objects and as pgf for embedding in latex. First run AnalysisPipeline to generate the PowerCurve and ClassifierResults objects to pass in to constructor.
- Parameters
ax (matplotlib.axes.Axes) --
expert_power_curve (PowerCurve) -- a PowerCurve with scores for combinations of k reference raters in predicting a held-out reference rater
amateur_power_curve=None (PowerCurve) -- a PowerCurve with scores for combinations of k other raters in predicting a held-out reference rater
classifier_scores=None (ClassifierResults) --
color_map={'expert_power_curve' ('black', 'amateur_power_curve': 'blue', 'classifier': 'green'}) -- a dictionary specifying colors to use for the different elements of the graph to be pltoted
with reference rater' (y_axis_label='Agreement) --
center_on=None (float) -- If a value is provided, it will be subtracted from all scores for classifiers and power curve values
y_range=None -- If specified, a tuple of two values, the min and max y-values for the graph
name='powercurve' -- A name for the plot
raters' (legend_label='Expert) -- Legend label for the power curve for reference raters
raters" (amateur_legend_label="Lay) -- Legend label for the power curve for other raters
verbosity=1 -- Controls how much information is printed to the console during execution. Set a higher number to help with debugging.
generate_pgf=False -- If True, also populate data to enable create of pgf format, suitable for inclusion in latex after calling .plot(), run self.template.substitute(**self.template_dict)
-
plot(include_expert_points='all', connect_expert_points=True, include_classifiers=True, include_classifier_equivalences=True, include_classifier_amateur_equivalences=False, other_rater_equivalences_to_include=[], include_droplines=True, include_amateur_curve=True, include_classifier_cis=True, include_seq_cis=True, x_ticks=None, legend_loc=None)¶ The method that fills in the contents of the matplotlib Axes object
- Parameters
include_expert_points='all' -- all means to plot all reference rater survey sizes on the x-axis of the power curve Or include a list of numbers indicating which survey sizes to include
connect_expert_points=True -- Whether to draw straight lines connecting the dots for survey power for surveys of successive sizes
include_classifiers=True -- Whether to include horizontal lines showing the classifier score(s)
include_classifier_equivalences=True -- Whether to include calculation of the equivalent number of reference raters for each classifier, based on the intersection point of the classifier line and the reference raters' power curve
include_classifier_amateur_equivalences=False -- Whether to include calculation of the equivalent number of other raters for each classifier, based on the intersection point of the classifier line and the other raters' power curve
other_rater_equivalences_to_include=[] -- A list of survey sizes for non-reference raters. For each one, compute the equivalent number of reference raters yielding the same score.
include_droplines=True -- Whether to include vertical lines from the intersection points (survey equivalences) to the x-axis
include_amateur_curve=True -- Whether to include a power curve for the other, non-reference raters
include_classifier_cis=True -- Whether to include error bars around the classifier horizontal lines
include_seq_cis=True -- Whether to include error bars around the survey equivalence values
x_ticks=None -- If provided, a list of x values for which tick marks should be shown. If None, then it will be automatically calculated.
legend_loc=None -- String indicating where to place the legend (uses default if None). Options as documented for matplotlib.axes.Axes.legend.
-
save(path: str, fig: <module 'matplotlib.figure' from '/home/docs/checkouts/readthedocs.org/user_builds/surveyequivalence/envs/stable/lib/python3.7/site-packages/matplotlib/figure.py'>, plotname='plot')¶ Wrapper for the matplotlib save_plot function. Saves all data to the ./plots directory as png and tex files.
- Parameters
fig (matplotlib figure object to be saved) --
name (Name for the file) --
Equivalences¶
-
class
surveyequivalence.equivalence.Equivalences(df)¶ Bases:
objectContains a dataframe with one row for each bootstrap sample of items and one column for each classifier. Cell contains the survey equivalence value (equivalent number of reference raters whose combined ratings yields the same score as the classifier).
-
property
lower_bounds¶ returns: :rtype: A pandas Series with a lower bound on the survey equivalence for each classifier. Compute based on interval covering 95% of the bootstrap samples of items, if there are more than 200 of them. Otherwise compute based on two standard deviations of the scores on bootstrap samples.
-
property
upper_bounds¶ returns: :rtype: A pandas Series with an upper bound on the survey equivalence for each classifier. Compute based on interval covering 95% of the bootstrap samples of items, if there are more than 200 of them. Otherwise compute based on two standard deviations of the scores on bootstrap samples.
-
property
ClassifierResults¶
-
class
surveyequivalence.equivalence.ClassifierResults(runs: Optional[Sequence[Dict]] = None, df=None)¶ Bases:
object-
property
lower_bounds¶ returns: :rtype: A pandas Series with a lower bound on the survey equivalence for each classifier. Compute based on interval covering 95% of the bootstrap samples of items, if there are more than 200 of them. Otherwise compute based on two standard deviations of the scores on bootstrap samples.
-
property
upper_bounds¶ returns: :rtype: A pandas Series with an upper bound on the survey equivalence for each classifier. Compute based on interval covering 95% of the bootstrap samples of items, if there are more than 200 of them. Otherwise compute based on two standard deviations of the scores on bootstrap samples.
-
property
values¶ returns: :rtype: Series of classifier scores for the first row, the actual item set, omitting results for all bootstrap item sets.
-
property
PowerCurve¶
-
class
surveyequivalence.equivalence.PowerCurve(runs: Optional[Sequence[Dict]] = None, df=None)¶ Bases:
surveyequivalence.equivalence.ClassifierResultsA special case of ClassifierResults where there is one column for each integer value k, representing the mean score, over many samples of k raters, of the predictions generated by combining ratings from k raters, scored against a reference rater.
-
compute_equivalence_at_actuals(classifier_score)¶ Compute the equivalence of the score of the classifier on the actual item sample based on the survey power curve computed for the actual item sample
-
compute_equivalence_at_mean(classifier_score)¶ Compute the equivalence of the mean score of the classifier across the bootstrap item samples based on the mean survey power curve computed across the bootstrap item samples
-
compute_equivalences(other, columns=None)¶ - Parameters
self --
other -- The classifier scores that are compared against this PowerCurve to find equivalences may either be an instance of ClassifierResults or a PowerCurve. Must have same row indexes as self, one for each item sample
columns -- a subset of the column names from other.df; if not specified, use all of them
- Returns
- Return type
a df with one row for each bootstrap run, and columns as specified by the columns parameter Each cell is a float, the survey equivalence value for that column from other. That is, the x s.t. expected score with x raters from self == classifier_score from other.
-
reliability_of_beating_classifier(other, k=1, other_col=1)¶ - Parameters
other -- the other ClassifierResults or PowerCurve
self_col -- the survey size (column) for self
other_col -- the survey size (column) for other to compare, with matching bootstrap samples as rows
- Returns
- Return type
fraction of bootstrap runs where self power higher than other power
-
reliability_of_difference(other, k=1)¶ - Parameters
other -- another PowerCurve
k -- survey size
- Returns
- Return type
fraction of bootstrap runs where power@k higher for self than other power curve
-
Combiners¶
-
class
surveyequivalence.combiners.AnonymousBayesianCombiner(*args, **kwargs)¶ Bases:
surveyequivalence.combiners.CombinerAnonymous Bayesian Combiner Class
-
static
D_k(labels: numpy.array, W: numpy.matrix, allowable_labels: Sequence[str]) -> (<class 'float'>, <class 'int'>)¶ Compute the joint distribution over k anonymous ratings
- Parameters
labels (item labels from several raters) --
W (item and rating dataset) --
allowable_labels (The set of labels that can be entered by the raters.) --
- Returns
- Return type
joint distribution, and num_items
-
static
D_k_item_contribution(labels: numpy.array, item: numpy.array, allowable_labels: Sequence[str]) -> (<class 'float'>, <class 'float'>)¶ ProbabilityOfOneItem function in Algorithm 5. Computes the contribution of a single item to the combiner
- Parameters
labels (item labels from several raters) --
item (The item under current consideration) --
allowable_labels (The set of labels that can be entered by the raters.) --
- Returns
- Return type
The contribution of this item.
-
combine(allowable_labels: Sequence[str], labels: Sequence[Tuple[str, str]], W: Optional[numpy.matrix] = None, item_id=None, to_predict_for=None) → surveyequivalence.combiners.DiscreteDistributionPrediction¶ Algorithm 6 Compute the anonymous bayesian combiner. Combines rater labels like frequency_combiner, but this uses the information from the item/rating dataset W.
- Parameters
allowable_labels (the set of labels/ratings allowed) --
labels (the k ratings) --
W (item and rating dataset) --
item_id (item index in W) --
to_predict_for (not used currently) --
- Returns
- Return type
Prediction based on anonymous bayesian combiner
-
static
-
class
surveyequivalence.combiners.Combiner(allowable_labels: Optional[Sequence[str]] = None, verbosity=0)¶ Bases:
abc.ABCAbstract class defining a combiner.
A combiner selects a single label from a bag/multiset of labels (and possibly other information) according to some function. For example, the PluralityCombiner accepts a bag of labels and returns the label that is most frequent.
-
abstract
combine(allowable_labels: Sequence[str], labels: Sequence[Tuple[str, str]], W: Optional[numpy.matrix] = None, item_id=None, to_predict_for=None) → surveyequivalence.combiners.DiscreteDistributionPrediction¶
-
abstract
-
class
surveyequivalence.combiners.DiscreteDistributionPrediction(label_names, probabilities, extreme_cutoff=0.02, normalize=True)¶ Bases:
surveyequivalence.combiners.PredictionA discrete distribution prediction where labels are associated with probabilities. Value takes the label with the highest probability.
-
draw_discrete_label()¶ Return one of the labels, drawn according to the distribution
- Returns
- Return type
A label
-
label_probability(label)¶ Returns the probability associated with an input label
- Parameters
label (label to query) --
- Returns
- Return type
Probability assicated with label.
-
property
value¶ Return the single label that has the highest predicted probability. Break ties by taking the first one
>>> DiscreteDistributionPrediction(['a', 'b', 'c'], [.3, .4, .3]).value 'b' >>> DiscreteDistributionPrediction(['a', 'b', 'c'], [.4, .4, .2]).value 'a'
- Returns
- Return type
label with highest probability
-
property
value_prob¶ Return the probability of the majority class
>>> DiscreteDistributionPrediction(['a', 'b', 'c'], [.3, .4, .3]).value .4 >>> DiscreteDistributionPrediction(['a', 'b', 'c'], [.4, .4, .2]).value .4
- Returns
- Return type
highest probability
-
-
class
surveyequivalence.combiners.DiscretePrediction(label)¶ Bases:
surveyequivalence.combiners.PredictionA discrete prediction. value is defined as a label
-
property
value¶
-
property
-
class
surveyequivalence.combiners.FrequencyCombiner(allowable_labels: Optional[Sequence[str]] = None, verbosity=0)¶ Bases:
surveyequivalence.combiners.CombinerReturns a vector of frequencies for each label
-
combine(allowable_labels: Sequence[str], labels: Sequence[Tuple[str, str]], W: Optional[numpy.matrix] = None, item_id=None, to_predict_for=None) → surveyequivalence.combiners.DiscreteDistributionPrediction¶ Returns the frequency vector for labels
>>> FrequencyCombiner().combine(['pos', 'neg'], np.array([(1, 'pos'), (2, 'neg'), (4, 'neg')]), ).probabilities [0.3333333333333333, 0.6666666666666666]
>>> FrequencyCombiner().combine(['pos', 'neg'], np.array([(1, 'neg'), (2, 'neg'), (4, 'neg')])).probabilities [0.0, 1.0]
- Parameters
allowable_labels (not used in this combiner) --
labels (nnumeric values from particular rater ids; rater ids are ignored) --
W (not used in this combiner) --
item_id (not used in this combiner) --
to_predict_for (not used in this combiner) --
- Returns
- Return type
Frequency vector of labels
-
-
class
surveyequivalence.combiners.MeanCombiner(allowable_labels: Optional[Sequence[str]] = None, verbosity=0)¶ Bases:
surveyequivalence.combiners.CombinerCombiner that returns the mean of all the labels.
-
combine(allowable_labels: Optional[Sequence[str]] = None, labels: Sequence[Tuple[str, float]] = [], W: Optional[numpy.matrix] = None, item_id=None, to_predict_for=None) → surveyequivalence.combiners.NumericPrediction¶ Returns the single label that is most frequent
- Parameters
allowable_labels (not used in this combiner) --
labels (nnumeric values from particular rater ids; rater ids are ignored) --
W (not used in this combiner) --
item_id (not used in this combiner) --
to_predict_for (not used in this combiner) --
- Returns
- Return type
The mean of the labels
-
-
class
surveyequivalence.combiners.NumericPrediction(num)¶ Bases:
surveyequivalence.combiners.PredictionA numeric prediction. value is defined as a number
-
property
value¶
-
property
-
class
surveyequivalence.combiners.PluralityVote(allowable_labels: Optional[Sequence[str]] = None, verbosity=0)¶ Bases:
surveyequivalence.combiners.CombinerCombiner that returns the single label that is most frequent
-
combine(allowable_labels: Optional[Sequence[str]] = None, labels: Sequence[Tuple[str, float]] = [], W: Optional[numpy.matrix] = None, item_id=None, to_predict_for=None) → surveyequivalence.combiners.NumericPrediction¶ Returns the single label that is most frequent
- Parameters
allowable_labels (not used in this combiner) --
labels (numeric values from particular rater ids; rater ids are ignored) --
W (not used in this combiner) --
item_id (not used in this combiner) --
to_predict_for (not used in this combiner) --
- Returns
- Return type
The most common label
-
Scoring Functions¶
-
class
surveyequivalence.scoring_functions.AUCScore¶ Bases:
surveyequivalence.scoring_functions.Scorer-
static
score(classifier_predictions: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction], rater_labels: Sequence[str], verbosity=0) → float¶ AUC score. This function uses sklearn's AUC function, but does not work in many cases with multiple labels.
- Parameters
classifier_predictions (numeric values) --
rater_labels (sequence of labels, which should be numeric values) --
verbosity --
- Returns
- Return type
AUC Score
-
static
-
class
surveyequivalence.scoring_functions.AgreementScore¶ Bases:
surveyequivalence.scoring_functions.ScorerAgreement Scorer
-
static
score(classifier_predictions: Sequence[str], rater_labels: Sequence[str], verbosity=0)¶ Agreement score measures the normalized number of times that the predictor matched the label. Akin to a typical accuracy score.
- Parameters
classifier_predictions (numeric values) --
rater_labels (sequence of labels, which should be numeric values) --
verbosity --
- Returns
- Return type
Agreement score
-
static
-
class
surveyequivalence.scoring_functions.Correlation¶ Bases:
surveyequivalence.scoring_functions.ScorerComputes the Pearson correlation coefficient.
-
static
score(classifier_predictions: Sequence[surveyequivalence.combiners.NumericPrediction], rater_labels: Sequence[str], verbosity=0)¶ - Parameters
classifier_predictions (numeric values) --
rater_labels (sequence of labels, which should be numeric values) --
verbosity --
- Returns
- Return type
Pearson correlation coefficient
-
static
-
class
surveyequivalence.scoring_functions.CrossEntropyScore¶ Bases:
surveyequivalence.scoring_functions.ScorerCross Entropy Scorer
-
static
score(classifier_predictions: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction], rater_labels: Sequence[str], verbosity=0)¶ Calculates the Cross Entropy of the two labels.
>>> CrossEntropyScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]], ['b', 'b', 'b']) 0.594597099859
>>> CrossEntropyScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]], ['a', 'b', 'b']) 0.87702971998
- Parameters
classifier_predictions (numeric values) --
rater_labels (sequence of labels, which should be numeric values) --
verbosity --
- Returns
- Return type
Cross Entropy score
-
static
-
class
surveyequivalence.scoring_functions.F1Score¶ Bases:
surveyequivalence.scoring_functions.Scorer-
static
score(classifier_predictions: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction], rater_labels: Sequence[str], verbosity=0, average: str = 'micro') → float¶ F1 score. This function uses sklearn's F1 function.
>>> F1Score.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]], ['b', 'b', 'b'], 'micro') 0.6666666666666666 >>> F1Score.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]], ['b', 'b', 'b'], 'macro') 0.39759036144
>>> F1Score.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]], ['a', 'b', 'b'], 'micro') 0.3333333333333333 >>> F1Score.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]], ['a', 'b', 'b'], 'macro') 0.25
- Parameters
classifier_predictions (numeric values) --
rater_labels (sequence of labels, which should be numeric values) --
verbosity --
average (macro or micro averaging) --
- Returns
- Return type
F1 Score
-
static
-
class
surveyequivalence.scoring_functions.PrecisionScore¶ Bases:
surveyequivalence.scoring_functions.Scorer-
static
score(classifier_predictions: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction], rater_labels: Sequence[str], verbosity=0, average: str = 'micro') → float¶ Precision score. This function uses sklearn's precision function.
>>> PrecisionScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]], ['b', 'b', 'b'], 'micro') 0.6666666666666666
>>> PrecisionScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]], ['a', 'b', 'b'], 'micro') 0.3333333333333333
- Parameters
classifier_predictions (numeric values) --
rater_labels (sequence of labels, which should be numeric values) --
verbosity --
average (macro or micro averaging) --
- Returns
- Return type
Precision Score
-
static
-
class
surveyequivalence.scoring_functions.RecallScore¶ Bases:
surveyequivalence.scoring_functions.Scorer-
static
score(classifier_predictions: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction], rater_labels: Sequence[str], verbosity=0, average: str = 'micro') → float¶ Recall score. This function uses sklearn's recall function.
>>> RecallScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]], ['b', 'b', 'b'], 'micro') 0.6666666666666666 >>> RecallScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]], ['b', 'b', 'b'], 'macro') 0.5
>>> RecallScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]], ['a', 'b', 'b'], 'micro') 0.3333333333333333 >>> RecallScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]], ['a', 'b', 'b'], 'macro') 0.25
- Parameters
classifier_predictions (numeric values) --
rater_labels (sequence of labels, which should be numeric values) --
verbosity --
average (macro or micro averaging) --
- Returns
- Return type
Recall Score
-
static
-
class
surveyequivalence.scoring_functions.Scorer¶ Bases:
abc.ABCScorer that defines a Scorer class as having a score() function. The scorer computes the goodness of a predictor against the average human rater.
-
abstract static
score(classifier_predictions: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction], rater_labels: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction]) → float¶
-
score_classifier(classifier_predictions: Sequence, raters: Sequence, W, verbosity=0)¶ Driver function that computes the mean score over all predictions
- Parameters
classifier_predictions (Scoring predictions) --
raters (The reference ratings. Score will compare classifier predictions with each rater in turn.) --
W (The item and rating dataset) --
verbosity (verbosity value from 1 to 4 indicating increased verbosity.) --
- Returns
- Return type
Mean score over all predictions for all raters.
-
abstract static
Synthetic Dataset Generation¶
States¶
-
class
surveyequivalence.synthetic_datasets.DiscreteState(state_name: str, labels: Sequence[str], probabilities: Sequence[float])¶ Bases:
surveyequivalence.synthetic_datasets.StateA discrete distribution over possible labels
- Parameters
state_name --
labels -- A sequence of strings; the allowable labels
probabilities -- A sequence of the same length, with values adding to one, giving probabilities for each of the label strings
-
draw_labels(n: int)¶ Make n iid draws of discrete labels from the distribution
- Parameters
n -- How many labels to draw from the distribution
- Returns
- Return type
a single item or a numpy array
Distributions Over States¶
-
class
surveyequivalence.synthetic_datasets.DistributionOverStates¶ Bases:
abc.ABCAbstract base class
-
class
surveyequivalence.synthetic_datasets.DiscreteDistributionOverStates(states: Sequence[surveyequivalence.synthetic_datasets.State], probabilities: Sequence[float])¶ Bases:
surveyequivalence.synthetic_datasets.DistributionOverStates- Parameters
states -- a sequence of State objects
probabilities -- a same length sequence of floats representing probabilities of the item states
-
draw_states(n: int) → Sequence[surveyequivalence.synthetic_datasets.DiscreteState]¶ - Parameters
n --
- Returns
- Return type
a single item or numpy array of State instances, drawn iid from the probability distribution
-
class
surveyequivalence.synthetic_datasets.FixedStateGenerator(states: Sequence[surveyequivalence.synthetic_datasets.State], probabilities: Sequence[float])¶ Bases:
surveyequivalence.synthetic_datasets.DiscreteDistributionOverStates-
draw_states(n: int)¶ Draw exactly in proportion to probabilities, rather than each draw random according to the probabilities :param n: How many items to draw
- Returns
- Return type
list of State instances
-
Mock Classifiers¶
-
class
surveyequivalence.synthetic_datasets.MockClassifier(name: str, label_predictors: Dict[str, surveyequivalence.combiners.Prediction])¶ Bases:
objectA mock classifier has access to each item's state when generating a prediction, something that a real classifier would not have access to
- Parameters
name --
label_predictions -- a dictionary mapping from item state names to Predictions
-
make_predictions(item_states: Sequence[surveyequivalence.synthetic_datasets.State]) → Sequence[surveyequivalence.combiners.Prediction]¶ - Parameters
item_states -- a sequence of State objects, representing the states of some items
- Returns
- Return type
a sequence of Prediction objects, one for each item
-
class
surveyequivalence.synthetic_datasets.MappedDiscreteMockClassifier(name, label_predictors: Dict[str, surveyequivalence.combiners.Prediction], prediction_map: Dict[str, surveyequivalence.combiners.Prediction])¶ Bases:
surveyequivalence.synthetic_datasets.MockClassifierA mock classifier that maps an item state to a Prediction, draws a discrete label from that, and then maps that discrete label to another Prediction.
- Parameters
name --
label_predictions -- a dictionary mapping from item state names to Predictions
-
make_predictions(item_states)¶ - Parameters
item_states -- a sequence of State objects, representing the states of some items
- Returns
- Return type
a sequence of Prediction objects, one for each item
Dataset Generators¶
-
class
surveyequivalence.synthetic_datasets.SyntheticDatasetGenerator(item_state_generator: surveyequivalence.synthetic_datasets.DistributionOverStates, num_items_per_dataset=1000, num_labels_per_item=10, mock_classifiers=None, name='')¶ Bases:
objectGenerator for a set of items with some raters per item. Items are defined by States, which are drawn from a DistributionOverStates. Each State is a distribution over labels. Each label is an i.i.d. draw from the State
- Parameters
item_state_generator --
num_items_per_dataset --
num_labels_per_item -- How many raters to generate labels for, for each item
mock_classifiers -- A list of MockClassifier instances, which generate label predictions based on the item state
name -- A text string naming this dataset generator
-
generate_labels(item_states, num_labels_per_item=None, rater_prefix='e')¶ Normally called with item_states=self.reference_rater_item_states
- Parameters
self --
item_states -- a list of States, one for each item
num_labels_per_item=None -- if None, use self.num_labels_per_item
rater_prefix="e" -- Rater columns are named as f"{rater_prefix}_{i}" where i is an integer
- Returns
- Return type
A pandas DataFrame with one row for each item and one column for each rater. Cells are labels.
-
class
surveyequivalence.synthetic_datasets.SyntheticBinaryDatasetGenerator(item_state_generator, num_items_per_dataset=50, num_labels_per_item=3, mock_classifiers=None, name=None, pct_noise=0.0, k_other_raters_per_label=1)¶ Bases:
surveyequivalence.synthetic_datasets.SyntheticDatasetGeneratorDataset generator for binary labels
Only additional parameters for this subclass are documented here.
- Parameters
pct_noise=0 -- In addition to the reference rater labels, this generator can generator labels from "other" raters. With probability pct_noise the binary labels will be drawn from a 50-50 coin flip, and otherwise from the item's State. If pct_noise==0, the other raters' labels will always be i.i.d draws from the same distribution as the reference rater labels.
k_other_raters_per_label=1 -- The number of other raters to generate labels for.
-
make_histogram(ax)¶ - Parameters
ax -- A matplotlib Axes instance
-
plot_item_state_distribution()¶ called if you are making a standalone graph; for insets, .make_histogram is called directly
Dataset¶
-
class
surveyequivalence.synthetic_datasets.Dataset¶ Bases:
objectA Dataset
-
class
surveyequivalence.synthetic_datasets.SyntheticDataset(ds_generator: surveyequivalence.synthetic_datasets.SyntheticBinaryDatasetGenerator)¶ Bases:
surveyequivalence.synthetic_datasets.Dataset- Parameters
ds_generator --
all the attributes (Sets) --
running the SyntheticBinaryDatasetGenerator (by) --
-
save(dirname='running_example')¶ Save ratings and predictions to csv files
- Parameters
dirname -- A subdirectory name in which to store saved results
include_timestamp_in_dirname -- Whether to postpend directory name with current timestamp
-
surveyequivalence.synthetic_datasets.make_running_example_dataset(num_items_per_dataset=10, num_labels_per_item=10, minimal=False, include_hard_classifier=False, include_soft_classifier=False) → surveyequivalence.synthetic_datasets.SyntheticDataset¶ This generates the running example dataset used in the original Survey Equivalence paper.
Three states: 70% high = 80/20, 10% med = 50/50; 20% low = 10/90
- Parameters
num_items_per_dataset --
num_labels_per_item --
minimal -- If minimal, use FixedStateGenerator, which generates labels in exact proportion to probabilities specified in the state, rather than each label being an iid draw from the State.
include_hard_classifier -- Includes a hard classifier which draws labels 90/10 for high state; 50/50 for medium; 05/95 fow low state
include_soft_classifier -- Includes a soft classifier which runs the hard_classifier to generate a label and then maps it to a calibrated prediction (.7681 when the label is positive; .3226 when the label is negative). Also includes an ideal classifier that always predicts the probability given by the State of the item.