API¶

Equivalence Module¶

AnalysisPipeline¶

class surveyequivalence.equivalence.AnalysisPipeline(W: pandas.core.frame.DataFrame, sparse_experts: bool = True, expert_cols: Sequence[str] = [], amateur_cols: Sequence[str] = [], classifier_predictions: Optional[pandas.core.frame.DataFrame] = None, combiner: Optional[surveyequivalence.combiners.Combiner] = None, scorer: Optional[surveyequivalence.scoring_functions.Scorer] = None, allowable_labels: Optional[Sequence[str]] = None, min_k=0, num_bootstrap_item_samples=100, max_rater_subsets=200, max_K=10, ratersets_memo=None, predictions_memo=None, item_samples=None, verbosity=1, run_on_creation=True, procs=1)¶

Bases: object

The main class for running an analysis

Parameters

W (pd.DataFrame) -- The ratings dataframe with one column for each rater, one row for each item
sparse_experts (bool) -- True (default) if some raters may not have rated all items
expert_cols (Sequence[str] = []) -- A list of column names, one for each potential "reference rater" whose the classifier is trying to predict. These are also the columns used for computing the power curve for survey equivalence
amateur_cols (Sequence[str] = []) -- A list of column names, one for each potential "other rater". Their ratings are not used for evaluating the classifier, but a separate power curve may be computed for them, using surveys of k of them to predict a reference rater's label. Survey equivalences can also be calculated between j "other raters" and k reference raters.
classifier_predictions (pd.DataFrame = None) -- A dataframe with one column for each classifier for which we want to compute survey equivalences. One row for each item; row indexes should be the same as for W
combiner (Combiner = None) -- A combiner that is used to make a prediction about the next label for an item, given labels from some other raters.
scorer (Scorer = None) -- A scorer that takes a vector of predictions and a vector of realized reference rater labels and yields a numeric score.
allowable_labels (Sequence[str] = None) -- A list of the potential label strings that a rater is permitted to assign to an item
min_k=0 -- When computing power curves, the smallest survey size to include
num_bootstrap_item_samples=100 -- When computing error bars, how many bootstrap samples of items to create
max_rater_subsets=200 -- When computing power curves, we compute the average score over predictions made from many subsets of reference raters of size k. When k is small, we choose all subsets of size k. For larger k, we take a sample from the powerset. This parameters determines how many subsets to select.
max_K=10 -- When computing computing curves, the largest survey size to include. Cannot be larger than the number of reference raters in W, minus one.
ratersets_memo=None -- While running, a dictionary is create to memoize certain computations, for efficiency. A value be passed in in order to reuse the memoized computations from a previous run.
predictions_memo=None -- While running, a dictionary is create to memoize certain computations, for efficiency. A value be passed in in order to reuse the memoized computations from a previous run.
item_samples=None -- If specified, the set of bootstrap item samples to use for computing error bars. If not specified, a new set of bootstrap item samples will be created.
verbosity=1 -- Controls how much information is printed to the console during execution. Set a higher number to help with debugging.
= True (run_on_creation) -- Whether to actually run the analysis pipeline
- 1 (procs=pathos.helpers.cpu_count()) -- How many processors are available for parallel execution

output_csv(fname)¶: output the dataframe and the expert predictions

path_for_saving(dirname_base='analysis_pipeline', include_timestamp=True)¶

Parameters

dirname_base -- A name that describes the analysis; / will be treated as a subdirectory
include_timestamp -- Whether to make a folder indicating the timestamp at which the run was done.

Returns

A path of the form {self.run_timestamp}/{dirname_base}
If the path does not exist yet, it is created.

run()¶: Create the power curve(s); normally invoked during __init__ but can be called separately.

save(path=None, msg='', save_results=True)¶

Save instance and results to files

Parameters

dirname_base="analysis_pipeline" -- A subdirectory name in which to store saved results
msg -- A text string to write in a README file that is generated
save_results=True -- If True, generates a results_summary.txt file with power curve and survey equivalence summary stats

surveyequivalence.equivalence.load_saved_pipeline(path)¶: Loads dataset, predictions, classifiers scores, and power curve(s) previously saved using surveyequivalence.equivalence.AnalysisPipeline.save()

Plot¶

class surveyequivalence.equivalence.Plot(ax, expert_power_curve, amateur_power_curve=None, classifier_scores=None, color_map={'amateur_power_curve': 'blue', 'classifier': 'green', 'expert_power_curve': 'black'}, y_axis_label='Agreement with reference rater', center_on=None, y_range=None, name='powercurve', legend_label='Expert raters', amateur_legend_label='Lay raters', verbosity=1, generate_pgf=False)¶

Bases: object

Generates visual display of power curve(s) and classifier scores, as matplotlib objects and as pgf for embedding in latex. First run AnalysisPipeline to generate the PowerCurve and ClassifierResults objects to pass in to constructor.

Parameters

ax (matplotlib.axes.Axes) --
expert_power_curve (PowerCurve) -- a PowerCurve with scores for combinations of k reference raters in predicting a held-out reference rater
amateur_power_curve=None (PowerCurve) -- a PowerCurve with scores for combinations of k other raters in predicting a held-out reference rater
classifier_scores=None (ClassifierResults) --
color_map={'expert_power_curve' ('black', 'amateur_power_curve': 'blue', 'classifier': 'green'}) -- a dictionary specifying colors to use for the different elements of the graph to be pltoted
with reference rater' (y_axis_label='Agreement) --
center_on=None (float) -- If a value is provided, it will be subtracted from all scores for classifiers and power curve values
y_range=None -- If specified, a tuple of two values, the min and max y-values for the graph
name='powercurve' -- A name for the plot
raters' (legend_label='Expert) -- Legend label for the power curve for reference raters
raters" (amateur_legend_label="Lay) -- Legend label for the power curve for other raters
verbosity=1 -- Controls how much information is printed to the console during execution. Set a higher number to help with debugging.
generate_pgf=False -- If True, also populate data to enable create of pgf format, suitable for inclusion in latex after calling .plot(), run self.template.substitute(**self.template_dict)

plot(include_expert_points='all', connect_expert_points=True, include_classifiers=True, include_classifier_equivalences=True, include_classifier_amateur_equivalences=False, other_rater_equivalences_to_include=[], include_droplines=True, include_amateur_curve=True, include_classifier_cis=True, include_seq_cis=True, x_ticks=None, legend_loc=None)¶

The method that fills in the contents of the matplotlib Axes object

Parameters

include_expert_points='all' -- all means to plot all reference rater survey sizes on the x-axis of the power curve Or include a list of numbers indicating which survey sizes to include
connect_expert_points=True -- Whether to draw straight lines connecting the dots for survey power for surveys of successive sizes
include_classifiers=True -- Whether to include horizontal lines showing the classifier score(s)
include_classifier_equivalences=True -- Whether to include calculation of the equivalent number of reference raters for each classifier, based on the intersection point of the classifier line and the reference raters' power curve
include_classifier_amateur_equivalences=False -- Whether to include calculation of the equivalent number of other raters for each classifier, based on the intersection point of the classifier line and the other raters' power curve
other_rater_equivalences_to_include=[] -- A list of survey sizes for non-reference raters. For each one, compute the equivalent number of reference raters yielding the same score.
include_droplines=True -- Whether to include vertical lines from the intersection points (survey equivalences) to the x-axis
include_amateur_curve=True -- Whether to include a power curve for the other, non-reference raters
include_classifier_cis=True -- Whether to include error bars around the classifier horizontal lines
include_seq_cis=True -- Whether to include error bars around the survey equivalence values
x_ticks=None -- If provided, a list of x values for which tick marks should be shown. If None, then it will be automatically calculated.
legend_loc=None -- String indicating where to place the legend (uses default if None). Options as documented for matplotlib.axes.Axes.legend.

save(path: str, fig: <module 'matplotlib.figure' from '/home/docs/checkouts/readthedocs.org/user_builds/surveyequivalence/envs/stable/lib/python3.7/site-packages/matplotlib/figure.py'>, plotname='plot')¶

Wrapper for the matplotlib save_plot function. Saves all data to the ./plots directory as png and tex files.

Parameters

fig (matplotlib figure object to be saved) --
name (Name for the file) --

Equivalences¶

class surveyequivalence.equivalence.Equivalences(df)¶

Bases: object

Contains a dataframe with one row for each bootstrap sample of items and one column for each classifier. Cell contains the survey equivalence value (equivalent number of reference raters whose combined ratings yields the same score as the classifier).

property lower_bounds¶: returns: :rtype: A pandas Series with a lower bound on the survey equivalence for each classifier. Compute based on interval covering 95% of the bootstrap samples of items, if there are more than 200 of them. Otherwise compute based on two standard deviations of the scores on bootstrap samples.

property upper_bounds¶: returns: :rtype: A pandas Series with an upper bound on the survey equivalence for each classifier. Compute based on interval covering 95% of the bootstrap samples of items, if there are more than 200 of them. Otherwise compute based on two standard deviations of the scores on bootstrap samples.

ClassifierResults¶

class surveyequivalence.equivalence.ClassifierResults(runs: Optional[Sequence[Dict]] = None, df=None)¶

Bases: object

property lower_bounds¶: returns: :rtype: A pandas Series with a lower bound on the survey equivalence for each classifier. Compute based on interval covering 95% of the bootstrap samples of items, if there are more than 200 of them. Otherwise compute based on two standard deviations of the scores on bootstrap samples.

property upper_bounds¶: returns: :rtype: A pandas Series with an upper bound on the survey equivalence for each classifier. Compute based on interval covering 95% of the bootstrap samples of items, if there are more than 200 of them. Otherwise compute based on two standard deviations of the scores on bootstrap samples.

property values¶: returns: :rtype: Series of classifier scores for the first row, the actual item set, omitting results for all bootstrap item sets.

PowerCurve¶

class surveyequivalence.equivalence.PowerCurve(runs: Optional[Sequence[Dict]] = None, df=None)¶

Bases: surveyequivalence.equivalence.ClassifierResults

A special case of ClassifierResults where there is one column for each integer value k, representing the mean score, over many samples of k raters, of the predictions generated by combining ratings from k raters, scored against a reference rater.

compute_equivalence_at_actuals(classifier_score)¶: Compute the equivalence of the score of the classifier on the actual item sample based on the survey power curve computed for the actual item sample

compute_equivalence_at_mean(classifier_score)¶: Compute the equivalence of the mean score of the classifier across the bootstrap item samples based on the mean survey power curve computed across the bootstrap item samples

compute_equivalences(other, columns=None)¶

Parameters

self --
other -- The classifier scores that are compared against this PowerCurve to find equivalences may either be an instance of ClassifierResults or a PowerCurve. Must have same row indexes as self, one for each item sample
columns -- a subset of the column names from other.df; if not specified, use all of them

Returns

Return type

a df with one row for each bootstrap run, and columns as specified by the columns parameter Each cell is a float, the survey equivalence value for that column from other. That is, the x s.t. expected score with x raters from self == classifier_score from other.

reliability_of_beating_classifier(other, k=1, other_col=1)¶

Parameters

other -- the other ClassifierResults or PowerCurve
self_col -- the survey size (column) for self
other_col -- the survey size (column) for other to compare, with matching bootstrap samples as rows

Returns

Return type

fraction of bootstrap runs where self power higher than other power

reliability_of_difference(other, k=1)¶

Parameters

other -- another PowerCurve
k -- survey size

Returns

Return type

fraction of bootstrap runs where power@k higher for self than other power curve

Combiners¶

class surveyequivalence.combiners.AnonymousBayesianCombiner(*args, **kwargs)¶

Bases: surveyequivalence.combiners.Combiner

Anonymous Bayesian Combiner Class

static D_k(labels: numpy.array, W: numpy.matrix, allowable_labels: Sequence[str]) -> (<class 'float'>, <class 'int'>)¶

Compute the joint distribution over k anonymous ratings

Parameters

labels (item labels from several raters) --
W (item and rating dataset) --
allowable_labels (The set of labels that can be entered by the raters.) --

Returns

Return type

joint distribution, and num_items

static D_k_item_contribution(labels: numpy.array, item: numpy.array, allowable_labels: Sequence[str]) -> (<class 'float'>, <class 'float'>)¶

ProbabilityOfOneItem function in Algorithm 5. Computes the contribution of a single item to the combiner

Parameters

labels (item labels from several raters) --
item (The item under current consideration) --
allowable_labels (The set of labels that can be entered by the raters.) --

Returns

Return type

The contribution of this item.

combine(allowable_labels: Sequence[str], labels: Sequence[Tuple[str, str]], W: Optional[numpy.matrix] = None, item_id=None, to_predict_for=None) → surveyequivalence.combiners.DiscreteDistributionPrediction ¶

Algorithm 6 Compute the anonymous bayesian combiner. Combines rater labels like frequency_combiner, but this uses the information from the item/rating dataset W.

Parameters

allowable_labels (the set of labels/ratings allowed) --
labels (the k ratings) --
W (item and rating dataset) --
item_id (item index in W) --
to_predict_for (not used currently) --

Returns

Return type

Prediction based on anonymous bayesian combiner

class surveyequivalence.combiners.Combiner(allowable_labels: Optional[Sequence[str]] = None, verbosity=0)¶

Bases: abc.ABC

Abstract class defining a combiner.

A combiner selects a single label from a bag/multiset of labels (and possibly other information) according to some function. For example, the PluralityCombiner accepts a bag of labels and returns the label that is most frequent.

abstract combine(allowable_labels: Sequence[str], labels: Sequence[Tuple[str, str]], W: Optional[numpy.matrix] = None, item_id=None, to_predict_for=None) → surveyequivalence.combiners.DiscreteDistributionPrediction ¶

class surveyequivalence.combiners.DiscreteDistributionPrediction(label_names, probabilities, extreme_cutoff=0.02, normalize=True)¶

Bases: surveyequivalence.combiners.Prediction

A discrete distribution prediction where labels are associated with probabilities. Value takes the label with the highest probability.

draw_discrete_label()¶

Return one of the labels, drawn according to the distribution

Returns
Return type: A label

label_probability(label)¶

Returns the probability associated with an input label

Parameters: label (label to query) --
Returns
Return type: Probability assicated with label.

property value¶

Return the single label that has the highest predicted probability. Break ties by taking the first one

>>> DiscreteDistributionPrediction(['a', 'b', 'c'], [.3, .4, .3]).value
'b'
>>> DiscreteDistributionPrediction(['a', 'b', 'c'], [.4, .4, .2]).value
'a'

Returns
Return type: label with highest probability

property value_prob¶

Return the probability of the majority class

>>> DiscreteDistributionPrediction(['a', 'b', 'c'], [.3, .4, .3]).value
.4
>>> DiscreteDistributionPrediction(['a', 'b', 'c'], [.4, .4, .2]).value
.4

Returns
Return type: highest probability

class surveyequivalence.combiners.DiscretePrediction(label)¶

Bases: surveyequivalence.combiners.Prediction

A discrete prediction. value is defined as a label

property value¶

class surveyequivalence.combiners.FrequencyCombiner(allowable_labels: Optional[Sequence[str]] = None, verbosity=0)¶

Bases: surveyequivalence.combiners.Combiner

Returns a vector of frequencies for each label

combine(allowable_labels: Sequence[str], labels: Sequence[Tuple[str, str]], W: Optional[numpy.matrix] = None, item_id=None, to_predict_for=None) → surveyequivalence.combiners.DiscreteDistributionPrediction ¶

Returns the frequency vector for labels

>>> FrequencyCombiner().combine(['pos', 'neg'], np.array([(1, 'pos'), (2, 'neg'), (4, 'neg')]), ).probabilities
[0.3333333333333333, 0.6666666666666666]

>>> FrequencyCombiner().combine(['pos', 'neg'], np.array([(1, 'neg'), (2, 'neg'), (4, 'neg')])).probabilities
[0.0, 1.0]

Parameters

allowable_labels (not used in this combiner) --
labels (nnumeric values from particular rater ids; rater ids are ignored) --
W (not used in this combiner) --
item_id (not used in this combiner) --
to_predict_for (not used in this combiner) --

Returns

Return type

Frequency vector of labels

class surveyequivalence.combiners.MeanCombiner(allowable_labels: Optional[Sequence[str]] = None, verbosity=0)¶

Bases: surveyequivalence.combiners.Combiner

Combiner that returns the mean of all the labels.

combine(allowable_labels: Optional[Sequence[str]] = None, labels: Sequence[Tuple[str, float]] = [], W: Optional[numpy.matrix] = None, item_id=None, to_predict_for=None) → surveyequivalence.combiners.NumericPrediction ¶

Returns the single label that is most frequent

Parameters

allowable_labels (not used in this combiner) --
labels (nnumeric values from particular rater ids; rater ids are ignored) --
W (not used in this combiner) --
item_id (not used in this combiner) --
to_predict_for (not used in this combiner) --

Returns

Return type

The mean of the labels

class surveyequivalence.combiners.NumericPrediction(num)¶

Bases: surveyequivalence.combiners.Prediction

A numeric prediction. value is defined as a number

property value¶

class surveyequivalence.combiners.PluralityVote(allowable_labels: Optional[Sequence[str]] = None, verbosity=0)¶

Bases: surveyequivalence.combiners.Combiner

Combiner that returns the single label that is most frequent

combine(allowable_labels: Optional[Sequence[str]] = None, labels: Sequence[Tuple[str, float]] = [], W: Optional[numpy.matrix] = None, item_id=None, to_predict_for=None) → surveyequivalence.combiners.NumericPrediction ¶

Returns the single label that is most frequent

Parameters

allowable_labels (not used in this combiner) --
labels (numeric values from particular rater ids; rater ids are ignored) --
W (not used in this combiner) --
item_id (not used in this combiner) --
to_predict_for (not used in this combiner) --

Returns

Return type

The most common label

class surveyequivalence.combiners.Prediction¶

Bases: abc.ABC

Abstract class that defines a value for many types of Predictions

abstract property value¶

Scoring Functions¶

class surveyequivalence.scoring_functions.AUCScore¶

Bases: surveyequivalence.scoring_functions.Scorer

static score(classifier_predictions: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction], rater_labels: Sequence[str], verbosity=0) → float¶

AUC score. This function uses sklearn's AUC function, but does not work in many cases with multiple labels.

Parameters

classifier_predictions (numeric values) --
rater_labels (sequence of labels, which should be numeric values) --
verbosity --

Returns

Return type

AUC Score

class surveyequivalence.scoring_functions.AgreementScore¶

Bases: surveyequivalence.scoring_functions.Scorer

Agreement Scorer

static score(classifier_predictions: Sequence[str], rater_labels: Sequence[str], verbosity=0)¶

Agreement score measures the normalized number of times that the predictor matched the label. Akin to a typical accuracy score.

Parameters

classifier_predictions (numeric values) --
rater_labels (sequence of labels, which should be numeric values) --
verbosity --

Returns

Return type

Agreement score

class surveyequivalence.scoring_functions.Correlation¶

Bases: surveyequivalence.scoring_functions.Scorer

Computes the Pearson correlation coefficient.

static score(classifier_predictions: Sequence[surveyequivalence.combiners.NumericPrediction], rater_labels: Sequence[str], verbosity=0)¶

Parameters

classifier_predictions (numeric values) --
rater_labels (sequence of labels, which should be numeric values) --
verbosity --

Returns

Return type

Pearson correlation coefficient

class surveyequivalence.scoring_functions.CrossEntropyScore¶

Bases: surveyequivalence.scoring_functions.Scorer

Cross Entropy Scorer

static score(classifier_predictions: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction], rater_labels: Sequence[str], verbosity=0)¶

Calculates the Cross Entropy of the two labels.

>>> CrossEntropyScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['b', 'b', 'b'])
0.594597099859

>>> CrossEntropyScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['a', 'b', 'b'])
0.87702971998

Parameters

classifier_predictions (numeric values) --
rater_labels (sequence of labels, which should be numeric values) --
verbosity --

Returns

Return type

Cross Entropy score

class surveyequivalence.scoring_functions.F1Score¶

Bases: surveyequivalence.scoring_functions.Scorer

static score(classifier_predictions: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction], rater_labels: Sequence[str], verbosity=0, average: str = 'micro') → float¶

F1 score. This function uses sklearn's F1 function.

>>> F1Score.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['b', 'b', 'b'], 'micro')
0.6666666666666666
>>> F1Score.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['b', 'b', 'b'], 'macro')
0.39759036144

>>> F1Score.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['a', 'b', 'b'], 'micro')
0.3333333333333333
>>> F1Score.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['a', 'b', 'b'], 'macro')
0.25

Parameters

classifier_predictions (numeric values) --
rater_labels (sequence of labels, which should be numeric values) --
verbosity --
average (macro or micro averaging) --

Returns

Return type

F1 Score

class surveyequivalence.scoring_functions.PrecisionScore¶

Bases: surveyequivalence.scoring_functions.Scorer

static score(classifier_predictions: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction], rater_labels: Sequence[str], verbosity=0, average: str = 'micro') → float¶

Precision score. This function uses sklearn's precision function.

>>> PrecisionScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['b', 'b', 'b'], 'micro')
0.6666666666666666

>>> PrecisionScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['a', 'b', 'b'], 'micro')
0.3333333333333333

Parameters

classifier_predictions (numeric values) --
rater_labels (sequence of labels, which should be numeric values) --
verbosity --
average (macro or micro averaging) --

Returns

Return type

Precision Score

class surveyequivalence.scoring_functions.RecallScore¶

Bases: surveyequivalence.scoring_functions.Scorer

static score(classifier_predictions: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction], rater_labels: Sequence[str], verbosity=0, average: str = 'micro') → float¶

Recall score. This function uses sklearn's recall function.

>>> RecallScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['b', 'b', 'b'], 'micro')
0.6666666666666666
>>> RecallScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['b', 'b', 'b'], 'macro')
0.5

>>> RecallScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['a', 'b', 'b'], 'micro')
0.3333333333333333
>>> RecallScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['a', 'b', 'b'], 'macro')
0.25

Parameters

classifier_predictions (numeric values) --
rater_labels (sequence of labels, which should be numeric values) --
verbosity --
average (macro or micro averaging) --

Returns

Return type

Recall Score

class surveyequivalence.scoring_functions.Scorer¶

Bases: abc.ABC

Scorer that defines a Scorer class as having a score() function. The scorer computes the goodness of a predictor against the average human rater.

abstract static score(classifier_predictions: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction], rater_labels: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction]) → float¶

score_classifier(classifier_predictions: Sequence, raters: Sequence, W, verbosity=0)¶

Driver function that computes the mean score over all predictions

Parameters

classifier_predictions (Scoring predictions) --
raters (The reference ratings. Score will compare classifier predictions with each rater in turn.) --
W (The item and rating dataset) --
verbosity (verbosity value from 1 to 4 indicating increased verbosity.) --

Returns

Return type

Mean score over all predictions for all raters.

Synthetic Dataset Generation¶

States¶

class surveyequivalence.synthetic_datasets.DiscreteState(state_name: str, labels: Sequence[str], probabilities: Sequence[float])¶

Bases: surveyequivalence.synthetic_datasets.State

A discrete distribution over possible labels

Parameters

state_name --
labels -- A sequence of strings; the allowable labels
probabilities -- A sequence of the same length, with values adding to one, giving probabilities for each of the label strings

draw_labels(n: int)¶

Make n iid draws of discrete labels from the distribution

Parameters: n -- How many labels to draw from the distribution
Returns
Return type: a single item or a numpy array

Distributions Over States¶

class surveyequivalence.synthetic_datasets.DistributionOverStates¶

Bases: abc.ABC

Abstract base class

class surveyequivalence.synthetic_datasets.DiscreteDistributionOverStates(states: Sequence[surveyequivalence.synthetic_datasets.State], probabilities: Sequence[float])¶

Bases: surveyequivalence.synthetic_datasets.DistributionOverStates

Parameters

states -- a sequence of State objects
probabilities -- a same length sequence of floats representing probabilities of the item states

draw_states(n: int) → Sequence[surveyequivalence.synthetic_datasets.DiscreteState]¶

Parameters: n --
Returns
Return type: a single item or numpy array of State instances, drawn iid from the probability distribution

class surveyequivalence.synthetic_datasets.FixedStateGenerator(states: Sequence[surveyequivalence.synthetic_datasets.State], probabilities: Sequence[float])¶

Bases: surveyequivalence.synthetic_datasets.DiscreteDistributionOverStates

draw_states(n: int)¶

Draw exactly in proportion to probabilities, rather than each draw random according to the probabilities :param n: How many items to draw

Returns
Return type: list of State instances

Mock Classifiers¶

class surveyequivalence.synthetic_datasets.MockClassifier(name: str, label_predictors: Dict[str, surveyequivalence.combiners.Prediction])¶

Bases: object

A mock classifier has access to each item's state when generating a prediction, something that a real classifier would not have access to

Parameters

name --
label_predictions -- a dictionary mapping from item state names to Predictions

make_predictions(item_states: Sequence[surveyequivalence.synthetic_datasets.State]) → Sequence[surveyequivalence.combiners.Prediction]¶

Parameters: item_states -- a sequence of State objects, representing the states of some items
Returns
Return type: a sequence of Prediction objects, one for each item

class surveyequivalence.synthetic_datasets.MappedDiscreteMockClassifier(name, label_predictors: Dict[str, surveyequivalence.combiners.Prediction], prediction_map: Dict[str, surveyequivalence.combiners.Prediction])¶

Bases: surveyequivalence.synthetic_datasets.MockClassifier

A mock classifier that maps an item state to a Prediction, draws a discrete label from that, and then maps that discrete label to another Prediction.

Parameters

name --
label_predictions -- a dictionary mapping from item state names to Predictions

make_predictions(item_states)¶

Parameters: item_states -- a sequence of State objects, representing the states of some items
Returns
Return type: a sequence of Prediction objects, one for each item

Dataset Generators¶

class surveyequivalence.synthetic_datasets.SyntheticDatasetGenerator(item_state_generator: surveyequivalence.synthetic_datasets.DistributionOverStates, num_items_per_dataset=1000, num_labels_per_item=10, mock_classifiers=None, name='')¶

Bases: object

Generator for a set of items with some raters per item. Items are defined by States, which are drawn from a DistributionOverStates. Each State is a distribution over labels. Each label is an i.i.d. draw from the State

Parameters

item_state_generator --
num_items_per_dataset --
num_labels_per_item -- How many raters to generate labels for, for each item
mock_classifiers -- A list of MockClassifier instances, which generate label predictions based on the item state
name -- A text string naming this dataset generator

generate_labels(item_states, num_labels_per_item=None, rater_prefix='e')¶

Normally called with item_states=self.reference_rater_item_states

Parameters

self --
item_states -- a list of States, one for each item
num_labels_per_item=None -- if None, use self.num_labels_per_item
rater_prefix="e" -- Rater columns are named as f"{rater_prefix}_{i}" where i is an integer

Returns

Return type

A pandas DataFrame with one row for each item and one column for each rater. Cells are labels.

class surveyequivalence.synthetic_datasets.SyntheticBinaryDatasetGenerator(item_state_generator, num_items_per_dataset=50, num_labels_per_item=3, mock_classifiers=None, name=None, pct_noise=0.0, k_other_raters_per_label=1)¶

Bases: surveyequivalence.synthetic_datasets.SyntheticDatasetGenerator

Dataset generator for binary labels

Only additional parameters for this subclass are documented here.

Parameters

pct_noise=0 -- In addition to the reference rater labels, this generator can generator labels from "other" raters. With probability pct_noise the binary labels will be drawn from a 50-50 coin flip, and otherwise from the item's State. If pct_noise==0, the other raters' labels will always be i.i.d draws from the same distribution as the reference rater labels.
k_other_raters_per_label=1 -- The number of other raters to generate labels for.

make_histogram(ax)¶

Parameters: ax -- A matplotlib Axes instance

plot_item_state_distribution()¶: called if you are making a standalone graph; for insets, .make_histogram is called directly

Dataset¶

class surveyequivalence.synthetic_datasets.Dataset¶

Bases: object

A Dataset

class surveyequivalence.synthetic_datasets.SyntheticDataset(ds_generator: surveyequivalence.synthetic_datasets.SyntheticBinaryDatasetGenerator)¶

Bases: surveyequivalence.synthetic_datasets.Dataset

Parameters

ds_generator --
all the attributes (Sets) --
running the SyntheticBinaryDatasetGenerator (by) --

save(dirname='running_example')¶

Save ratings and predictions to csv files

Parameters

dirname -- A subdirectory name in which to store saved results
include_timestamp_in_dirname -- Whether to postpend directory name with current timestamp

surveyequivalence.synthetic_datasets.make_running_example_dataset(num_items_per_dataset=10, num_labels_per_item=10, minimal=False, include_hard_classifier=False, include_soft_classifier=False) → surveyequivalence.synthetic_datasets.SyntheticDataset ¶

This generates the running example dataset used in the original Survey Equivalence paper.

Three states: 70% high = 80/20, 10% med = 50/50; 20% low = 10/90

Parameters

num_items_per_dataset --
num_labels_per_item --
minimal -- If minimal, use FixedStateGenerator, which generates labels in exact proportion to probabilities specified in the state, rather than each label being an iid draw from the State.
include_hard_classifier -- Includes a hard classifier which draws labels 90/10 for high state; 50/50 for medium; 05/95 fow low state
include_soft_classifier -- Includes a soft classifier which runs the hard_classifier to generate a label and then maps it to a calibrated prediction (.7681 when the label is positive; .3226 when the label is negative). Also includes an ideal classifier that always predicts the probability given by the State of the item.

API¶

Equivalence Module¶

AnalysisPipeline¶

Plot¶

Equivalences¶

ClassifierResults¶

PowerCurve¶

Combiners¶

Scoring Functions¶

Synthetic Dataset Generation¶

States¶

Distributions Over States¶

Mock Classifiers¶

Dataset Generators¶

Dataset¶

SurveyEquivalence

Navigation

Related Topics