API

Equivalence Module

AnalysisPipeline

class surveyequivalence.equivalence.AnalysisPipeline(W: pandas.core.frame.DataFrame, sparse_experts: bool = True, expert_cols: Sequence[str] = [], amateur_cols: Sequence[str] = [], classifier_predictions: Optional[pandas.core.frame.DataFrame] = None, combiner: Optional[surveyequivalence.combiners.Combiner] = None, scorer: Optional[surveyequivalence.scoring_functions.Scorer] = None, allowable_labels: Optional[Sequence[str]] = None, min_k=0, num_bootstrap_item_samples=100, max_rater_subsets=200, max_K=10, ratersets_memo=None, predictions_memo=None, item_samples=None, verbosity=1, run_on_creation=True, procs=1)

Bases: object

The main class for running an analysis

Parameters
  • W (pd.DataFrame) -- The ratings dataframe with one column for each rater, one row for each item

  • sparse_experts (bool) -- True (default) if some raters may not have rated all items

  • expert_cols (Sequence[str] = []) -- A list of column names, one for each potential "reference rater" whose the classifier is trying to predict. These are also the columns used for computing the power curve for survey equivalence

  • amateur_cols (Sequence[str] = []) -- A list of column names, one for each potential "other rater". Their ratings are not used for evaluating the classifier, but a separate power curve may be computed for them, using surveys of k of them to predict a reference rater's label. Survey equivalences can also be calculated between j "other raters" and k reference raters.

  • classifier_predictions (pd.DataFrame = None) -- A dataframe with one column for each classifier for which we want to compute survey equivalences. One row for each item; row indexes should be the same as for W

  • combiner (Combiner = None) -- A combiner that is used to make a prediction about the next label for an item, given labels from some other raters.

  • scorer (Scorer = None) -- A scorer that takes a vector of predictions and a vector of realized reference rater labels and yields a numeric score.

  • allowable_labels (Sequence[str] = None) -- A list of the potential label strings that a rater is permitted to assign to an item

  • min_k=0 -- When computing power curves, the smallest survey size to include

  • num_bootstrap_item_samples=100 -- When computing error bars, how many bootstrap samples of items to create

  • max_rater_subsets=200 -- When computing power curves, we compute the average score over predictions made from many subsets of reference raters of size k. When k is small, we choose all subsets of size k. For larger k, we take a sample from the powerset. This parameters determines how many subsets to select.

  • max_K=10 -- When computing computing curves, the largest survey size to include. Cannot be larger than the number of reference raters in W, minus one.

  • ratersets_memo=None -- While running, a dictionary is create to memoize certain computations, for efficiency. A value be passed in in order to reuse the memoized computations from a previous run.

  • predictions_memo=None -- While running, a dictionary is create to memoize certain computations, for efficiency. A value be passed in in order to reuse the memoized computations from a previous run.

  • item_samples=None -- If specified, the set of bootstrap item samples to use for computing error bars. If not specified, a new set of bootstrap item samples will be created.

  • verbosity=1 -- Controls how much information is printed to the console during execution. Set a higher number to help with debugging.

  • = True (run_on_creation) -- Whether to actually run the analysis pipeline

  • - 1 (procs=pathos.helpers.cpu_count()) -- How many processors are available for parallel execution

output_csv(fname)

output the dataframe and the expert predictions

path_for_saving(dirname_base='analysis_pipeline', include_timestamp=True)
Parameters
  • dirname_base -- A name that describes the analysis; / will be treated as a subdirectory

  • include_timestamp -- Whether to make a folder indicating the timestamp at which the run was done.

Returns

  • A path of the form {self.run_timestamp}/{dirname_base}

  • If the path does not exist yet, it is created.

run()

Create the power curve(s); normally invoked during __init__ but can be called separately.

save(path=None, msg='', save_results=True)

Save instance and results to files

Parameters
  • dirname_base="analysis_pipeline" -- A subdirectory name in which to store saved results

  • msg -- A text string to write in a README file that is generated

  • save_results=True -- If True, generates a results_summary.txt file with power curve and survey equivalence summary stats

surveyequivalence.equivalence.load_saved_pipeline(path)

Loads dataset, predictions, classifiers scores, and power curve(s) previously saved using surveyequivalence.equivalence.AnalysisPipeline.save()

Plot

class surveyequivalence.equivalence.Plot(ax, expert_power_curve, amateur_power_curve=None, classifier_scores=None, color_map={'amateur_power_curve': 'blue', 'classifier': 'green', 'expert_power_curve': 'black'}, y_axis_label='Agreement with reference rater', center_on=None, y_range=None, name='powercurve', legend_label='Expert raters', amateur_legend_label='Lay raters', verbosity=1, generate_pgf=False)

Bases: object

Generates visual display of power curve(s) and classifier scores, as matplotlib objects and as pgf for embedding in latex. First run AnalysisPipeline to generate the PowerCurve and ClassifierResults objects to pass in to constructor.

Parameters
  • ax (matplotlib.axes.Axes) --

  • expert_power_curve (PowerCurve) -- a PowerCurve with scores for combinations of k reference raters in predicting a held-out reference rater

  • amateur_power_curve=None (PowerCurve) -- a PowerCurve with scores for combinations of k other raters in predicting a held-out reference rater

  • classifier_scores=None (ClassifierResults) --

  • color_map={'expert_power_curve' ('black', 'amateur_power_curve': 'blue', 'classifier': 'green'}) -- a dictionary specifying colors to use for the different elements of the graph to be pltoted

  • with reference rater' (y_axis_label='Agreement) --

  • center_on=None (float) -- If a value is provided, it will be subtracted from all scores for classifiers and power curve values

  • y_range=None -- If specified, a tuple of two values, the min and max y-values for the graph

  • name='powercurve' -- A name for the plot

  • raters' (legend_label='Expert) -- Legend label for the power curve for reference raters

  • raters" (amateur_legend_label="Lay) -- Legend label for the power curve for other raters

  • verbosity=1 -- Controls how much information is printed to the console during execution. Set a higher number to help with debugging.

  • generate_pgf=False -- If True, also populate data to enable create of pgf format, suitable for inclusion in latex after calling .plot(), run self.template.substitute(**self.template_dict)

plot(include_expert_points='all', connect_expert_points=True, include_classifiers=True, include_classifier_equivalences=True, include_classifier_amateur_equivalences=False, other_rater_equivalences_to_include=[], include_droplines=True, include_amateur_curve=True, include_classifier_cis=True, include_seq_cis=True, x_ticks=None, legend_loc=None)

The method that fills in the contents of the matplotlib Axes object

Parameters
  • include_expert_points='all' -- all means to plot all reference rater survey sizes on the x-axis of the power curve Or include a list of numbers indicating which survey sizes to include

  • connect_expert_points=True -- Whether to draw straight lines connecting the dots for survey power for surveys of successive sizes

  • include_classifiers=True -- Whether to include horizontal lines showing the classifier score(s)

  • include_classifier_equivalences=True -- Whether to include calculation of the equivalent number of reference raters for each classifier, based on the intersection point of the classifier line and the reference raters' power curve

  • include_classifier_amateur_equivalences=False -- Whether to include calculation of the equivalent number of other raters for each classifier, based on the intersection point of the classifier line and the other raters' power curve

  • other_rater_equivalences_to_include=[] -- A list of survey sizes for non-reference raters. For each one, compute the equivalent number of reference raters yielding the same score.

  • include_droplines=True -- Whether to include vertical lines from the intersection points (survey equivalences) to the x-axis

  • include_amateur_curve=True -- Whether to include a power curve for the other, non-reference raters

  • include_classifier_cis=True -- Whether to include error bars around the classifier horizontal lines

  • include_seq_cis=True -- Whether to include error bars around the survey equivalence values

  • x_ticks=None -- If provided, a list of x values for which tick marks should be shown. If None, then it will be automatically calculated.

  • legend_loc=None -- String indicating where to place the legend (uses default if None). Options as documented for matplotlib.axes.Axes.legend.

save(path: str, fig: <module 'matplotlib.figure' from '/home/docs/checkouts/readthedocs.org/user_builds/surveyequivalence/envs/stable/lib/python3.7/site-packages/matplotlib/figure.py'>, plotname='plot')

Wrapper for the matplotlib save_plot function. Saves all data to the ./plots directory as png and tex files.

Parameters
  • fig (matplotlib figure object to be saved) --

  • name (Name for the file) --

Equivalences

class surveyequivalence.equivalence.Equivalences(df)

Bases: object

Contains a dataframe with one row for each bootstrap sample of items and one column for each classifier. Cell contains the survey equivalence value (equivalent number of reference raters whose combined ratings yields the same score as the classifier).

property lower_bounds

returns: :rtype: A pandas Series with a lower bound on the survey equivalence for each classifier. Compute based on interval covering 95% of the bootstrap samples of items, if there are more than 200 of them. Otherwise compute based on two standard deviations of the scores on bootstrap samples.

property upper_bounds

returns: :rtype: A pandas Series with an upper bound on the survey equivalence for each classifier. Compute based on interval covering 95% of the bootstrap samples of items, if there are more than 200 of them. Otherwise compute based on two standard deviations of the scores on bootstrap samples.

ClassifierResults

class surveyequivalence.equivalence.ClassifierResults(runs: Optional[Sequence[Dict]] = None, df=None)

Bases: object

property lower_bounds

returns: :rtype: A pandas Series with a lower bound on the survey equivalence for each classifier. Compute based on interval covering 95% of the bootstrap samples of items, if there are more than 200 of them. Otherwise compute based on two standard deviations of the scores on bootstrap samples.

property upper_bounds

returns: :rtype: A pandas Series with an upper bound on the survey equivalence for each classifier. Compute based on interval covering 95% of the bootstrap samples of items, if there are more than 200 of them. Otherwise compute based on two standard deviations of the scores on bootstrap samples.

property values

returns: :rtype: Series of classifier scores for the first row, the actual item set, omitting results for all bootstrap item sets.

PowerCurve

class surveyequivalence.equivalence.PowerCurve(runs: Optional[Sequence[Dict]] = None, df=None)

Bases: surveyequivalence.equivalence.ClassifierResults

A special case of ClassifierResults where there is one column for each integer value k, representing the mean score, over many samples of k raters, of the predictions generated by combining ratings from k raters, scored against a reference rater.

compute_equivalence_at_actuals(classifier_score)

Compute the equivalence of the score of the classifier on the actual item sample based on the survey power curve computed for the actual item sample

compute_equivalence_at_mean(classifier_score)

Compute the equivalence of the mean score of the classifier across the bootstrap item samples based on the mean survey power curve computed across the bootstrap item samples

compute_equivalences(other, columns=None)
Parameters
  • self --

  • other -- The classifier scores that are compared against this PowerCurve to find equivalences may either be an instance of ClassifierResults or a PowerCurve. Must have same row indexes as self, one for each item sample

  • columns -- a subset of the column names from other.df; if not specified, use all of them

Returns

Return type

a df with one row for each bootstrap run, and columns as specified by the columns parameter Each cell is a float, the survey equivalence value for that column from other. That is, the x s.t. expected score with x raters from self == classifier_score from other.

reliability_of_beating_classifier(other, k=1, other_col=1)
Parameters
  • other -- the other ClassifierResults or PowerCurve

  • self_col -- the survey size (column) for self

  • other_col -- the survey size (column) for other to compare, with matching bootstrap samples as rows

Returns

Return type

fraction of bootstrap runs where self power higher than other power

reliability_of_difference(other, k=1)
Parameters
  • other -- another PowerCurve

  • k -- survey size

Returns

Return type

fraction of bootstrap runs where power@k higher for self than other power curve

Combiners

class surveyequivalence.combiners.AnonymousBayesianCombiner(*args, **kwargs)

Bases: surveyequivalence.combiners.Combiner

Anonymous Bayesian Combiner Class

static D_k(labels: numpy.array, W: numpy.matrix, allowable_labels: Sequence[str]) -> (<class 'float'>, <class 'int'>)

Compute the joint distribution over k anonymous ratings

Parameters
  • labels (item labels from several raters) --

  • W (item and rating dataset) --

  • allowable_labels (The set of labels that can be entered by the raters.) --

Returns

Return type

joint distribution, and num_items

static D_k_item_contribution(labels: numpy.array, item: numpy.array, allowable_labels: Sequence[str]) -> (<class 'float'>, <class 'float'>)

ProbabilityOfOneItem function in Algorithm 5. Computes the contribution of a single item to the combiner

Parameters
  • labels (item labels from several raters) --

  • item (The item under current consideration) --

  • allowable_labels (The set of labels that can be entered by the raters.) --

Returns

Return type

The contribution of this item.

combine(allowable_labels: Sequence[str], labels: Sequence[Tuple[str, str]], W: Optional[numpy.matrix] = None, item_id=None, to_predict_for=None)surveyequivalence.combiners.DiscreteDistributionPrediction

Algorithm 6 Compute the anonymous bayesian combiner. Combines rater labels like frequency_combiner, but this uses the information from the item/rating dataset W.

Parameters
  • allowable_labels (the set of labels/ratings allowed) --

  • labels (the k ratings) --

  • W (item and rating dataset) --

  • item_id (item index in W) --

  • to_predict_for (not used currently) --

Returns

Return type

Prediction based on anonymous bayesian combiner

class surveyequivalence.combiners.Combiner(allowable_labels: Optional[Sequence[str]] = None, verbosity=0)

Bases: abc.ABC

Abstract class defining a combiner.

A combiner selects a single label from a bag/multiset of labels (and possibly other information) according to some function. For example, the PluralityCombiner accepts a bag of labels and returns the label that is most frequent.

abstract combine(allowable_labels: Sequence[str], labels: Sequence[Tuple[str, str]], W: Optional[numpy.matrix] = None, item_id=None, to_predict_for=None)surveyequivalence.combiners.DiscreteDistributionPrediction
class surveyequivalence.combiners.DiscreteDistributionPrediction(label_names, probabilities, extreme_cutoff=0.02, normalize=True)

Bases: surveyequivalence.combiners.Prediction

A discrete distribution prediction where labels are associated with probabilities. Value takes the label with the highest probability.

draw_discrete_label()

Return one of the labels, drawn according to the distribution

Returns

Return type

A label

label_probability(label)

Returns the probability associated with an input label

Parameters

label (label to query) --

Returns

Return type

Probability assicated with label.

property value

Return the single label that has the highest predicted probability. Break ties by taking the first one

>>> DiscreteDistributionPrediction(['a', 'b', 'c'], [.3, .4, .3]).value
'b'
>>> DiscreteDistributionPrediction(['a', 'b', 'c'], [.4, .4, .2]).value
'a'
Returns

Return type

label with highest probability

property value_prob

Return the probability of the majority class

>>> DiscreteDistributionPrediction(['a', 'b', 'c'], [.3, .4, .3]).value
.4
>>> DiscreteDistributionPrediction(['a', 'b', 'c'], [.4, .4, .2]).value
.4
Returns

Return type

highest probability

class surveyequivalence.combiners.DiscretePrediction(label)

Bases: surveyequivalence.combiners.Prediction

A discrete prediction. value is defined as a label

property value
class surveyequivalence.combiners.FrequencyCombiner(allowable_labels: Optional[Sequence[str]] = None, verbosity=0)

Bases: surveyequivalence.combiners.Combiner

Returns a vector of frequencies for each label

combine(allowable_labels: Sequence[str], labels: Sequence[Tuple[str, str]], W: Optional[numpy.matrix] = None, item_id=None, to_predict_for=None)surveyequivalence.combiners.DiscreteDistributionPrediction

Returns the frequency vector for labels

>>> FrequencyCombiner().combine(['pos', 'neg'], np.array([(1, 'pos'), (2, 'neg'), (4, 'neg')]), ).probabilities
[0.3333333333333333, 0.6666666666666666]
>>> FrequencyCombiner().combine(['pos', 'neg'], np.array([(1, 'neg'), (2, 'neg'), (4, 'neg')])).probabilities
[0.0, 1.0]
Parameters
  • allowable_labels (not used in this combiner) --

  • labels (nnumeric values from particular rater ids; rater ids are ignored) --

  • W (not used in this combiner) --

  • item_id (not used in this combiner) --

  • to_predict_for (not used in this combiner) --

Returns

Return type

Frequency vector of labels

class surveyequivalence.combiners.MeanCombiner(allowable_labels: Optional[Sequence[str]] = None, verbosity=0)

Bases: surveyequivalence.combiners.Combiner

Combiner that returns the mean of all the labels.

combine(allowable_labels: Optional[Sequence[str]] = None, labels: Sequence[Tuple[str, float]] = [], W: Optional[numpy.matrix] = None, item_id=None, to_predict_for=None)surveyequivalence.combiners.NumericPrediction

Returns the single label that is most frequent

Parameters
  • allowable_labels (not used in this combiner) --

  • labels (nnumeric values from particular rater ids; rater ids are ignored) --

  • W (not used in this combiner) --

  • item_id (not used in this combiner) --

  • to_predict_for (not used in this combiner) --

Returns

Return type

The mean of the labels

class surveyequivalence.combiners.NumericPrediction(num)

Bases: surveyequivalence.combiners.Prediction

A numeric prediction. value is defined as a number

property value
class surveyequivalence.combiners.PluralityVote(allowable_labels: Optional[Sequence[str]] = None, verbosity=0)

Bases: surveyequivalence.combiners.Combiner

Combiner that returns the single label that is most frequent

combine(allowable_labels: Optional[Sequence[str]] = None, labels: Sequence[Tuple[str, float]] = [], W: Optional[numpy.matrix] = None, item_id=None, to_predict_for=None)surveyequivalence.combiners.NumericPrediction

Returns the single label that is most frequent

Parameters
  • allowable_labels (not used in this combiner) --

  • labels (numeric values from particular rater ids; rater ids are ignored) --

  • W (not used in this combiner) --

  • item_id (not used in this combiner) --

  • to_predict_for (not used in this combiner) --

Returns

Return type

The most common label

class surveyequivalence.combiners.Prediction

Bases: abc.ABC

Abstract class that defines a value for many types of Predictions

abstract property value

Scoring Functions

class surveyequivalence.scoring_functions.AUCScore

Bases: surveyequivalence.scoring_functions.Scorer

static score(classifier_predictions: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction], rater_labels: Sequence[str], verbosity=0)float

AUC score. This function uses sklearn's AUC function, but does not work in many cases with multiple labels.

Parameters
  • classifier_predictions (numeric values) --

  • rater_labels (sequence of labels, which should be numeric values) --

  • verbosity --

Returns

Return type

AUC Score

class surveyequivalence.scoring_functions.AgreementScore

Bases: surveyequivalence.scoring_functions.Scorer

Agreement Scorer

static score(classifier_predictions: Sequence[str], rater_labels: Sequence[str], verbosity=0)

Agreement score measures the normalized number of times that the predictor matched the label. Akin to a typical accuracy score.

Parameters
  • classifier_predictions (numeric values) --

  • rater_labels (sequence of labels, which should be numeric values) --

  • verbosity --

Returns

Return type

Agreement score

class surveyequivalence.scoring_functions.Correlation

Bases: surveyequivalence.scoring_functions.Scorer

Computes the Pearson correlation coefficient.

static score(classifier_predictions: Sequence[surveyequivalence.combiners.NumericPrediction], rater_labels: Sequence[str], verbosity=0)
Parameters
  • classifier_predictions (numeric values) --

  • rater_labels (sequence of labels, which should be numeric values) --

  • verbosity --

Returns

Return type

Pearson correlation coefficient

class surveyequivalence.scoring_functions.CrossEntropyScore

Bases: surveyequivalence.scoring_functions.Scorer

Cross Entropy Scorer

static score(classifier_predictions: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction], rater_labels: Sequence[str], verbosity=0)

Calculates the Cross Entropy of the two labels.

>>> CrossEntropyScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['b', 'b', 'b'])
0.594597099859
>>> CrossEntropyScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['a', 'b', 'b'])
0.87702971998
Parameters
  • classifier_predictions (numeric values) --

  • rater_labels (sequence of labels, which should be numeric values) --

  • verbosity --

Returns

Return type

Cross Entropy score

class surveyequivalence.scoring_functions.F1Score

Bases: surveyequivalence.scoring_functions.Scorer

static score(classifier_predictions: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction], rater_labels: Sequence[str], verbosity=0, average: str = 'micro')float

F1 score. This function uses sklearn's F1 function.

>>> F1Score.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['b', 'b', 'b'], 'micro')
0.6666666666666666
>>> F1Score.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['b', 'b', 'b'], 'macro')
0.39759036144
>>> F1Score.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['a', 'b', 'b'], 'micro')
0.3333333333333333
>>> F1Score.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['a', 'b', 'b'], 'macro')
0.25
Parameters
  • classifier_predictions (numeric values) --

  • rater_labels (sequence of labels, which should be numeric values) --

  • verbosity --

  • average (macro or micro averaging) --

Returns

Return type

F1 Score

class surveyequivalence.scoring_functions.PrecisionScore

Bases: surveyequivalence.scoring_functions.Scorer

static score(classifier_predictions: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction], rater_labels: Sequence[str], verbosity=0, average: str = 'micro')float

Precision score. This function uses sklearn's precision function.

>>> PrecisionScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['b', 'b', 'b'], 'micro')
0.6666666666666666
>>> PrecisionScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['a', 'b', 'b'], 'micro')
0.3333333333333333
Parameters
  • classifier_predictions (numeric values) --

  • rater_labels (sequence of labels, which should be numeric values) --

  • verbosity --

  • average (macro or micro averaging) --

Returns

Return type

Precision Score

class surveyequivalence.scoring_functions.RecallScore

Bases: surveyequivalence.scoring_functions.Scorer

static score(classifier_predictions: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction], rater_labels: Sequence[str], verbosity=0, average: str = 'micro')float

Recall score. This function uses sklearn's recall function.

>>> RecallScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['b', 'b', 'b'], 'micro')
0.6666666666666666
>>> RecallScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['b', 'b', 'b'], 'macro')
0.5
>>> RecallScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['a', 'b', 'b'], 'micro')
0.3333333333333333
>>> RecallScore.score([DiscreteDistributionPrediction(['a', 'b'], prs) for prs in [[.3, .7], [.4, .6], [.6, .4]]],  ['a', 'b', 'b'], 'macro')
0.25
Parameters
  • classifier_predictions (numeric values) --

  • rater_labels (sequence of labels, which should be numeric values) --

  • verbosity --

  • average (macro or micro averaging) --

Returns

Return type

Recall Score

class surveyequivalence.scoring_functions.Scorer

Bases: abc.ABC

Scorer that defines a Scorer class as having a score() function. The scorer computes the goodness of a predictor against the average human rater.

abstract static score(classifier_predictions: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction], rater_labels: Sequence[surveyequivalence.combiners.DiscreteDistributionPrediction])float
score_classifier(classifier_predictions: Sequence, raters: Sequence, W, verbosity=0)

Driver function that computes the mean score over all predictions

Parameters
  • classifier_predictions (Scoring predictions) --

  • raters (The reference ratings. Score will compare classifier predictions with each rater in turn.) --

  • W (The item and rating dataset) --

  • verbosity (verbosity value from 1 to 4 indicating increased verbosity.) --

Returns

Return type

Mean score over all predictions for all raters.

Synthetic Dataset Generation

States

class surveyequivalence.synthetic_datasets.DiscreteState(state_name: str, labels: Sequence[str], probabilities: Sequence[float])

Bases: surveyequivalence.synthetic_datasets.State

A discrete distribution over possible labels

Parameters
  • state_name --

  • labels -- A sequence of strings; the allowable labels

  • probabilities -- A sequence of the same length, with values adding to one, giving probabilities for each of the label strings

draw_labels(n: int)

Make n iid draws of discrete labels from the distribution

Parameters

n -- How many labels to draw from the distribution

Returns

Return type

a single item or a numpy array

Distributions Over States

class surveyequivalence.synthetic_datasets.DistributionOverStates

Bases: abc.ABC

Abstract base class

class surveyequivalence.synthetic_datasets.DiscreteDistributionOverStates(states: Sequence[surveyequivalence.synthetic_datasets.State], probabilities: Sequence[float])

Bases: surveyequivalence.synthetic_datasets.DistributionOverStates

Parameters
  • states -- a sequence of State objects

  • probabilities -- a same length sequence of floats representing probabilities of the item states

draw_states(n: int)Sequence[surveyequivalence.synthetic_datasets.DiscreteState]
Parameters

n --

Returns

Return type

a single item or numpy array of State instances, drawn iid from the probability distribution

class surveyequivalence.synthetic_datasets.FixedStateGenerator(states: Sequence[surveyequivalence.synthetic_datasets.State], probabilities: Sequence[float])

Bases: surveyequivalence.synthetic_datasets.DiscreteDistributionOverStates

draw_states(n: int)

Draw exactly in proportion to probabilities, rather than each draw random according to the probabilities :param n: How many items to draw

Returns

Return type

list of State instances

Mock Classifiers

class surveyequivalence.synthetic_datasets.MockClassifier(name: str, label_predictors: Dict[str, surveyequivalence.combiners.Prediction])

Bases: object

A mock classifier has access to each item's state when generating a prediction, something that a real classifier would not have access to

Parameters
  • name --

  • label_predictions -- a dictionary mapping from item state names to Predictions

make_predictions(item_states: Sequence[surveyequivalence.synthetic_datasets.State])Sequence[surveyequivalence.combiners.Prediction]
Parameters

item_states -- a sequence of State objects, representing the states of some items

Returns

Return type

a sequence of Prediction objects, one for each item

class surveyequivalence.synthetic_datasets.MappedDiscreteMockClassifier(name, label_predictors: Dict[str, surveyequivalence.combiners.Prediction], prediction_map: Dict[str, surveyequivalence.combiners.Prediction])

Bases: surveyequivalence.synthetic_datasets.MockClassifier

A mock classifier that maps an item state to a Prediction, draws a discrete label from that, and then maps that discrete label to another Prediction.

Parameters
  • name --

  • label_predictions -- a dictionary mapping from item state names to Predictions

make_predictions(item_states)
Parameters

item_states -- a sequence of State objects, representing the states of some items

Returns

Return type

a sequence of Prediction objects, one for each item

Dataset Generators

class surveyequivalence.synthetic_datasets.SyntheticDatasetGenerator(item_state_generator: surveyequivalence.synthetic_datasets.DistributionOverStates, num_items_per_dataset=1000, num_labels_per_item=10, mock_classifiers=None, name='')

Bases: object

Generator for a set of items with some raters per item. Items are defined by States, which are drawn from a DistributionOverStates. Each State is a distribution over labels. Each label is an i.i.d. draw from the State

Parameters
  • item_state_generator --

  • num_items_per_dataset --

  • num_labels_per_item -- How many raters to generate labels for, for each item

  • mock_classifiers -- A list of MockClassifier instances, which generate label predictions based on the item state

  • name -- A text string naming this dataset generator

generate_labels(item_states, num_labels_per_item=None, rater_prefix='e')

Normally called with item_states=self.reference_rater_item_states

Parameters
  • self --

  • item_states -- a list of States, one for each item

  • num_labels_per_item=None -- if None, use self.num_labels_per_item

  • rater_prefix="e" -- Rater columns are named as f"{rater_prefix}_{i}" where i is an integer

Returns

Return type

A pandas DataFrame with one row for each item and one column for each rater. Cells are labels.

class surveyequivalence.synthetic_datasets.SyntheticBinaryDatasetGenerator(item_state_generator, num_items_per_dataset=50, num_labels_per_item=3, mock_classifiers=None, name=None, pct_noise=0.0, k_other_raters_per_label=1)

Bases: surveyequivalence.synthetic_datasets.SyntheticDatasetGenerator

Dataset generator for binary labels

Only additional parameters for this subclass are documented here.

Parameters
  • pct_noise=0 -- In addition to the reference rater labels, this generator can generator labels from "other" raters. With probability pct_noise the binary labels will be drawn from a 50-50 coin flip, and otherwise from the item's State. If pct_noise==0, the other raters' labels will always be i.i.d draws from the same distribution as the reference rater labels.

  • k_other_raters_per_label=1 -- The number of other raters to generate labels for.

make_histogram(ax)
Parameters

ax -- A matplotlib Axes instance

plot_item_state_distribution()

called if you are making a standalone graph; for insets, .make_histogram is called directly

Dataset

class surveyequivalence.synthetic_datasets.Dataset

Bases: object

A Dataset

class surveyequivalence.synthetic_datasets.SyntheticDataset(ds_generator: surveyequivalence.synthetic_datasets.SyntheticBinaryDatasetGenerator)

Bases: surveyequivalence.synthetic_datasets.Dataset

Parameters
  • ds_generator --

  • all the attributes (Sets) --

  • running the SyntheticBinaryDatasetGenerator (by) --

save(dirname='running_example')

Save ratings and predictions to csv files

Parameters
  • dirname -- A subdirectory name in which to store saved results

  • include_timestamp_in_dirname -- Whether to postpend directory name with current timestamp

surveyequivalence.synthetic_datasets.make_running_example_dataset(num_items_per_dataset=10, num_labels_per_item=10, minimal=False, include_hard_classifier=False, include_soft_classifier=False)surveyequivalence.synthetic_datasets.SyntheticDataset

This generates the running example dataset used in the original Survey Equivalence paper.

Three states: 70% high = 80/20, 10% med = 50/50; 20% low = 10/90

Parameters
  • num_items_per_dataset --

  • num_labels_per_item --

  • minimal -- If minimal, use FixedStateGenerator, which generates labels in exact proportion to probabilities specified in the state, rather than each label being an iid draw from the State.

  • include_hard_classifier -- Includes a hard classifier which draws labels 90/10 for high state; 50/50 for medium; 05/95 fow low state

  • include_soft_classifier -- Includes a soft classifier which runs the hard_classifier to generate a label and then maps it to a calibrated prediction (.7681 when the label is positive; .3226 when the label is negative). Also includes an ideal classifier that always predicts the probability given by the State of the item.