expan.core package

Submodules

expan.core.binning module

class expan.core.binning.Binning

Bases: object

The Binning class has two subclasses: CategoricalBinning and NumericalBinning.

label(data, format_str=None)

This returns the bin labels associated with each data point in the series, essentially ‘applying’ the binning to data.

Parameters:
  • data (array-like) – array of datapoints to be binned
  • format_str (str) – string defining the format of the label to apply
Returns:

array of the bin label corresponding to each data point.

Return type:

array-like

Note

Implemented in subclass.

class expan.core.binning.CategoricalBinning(data=None, nbins=None)

Bases: expan.core.binning.Binning

A CategoricalBinning is essentially a list of lists of categories. Each bin within a Binning is an ordered list of categories.

CategoricalBinning constructor

Parameters:
  • data (array-like) – array of datapoints to be binned
  • nbins (int) – number of bins
categories

Returns list of categories.

Returns:list of categories
Return type:array-like
label(data, format_str='{standard}')

This returns the bin labels associated with each data point in the series, essentially ‘applying’ the binning to data.

Parameters:
  • data (array-like) – array of datapoints to be binned
  • format_str (str) –

    string defining the format of the label to apply

    Options:

    • {iter.uppercase}, {iter.lowercase}, {iter.integer}
    • {set_notation} - all categories, comma-separated, surrounded by curly braces
    • {standard} - a shortcut for: {set_notation}
Returns:

array of the bin label corresponding to each data point

Return type:

array-like

labels(format_str='{standard}')

Returns the labels of the bins defined by this binning.

Parameters:format_str (str) –

string defining the format of the label to return

Options:

  • {iter.uppercase}, {iter.lowercase}, {iter.integer}
  • {set_notation} - all categories, comma-separated, surrounded by curly braces
  • {standard} - a shortcut for: {set_notation}
Returns:labels of the bins defined by this binning
Return type:array-like

Note

This is not the same as label (which applies the bins to data and returns the labels of the data).

mid(data)

Returns the middle category of every bin.

Parameters:data – data on which the binning is to be applied
Returns:the middle category of every bin
Return type:array-like
class expan.core.binning.NumericalBinning(data=None, nbins=None, uppers=None, lowers=None, up_closed=None, lo_closed=None)

Bases: expan.core.binning.Binning

The Binning class for numerical variables.

Todo

Think of a good way of exposing the _apply() method, because with the returned indices, it can then get uppers/lowers/mids/labels (ie reformat) without doing the apply again. And experimenting with maintaining the lists with a single element tacked onto the end representing non-matching entries.

All access then are through properties which drop this end list, except when using the indices returned by _apply.

This means that the -1 indices just works, so using the indices to get labels, bounds, etc., is straightforward and fast because it is just integer-based array slicing.

NumericalBinning constructor.

Parameters:
  • data (array-like) – array of datapoints to be binned
  • nbins (int) – number of bins
  • uppers (array-like) – a list of upper bounds
  • lowers (array-like) – a list of lower bounds
  • up_closed (array-like) – a list of booleans indicating whether the upper bounds are closed
  • lo_closed (array-like) – a list of booleans indicating whether the lower bounds are closed
label(data, format_str='{standard}')

Returns the bin labels associated with each data point in the series, essentially ‘applying’ the binning to data.

Parameters:
  • data (array-like) – array of datapoints to be binned
  • format_str (str) –

    string defining the format of the label to apply

    Options:

    • {iter.uppercase} and {iter.lowercase} = labels the bins with letters
    • {iter.integer} = labels the bins with integers
    • {up} and {lo} = the bounds themselves (can specify precision: {up:.1f})
    • {up_cond} and {lo_cond} = ‘<’, ‘<=’ etc.
    • {up_bracket} and {lo_bracket} = ‘(‘, ‘[‘ etc.
    • {mid} = the midpoint of the bin (can specify precision: {mid:.1f}
    • {conditions} = {lo:.1f}{lo_cond}x{up_cond}{up:.1f}
    • {set_notation} = {lo_bracket}{lo:.1f},{up:.1f}{up_bracket}
    • {standard} = {conditions}
    • {simple} = {lo:.1f}_{up:.1f}
    • {simplei} = {lo:.0f}_{up:.0f} (same as simple but for integers)
see:
  • Binning.label.__doc__
  • NumericalBinning._labels.__doc__

When format_str is None, the label is the midpoint of the bin. This may not be the most convenient. Might be better to make the default format_str ‘{standard}’ and then have the client use mid() directly if midpoints are desired.

labels(format_str='{standard}')

Returns the labels of the bins defined by this binning.

Returns:array of labels of the bins defined by this binning
Options:
  • {iter.uppercase} and {iter.lowercase} = labels the bins with letters
  • {iter.integer} = labels the bins with integers
  • {up} and {lo} = the bounds themselves (can specify precision: {up:.1f})
  • {up_cond} and {lo_cond} = ‘<’, ‘<=’ etc.
  • {up_bracket} and {lo_bracket} = ‘(‘, ‘[‘ etc.
  • {mid} = the midpoint of the bin (can specify precision: {mid:.1f}
  • {conditions} = {lo:.1f}{lo_cond}x{up_cond}{up:.1f}
  • {set_notation} = {lo_bracket}{lo:.1f},{up:.1f}{up_bracket}
  • {standard} = {conditions}
  • {simple} = {lo:.1f}_{up:.1f}
  • {simplei} = {lo:.0f}_{up:.0f} (same as simple but for integers)
Return type:array-like

Note

This is not the same as label (which applies the bins to data and returns the labels of the data)

lo_closed

Return a list of booleans indicating whether the lower bounds are closed.

lower(data)

see upper()

lowers

Return a list of lower bounds.

mid(data)

Returns the midpoints of the bins associated with the data

Parameters:data (array-like) – array of datapoints to be binned
Returns:
array containing midpoints of bin
corresponding to each data point.
Return type:array-like

Note

Currently doesn’t take into account whether bounds are closed or open.

up_closed

Return a list of booleans indicating whether the upper bounds are closed.

upper(data)

Returns the upper bounds of the bins associated with the data

Parameters:data (array-like) – array of datapoints to be binned
Returns:
array containing upper bound of bin
corresponding to each data point.
Return type:array-like
uppers

Return a list of upper bounds.

expan.core.binning.create_binning(x, nbins=8)

Determines bins for the input values - suitable for doing SubGroup Analyses.

Parameters:
  • x (array_like) – input array
  • nbins (integer) – number of bins
Returns:

binning object

expan.core.experiment module

class expan.core.experiment.Experiment(control_variant_name, data, metadata, report_kpi_names=None, derived_kpis=None)

Bases: object

Class which adds the analysis functions to experimental data.

delta(method='fixed_horizon', **worker_args)
filter(kpis, percentile=99.0, threshold_type='upper')

Method that filters out entities whose KPIs exceed the value at a given percentile. If any of the KPIs exceeds its threshold the entity is filtered out.

Parameters:
  • kpis (list) – list of KPI names
  • percentile (float) – percentile considered as threshold
  • threshold_type (string) – type of threshold used (‘lower’ or ‘upper’)

Returns:

get_kpi_by_name_and_variant(name, variant)

expan.core.experimentdata module

expan.core.results module

expan.core.statistics module

expan.core.statistics.alpha_to_percentiles(alpha)

Transforms alpha value to corresponding percentile.

Parameters:alpha (float) – alpha values to transform
Returns:list of percentiles corresponding to given alpha
expan.core.statistics.bootstrap(x, y, func=<function _delta_mean>, nruns=10000, percentiles=[2.5, 97.5], min_observations=20, return_bootstraps=False, relative=False)

Bootstraps the Confidence Intervals for a particular function comparing two samples. NaNs are ignored (discarded before calculation).

Parameters:
  • x (array like) – sample of treatment group
  • y (array like) – sample of control group
  • func (function) – function of which the distribution is to be computed. The default comparison metric is the difference of means. For bootstraping correlation: func=lambda x,y: np.stats.pearsonr(x,y)[0]
  • nruns (integer) – number of bootstrap runs to perform
  • percentiles (list) – The values corresponding to the given percentiles are returned. The default percentiles (2.5% and 97.5%) correspond to an alpha of 0.05.
  • min_observations (integer) – minimum number of observations necessary
  • return_bootstraps (boolean) – If this variable is set the bootstrap sets are returned otherwise the first return value is empty.
  • relative (boolean) – if relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
Returns:

  • dict: percentile levels (index) and values
  • np.array (nruns): array containing the bootstraping results per run

Return type:

tuple

expan.core.statistics.chi_square(x, y, min_counts=5)

Performs the chi-square homogeneity test on categorical arrays x and y

Parameters:
  • x (array_like) – sample of the treatment variable to check
  • y (array_like) – sample of the control variable to check
  • min_counts (int) – drop categories where minimum number of observations or expected observations is below min_counts for x or y
Returns:

  • float: p-value
  • float: chi-square value
  • int: number of attributes used (after dropping)

Return type:

tuple

expan.core.statistics.delta(x, y, assume_normal=True, percentiles=[2.5, 97.5], min_observations=20, nruns=10000, relative=False, x_weights=1, y_weights=1)

Calculates the difference of means between the samples (x-y) in a statistical sense, i.e. with confidence intervals.

NaNs are ignored: treated as if they weren’t included at all. This is done because at this level we cannot determine what a NaN means. In some cases, a NaN represents missing data that should be completely ignored, and in some cases it represents inapplicable (like PCII for non-ordering customers) - in which case the NaNs should be replaced by zeros at a higher level. Replacing with zeros, however, would be completely incorrect for return rates.

Computation is done in form of treatment minus control, i.e. x-y

Parameters:
  • x (array_like) – sample of a treatment group
  • y (array_like) – sample of a control group
  • assume_normal (boolean) – specifies whether normal distribution assumptions can be made
  • percentiles (list) – list of percentile values for confidence bounds
  • min_observations (integer) – minimum number of observations needed
  • nruns (integer) – only used if assume normal is false
  • relative (boolean) – if relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
  • x_weights (list) – weights for the x vector, in order to calculate the weighted mean and confidence intervals, which is equivalent to the overall metric. This weighted approach is only relevant for ratios.
  • y_weights (list) – weights for the y vector, in order to calculate the weighted mean and confidence intervals, which is equivalent to the overall metric. This weighted approach is only relevant for ratios.
Returns:

DeltaStatistics object

expan.core.statistics.estimate_std(x, mu, pctile)

Estimate the standard deviation from a given percentile, according to the z-score:

z = (x - mu) / sigma
Parameters:
  • x (float) – cumulated density at the given percentile
  • mu (float) – mean of the distribution
  • pctile (float) – percentile value (between 0 and 100)
Returns:

estimated standard deviation of the distribution

Return type:

float

expan.core.statistics.make_delta(assume_normal=True, percentiles=[2.5, 97.5], min_observations=20, nruns=10000, relative=False)

a closure to the below delta function

expan.core.statistics.normal_difference(mean1, std1, n1, mean2, std2, n2, percentiles=[2.5, 97.5], relative=False)

Calculates the difference distribution of two normal distributions.

Computation is done in form of treatment minus control. It is assumed that the standard deviations of both distributions do not differ too much.

Parameters:
  • mean1 (float) – mean value of the treatment distribution
  • std1 (float) – standard deviation of the treatment distribution
  • n1 (integer) – number of samples of the treatment distribution
  • mean2 (float) – mean value of the control distribution
  • std2 (float) – standard deviation of the control distribution
  • n2 (integer) – number of samples of the control distribution
  • percentiles (list) – list of percentile values to compute
  • relative (boolean) – If relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
Returns:

percentiles and corresponding values

Return type:

dict

For further information vistit:
http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Confidence_Intervals/BS704_Confidence_Intervals5.html
expan.core.statistics.normal_percentiles(mean, std, n, percentiles=[2.5, 97.5], relative=False)

Calculate the percentile values for a normal distribution with parameters estimated from samples.

Parameters:
  • mean (float) – mean value of the distribution
  • std (float) – standard deviation of the distribution
  • n (integer) – number of samples
  • percentiles (list) – list of percentile values to compute
  • relative (boolean) – if relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
Returns:

percentiles and corresponding values

Return type:

dict

For more information visit:
http://www.itl.nist.gov/div898/handbook/eda/section3/eda352.htm http://www.boost.org/doc/libs/1_46_1/libs/math/doc/sf_and_dist/html/math_toolkit/dist/stat_tut/weg/st_eg/tut_mean_intervals.html http://www.stat.yale.edu/Courses/1997-98/101/confint.htm
expan.core.statistics.normal_sample_difference(x, y, percentiles=[2.5, 97.5], relative=False)

Calculates the difference distribution of two normal distributions given by their samples.

Computation is done in form of treatment minus control. It is assumed that the standard deviations of both distributions do not differ too much.

Parameters:
  • x (array-like) – sample of a treatment group
  • y (array-like) – sample of a control group
  • percentiles (list) – list of percentile values to compute
  • relative (boolean) – If relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
Returns:

percentiles and corresponding values

Return type:

dict

expan.core.statistics.normal_sample_percentiles(values, percentiles=[2.5, 97.5], relative=False)

Calculate the percentile values for a sample assumed to be normally distributed. If normality can not be assumed, use bootstrap_ci instead. NaNs are ignored (discarded before calculation).

Parameters:
  • values (array-like) – sample for which the normal distribution percentiles are computed.
  • percentiles (list) – list of percentile values to compute
  • relative (boolean) – if relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
Returns:

percentiles and corresponding values

Return type:

dict

For further information visit:
http://www.itl.nist.gov/div898/handbook/eda/section3/eda352.htm http://www.boost.org/doc/libs/1_46_1/libs/math/doc/sf_and_dist/html/math_toolkit/dist/stat_tut/weg/st_eg/tut_mean_intervals.html http://www.stat.yale.edu/Courses/1997-98/101/confint.htm
expan.core.statistics.pooled_std(std1, n1, std2, n2)

Returns the pooled estimate of standard deviation. Assumes that population variances are equal (std(v1)**2==std(v2)**2) - this assumption is checked for reasonableness and an exception is raised if this is strongly violated.

Parameters:
  • std1 (float) – standard deviation of first sample
  • n1 (integer) – size of first sample
  • std2 (float) – standard deviation of second sample
  • n2 (integer) – size of second sample
Returns:

Pooled standard deviation

Return type:

float

For further information visit:
http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Confidence_Intervals/BS704_Confidence_Intervals5.html

Todo

Also implement a version for unequal variances.

expan.core.statistics.sample_size(x)

Calculates sample size of a sample x :param x: sample to calculate sample size :type x: array_like

Returns:sample size of the sample excluding nans
Return type:int

expan.core.util module

expan.core.util.drop_nan(np_array)
expan.core.util.find_list_of_dicts_element(items, key1, value, key2)
expan.core.util.generate_random_data()
expan.core.util.generate_random_data_n_variants(n_variants=3)
expan.core.util.get_column_names_by_type(df, dtype)
expan.core.util.is_number_and_nan(obj)
expan.core.util.scale_range(x, new_min=0.0, new_max=1.0, old_min=None, old_max=None, squash_outside_range=True, squash_inf=False)

Scales a sequence to fit within a new range.

If squash_inf is set, then infinite values will take on the extremes of the new range (as opposed to staying infinite).

Args:
x: new_min: new_max: old_min: old_max: squash_outside_range: squash_inf:
Note:
Infinity in the input is disregarded in the construction of the scale of the mapping.
>>> scale_range([1,3,5])
array([ 0. ,  0.5,  1. ])
>>> scale_range([1,2,3,4,5])
array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ])
>>> scale_range([1,3,5, np.inf])
array([ 0. ,  0.5,  1. ,  inf])
>>> scale_range([1,3,5, -np.inf])
array([ 0. ,  0.5,  1. , -inf])
>>> scale_range([1,3,5, -np.inf], squash_inf=True)
array([ 0. ,  0.5,  1. ,  0. ])
>>> scale_range([1,3,5, np.inf], squash_inf=True)
array([ 0. ,  0.5,  1. ,  1. ])
>>> scale_range([1,3,5], new_min=0.5)
array([ 0.5 ,  0.75,  1.  ])
>>> scale_range([1,3,5], old_min=1, old_max=4)
array([ 0.        ,  0.66666667,  1.        ])
>>> scale_range([5], old_max=4)
array([ 1.])

expan.core.version module

expan.core.version.git_commit_count()

Returns the output of git rev-list –count HEAD as an int.

expan.core.version.git_latest_commit()

” Returns output of git rev-parse HEAD.

expan.core.version.version(format_str='{short}')

Returns current version number in specified format.

Parameters:format_str (str) –

Returns:

expan.core.version.version_numbers()

Module contents

ExpAn core module.

expan.core.version(format_str='{short}')

Returns current version number in specified format.

Parameters:format_str (str) –

Returns: