expan.core package¶

Submodules¶

expan.core.binning module¶

class expan.core.binning.Binning¶

Bases: object

The Binning class has two subclasses: CategoricalBinning and NumericalBinning.

label(data, format_str=None)¶

This returns the bin labels associated with each data point in the series, essentially ‘applying’ the binning to data.

Parameters:	data (array-like) – array of datapoints to be binned format_str (str) – string defining the format of the label to apply
Returns:	array of the bin label corresponding to each data point.
Return type:	array-like

Note

Implemented in subclass.

class expan.core.binning.CategoricalBinning(data=None, nbins=None)¶

Bases: expan.core.binning.Binning

A CategoricalBinning is essentially a list of lists of categories. Each bin within a Binning is an ordered list of categories.

CategoricalBinning constructor

Parameters:	data (array-like) – array of datapoints to be binned nbins (int) – number of bins

categories¶

Returns list of categories.

Returns:	list of categories
Return type:	array-like

label(data, format_str='{standard}')¶

This returns the bin labels associated with each data point in the series, essentially ‘applying’ the binning to data.

Parameters:	data (array-like) – array of datapoints to be binned format_str (str) – string defining the format of the label to apply Options: {iter.uppercase}, {iter.lowercase}, {iter.integer} {set_notation} - all categories, comma-separated, surrounded by curly braces {standard} - a shortcut for: {set_notation}
Returns:	array of the bin label corresponding to each data point
Return type:	array-like

labels(format_str='{standard}')¶

Returns the labels of the bins defined by this binning.

Parameters:

format_str (str) –

string defining the format of the label to return

Options:

{iter.uppercase}, {iter.lowercase}, {iter.integer}
{set_notation} - all categories, comma-separated, surrounded by curly braces
{standard} - a shortcut for: {set_notation}

Returns: labels of the bins defined by this binning

Return type: array-like

Note

This is not the same as label (which applies the bins to data and returns the labels of the data).

mid(data)¶

Returns the middle category of every bin.

Parameters:	data – data on which the binning is to be applied
Returns:	the middle category of every bin
Return type:	array-like

class expan.core.binning.NumericalBinning(data=None, nbins=None, uppers=None, lowers=None, up_closed=None, lo_closed=None)¶

Bases: expan.core.binning.Binning

The Binning class for numerical variables.

Todo

Think of a good way of exposing the _apply() method, because with the returned indices, it can then get uppers/lowers/mids/labels (ie reformat) without doing the apply again. And experimenting with maintaining the lists with a single element tacked onto the end representing non-matching entries.

All access then are through properties which drop this end list, except when using the indices returned by _apply.

This means that the -1 indices just works, so using the indices to get labels, bounds, etc., is straightforward and fast because it is just integer-based array slicing.

NumericalBinning constructor.

Parameters:

data (array-like) – array of datapoints to be binned
nbins (int) – number of bins
uppers (array-like) – a list of upper bounds
lowers (array-like) – a list of lower bounds
up_closed (array-like) – a list of booleans indicating whether the upper bounds are closed
lo_closed (array-like) – a list of booleans indicating whether the lower bounds are closed

label(data, format_str='{standard}')¶

Returns the bin labels associated with each data point in the series, essentially ‘applying’ the binning to data.

Parameters:

data (array-like) – array of datapoints to be binned
format_str (str) –
string defining the format of the label to apply

Options:
- {iter.uppercase} and {iter.lowercase} = labels the bins with letters
- {iter.integer} = labels the bins with integers
- {up} and {lo} = the bounds themselves (can specify precision: {up:.1f})
- {up_cond} and {lo_cond} = ‘<’, ‘<=’ etc.
- {up_bracket} and {lo_bracket} = ‘(‘, ‘[‘ etc.
- {mid} = the midpoint of the bin (can specify precision: {mid:.1f}
- {conditions} = {lo:.1f}{lo_cond}x{up_cond}{up:.1f}
- {set_notation} = {lo_bracket}{lo:.1f},{up:.1f}{up_bracket}
- {standard} = {conditions}
- {simple} = {lo:.1f}_{up:.1f}
- {simplei} = {lo:.0f}_{up:.0f} (same as simple but for integers)

see:

Binning.label.__doc__
NumericalBinning._labels.__doc__

When format_str is None, the label is the midpoint of the bin. This may not be the most convenient. Might be better to make the default format_str ‘{standard}’ and then have the client use mid() directly if midpoints are desired.

labels(format_str='{standard}')¶

Returns the labels of the bins defined by this binning.

Returns:

array of labels of the bins defined by this binning

Options:

{iter.uppercase} and {iter.lowercase} = labels the bins with letters

{iter.integer} = labels the bins with integers

{up} and {lo} = the bounds themselves (can specify precision: {up:.1f})

{up_cond} and {lo_cond} = ‘<’, ‘<=’ etc.

{up_bracket} and {lo_bracket} = ‘(‘, ‘[‘ etc.

{mid} = the midpoint of the bin (can specify precision: {mid:.1f}

{conditions} = {lo:.1f}{lo_cond}x{up_cond}{up:.1f}

{set_notation} = {lo_bracket}{lo:.1f},{up:.1f}{up_bracket}

{standard} = {conditions}

{simple} = {lo:.1f}_{up:.1f}

{simplei} = {lo:.0f}_{up:.0f} (same as simple but for integers)

Return type: array-like

Note

This is not the same as label (which applies the bins to data and returns the labels of the data)

lo_closed¶: Return a list of booleans indicating whether the lower bounds are closed.

lower(data)¶: see upper()

lowers¶: Return a list of lower bounds.

mid(data)¶

Returns the midpoints of the bins associated with the data

Parameters:	data (array-like) – array of datapoints to be binned
Returns:	array containing midpoints of bin corresponding to each data point.
Return type:	array-like

Note

Currently doesn’t take into account whether bounds are closed or open.

up_closed¶: Return a list of booleans indicating whether the upper bounds are closed.

upper(data)¶

Returns the upper bounds of the bins associated with the data

Parameters:	data (array-like) – array of datapoints to be binned
Returns:	array containing upper bound of bin corresponding to each data point.
Return type:	array-like

uppers¶: Return a list of upper bounds.

expan.core.binning.create_binning(x, nbins=8)¶

Determines bins for the input values - suitable for doing SubGroup Analyses.

Parameters:	x (array_like) – input array nbins (integer) – number of bins
Returns:	binning object

expan.core.experiment module¶

class expan.core.experiment.Experiment(control_variant_name, data, metadata, report_kpi_names=None, derived_kpis=None)¶

Bases: object

Class which adds the analysis functions to experimental data.

delta(method='fixed_horizon', **worker_args)¶

filter(kpis, percentile=99.0, threshold_type='upper')¶

Method that filters out entities whose KPIs exceed the value at a given percentile. If any of the KPIs exceeds its threshold the entity is filtered out.

Parameters:	kpis (list) – list of KPI names percentile (float) – percentile considered as threshold threshold_type (string) – type of threshold used (‘lower’ or ‘upper’)

Returns:

get_kpi_by_name_and_variant(name, variant)¶

expan.core.experimentdata module¶

expan.core.results module¶

expan.core.statistics module¶

expan.core.statistics.alpha_to_percentiles(alpha)¶

Transforms alpha value to corresponding percentile.

Parameters:	alpha (float) – alpha values to transform
Returns:	list of percentiles corresponding to given alpha

expan.core.statistics.bootstrap(x, y, func=<function _delta_mean>, nruns=10000, percentiles=[2.5, 97.5], min_observations=20, return_bootstraps=False, relative=False)¶

Bootstraps the Confidence Intervals for a particular function comparing two samples. NaNs are ignored (discarded before calculation).

Parameters:

x (array like) – sample of treatment group
y (array like) – sample of control group
func (function) – function of which the distribution is to be computed. The default comparison metric is the difference of means. For bootstraping correlation: func=lambda x,y: np.stats.pearsonr(x,y)[0]
nruns (integer) – number of bootstrap runs to perform
percentiles (list) – The values corresponding to the given percentiles are returned. The default percentiles (2.5% and 97.5%) correspond to an alpha of 0.05.
min_observations (integer) – minimum number of observations necessary
return_bootstraps (boolean) – If this variable is set the bootstrap sets are returned otherwise the first return value is empty.
relative (boolean) – if relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.

Returns:

dict: percentile levels (index) and values
np.array (nruns): array containing the bootstraping results per run

Return type:

tuple

expan.core.statistics.chi_square(x, y, min_counts=5)¶

Performs the chi-square homogeneity test on categorical arrays x and y

Parameters:

x (array_like) – sample of the treatment variable to check
y (array_like) – sample of the control variable to check
min_counts (int) – drop categories where minimum number of observations or expected observations is below min_counts for x or y

Returns:

float: p-value
float: chi-square value
int: number of attributes used (after dropping)

Return type:

tuple

expan.core.statistics.delta(x, y, assume_normal=True, percentiles=[2.5, 97.5], min_observations=20, nruns=10000, relative=False, x_weights=1, y_weights=1)¶

Calculates the difference of means between the samples (x-y) in a statistical sense, i.e. with confidence intervals.

NaNs are ignored: treated as if they weren’t included at all. This is done because at this level we cannot determine what a NaN means. In some cases, a NaN represents missing data that should be completely ignored, and in some cases it represents inapplicable (like PCII for non-ordering customers) - in which case the NaNs should be replaced by zeros at a higher level. Replacing with zeros, however, would be completely incorrect for return rates.

Computation is done in form of treatment minus control, i.e. x-y

Parameters:

x (array_like) – sample of a treatment group
y (array_like) – sample of a control group
assume_normal (boolean) – specifies whether normal distribution assumptions can be made
percentiles (list) – list of percentile values for confidence bounds
min_observations (integer) – minimum number of observations needed
nruns (integer) – only used if assume normal is false
relative (boolean) – if relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
x_weights (list) – weights for the x vector, in order to calculate the weighted mean and confidence intervals, which is equivalent to the overall metric. This weighted approach is only relevant for ratios.
y_weights (list) – weights for the y vector, in order to calculate the weighted mean and confidence intervals, which is equivalent to the overall metric. This weighted approach is only relevant for ratios.

Returns:

DeltaStatistics object

expan.core.statistics.estimate_std(x, mu, pctile)¶

Estimate the standard deviation from a given percentile, according to the z-score:

z = (x - mu) / sigma

Parameters:	x (float) – cumulated density at the given percentile mu (float) – mean of the distribution pctile (float) – percentile value (between 0 and 100)
Returns:	estimated standard deviation of the distribution
Return type:	float

expan.core.statistics.make_delta(assume_normal=True, percentiles=[2.5, 97.5], min_observations=20, nruns=10000, relative=False)¶: a closure to the below delta function

expan.core.statistics.normal_difference(mean1, std1, n1, mean2, std2, n2, percentiles=[2.5, 97.5], relative=False)¶

Calculates the difference distribution of two normal distributions.

Computation is done in form of treatment minus control. It is assumed that the standard deviations of both distributions do not differ too much.

Parameters:	mean1 (float) – mean value of the treatment distribution std1 (float) – standard deviation of the treatment distribution n1 (integer) – number of samples of the treatment distribution mean2 (float) – mean value of the control distribution std2 (float) – standard deviation of the control distribution n2 (integer) – number of samples of the control distribution percentiles (list) – list of percentile values to compute relative (boolean) – If relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
Returns:	percentiles and corresponding values
Return type:	dict

For further information vistit:: http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Confidence_Intervals/BS704_Confidence_Intervals5.html

expan.core.statistics.normal_percentiles(mean, std, n, percentiles=[2.5, 97.5], relative=False)¶

Calculate the percentile values for a normal distribution with parameters estimated from samples.

Parameters:	mean (float) – mean value of the distribution std (float) – standard deviation of the distribution n (integer) – number of samples percentiles (list) – list of percentile values to compute relative (boolean) – if relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
Returns:	percentiles and corresponding values
Return type:	dict

For more information visit:: http://www.itl.nist.gov/div898/handbook/eda/section3/eda352.htm http://www.boost.org/doc/libs/1_46_1/libs/math/doc/sf_and_dist/html/math_toolkit/dist/stat_tut/weg/st_eg/tut_mean_intervals.html http://www.stat.yale.edu/Courses/1997-98/101/confint.htm

expan.core.statistics.normal_sample_difference(x, y, percentiles=[2.5, 97.5], relative=False)¶

Calculates the difference distribution of two normal distributions given by their samples.

Computation is done in form of treatment minus control. It is assumed that the standard deviations of both distributions do not differ too much.

Parameters:	x (array-like) – sample of a treatment group y (array-like) – sample of a control group percentiles (list) – list of percentile values to compute relative (boolean) – If relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
Returns:	percentiles and corresponding values
Return type:	dict

expan.core.statistics.normal_sample_percentiles(values, percentiles=[2.5, 97.5], relative=False)¶

Calculate the percentile values for a sample assumed to be normally distributed. If normality can not be assumed, use bootstrap_ci instead. NaNs are ignored (discarded before calculation).

Parameters:	values (array-like) – sample for which the normal distribution percentiles are computed. percentiles (list) – list of percentile values to compute relative (boolean) – if relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
Returns:	percentiles and corresponding values
Return type:	dict

For further information visit:: http://www.itl.nist.gov/div898/handbook/eda/section3/eda352.htm http://www.boost.org/doc/libs/1_46_1/libs/math/doc/sf_and_dist/html/math_toolkit/dist/stat_tut/weg/st_eg/tut_mean_intervals.html http://www.stat.yale.edu/Courses/1997-98/101/confint.htm

expan.core.statistics.pooled_std(std1, n1, std2, n2)¶

Returns the pooled estimate of standard deviation. Assumes that population variances are equal (std(v1)**2==std(v2)**2) - this assumption is checked for reasonableness and an exception is raised if this is strongly violated.

Parameters:	std1 (float) – standard deviation of first sample n1 (integer) – size of first sample std2 (float) – standard deviation of second sample n2 (integer) – size of second sample
Returns:	Pooled standard deviation
Return type:	float

For further information visit:: http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Confidence_Intervals/BS704_Confidence_Intervals5.html

Todo

Also implement a version for unequal variances.

expan.core.statistics.sample_size(x)¶

Calculates sample size of a sample x :param x: sample to calculate sample size :type x: array_like

Returns:	sample size of the sample excluding nans
Return type:	int

expan.core.util module¶

expan.core.util.drop_nan(np_array)¶

expan.core.util.find_list_of_dicts_element(items, key1, value, key2)¶

expan.core.util.generate_random_data()¶

expan.core.util.generate_random_data_n_variants(n_variants=3)¶

expan.core.util.get_column_names_by_type(df, dtype)¶

expan.core.util.is_number_and_nan(obj)¶

expan.core.util.scale_range(x, new_min=0.0, new_max=1.0, old_min=None, old_max=None, squash_outside_range=True, squash_inf=False)¶

Scales a sequence to fit within a new range.

If squash_inf is set, then infinite values will take on the extremes of the new range (as opposed to staying infinite).

Args:

x: new_min: new_max: old_min: old_max: squash_outside_range: squash_inf:

Note:

Infinity in the input is disregarded in the construction of the scale of the mapping.

>>> scale_range([1,3,5])
array([ 0. ,  0.5,  1. ])

>>> scale_range([1,2,3,4,5])
array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ])

>>> scale_range([1,3,5, np.inf])
array([ 0. ,  0.5,  1. ,  inf])

>>> scale_range([1,3,5, -np.inf])
array([ 0. ,  0.5,  1. , -inf])

>>> scale_range([1,3,5, -np.inf], squash_inf=True)
array([ 0. ,  0.5,  1. ,  0. ])

>>> scale_range([1,3,5, np.inf], squash_inf=True)
array([ 0. ,  0.5,  1. ,  1. ])

>>> scale_range([1,3,5], new_min=0.5)
array([ 0.5 ,  0.75,  1.  ])

>>> scale_range([1,3,5], old_min=1, old_max=4)
array([ 0.        ,  0.66666667,  1.        ])

>>> scale_range([5], old_max=4)
array([ 1.])

expan.core.version module¶

expan.core.version.git_commit_count()¶: Returns the output of git rev-list –count HEAD as an int.

Note

http://programmers.stackexchange.com/a/151558

expan.core.version.git_latest_commit()¶: ” Returns output of git rev-parse HEAD.

Note

http://programmers.stackexchange.com/a/151558

expan.core.version.version(format_str='{short}')¶

Returns current version number in specified format.

Parameters:	format_str (str) –

Returns:

expan.core.version.version_numbers()¶

Module contents¶

ExpAn core module.

expan.core.version(format_str='{short}')¶

Returns current version number in specified format.

Parameters:	format_str (str) –

Returns: