expan.core package

Submodules

expan.core.binning module

class expan.core.binning.Bin(bin_type, *repr_args)

Bases: object

Constructor for a bin object. :param id: identifier (e.g. bin number) of the bin :param bin_type: “numerical” or “categorical” :param repr_args: arguments to represent this bin.

args for numerical bin includes lower, upper, lower_closed, upper_closed args for categorical bin includes a list of categories for this bin.
class expan.core.binning.CategoricalRepresentation(categories)

Bases: object

Constructor for representation of a categorical bin. :param categories: list of categorical values that belong to this bin

apply_to_data(data, feature)

Apply the bin to data. :param data: pandas data frame :param feature: feature name on which this bin is defined :return: subset of input dataframe which belongs to this bin

class expan.core.binning.NumericalRepresentation(lower, upper, lower_closed, upper_closed)

Bases: object

Constructor for representation of a numerical bin. :param upper: upper bound of the bin (exclusive) :param lower: lower bound of the bin (inclusive) :param lower_closed: boolean indicator whether lower bound is closed :param upper_closed: boolean indicator whether upper bound is closed

apply_to_data(data, feature)

Apply the bin to data. :param data: pandas data frame :param feature: feature name on which this bin is defined :return: subset of input dataframe which belongs to this bin

expan.core.binning.create_bins(data, n_bins)

Create bins from the data value :param data: a list or a 1-dim array of data to determine the bins :param n_bins: number of bins to create :return: a list of Bin object

expan.core.binning.toBinObject(bins)

expan.core.experiment module

class expan.core.experiment.Experiment(control_variant_name, data, metadata, report_kpi_names=None, derived_kpis=None)

Bases: object

Class which adds the analysis functions to experimental data.

delta(method='fixed_horizon', **worker_args)
filter(kpis, percentile=99.0, threshold_type='upper')

Method that filters out entities whose KPIs exceed the value at a given percentile. If any of the KPIs exceeds its threshold the entity is filtered out.

Parameters:
  • kpis (list) – list of KPI names
  • percentile (float) – percentile considered as threshold
  • threshold_type (string) – type of threshold used (‘lower’ or ‘upper’)
Returns:

No return value. Will filter out outliers in self.data in place.

get_kpi_by_name_and_variant(data, name, variant)
sga(feature_name_to_bins, multi_test_correction=False)

Perform subgroup analysis. :param feature_name_to_bins: a dict of feature name (key) to list of Bin objects (value).

This dict defines how and on which column to perform the subgroup split.
Parameters:multi_test_correction (boolean) – flag of whether the correction for multiple testing is needed.
Returns:Analysis results per subgroup.
sga_date(multi_test_correction=False)

Perform subgroup analysis on date partitioning each day from start day till end date. Produces non-cumulative delta and CIs for each subgroup. :param multi_test_correction: flag of whether the correction for multiple testing is needed. :type multi_test_correction: boolean

Returns:Analysis results per date

expan.core.experimentdata module

expan.core.results module

expan.core.statistics module

expan.core.statistics.alpha_to_percentiles(alpha)

Transforms alpha value to corresponding percentile.

Parameters:alpha (float) – alpha values to transform
Returns:list of percentiles corresponding to given alpha
expan.core.statistics.bootstrap(x, y, func=<function _delta_mean>, nruns=10000, percentiles=[2.5, 97.5], min_observations=20, return_bootstraps=False, relative=False, multi_test_correction=False, num_tests=1)

Bootstraps the Confidence Intervals for a particular function comparing two samples. NaNs are ignored (discarded before calculation).

Parameters:
  • x (array like) – sample of treatment group
  • y (array like) – sample of control group
  • func (function) – function of which the distribution is to be computed. The default comparison metric is the difference of means. For bootstraping correlation: func=lambda x,y: np.stats.pearsonr(x,y)[0]
  • nruns (integer) – number of bootstrap runs to perform
  • percentiles (list) – The values corresponding to the given percentiles are returned. The default percentiles (2.5% and 97.5%) correspond to an alpha of 0.05.
  • min_observations (integer) – minimum number of observations necessary
  • return_bootstraps (boolean) – If this variable is set the bootstrap sets are returned otherwise the first return value is empty.
  • relative (boolean) – if relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
  • multi_test_correction (boolean) – flag of whether the correction for multiple testing is needed.
  • num_tests (integer) – number of tests or reported kpis used for multiple correction.
Returns:

  • dict: percentile levels (index) and values
  • np.array (nruns): array containing the bootstraping results per run

Return type:

tuple

expan.core.statistics.chi_square(x, y, min_counts=5)

Performs the chi-square homogeneity test on categorical arrays x and y

Parameters:
  • x (array_like) – sample of the treatment variable to check
  • y (array_like) – sample of the control variable to check
  • min_counts (int) – drop categories where minimum number of observations or expected observations is below min_counts for x or y
Returns:

  • float: p-value
  • float: chi-square value
  • int: number of attributes used (after dropping)

Return type:

tuple

expan.core.statistics.compute_statistical_power(x, y, alpha=0.05)

Compute statistical power :param x: sample of a treatment group :type x: array-like :param y: sample of a control group :type y: array-like :param alpha: Type I error (false positive rate)

Returns:
statistical power — the probability of a test to detect an effect,
if the effect actually exists.
Return type:float
expan.core.statistics.delta(x, y, assume_normal=True, percentiles=[2.5, 97.5], min_observations=20, nruns=10000, relative=False, x_weights=1, y_weights=1, multi_test_correction=False, num_tests=1)

Calculates the difference of means between the samples (x-y) in a statistical sense, i.e. with confidence intervals.

NaNs are ignored: treated as if they weren’t included at all. This is done because at this level we cannot determine what a NaN means. In some cases, a NaN represents missing data that should be completely ignored, and in some cases it represents inapplicable (like PCII for non-ordering customers) - in which case the NaNs should be replaced by zeros at a higher level. Replacing with zeros, however, would be completely incorrect for return rates.

Computation is done in form of treatment minus control, i.e. x-y

Parameters:
  • x (array_like) – sample of a treatment group
  • y (array_like) – sample of a control group
  • assume_normal (boolean) – specifies whether normal distribution assumptions can be made
  • percentiles (list) – list of percentile values for confidence bounds
  • min_observations (integer) – minimum number of observations needed
  • nruns (integer) – only used if assume normal is false
  • relative (boolean) – if relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
  • x_weights (list) – weights for the x vector, in order to calculate the weighted mean and confidence intervals, which is equivalent to the overall metric. This weighted approach is only relevant for ratios.
  • y_weights (list) – weights for the y vector, in order to calculate the weighted mean and confidence intervals, which is equivalent to the overall metric. This weighted approach is only relevant for ratios.
  • multi_test_correction (boolean) – flag of whether the correction for multiple testing is needed.
  • num_tests (integer) – number of tests or reported kpis used for multiple correction.
Returns:

DeltaStatistics object

expan.core.statistics.estimate_sample_size(x, mde, n, r, alpha=0.05, beta=0.2)

Estimates sample size based on sample mean and variance given MDE (Minimum Detectable effect), number of variants and variant split ratio

Parameters:
  • x (pd.Series or pd.DataFrame) – sample to base estimation on
  • mde (float) – minimum detectable effect
  • n (int) – number of variants
  • r (float) – variant split ratio
  • alpha (float) – significance level
  • beta (float) – type II error
Returns:

estimated sample size

Return type:

float or pd.Series

expan.core.statistics.estimate_std(x, mu, pctile)

Estimate the standard deviation from a given percentile, according to the z-score:

z = (x - mu) / sigma
Parameters:
  • x (float) – cumulated density at the given percentile
  • mu (float) – mean of the distribution
  • pctile (float) – percentile value (between 0 and 100)
Returns:

estimated standard deviation of the distribution

Return type:

float

expan.core.statistics.make_delta(assume_normal=True, percentiles=[2.5, 97.5], min_observations=20, nruns=10000, relative=False, multi_test_correction=False, num_tests=1)

a closure to the below delta function

expan.core.statistics.normal_difference(mean1, std1, n1, mean2, std2, n2, percentiles=[2.5, 97.5], relative=False, multi_test_correction=False, num_tests=1)

Calculates the difference distribution of two normal distributions.

Computation is done in form of treatment minus control. It is assumed that the standard deviations of both distributions do not differ too much.

Parameters:
  • mean1 (float) – mean value of the treatment distribution
  • std1 (float) – standard deviation of the treatment distribution
  • n1 (integer) – number of samples of the treatment distribution
  • mean2 (float) – mean value of the control distribution
  • std2 (float) – standard deviation of the control distribution
  • n2 (integer) – number of samples of the control distribution
  • percentiles (list) – list of percentile values to compute
  • relative (boolean) – If relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
  • multi_test_correction (boolean) – flag of whether the correction for multiple testing is needed.
  • num_tests (integer) – number of tests or reported kpis used for multiple correction.
Returns:

percentiles and corresponding values

Return type:

dict

For further information vistit:
http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Confidence_Intervals/BS704_Confidence_Intervals5.html
expan.core.statistics.normal_percentiles(mean, std, n, percentiles=[2.5, 97.5], relative=False)

Calculate the percentile values for a normal distribution with parameters estimated from samples.

Parameters:
  • mean (float) – mean value of the distribution
  • std (float) – standard deviation of the distribution
  • n (integer) – number of samples
  • percentiles (list) – list of percentile values to compute
  • relative (boolean) – if relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
Returns:

percentiles and corresponding values

Return type:

dict

For more information visit:
http://www.itl.nist.gov/div898/handbook/eda/section3/eda352.htm http://www.boost.org/doc/libs/1_46_1/libs/math/doc/sf_and_dist/html/math_toolkit/dist/stat_tut/weg/st_eg/tut_mean_intervals.html http://www.stat.yale.edu/Courses/1997-98/101/confint.htm
expan.core.statistics.normal_sample_difference(x, y, percentiles=[2.5, 97.5], relative=False, multi_test_correction=False, num_tests=1)

Calculates the difference distribution of two normal distributions given by their samples.

Computation is done in form of treatment minus control. It is assumed that the standard deviations of both distributions do not differ too much.

Parameters:
  • x (array-like) – sample of a treatment group
  • y (array-like) – sample of a control group
  • percentiles (list) – list of percentile values to compute
  • relative (boolean) – If relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
  • multi_test_correction (boolean) – flag of whether the correction for multiple testing is needed.
  • num_tests (integer) – number of tests or reported kpis used for multiple correction.
Returns:

percentiles and corresponding values

Return type:

dict

expan.core.statistics.normal_sample_percentiles(values, percentiles=[2.5, 97.5], relative=False)

Calculate the percentile values for a sample assumed to be normally distributed. If normality can not be assumed, use bootstrap_ci instead. NaNs are ignored (discarded before calculation).

Parameters:
  • values (array-like) – sample for which the normal distribution percentiles are computed.
  • percentiles (list) – list of percentile values to compute
  • relative (boolean) – if relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
Returns:

percentiles and corresponding values

Return type:

dict

For further information visit:
http://www.itl.nist.gov/div898/handbook/eda/section3/eda352.htm http://www.boost.org/doc/libs/1_46_1/libs/math/doc/sf_and_dist/html/math_toolkit/dist/stat_tut/weg/st_eg/tut_mean_intervals.html http://www.stat.yale.edu/Courses/1997-98/101/confint.htm
expan.core.statistics.pooled_std(std1, n1, std2, n2)

Returns the pooled estimate of standard deviation. Assumes that population variances are equal (std(v1)**2==std(v2)**2) - this assumption is checked for reasonableness and an exception is raised if this is strongly violated.

Parameters:
  • std1 (float) – standard deviation of first sample
  • n1 (integer) – size of first sample
  • std2 (float) – standard deviation of second sample
  • n2 (integer) – size of second sample
Returns:

Pooled standard deviation

Return type:

float

For further information visit:
http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Confidence_Intervals/BS704_Confidence_Intervals5.html

Todo

Also implement a version for unequal variances.

expan.core.statistics.sample_size(x)

Calculates sample size of a sample x :param x: sample to calculate sample size :type x: array_like

Returns:sample size of the sample excluding nans
Return type:int

expan.core.util module

expan.core.util.drop_nan(np_array)
expan.core.util.find_list_of_dicts_element(items, key1, value, key2)
expan.core.util.generate_random_data()
expan.core.util.generate_random_data_n_variants(n_variants=3)
expan.core.util.get_column_names_by_type(df, dtype)
expan.core.util.is_number_and_nan(obj)
expan.core.util.scale_range(x, new_min=0.0, new_max=1.0, old_min=None, old_max=None, squash_outside_range=True, squash_inf=False)

Scales a sequence to fit within a new range.

If squash_inf is set, then infinite values will take on the extremes of the new range (as opposed to staying infinite).

Args:
x: new_min: new_max: old_min: old_max: squash_outside_range: squash_inf:
Note:
Infinity in the input is disregarded in the construction of the scale of the mapping.
>>> scale_range([1,3,5])
array([ 0. ,  0.5,  1. ])
>>> scale_range([1,2,3,4,5])
array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ])
>>> scale_range([1,3,5, np.inf])
array([ 0. ,  0.5,  1. ,  inf])
>>> scale_range([1,3,5, -np.inf])
array([ 0. ,  0.5,  1. , -inf])
>>> scale_range([1,3,5, -np.inf], squash_inf=True)
array([ 0. ,  0.5,  1. ,  0. ])
>>> scale_range([1,3,5, np.inf], squash_inf=True)
array([ 0. ,  0.5,  1. ,  1. ])
>>> scale_range([1,3,5], new_min=0.5)
array([ 0.5 ,  0.75,  1.  ])
>>> scale_range([1,3,5], old_min=1, old_max=4)
array([ 0.        ,  0.66666667,  1.        ])
>>> scale_range([5], old_max=4)
array([ 1.])

expan.core.version module

expan.core.version.git_commit_count()

Returns the output of git rev-list –count HEAD as an int.

expan.core.version.git_latest_commit()

” Returns output of git rev-parse HEAD.

expan.core.version.version(format_str='{short}')

Returns current version number in specified format.

Parameters:format_str (str) –

Returns:

expan.core.version.version_numbers()

Module contents

ExpAn core module.

expan.core.version(format_str='{short}')

Returns current version number in specified format.

Parameters:format_str (str) –

Returns: