expan.core package¶

Submodules¶

expan.core.binning module¶

class expan.core.binning.Bin(bin_type, *repr_args)¶

Bases: object

Constructor for a bin object. :param id: identifier (e.g. bin number) of the bin :param bin_type: “numerical” or “categorical” :param repr_args: arguments to represent this bin.

args for numerical bin includes lower, upper, lower_closed, upper_closed args for categorical bin includes a list of categories for this bin.

class expan.core.binning.CategoricalRepresentation(categories)¶

Bases: object

Constructor for representation of a categorical bin. :param categories: list of categorical values that belong to this bin

apply_to_data(data, feature)¶: Apply the bin to data. :param data: pandas data frame :param feature: feature name on which this bin is defined :return: subset of input dataframe which belongs to this bin

class expan.core.binning.NumericalRepresentation(lower, upper, lower_closed, upper_closed)¶

Bases: object

Constructor for representation of a numerical bin. :param upper: upper bound of the bin (exclusive) :param lower: lower bound of the bin (inclusive) :param lower_closed: boolean indicator whether lower bound is closed :param upper_closed: boolean indicator whether upper bound is closed

apply_to_data(data, feature)¶: Apply the bin to data. :param data: pandas data frame :param feature: feature name on which this bin is defined :return: subset of input dataframe which belongs to this bin

expan.core.binning.create_bins(data, n_bins)¶: Create bins from the data value :param data: a list or a 1-dim array of data to determine the bins :param n_bins: number of bins to create :return: a list of Bin object

expan.core.binning.toBinObject(bins)¶

expan.core.experiment module¶

class expan.core.experiment.Experiment(control_variant_name, data, metadata, report_kpi_names=None, derived_kpis=None)¶

Bases: object

Class which adds the analysis functions to experimental data.

delta(method='fixed_horizon', **worker_args)¶

filter(kpis, percentile=99.0, threshold_type='upper')¶

Method that filters out entities whose KPIs exceed the value at a given percentile. If any of the KPIs exceeds its threshold the entity is filtered out.

Parameters:	kpis (list) – list of KPI names percentile (float) – percentile considered as threshold threshold_type (string) – type of threshold used (‘lower’ or ‘upper’)
Returns:	No return value. Will filter out outliers in self.data in place.

get_kpi_by_name_and_variant(data, name, variant)¶

sga(feature_name_to_bins, multi_test_correction=False)¶

Perform subgroup analysis. :param feature_name_to_bins: a dict of feature name (key) to list of Bin objects (value).

This dict defines how and on which column to perform the subgroup split.

Parameters:	multi_test_correction (boolean) – flag of whether the correction for multiple testing is needed.
Returns:	Analysis results per subgroup.

sga_date(multi_test_correction=False)¶

Perform subgroup analysis on date partitioning each day from start day till end date. Produces non-cumulative delta and CIs for each subgroup. :param multi_test_correction: flag of whether the correction for multiple testing is needed. :type multi_test_correction: boolean

Returns:	Analysis results per date

expan.core.experimentdata module¶

expan.core.results module¶

expan.core.statistics module¶

expan.core.statistics.alpha_to_percentiles(alpha)¶

Transforms alpha value to corresponding percentile.

Parameters:	alpha (float) – alpha values to transform
Returns:	list of percentiles corresponding to given alpha

expan.core.statistics.bootstrap(x, y, func=<function _delta_mean>, nruns=10000, percentiles=[2.5, 97.5], min_observations=20, return_bootstraps=False, relative=False, multi_test_correction=False, num_tests=1)¶

Bootstraps the Confidence Intervals for a particular function comparing two samples. NaNs are ignored (discarded before calculation).

Parameters:

x (array like) – sample of treatment group
y (array like) – sample of control group
func (function) – function of which the distribution is to be computed. The default comparison metric is the difference of means. For bootstraping correlation: func=lambda x,y: np.stats.pearsonr(x,y)[0]
nruns (integer) – number of bootstrap runs to perform
percentiles (list) – The values corresponding to the given percentiles are returned. The default percentiles (2.5% and 97.5%) correspond to an alpha of 0.05.
min_observations (integer) – minimum number of observations necessary
return_bootstraps (boolean) – If this variable is set the bootstrap sets are returned otherwise the first return value is empty.
relative (boolean) – if relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
multi_test_correction (boolean) – flag of whether the correction for multiple testing is needed.
num_tests (integer) – number of tests or reported kpis used for multiple correction.

Returns:

dict: percentile levels (index) and values
np.array (nruns): array containing the bootstraping results per run

Return type:

tuple

expan.core.statistics.chi_square(x, y, min_counts=5)¶

Performs the chi-square homogeneity test on categorical arrays x and y

Parameters:

x (array_like) – sample of the treatment variable to check
y (array_like) – sample of the control variable to check
min_counts (int) – drop categories where minimum number of observations or expected observations is below min_counts for x or y

Returns:

float: p-value
float: chi-square value
int: number of attributes used (after dropping)

Return type:

tuple

expan.core.statistics.compute_statistical_power(x, y, alpha=0.05)¶

Compute statistical power :param x: sample of a treatment group :type x: array-like :param y: sample of a control group :type y: array-like :param alpha: Type I error (false positive rate)

Returns:	statistical power — the probability of a test to detect an effect, if the effect actually exists.
Return type:	float

expan.core.statistics.delta(x, y, assume_normal=True, percentiles=[2.5, 97.5], min_observations=20, nruns=10000, relative=False, x_weights=1, y_weights=1, multi_test_correction=False, num_tests=1)¶

Calculates the difference of means between the samples (x-y) in a statistical sense, i.e. with confidence intervals.

NaNs are ignored: treated as if they weren’t included at all. This is done because at this level we cannot determine what a NaN means. In some cases, a NaN represents missing data that should be completely ignored, and in some cases it represents inapplicable (like PCII for non-ordering customers) - in which case the NaNs should be replaced by zeros at a higher level. Replacing with zeros, however, would be completely incorrect for return rates.

Computation is done in form of treatment minus control, i.e. x-y

Parameters:

x (array_like) – sample of a treatment group
y (array_like) – sample of a control group
assume_normal (boolean) – specifies whether normal distribution assumptions can be made
percentiles (list) – list of percentile values for confidence bounds
min_observations (integer) – minimum number of observations needed
nruns (integer) – only used if assume normal is false
relative (boolean) – if relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
x_weights (list) – weights for the x vector, in order to calculate the weighted mean and confidence intervals, which is equivalent to the overall metric. This weighted approach is only relevant for ratios.
y_weights (list) – weights for the y vector, in order to calculate the weighted mean and confidence intervals, which is equivalent to the overall metric. This weighted approach is only relevant for ratios.
multi_test_correction (boolean) – flag of whether the correction for multiple testing is needed.
num_tests (integer) – number of tests or reported kpis used for multiple correction.

Returns:

DeltaStatistics object

expan.core.statistics.estimate_sample_size(x, mde, n, r, alpha=0.05, beta=0.2)¶

Estimates sample size based on sample mean and variance given MDE (Minimum Detectable effect), number of variants and variant split ratio

Parameters:	x (pd.Series or pd.DataFrame) – sample to base estimation on mde (float) – minimum detectable effect n (int) – number of variants r (float) – variant split ratio alpha (float) – significance level beta (float) – type II error
Returns:	estimated sample size
Return type:	float or pd.Series

expan.core.statistics.estimate_std(x, mu, pctile)¶

Estimate the standard deviation from a given percentile, according to the z-score:

z = (x - mu) / sigma

Parameters:	x (float) – cumulated density at the given percentile mu (float) – mean of the distribution pctile (float) – percentile value (between 0 and 100)
Returns:	estimated standard deviation of the distribution
Return type:	float

expan.core.statistics.make_delta(assume_normal=True, percentiles=[2.5, 97.5], min_observations=20, nruns=10000, relative=False, multi_test_correction=False, num_tests=1)¶: a closure to the below delta function

expan.core.statistics.normal_difference(mean1, std1, n1, mean2, std2, n2, percentiles=[2.5, 97.5], relative=False, multi_test_correction=False, num_tests=1)¶

Calculates the difference distribution of two normal distributions.

Computation is done in form of treatment minus control. It is assumed that the standard deviations of both distributions do not differ too much.

Parameters:	mean1 (float) – mean value of the treatment distribution std1 (float) – standard deviation of the treatment distribution n1 (integer) – number of samples of the treatment distribution mean2 (float) – mean value of the control distribution std2 (float) – standard deviation of the control distribution n2 (integer) – number of samples of the control distribution percentiles (list) – list of percentile values to compute relative (boolean) – If relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions. multi_test_correction (boolean) – flag of whether the correction for multiple testing is needed. num_tests (integer) – number of tests or reported kpis used for multiple correction.
Returns:	percentiles and corresponding values
Return type:	dict

For further information vistit:: http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Confidence_Intervals/BS704_Confidence_Intervals5.html

expan.core.statistics.normal_percentiles(mean, std, n, percentiles=[2.5, 97.5], relative=False)¶

Calculate the percentile values for a normal distribution with parameters estimated from samples.

Parameters:	mean (float) – mean value of the distribution std (float) – standard deviation of the distribution n (integer) – number of samples percentiles (list) – list of percentile values to compute relative (boolean) – if relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
Returns:	percentiles and corresponding values
Return type:	dict

For more information visit:: http://www.itl.nist.gov/div898/handbook/eda/section3/eda352.htm http://www.boost.org/doc/libs/1_46_1/libs/math/doc/sf_and_dist/html/math_toolkit/dist/stat_tut/weg/st_eg/tut_mean_intervals.html http://www.stat.yale.edu/Courses/1997-98/101/confint.htm

expan.core.statistics.normal_sample_difference(x, y, percentiles=[2.5, 97.5], relative=False, multi_test_correction=False, num_tests=1)¶

Calculates the difference distribution of two normal distributions given by their samples.

Computation is done in form of treatment minus control. It is assumed that the standard deviations of both distributions do not differ too much.

Parameters:	x (array-like) – sample of a treatment group y (array-like) – sample of a control group percentiles (list) – list of percentile values to compute relative (boolean) – If relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions. multi_test_correction (boolean) – flag of whether the correction for multiple testing is needed. num_tests (integer) – number of tests or reported kpis used for multiple correction.
Returns:	percentiles and corresponding values
Return type:	dict

expan.core.statistics.normal_sample_percentiles(values, percentiles=[2.5, 97.5], relative=False)¶

Calculate the percentile values for a sample assumed to be normally distributed. If normality can not be assumed, use bootstrap_ci instead. NaNs are ignored (discarded before calculation).

Parameters:	values (array-like) – sample for which the normal distribution percentiles are computed. percentiles (list) – list of percentile values to compute relative (boolean) – if relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. In this case, the interval is mean-ret_val[0] to mean+ret_val[1]. This is more useful in many situations because it corresponds with the sem() and std() functions.
Returns:	percentiles and corresponding values
Return type:	dict

For further information visit:: http://www.itl.nist.gov/div898/handbook/eda/section3/eda352.htm http://www.boost.org/doc/libs/1_46_1/libs/math/doc/sf_and_dist/html/math_toolkit/dist/stat_tut/weg/st_eg/tut_mean_intervals.html http://www.stat.yale.edu/Courses/1997-98/101/confint.htm

expan.core.statistics.pooled_std(std1, n1, std2, n2)¶

Returns the pooled estimate of standard deviation. Assumes that population variances are equal (std(v1)**2==std(v2)**2) - this assumption is checked for reasonableness and an exception is raised if this is strongly violated.

Parameters:	std1 (float) – standard deviation of first sample n1 (integer) – size of first sample std2 (float) – standard deviation of second sample n2 (integer) – size of second sample
Returns:	Pooled standard deviation
Return type:	float

For further information visit:: http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Confidence_Intervals/BS704_Confidence_Intervals5.html

Todo

Also implement a version for unequal variances.

expan.core.statistics.sample_size(x)¶

Calculates sample size of a sample x :param x: sample to calculate sample size :type x: array_like

Returns:	sample size of the sample excluding nans
Return type:	int

expan.core.util module¶

expan.core.util.drop_nan(np_array)¶

expan.core.util.find_list_of_dicts_element(items, key1, value, key2)¶

expan.core.util.generate_random_data()¶

expan.core.util.generate_random_data_n_variants(n_variants=3)¶

expan.core.util.get_column_names_by_type(df, dtype)¶

expan.core.util.is_number_and_nan(obj)¶

expan.core.util.scale_range(x, new_min=0.0, new_max=1.0, old_min=None, old_max=None, squash_outside_range=True, squash_inf=False)¶

Scales a sequence to fit within a new range.

If squash_inf is set, then infinite values will take on the extremes of the new range (as opposed to staying infinite).

Args:

x: new_min: new_max: old_min: old_max: squash_outside_range: squash_inf:

Note:

Infinity in the input is disregarded in the construction of the scale of the mapping.

>>> scale_range([1,3,5])
array([ 0. ,  0.5,  1. ])

>>> scale_range([1,2,3,4,5])
array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ])

>>> scale_range([1,3,5, np.inf])
array([ 0. ,  0.5,  1. ,  inf])

>>> scale_range([1,3,5, -np.inf])
array([ 0. ,  0.5,  1. , -inf])

>>> scale_range([1,3,5, -np.inf], squash_inf=True)
array([ 0. ,  0.5,  1. ,  0. ])

>>> scale_range([1,3,5, np.inf], squash_inf=True)
array([ 0. ,  0.5,  1. ,  1. ])

>>> scale_range([1,3,5], new_min=0.5)
array([ 0.5 ,  0.75,  1.  ])

>>> scale_range([1,3,5], old_min=1, old_max=4)
array([ 0.        ,  0.66666667,  1.        ])

>>> scale_range([5], old_max=4)
array([ 1.])

expan.core.version module¶

expan.core.version.git_commit_count()¶: Returns the output of git rev-list –count HEAD as an int.

Note

http://programmers.stackexchange.com/a/151558

expan.core.version.git_latest_commit()¶: ” Returns output of git rev-parse HEAD.

Note

http://programmers.stackexchange.com/a/151558

expan.core.version.version(format_str='{short}')¶

Returns current version number in specified format.

Parameters:	format_str (str) –

Returns:

expan.core.version.version_numbers()¶

Module contents¶

ExpAn core module.

expan.core.version(format_str='{short}')¶

Returns current version number in specified format.

Parameters:	format_str (str) –

Returns: