mars.tensor.stats.ks_1samp#

mars.tensor.stats.ks_1samp(x: Union[ndarray, list, TileableType], cdf: Callable, args: Tuple = (), alternative: str = 'two-sided', mode: str = 'auto')[source]#

Performs the one-sample Kolmogorov-Smirnov test for goodness of fit.

This test compares the underlying distribution F(x) of a sample against a given continuous distribution G(x). See Notes for a description of the available null and alternative hypotheses.

Parameters
  • x (array_like) – a 1-D array of observations of iid random variables.

  • cdf (callable) – callable used to calculate the cdf.

  • args (tuple, sequence, optional) – Distribution parameters, used with cdf.

  • alternative ({'two-sided', 'less', 'greater'}, optional) – Defines the null and alternative hypotheses. Default is ‘two-sided’. Please see explanations in the Notes below.

  • mode ({'auto', 'exact', 'approx', 'asymp'}, optional) –

    Defines the distribution used for calculating the p-value. The following options are available (default is ‘auto’):

    • ’auto’ : selects one of the other options.

    • ’exact’ : uses the exact distribution of test statistic.

    • ’approx’ : approximates the two-sided probability with twice the one-sided probability

    • ’asymp’: uses asymptotic distribution of test statistic

Returns

  • statistic (float) – KS test statistic, either D, D+ or D- (depending on the value of ‘alternative’)

  • pvalue (float) – One-tailed or two-tailed p-value.

See also

ks_2samp, kstest

Notes

There are three options for the null and corresponding alternative hypothesis that can be selected using the alternative parameter.

  • two-sided: The null hypothesis is that the two distributions are identical, F(x)=G(x) for all x; the alternative is that they are not identical.

  • less: The null hypothesis is that F(x) >= G(x) for all x; the alternative is that F(x) < G(x) for at least one x.

  • greater: The null hypothesis is that F(x) <= G(x) for all x; the alternative is that F(x) > G(x) for at least one x.

Note that the alternative hypotheses describe the CDFs of the underlying distributions, not the observed values. For example, suppose x1 ~ F and x2 ~ G. If F(x) > G(x) for all x, the values in x1 tend to be less than those in x2.

Examples

>>> import numpy as np
>>> from scipy import stats
>>> import mars.tensor as mt
>>> from mars.tensor.stats import ks_1samp
>>> np.random.seed(12345678)  #fix random seed to get the same result
>>> x = mt.linspace(-15, 15, 9, chunk_size=5)
>>> ks_1samp(x, stats.norm.cdf).execute()
(0.44435602715924361, 0.038850142705171065)
>>> ks_1samp(stats.norm.rvs(size=100), stats.norm.cdf).execute()
KstestResult(statistic=0.165471391799..., pvalue=0.007331283245...)

Test against one-sided alternative hypothesis

Shift distribution to larger values, so that `` CDF(x) < norm.cdf(x)``:

>>> x = stats.norm.rvs(loc=0.2, size=100)
>>> ks_1samp(x, stats.norm.cdf, alternative='less').execute()
KstestResult(statistic=0.235488541678..., pvalue=1.158315030683...)

Reject null hypothesis in favor of alternative hypothesis: less

>>> ks_1samp(x, stats.norm.cdf, alternative='greater').execute()
KstestResult(statistic=0.010167165616..., pvalue=0.972494973653...)

Reject null hypothesis in favor of alternative hypothesis: greater

>>> ks_1samp(x, stats.norm.cdf).execute()
KstestResult(statistic=0.235488541678..., pvalue=2.316630061366...)

Don’t reject null hypothesis in favor of alternative hypothesis: two-sided

Testing t distributed random variables against normal distribution

With 100 degrees of freedom the t distribution looks close to the normal distribution, and the K-S test does not reject the hypothesis that the sample came from the normal distribution:

>>> ks_1samp(stats.t.rvs(100, size=100), stats.norm.cdf).execute()
KstestResult(statistic=0.077844250253..., pvalue=0.553155412513...)

With 3 degrees of freedom the t distribution looks sufficiently different from the normal distribution, that we can reject the hypothesis that the sample came from the normal distribution at the 10% level:

>>> ks_1samp(stats.t.rvs(3, size=100), stats.norm.cdf).execute()
KstestResult(statistic=0.118967105356..., pvalue=0.108627114578...)