## How to calculate a Confidence Interval using numpy.percentile() in Python - python

### Pythonic way of detecting outliers in one dimensional observation data

```For the given data, I want to set the outlier values (defined by 95% confidense level or 95% quantile function or anything that is required) as nan values. Following is the my data and code that I am using right now. I would be glad if someone could explain me further.
import numpy as np, matplotlib.pyplot as plt
data = np.random.rand(1000)+5.0
plt.plot(data)
plt.xlabel('observation number')
plt.ylabel('recorded value')
plt.show()
```
```The problem with using percentile is that the points identified as outliers is a function of your sample size.
There are a huge number of ways to test for outliers, and you should give some thought to how you classify them. Ideally, you should use a-priori information (e.g. "anything above/below this value is unrealistic because...")
However, a common, not-too-unreasonable outlier test is to remove points based on their "median absolute deviation".
Here's an implementation for the N-dimensional case (from some code for a paper here: https://github.com/joferkington/oost_paper_code/blob/master/utilities.py):
def is_outlier(points, thresh=3.5):
"""
Returns a boolean array with True if points are outliers and False
otherwise.
Parameters:
-----------
points : An numobservations by numdimensions array of observations
thresh : The modified z-score to use as a threshold. Observations with
a modified z-score (based on the median absolute deviation) greater
than this value will be classified as outliers.
Returns:
--------
mask : A numobservations-length boolean array.
References:
----------
Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
Handle Outliers", The ASQC Basic References in Quality Control:
Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.
"""
if len(points.shape) == 1:
points = points[:,None]
median = np.median(points, axis=0)
diff = np.sum((points - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
return modified_z_score > thresh
This is very similar to one of my previous answers, but I wanted to illustrate the sample size effect in detail.
Let's compare a percentile-based outlier test (similar to #CTZhu's answer) with a median-absolute-deviation (MAD) test for a variety of different sample sizes:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def main():
for num in [10, 50, 100, 1000]:
# Generate some data
x = np.random.normal(0, 0.5, num-3)
x = np.r_[x, -3, -10, 12]
plot(x)
plt.show()
if len(points.shape) == 1:
points = points[:,None]
median = np.median(points, axis=0)
diff = np.sum((points - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
return modified_z_score > thresh
def percentile_based_outlier(data, threshold=95):
diff = (100 - threshold) / 2.0
minval, maxval = np.percentile(data, [diff, 100 - diff])
return (data < minval) | (data > maxval)
def plot(x):
fig, axes = plt.subplots(nrows=2)
for ax, func in zip(axes, [percentile_based_outlier, mad_based_outlier]):
sns.distplot(x, ax=ax, rug=True, hist=False)
outliers = x[func(x)]
ax.plot(outliers, np.zeros_like(outliers), 'ro', clip_on=False)
kwargs = dict(y=0.95, x=0.05, ha='left', va='top')
axes.set_title('Percentile-based Outliers', **kwargs)
fig.suptitle('Comparing Outlier Tests with n={}'.format(len(x)), size=14)
main()
Notice that the MAD-based classifier works correctly regardless of sample-size, while the percentile based classifier classifies more points the larger the sample size is, regardless of whether or not they are actually outliers.
```
```Detection of outliers in one dimensional data depends on its distribution
1- Normal Distribution :
Data values are almost equally distributed over the expected range :
In this case you easily use all the methods that include mean ,like the confidence interval of 3 or 2 standard deviations(95% or 99.7%) accordingly for a normally distributed data (central limit theorem and sampling distribution of sample mean).I is a highly effective method.
Explained in Khan Academy statistics and Probability - sampling distribution library.
One other way is prediction interval if you want confidence interval of data points rather than mean.
Data values are are randomly distributed over a range:
mean may not be a fair representation of the data, because the average is easily influenced by outliers (very small or large values in the data set that are not typical)
The median is another way to measure the center of a numerical data set.
Median Absolute deviation - a method which measures the distance of all points from the median in terms of median distance
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm - has a good explanation as explained in Joe Kington's answer above
2 - Symmetric Distribution : Again Median Absolute Deviation is a good method if the z-score calculation and threshold is changed accordingly
Explanation :
http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers/
3 - Asymmetric Distribution : Double MAD - Double Median Absolute Deviation
Explanation in the above attached link
Attaching my python code for reference :
"""
FOR ASSYMMETRIC DISTRIBUTION
Returns : filtered array excluding the outliers
Parameters : the actual data Points array
Calculates median to divide data into 2 halves.(skew conditions handled)
Then those two halves are treated as separate data with calculation same as for symmetric distribution.(first answer)
Only difference being , the thresholds are now the median distance of the right and left median with the actual data median
"""
if len(points.shape) == 1:
points = points[:,None]
median = np.median(points, axis=0)
medianIndex = (points.size/2)
leftData = np.copy(points[0:medianIndex])
rightData = np.copy(points[medianIndex:points.size])
median1 = np.median(leftData, axis=0)
diff1 = np.sum((leftData - median1)**2, axis=-1)
diff1 = np.sqrt(diff1)
median2 = np.median(rightData, axis=0)
diff2 = np.sum((rightData - median2)**2, axis=-1)
diff2 = np.sqrt(diff2)
med_abs_deviation1 = max(np.median(diff1),0.000001)
med_abs_deviation2 = max(np.median(diff2),0.000001)
threshold1 = ((median-median1)/med_abs_deviation1)*3
threshold2 = ((median2-median)/med_abs_deviation2)*3
#if any threshold is 0 -> no outliers
if threshold1==0:
threshold1 = sys.maxint
if threshold2==0:
threshold2 = sys.maxint
#multiplied by a factor so that only the outermost points are removed
modified_z_score1 = 0.6745 * diff1 / med_abs_deviation1
modified_z_score2 = 0.6745 * diff2 / med_abs_deviation2
filtered1 = []
i = 0
for data in modified_z_score1:
if data < threshold1:
filtered1.append(leftData[i])
i += 1
i = 0
filtered2 = []
for data in modified_z_score2:
if data < threshold2:
filtered2.append(rightData[i])
i += 1
filtered = filtered1 + filtered2
return filtered
```
```I've adapted the code from http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers and it gives the same results as Joe Kington's, but uses L1 distance instead of L2 distance, and has support for asymmetric distributions. The original R code did not have Joe's 0.6745 multiplier, so I also added that in for consistency within this thread. Not 100% sure if it's necessary, but makes the comparison apples-to-apples.
# warning: this function does not check for NAs
# nor does it address issues when
# more than 50% of your data have identical values
m = np.median(y)
abs_dev = np.abs(y - m)
modified_z_score = 0.6745 * abs_dev / y_mad
modified_z_score[y == m] = 0
return modified_z_score > thresh
```
```Use np.percentile as #Martin suggested:
percentiles = np.percentile(data, [2.5, 97.5])
# or =>, <= for within 95%
data[(percentiles<data) & (percentiles>data)]
# set the outliners to np.nan
data[(percentiles>data) | (percentiles<data)] = np.nan
```
```Well a simple solution can also be, removing something which outside 2 standard deviations(or 1.96):
import random
def outliers(tmp):
"""tmp is a list of numbers"""
outs = []
mean = sum(tmp)/(1.0*len(tmp))
var = sum((tmp[i] - mean)**2 for i in range(0, len(tmp)))/(1.0*len(tmp))
std = var**0.5
outs = [tmp[i] for i in range(0, len(tmp)) if abs(tmp[i]-mean) > 1.96*std]
return outs
lst = [random.randrange(-10, 55) for _ in range(40)]
print lst
print outliers(lst)```

### R / Python confidence interval

```Just trying to figure out a R -> Python thing: Why are these two items not giving the same results?
Calculating 95% confidence interval for sample data with mean = 65, s = 22, n = 121.
R:
tsum.test(n.x=121, mean.x=65, s.x=22)
gives 95% confidence interval of
61.04014 68.95986
Python:
stats.norm.interval(alpha=0.95, loc=65, scale=22/np.sqrt(121))
gives 95% confidence interval of
(61.080072030919894, 68.9199279690801)
I thought that these should be identical results, or am I not using the appropriate equivalent Python function for R's tsum.test ?
```
```Upon further investigation I can see that I was wrong to assume to use stats.norm for this.
scipy.stats.t allows for the degree of freedom calculation that R's tsum.test is doing automatically:
stats.t.interval(alpha = 0.95, df = 121-1, loc = 65, scale= 22/np.sqrt(121))
returns
(61.04013918989445, 68.95986081010555)
which round at 5 decimal points to the answers given by tsum.test in R.
The general function I am using, if helpful, is this:
def get_conf_interval_from_sample(n, mean, sigma, alpha = 0.95) :
"""Get confidence interval from sample data with sample of n, mean, sigma, where df = n-1
Equivalent to getting confidence interval using t.test / tsum.test in R"""
df = n-1
scale = sigma / np.sqrt(n)
return stats.t.interval(alpha=alpha, df=df, loc=mean, scale=scale)```````

### DBSCAN Silhouette Coefficients: does this for-loop work?

```I'm trying to compare the results of my classmates Silhouette Score calculations to mine, and am having some trouble wrapping my head around their for-loop. I'm not looking for freebies, we've already submitted the below for grading, just trying to understand what's going on here for future reference.
The question:
Using DBSCAN iterate (for-loop) through different values of min_samples (1 to 10) and epsilon (.05 to .5, in steps of .01) to find clusters in the road-data used in the Lesson and calculate the Silohouette Coeff for min_samples and epsilon.
osm lat lon alt
0 144552912 9.349849 56.740876 17.052772
1 144552912 9.350188 56.740679 17.614840
2 144552912 9.350549 56.740544 18.083536
...
434873 93323209 9.943451 57.496270 24.635285
434874 rows × 4 columns
(Updated Edit) Normalized:
#Normalize sample from dataset
XX = X.copy()
XX['alt'] = (X.alt - X.alt.mean())/X.alt.std()
XX['lat'] = (X.lat - X.lat.mean())/X.lat.std()
XX['lon'] = (X.lon - X.lon.mean())/X.lon.std()
Classmates loop:
start = 0.0
stop = 0.45
step = 0.01
my_list = np.arange(start, stop+step, step)
startb = 1
stopb = 10
stepb = .2 # To scale proportionately with epsilon increments
my_listb = np.arange(startb, stopb+stepb, stepb)
my_range = range(45)
one = []
for i in tqdm(my_range):
dbscan = DBSCAN(eps = .05 + my_list[i] , min_samples = 1 + my_listb[i])
XX.cluster = dbscan.fit_predict(XX[['lat','lon']])
one.append(metrics.silhouette_score(XX[['lat', 'lon']], XX.cluster))
My Loop(s):
(I broke my solution up into 10 loops, one for each min_sample (1-10). Examples below.)
#eps loop 0.05 to 0.5 (steps 0.01) min_samples=1
eps_range = [x / 100.0 for x in range(5,51,1)]
eps_scores_1 = []
for e in tqdm(eps_range):
dbscan = DBSCAN(eps=e, min_samples=1)
labels = dbscan.fit_predict(XX[['lon', 'lat', 'alt']])
eps_scores_1.append(metrics.silhouette_score(XX[['lon', 'lat', 'alt']],labels))
-
#eps loop 0.05 to 0.5 (steps 0.01) min_samples=2
eps_range = [x / 100.0 for x in range(5,51,1)]
eps_scores_2 = []
for e in tqdm(eps_range):
dbscan = DBSCAN(eps=e, min_samples=2)
labels = dbscan.fit_predict(XX[['lon', 'lat', 'alt']])
eps_scores_2.append(metrics.silhouette_score(XX[['lon', 'lat', 'alt']],labels))
What I observe, as far as differences:
Classmate did not include 'alt' in their for-loop.
Classmate attempted some kind of nested loop?
Classmate's range is 45, not sure that's right.
Classmate's my_list is not in the correct notation?
Classmate's max Silhouette Scores are much higher than mine.
(not shown) Classmate used 10,000 random samples, I used 30,000 random samples.
```
```The question asks for both minors and epsilon to be varied - it called for a nested loop. Your classmate used a single loop, and did not consider combinations. You did the outer loop by copy and paste.
Your classmate uses a very misleading way of managing the range, because he adds 0.05 respectively 1 later!
You cannot just mix latitude, longitude, and altitude. They have different units. In fact, you shouldn't even mix latitude and longitude because of distortion - use Haversine distance instead!
Silhouette assumes convex clusters, but DBSCAN does not generate convex clusters.
The sklearn implementation likely treats noise just like a cluster, which will usually give worse results. But Silhouette is not really meant to be used with noise labels...```

### How to properly sample truncated distributions?

```I am trying to learn how to sample truncated distributions. To begin with I decided to try a simple example I found here example
I didn't really understand the division by the CDF, therefore I decided to tweak the algorithm a bit. Being sampled is an exponential distribution for values x>0 Here is an example python code:
# Sample exponential distribution for the case x>0
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
def pdf(x):
return x*np.exp(-x)
xvec=np.zeros(1000000)
x=1.
for i in range(1000000):
a=x+np.random.normal()
xs=x
if a > 0. :
xs=a
A=pdf(xs)/pdf(x)
if np.random.uniform()<A :
x=xs
xvec[i]=x
x=np.linspace(0,15,1000)
plt.plot(x,pdf(x))
plt.hist([x for x in xvec if x != 0],bins=150,normed=True)
plt.show()
Ant the output is:
The code above seems to work fine only for when using the condition if a > 0. :, i.e. positive x, choosing another condition (e.g. if a > 0.5 :) produces wrong results.
Since my final goal was to sample a 2D-Gaussian - pdf on a truncated interval I tried extending the simple example using the exponential distribution (see the code below). Unfortunately, since the simple case didn't work, I assume that the code given below would yield wrong results.
I assume that all this can be done using the advanced tools of python. However, since my primary idea was to understand the principle behind, I would greatly appreciate your help to understand my mistake.
EDIT:
# code updated according to the answer of CrazyIvan
from scipy.stats import multivariate_normal
RANGE=100000
a=2.06072E-02
b=1.10011E+00
a_range=[0.001,0.5]
b_range=[0.01, 2.5]
cov=[[3.1313994E-05, 1.8013737E-03],[ 1.8013737E-03, 1.0421529E-01]]
x=a
y=b
j=0
for i in range(RANGE):
a_t,b_t=np.random.multivariate_normal([a,b],cov)
# accept if within bounds - all that is neded to truncate
if a_range<a_t and a_t<a_range and b_range<b_t and b_t<b_range:
print(dx,dy)
EDIT:
I changed the code by norming the analytic pdf according to this scheme, and according to the answers given by, #Crazy Ivan and #Leandro Caniglia , for the case where the bottom of the pdf is removed. That is dividing by (1-CDF(0.5)) since my accept condition is x>0.5. This seems again to show some discrepancies. Again the mystery prevails ..
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
def pdf(x):
return x*np.exp(-x)
# included the corresponding cdf
def cdf(x):
return 1. -np.exp(-x)-x*np.exp(-x)
xvec=np.zeros(1000000)
x=1.
for i in range(1000000):
a=x+np.random.normal()
xs=x
if a > 0.5 :
xs=a
A=pdf(xs)/pdf(x)
if np.random.uniform()<A :
x=xs
xvec[i]=x
x=np.linspace(0,15,1000)
# new part norm the analytic pdf to fix the area
plt.plot(x,pdf(x)/(1.-cdf(0.5)))
plt.hist([x for x in xvec if x != 0],bins=200,normed=True)
plt.savefig("test_exp.png")
plt.show()
It seems that this can be cured by choosing larger shift size
shift=15.
a=x+np.random.normal()*shift.
which is in general an issue of the Metropolis - Hastings. See the graph below:
I also checked shift=150
Bottom line is that changing the shift size definitely improves the convergence. The misery is why, since the Gaussian is unbounded.
```
```You say you want to learn the basic idea of sampling a truncated distribution, but your source is a blog post about
Metropolis–Hastings algorithm? Do you actually need this "method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult"? Taking this as your starting point is like learning English by reading Shakespeare.
Truncated normal
For truncated normal, basic rejection sampling is all you need: generate samples for original distribution, reject those outside of bounds. As Leandro Caniglia noted, you should not expect truncated distribution to have the same PDF except on a shorter interval — this is plain impossible because the area under the graph of a PDF is always 1. If you cut off stuff from sides, there has to be more in the middle; the PDF gets rescaled.
It's quite inefficient to gather samples one by one, when you need 100000. I would grab 100000 normal samples at once, accept only those that fit; then repeat until I have enough. Example of sampling truncated normal between amin and amax:
import numpy as np
n_samples = 100000
amin, amax = -1, 2
samples = np.zeros((0,)) # empty for now
while samples.shape < n_samples:
s = np.random.normal(0, 1, size=(n_samples,))
accepted = s[(s >= amin) & (s <= amax)]
samples = np.concatenate((samples, accepted), axis=0)
samples = samples[:n_samples] # we probably got more than needed, so discard extra ones
And here is the comparison with the PDF curve, rescaled by division by cdf(amax) - cdf(amin) as explained above.
from scipy.stats import norm
_ = plt.hist(samples, bins=50, density=True)
t = np.linspace(-2, 3, 500)
plt.plot(t, norm.pdf(t)/(norm.cdf(amax) - norm.cdf(amin)), 'r')
plt.show()
Truncated multivariate normal
Now we want to keep the first coordinate between amin and amax, and the second between bmin and bmax. Same story, except there will be a 2-column array and the comparison with bounds is done in a relatively sneaky way:
(np.min(s - [amin, bmin], axis=1) >= 0) & (np.max(s - [amax, bmax], axis=1) <= 0)
This means: subtract amin, bmin from each row and keep only the rows where both results are nonnegative (meaning we had a >= amin and b >= bmin). Also do a similar thing with amax, bmax. Accept only the rows that meet both criteria.
n_samples = 10
amin, amax = -1, 2
bmin, bmax = 0.2, 2.4
mean = [0.3, 0.5]
cov = [[2, 1.1], [1.1, 2]]
samples = np.zeros((0, 2)) # 2 columns now
while samples.shape < n_samples:
s = np.random.multivariate_normal(mean, cov, size=(n_samples,))
accepted = s[(np.min(s - [amin, bmin], axis=1) >= 0) & (np.max(s - [amax, bmax], axis=1) <= 0)]
samples = np.concatenate((samples, accepted), axis=0)
samples = samples[:n_samples, :]
Not going to plot, but here are some values: naturally, within bounds.
array([[ 0.43150033, 1.55775629],
[ 0.62339265, 1.63506963],
[-0.6723598 , 1.58053835],
[-0.53347361, 0.53513105],
[ 1.70524439, 2.08226558],
[ 0.37474842, 0.2512812 ],
[-0.40986396, 0.58783193],
[ 0.65967087, 0.59755193],
[ 0.33383214, 2.37651975],
[ 1.7513789 , 1.24469918]])
```
```To compute the truncated density function pdf_t from the entire density function pdf, do the following:
Let [a, b] be the truncation interval; (x axis)
Let A := cdf(a) and B := cdf(b); (cdf = non-truncated cumulative distribution function)
Then pdf_t(x) := pdf(x) / (B - A) if x in [a, b] and 0 elsewhere.
In cases where a = -infinity (resp. b = +infinity), take A := 0 (resp. B := 1).
As for the "mistery" you see
please note that your blue curve is wrong. It is not the pdf of your truncated distribution, it is just the pdf of the non-truncated one, scaled by the correct amount (division by 1-cdf(0.5)). The actual truncated pdf curve starts with a vertical line on x = 0.5 which goes up until it reaches your current blue curve. In other words, you only scaled the curve but forgot to truncate it, in this case to the left. Such a truncation corresponds to the "0 elsewhere" part of step 3 in the algorithm above.```

### statistics for histogram of periodic data

```For a series of angle values in (-pi, pi) range, I make a histogram. Is there an effective way to calculate a mean and modal (post probable) value? Consider following examples:
import numpy as N, cmath
deg = N.pi/180.
d = N.array([-175., 170, 175, 179, -179])*deg
i = N.sum(N.exp(1j*d))
ave = cmath.phase(i)
i /= float(d.size)
stdev = -2. * N.log(N.sqrt(i.real**2 + i.imag**2))
print ave/deg, stdev/deg
Now, let's have a histogram:
counts, bins = N.histogram(data, N.linspace(-N.pi, N.pi, 360))
Is it possible to calculate mean, mode having counts and bins? For non-periodic data, calculation of a mean is straightforward:
ave = sum(counts*bins[:-1])
Calculations of a modal value requires more effort. Actually, I'm not sure my code below is correct: firstly, I identify bins which occur most frequently and then I calculate an arithmetic mean:
cmax = bins[N.argmax(counts)]
mode = N.mean(N.take(bins, N.nonzero(counts == cmax)))
I have no idea, how to calculate standard deviation from such data, though. One obvious solution to all my problems (at least those described above) is to convert histogram data to a data series and then use it in calculations. This is not elegant, however, and inefficient.
Any hints will be very appreciated.
This is the partial solution I wrote.
import numpy as N, cmath
import scipy.stats as ST
d = [-175, 170.2, 175.57, 179, -179, 170.2, 175.57, 170.2]
deg = N.pi/180.
data = N.array(d)*deg
i = N.sum(N.exp(1j*data))
ave = cmath.phase(i) # correct and exact mean for periodic data
wrong_ave = N.mean(d)
i /= float(data.size)
stdev = -2. * N.log(N.sqrt(i.real**2 + i.imag**2))
wrong_stdev = N.std(d)
bins = N.linspace(-N.pi, N.pi, 360)
counts, bins = N.histogram(data, bins, normed=False)
# consider it weighted vector addition
nz = N.nonzero(counts)
weight = counts[nz]
i = N.sum(weight * N.exp(1j*bins[nz])/len(nz))
pave = cmath.phase(i) # correct and approximated mean for periodic data
i /= sum(weight)/float(len(nz))
pstdev = -2. * N.log(N.sqrt(i.real**2 + i.imag**2))
print
print 'scipy: %12.3f (mean) %12.3f (stdev)' % (ST.circmean(data)/deg, \
ST.circstd(data)/deg)
When run, it gives following results:
mean: 175.840 85.843 175.360
stdev: 0.472 151.785 0.430
scipy: 175.840 (mean) 3.673 (stdev)
A few comments now: the first column gives mean/stdev calculated. As can be seen, the mean agrees well with scipy.stats.circmean (thanks JoeKington for pointing it out). Unfortunately stdev differs. I will look at it later. The second column gives completely wrong results (non-periodic mean/std from numpy obviously does not work here). The 3rd column gives sth I wanted to obtain from the histogram data (#JoeKington: my raw data won't fit memory of my computer.., #dmytro: thanks for your input: of course, bin size will influence result but in my application I don't have much choice, i.e. I have to reduce data somehow). As can be seen, the mean (3rd column) is properly calculated, stdev needs further attention :)
```
```Have a look at scipy.stats.circmean and scipy.stats.circstd.
Or do you only have the histogram counts, and not the "raw" data? If so, you could fit a Von Mises distribution to your histogram counts and approximate the mean and stddev in that way.
```
```Here's how to get an approximation.
Since Var(x) = <x^2> - <x>^2, we have:
meanX = N.sum(counts * bins[:-1]) / N.sum(counts)
meanX2 = N.sum(counts * bins[:-1]**2) / N.sum(counts)
std = N.sqrt(meanX2 - meanX**2)```