本文介绍了使对数正态分布适合已分类的数据python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使对数正态分布适合我已经分类的数据。条形图如下所示:



不幸的是,当我尝试使用标准的 lognorm时.pdf()拟合分布的形状非常不同。我想这是因为我的数据已经装箱了。代码如下:

 时间,数据,bin_points = ReadHistogramFile(filename)

xmin = 200
xmax = 800
x = np.linspace(xmin,xmax,1000)
形状,位置,比例= stats.lognorm.fit(data,floc = 0)
pdf =统计.lognorm.pdf(x,shape,loc = loc,scale = scale)

area = data.sum()
plt.bar(条形,数据,宽度= 10,颜色= 'b')
plt.plot(x * area,pdf,'k')

适合的分布如下所示:

显然,缩放也存在问题。我对此不太担心。我的主要问题是分布的形状。这可能与以下内容重复:

解决方案

如您所提到的,您不能使用 lognorm.fit 处理合并后的数据。因此,您要做的就是从直方图中恢复原始数据。显然,这并非无损,垃圾箱越多越好。



带有一些生成数据的示例代码:

  import numpy as np 
import scipy.stats as stats
import matplotlib.pylab as plt


#生成一些数据
ln = stats.lognorm(0.4,scale = 100)
数据= ln.rvs(大小= 2000)

计数,垃圾箱_ = plt.hist(data, bins = 50)
#注意bin的len为51,因为它包含每个bin的上限和下限

#从直方图中恢复数据:计数乘以bin中心
restore = [[d] * int(counts [n])for n,d inumerate((bins [1:] + bins [:-1])/ 2)]
#将结果$ b $展平b restore = [为子列表中的项目恢复的子列表中的项目]

print stats.lognorm.fit(restored,floc = 0)

dist = stats.lognorm(* stats.lognorm.fit(restored,floc = 0))
x = np.arange(1,400)
y = dist.pdf(x)

#pdf已标准化,因此我们需要对其进行缩放以匹配直方图
y = y / y.max( )
y = y * counts.max()

plt.plot(x,y,'r',linewidth = 2)
plt.show()

I would like to make a lognormal fit to my already binned data. The bar plot looks like this:

Unfortunately, when I try to use the standard lognorm.pdf() the shape of the fitted distribution is very different. I guess it's because my data is already binned. Here's the code:

times, data, bin_points = ReadHistogramFile(filename)

xmin = 200
xmax = 800
x = np.linspace(xmin, xmax, 1000)
shape, loc, scale = stats.lognorm.fit(data, floc=0)
pdf = stats.lognorm.pdf(x, shape, loc=loc, scale=scale)

area=data.sum()
plt.bar(bars, data, width=10, color='b')
plt.plot(x*area, pdf, 'k' )

Here's what the fitted distribution looks like:Obviously there's something wrong also with the scaling. I'm less concerned about that though. My main issue is, the shape of the distribution. This might be duplicate to: this question but I could not find a correct solution. I tried it and still get a very similar shape as when doing the above. Thanks for any help!

Update:By using curve_fit() I was able to get somewhat of a fit. But I'm not satisfied yet. I'd like to have the original bins and not unity-bins. Also I'm not sure, what exactly is happening, and if there is not a better fit. Here's the code:

def normalize_integral(data, bin_size):
normalized_data = np.zeros(size(data))
print bin_size
sum = data.sum()
integral = bin_size*sum
for i in range(0, size(data)-1):
    normalized_data[i] = data[i]/integral

print 'integral:', normalized_data.sum()*bin_size
return normalized_data



def pdf(x, mu, sigma):
"""pdf of lognormal distribution"""

return (np.exp(-(np.log(x) - mu)**2 / (2 * sigma**2)) / (x * sigma * np.sqrt(2 * np.pi)))


bin_points=np.linspace(280.5, 1099.55994, len(bin_points))
data=[9.78200000e+03 1.15120000e+04 1.18000000e+04 1.79620000e+04 2.76980000e+04   2.78260000e+04   3.35460000e+04   3.24260000e+04 3.16500000e+04   3.30820000e+04   4.84560000e+04   5.86500000e+04 6.34220000e+04   5.11880000e+04   5.13180000e+04   4.74320000e+04 4.35420000e+04   4.13400000e+04   3.60880000e+04   2.96900000e+04 2.66640000e+04   2.58720000e+04   2.57560000e+04   2.20960000e+04 1.46880000e+04   9.97200000e+03   5.74200000e+03   3.52000000e+03 2.74600000e+03   2.61800000e+03   1.50000000e+03   7.96000000e+02 5.40000000e+02   2.98000000e+02   2.90000000e+02   2.22000000e+02 2.26000000e+02   1.88000000e+02   1.20000000e+02   5.00000000e+01 5.40000000e+01   5.80000000e+01   5.20000000e+01   2.00000000e+01 2.80000000e+01   6.00000000e+00   0.00000000e+00   0.00000000e+00 0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00 0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00 0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00 0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00 0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00 0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00 0.00000000e+00   0.00000000e+00]
normalized_data_unitybins = normalize_integral(data,1)


plt.figure(figsize=(9,4))
ax1=plt.subplot(121)
ax2=plt.subplot(122)
ax2.bar(unity_bins, normalized_data_unitybins, width=1, color='b')
fitParams, fitCov = curve_fit(pdf, unity_bins, normalized_data_unitybins, p0=[1,1],maxfev = 1000000)
fitData=pdf(unity_bins, *fitParams)
ax2.plot(unity_bins, fitData,'g-')

ax1.bar(bin_points, normalized_data_unitybins, width=10, color='b')
fitParams, fitCov = curve_fit(pdf, bin_points, normalized_data_unitybins, p0=[1,1],maxfev = 1000000)
fitData=pdf(bin_points, *fitParams)
ax1.plot(bin_points, fitData,'g-')
解决方案

As you mention, you cannot use lognorm.fiton the binned data. So all you need to do is to restore the raw data from the histogram. Obviously this is not 'lossless', the more bins the better.

Sample code with some generated data:

import numpy as np
import scipy.stats as stats
import matplotlib.pylab as plt


# generate some data
ln = stats.lognorm(0.4,scale=100)
data = ln.rvs(size=2000)

counts, bins, _ = plt.hist(data, bins=50)
# note that the len of bins is 51, since it contains upper and lower limit of every bin

# restore data from histogram: counts multiplied bin centers
restored = [[d]*int(counts[n]) for n,d in enumerate((bins[1:]+bins[:-1])/2)]
# flatten the result
restored = [item for sublist in restored for item in sublist]

print stats.lognorm.fit(restored, floc=0)

dist = stats.lognorm(*stats.lognorm.fit(restored, floc=0))
x = np.arange(1,400)
y = dist.pdf(x)

# the pdf is normalized, so we need to scale it to match the histogram
y = y/y.max()
y = y*counts.max()

plt.plot(x,y,'r',linewidth=2)
plt.show()

这篇关于使对数正态分布适合已分类的数据python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-14 10:55