如何根据数组的密度对数组进行二次采样? (删除常用值，保留稀有值)

本文介绍了如何根据数组的密度对数组进行二次采样? (删除常用值，保留稀有值)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个问题，我想绘制一个数据分布，其中一些值经常出现，而另一些则很少见.总点数约为30.000.将这样的图形渲染为png或(禁止上帝)pdf会花费很多时间，并且pdf太大而无法显示.

I have this problem that I want to plot a data distribution where some values occur frequently while others are quite rare. The number of points in total is around 30.000. Rendering such a plot as png or (god forbid) pdf takes forever and the pdf is much too large to display.

所以我想对采样数据进行二次采样.我想实现的是删除许多重叠的点(密度高的点)，但保留那些密度低的点的几率几乎为1.

So I want to subsample the data just for the plots. What I would like to achieve is to remove a lot of points where they overlap (where the density is high), but keep the ones where the density is low with almost probability 1.

现在，numpy.random.choice允许您指定一个概率向量，该向量是我根据数据直方图进行一些调整后计算得出的.但是我似乎无法选择，这样才能真正保留稀有点.

Now, numpy.random.choice allows one to specify a vector of probabilities, which I've computed according to the data histogram with a few tweaks. But I can't seem to get my choice so that the rare points are really kept.

我已经附加了数据的图像；分布的右尾少了几个数量级，所以我想保留这些.数据是3d，但密度仅来自一维，因此我可以用它来衡量给定位置上有多少个点

I've attached an image of the data; the right tail of the distribution has orders of magnitude fewer points, so I'd like to keep those. The data is 3d, but the density comes from only one dimension, so I can use that as a measure for how many points are in a given location

推荐答案

请考虑以下功能.它将沿轴和

Consider the following function. It will bin the data in equal bins along the axis and

如果容器中有一个或两个点，请接管这些点，
如果容器中有更多点，则接管最小值和最大值.
在第一个点和最后一个点之间附加以确保使用相同的数据范围.

这可以将原始数据保留在低密度区域中，但可以大大减少要在高密度区域中绘制的数据量.同时，所有特征都保留有足够密集的分箱.

This allows to keep the original data in regions of low density, but significantly reduce the amount of data to plot in regions of high density. At the same time all the features are preserved with a sufficiently dense binning.

import numpy as np; np.random.seed(42)

def filt(x,y, bins):
    d = np.digitize(x, bins)
    xfilt = []
    yfilt = []
    for i in np.unique(d):
        xi = x[d == i]
        yi = y[d == i]
        if len(xi) <= 2:
            xfilt.extend(list(xi))
            yfilt.extend(list(yi))
        else:
            xfilt.extend([xi[np.argmax(yi)], xi[np.argmin(yi)]])
            yfilt.extend([yi.max(), yi.min()])
    # prepend/append first/last point if necessary
    if x[0] != xfilt[0]:
        xfilt = [x[0]] + xfilt
        yfilt = [y[0]] + yfilt
    if x[-1] != xfilt[-1]:
        xfilt.append(x[-1])
        yfilt.append(y[-1])
    sort = np.argsort(xfilt)
    return np.array(xfilt)[sort], np.array(yfilt)[sort]

为说明这一概念，让我们使用一些玩具数据

To illustrate the concept let's use some toy data

x = np.array([1,2,3,4, 6,7,8,9, 11,14, 17, 26,28,29])
y = np.array([4,2,5,3, 7,3,5,5, 2, 4,  5,  2,5,3])
bins = np.linspace(0,30,7)

然后调用xf, yf = filt(x,y,bins)并绘制原始数据和过滤后的数据，得出:

Then calling xf, yf = filt(x,y,bins) and plotting both the original data and the filtered data gives:

问题的用例包含大约30000个数据点，将在下面显示.使用提出的技术将允许将绘制点的数量从30000减少到大约500.当然，该数量将取决于使用中的装箱-这里是300个装箱.在这种情况下，该函数大约需要10毫秒才能计算出来.这不是超级快，但与绘制所有点相比仍是一个很大的改进.

The usecase of the question with some 30000 datapoints would be shown in the following. Using the presented technique would allow to reduce the number of plotted points from 30000 to some 500. This number will of course depend on the binning in use - here 300 bins. In this case the function takes ~10 ms to compute. This is not super-fast, but still a large improvement compared to plotting all the points.

import matplotlib.pyplot as plt

# Generate some data
x = np.sort(np.random.rayleigh(3, size=30000))
y = np.cumsum(np.random.randn(len(x)))+250
# Decide for a number of bins
bins = np.linspace(x.min(),x.max(),301)
# Filter data
xf, yf = filt(x,y,bins) 

# Plot results
fig, (ax1, ax2, ax3) = plt.subplots(nrows=3, figsize=(7,8), 
                                    gridspec_kw=dict(height_ratios=[1,2,2]))

ax1.hist(x, bins=bins)
ax1.set_yscale("log")
ax1.set_yticks([1,10,100,1000])

ax2.plot(x,y, linewidth=1, label="original data, {} points".format(len(x)))

ax3.plot(xf, yf, linewidth=1, label="binned min/max, {} points".format(len(xf)))

for ax in [ax2, ax3]:
    ax.legend()
plt.show()

这篇关于如何根据数组的密度对数组进行二次采样? (删除常用值，保留稀有值)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！