给定两个 2d 点列表，如何为第一个列表中的每个点在第二个列表中找到最近的点?

本文介绍了给定两个 2d 点列表，如何为第一个列表中的每个点在第二个列表中找到最近的点?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个随机排序的 2d 点的大型 numpy 数组，假设它们是 A 和 B.我需要做的是找到两个数组之间匹配"的数量，其中匹配是 A 中的一个点(称为 A')在某个给定半径 R 内，在 B 中有一个点(称为 B').这意味着 A 中的每个点都必须与 B 中的 1 个点或没有点匹配.返回两个数组之间匹配项的列表索引也很好，但这不是必需的.由于这个半径 R 内可以有很多点，所以最好在 B 中找到最接近 A' 的点，然后检查它是否在半径 R 内. 这个简单地用距离公式 dx^ 进行测试2 + dy^2.显然有循环遍历两个数组的蛮力 O(n^2) 解决方案，但我需要更快的东西，希望 O(n log n).

I have two large numpy arrays of randomly sorted 2d points, let's say they're A and B. What I need to do is find the number of "matches" between the two arrays, where a match is a point in A (call it A') being within some given radius R with a point in B (call it B'). This means that every point in A must match with either 1 or no points in B. It would also be nice to return the list indices of the matches between the two arrays, however this isn't necessary. Since there can be many points in this radius R, it seems better to find the point which is closest to A' in B, and then checking if it's within the radius R. This is tested simply with the distance formula dx^2 + dy^2. Obviously there's the brute force O(n^2) solution of looping through both arrays, but I need something faster, hopefully O(n log n).

我所看到的是，Voronoi 图可用于解决此类问题，但我不确定如何实现.我不熟悉 Voronoi 图，所以我用 scipy.spatial.Voronoi 生成它.使用这些图是否有解决此问题的快速算法或其他算法?

What I've seen is that a Voronoi diagram can be used for a problem like this, however I'm not sure how this would be implemented. I'm unfamiliar with Voronoi diagrams, so I'm generating it with scipy.spatial.Voronoi. Is there a fast algorithm for this problem by using these diagrams or is there another?

推荐答案

我认为有几种选择.我开始了一个小的比较测试来探索一些.前几个只是找出彼此半径内相互有多少点，以确保我在问题的主要部分得到一致的结果.它没有回答您关于找到最接近的问题的邮件，我认为这只是其中一些的更多工作 - 为最后一个选项做了它，请参阅帖子底部.问题的驱动因素是进行所有比较，我认为您可以通过某种排序(这里是最后一个概念)来限制比较.

I think there are several options. I ginned up a small comparison test to explore a few. The first couple of these only go as far as finding how many points are mutually within radius of each other to make sure I was getting consistent results on the main part of the problem. It does not answer the mail on the part of your problem about finding the closest, which I think would be just a bit more work on a few of them--did it for the last option, see bottom of post. The driver of the problem is doing all of the comparisons, and I think you can make some hay by some sorting (last notion here) to limit comparisons.

使用蛮力点对点比较.显然是 O(n^2).

Use brute force point-to-point comparison. Clearly O(n^2).

效果很好&小"数据最快.对于大数据，由于内存中矩阵输出的大小，这开始爆炸.对于 1M x 1M 的应用程序可能不可行.

Works great & fastest for "small" data. With large data, this starts to blow up because of size of matrix output in memory. Probably infeasible for 1M x 1M application.

来自其他解决方案.快，但不如 cdist 或分段"(如下)快.也许有一种不同的方式来使用 KDTree 来完成这项任务......我对它不是很有经验.这种方法(如下)似乎合乎逻辑.

From other solution. Fast, but not as fast as cdist or "sectioning" (below). Perhaps there is a different way to employ KDTree for this task... I'm not very experienced with it. This approach (below) seemed logical.

这很有效，因为您对所有的距离不感兴趣，您只想要半径内的距离.因此，通过对目标数组进行排序并仅在其周围的矩形窗口内查找竞争者"，您可以获得非常快的性能，使用本机 python 并且没有内存爆炸".可能仍然有点留在桌子上"以进行增强，可能是通过在此实现中嵌入 cdist 或 (gulp) 尝试对其进行多线程处理.

This works very well because you are not interested in all of the distances, you just want ones that are within a radius. So, by sorting the target array and only looking within a rectangular window around it for "contenders" you can get very fast performance w/ native python and no "memory explosion." Probably still a bit "left on the table" here for enhancement maybe by embedding cdist within this implementation or (gulp) trying to multithread it.

这是一个紧密的数学"循环，因此在 cython 中尝试某些东西或拆分其中一个数组并进行多线程处理将是新颖的.并且对结果进行酸洗以便您不必运行它通常看起来很谨慎.

This is a tight "mathy" loop so trying something in cython or splitting up one of the arrays and multi-threading it would be novel. And pickling the result so you don't have to run this often seems prudent.

我认为你可以很容易地用数组中的索引来扩充元组以获得匹配列表.

I think any of these you could augment the tuples with the index within the array pretty easily to get a list of the matches.

我的旧 iMac 通过切片在 90 秒内完成了 100K x 100K，所以这对于 1M x 1M 来说并不是一个好兆头

My older iMac does 100K x 100K in 90 seconds via sectioning, so that does not bode well for 1M x 1M

对比:

# distance checker

from random import uniform
import time
import numpy as np
from scipy.spatial import distance, KDTree
from bisect import bisect
from operator import itemgetter
import sys
from matplotlib import pyplot as plt
sizes = [100, 500, 1000, 2000, 5000, 10000, 20000]
#sizes = [20_000, 30_000, 40_000, 50_000, 60_000]   # for the playoffs.  :)
naive_times = []
cdist_times = []
kdtree_times = []
sectioned_times = []
delta = 0.1

for size in sizes:
    print(f'\n *** running test with vectors of size {size} ***')
    r = 20  # radius to match
    r_squared = r**2

    A = [(uniform(-1000,1000), uniform(-1000,1000)) for t in range(size)]
    B = [(uniform(-1000,1000), uniform(-1000,1000)) for t in range(size)]

    # naive python
    print('naive python')
    tic = time.time()
    matches = [(p1, p2) for p1 in A
                        for p2 in B
                        if (p1[0] - p2[0])**2 + (p1[1] - p2[1])**2 <= r_squared]

    toc = time.time()
    print(f'found: {len(matches)}')
    naive_times.append(toc-tic)
    print(toc-tic)
    print()

    # using cdist module
    print('cdist')
    tic = time.time()
    dist_matrix = distance.cdist(A, B, 'euclidean')
    result = np.count_nonzero(dist_matrix<=r)
    toc = time.time()
    print(f'found: {result}')
    cdist_times.append(toc-tic)
    print(toc-tic)
    print()

    # KDTree
    print('KDTree')
    tic = time.time()
    my_tree = KDTree(A)
    results = my_tree.query_ball_point(B, r=r)
    # for count, r in enumerate(results):
    #   for t in r:
    #       print(count, A[t])

    result = sum(len(lis) for lis in results)
    toc = time.time()
    print(f'found: {result}')
    kdtree_times.append(toc-tic)
    print(toc-tic)
    print()

    # python with sort and sectioning
    print('with sort and sectioning')
    result = 0
    tic = time.time()
    B.sort()
    for point in A:
        # gather the neighborhood in x-dimension within x-r <= x <= x+r+1
        # if this has any merit, we could "do it again" for y-coord....
        contenders = B[bisect(B,(point[0]-r-delta, 0)) : bisect(B,(point[0]+r+delta, 0))]
        # further chop down to the y-neighborhood
        # flip the coordinate to support bisection by y-value
        contenders = list(map(lambda p: (p[1], p[0]), contenders))
        contenders.sort()
        contenders = contenders[bisect(contenders,(point[1]-r-delta, 0)) : 
                                bisect(contenders,(point[1]+r+delta, 0))]
        # note (x, y) in contenders is still inverted, so need to index properly
        matches = [(point, p2) for p2 in contenders if (point[0] - p2[1])**2 + (point[1] - p2[0])**2 <= r_squared]
        result += len(matches)
    toc = time.time()
    print(f'found: {result}')
    sectioned_times.append(toc-tic)
    print(toc-tic)
print('complete.')

plt.plot(sizes, naive_times, label = 'naive')
plt.plot(sizes, cdist_times, label = 'cdist')
plt.plot(sizes, kdtree_times, label = 'kdtree')
plt.plot(sizes, sectioned_times, label = 'sectioning')
plt.legend()
plt.show()

尺寸和地块之一的结果:

Results for one of the sizes and plots:

 *** running test with vectors of size 20000 ***
naive python
found: 124425
101.40657806396484

cdist
found: 124425
2.9293079376220703

KDTree
found: 124425
18.166933059692383

with sort and sectioning
found: 124425
2.3414530754089355
complete.

注意:在第一个图中，cdist 覆盖了 sectioning.季后赛显示在第二个图中.

Note: In first plot, cdist overlays the sectioning. Playoffs are shown in second plot.

此代码在半径内的点内找到最小值.运行时相当于上面的分段代码.

This code finds the minimum within the points within radius. Runtime is equivalent to the sectioning code above.

print('with sort and sectioning, and min finding')
result = 0
pairings = {}  
tic = time.time()
B.sort()
def dist_squared(a, b): 
    # note (x, y) in point b will be inverted (below), so need to index properly
    return (a[0] - b[1])**2 + (a[1] - b[0])**2
for idx, point in enumerate(A):
    # gather the neighborhood in x-dimension within x-r <= x <= x+r+1
    # if this has any merit, we could "do it again" for y-coord....
    contenders = B[bisect(B,(point[0]-r-delta, 0)) : bisect(B,(point[0]+r+delta, 0))]
    # further chop down to the y-neighborhood
    # flip the coordinate to support bisection by y-value
    contenders = list(map(lambda p: (p[1], p[0]), contenders))
    contenders.sort()
    contenders = contenders[bisect(contenders,(point[1]-r-delta, 0)) : 
                            bisect(contenders,(point[1]+r+delta, 0))]
    matches = [(dist_squared(point, p2), point, p2) for p2 in contenders 
        if dist_squared(point, p2) <= r_squared]
    if matches:
        pairings[idx] = min(matches)[1]  # pair the closest point in B with the point in A
toc = time.time()
print(toc-tic)

这篇关于给定两个 2d 点列表，如何为第一个列表中的每个点在第二个列表中找到最近的点?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！