本文介绍了如何确定人物相似度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Levenshtein距离在OCR之后查找相似的字符串。但是,对于某些字符串,编辑距离是相同的,尽管视觉外观明显不同。

I am using the Levenshtein distance to find similar strings after OCR. However, for some strings the edit distance is the same, although the visual appearance is obviously different.

例如字符串 Co 将返回以下匹配项:

For example the string Co will return these matches:

CY (1)
CZ (1)
Ca (1)

考虑到, Co 是OCR引擎 Ca 的结果比那些更可能匹配。因此,在计算了Levenshtein距离之后,我想通过按视觉相似度排序来优化查询结果。为了计算相似度,我想使用Arial等标准的sans-serif字体。

Considering, that Co is the result from an OCR engine, Ca would be the more likely match than the ones. Therefore, after calculating the Levenshtein distance, I'd like to refine query result by ordering by visual similarity. In order to calculate this similarity a I'd like to use standard sans-serif font, like Arial.

是否可以为此目的使用一个库,或者如何自己实现?另外,是否还有其他比Levenshtein距离更精确的字符串相似性算法?

Is there a library I can use for this purpose, or how could I implement this myself? Alternatively, are there any string similarity algorithms that are more accurate than the Levenshtein distance, which I could use in addition?

推荐答案

如果您正在寻找一张表格,该表格可让您根据视觉相似性来计算各种替换成本 ,我一直在寻找这样的东西,但收效甚微,因此我开始将其视为新问题。我不使用OCR,但我正在寻找一种方法来限制概率性搜索中 mis-typed 字符的搜索参数。由于由于人类在视觉上混淆了字符而导致键入错误,因此同样的原理也应适用于您。

If you're looking for a table that will allow you to calculate a 'replacement cost' of sorts based on visual similarity, I've been searching for such a thing for awhile with little success, so I started looking at it as a new problem. I'm not working with OCR, but I am looking for a way to limit the search parameters in a probabilistic search for mis-typed characters. Since they are mis-typed because a human has confused the characters visually, the same principle should apply to you.

我的方法是根据字母的笔划成分对字母进行分类一个8位字段。这些位是从左到右:

My approach was to categorize letters based on their stroke components in an 8-bit field. the bits are, left to right:

7: Left Vertical
6: Center Vertical
5: Right Vertical
4: Top Horizontal
3: Middle Horizontal
2: Bottom Horizontal
1: Top-left to bottom-right stroke
0: Bottom-left to top-right stroke

对于小写字符,左侧的降序记录在第1位,并且

For lower-case characters, descenders on the left are recorded in bit 1, and descenders on the right in bit 0, as diagonals.

通过这种方案,我想出了以下值,这些值试图根据视觉相似性对字符进行排名。

With that scheme, I came up with the following values which attempt to rank the characters according to visual similarity.

m:               11110000: F0
g:               10111101: BD
S,B,G,a,e,s:     10111100: BC
R,p:             10111010: BA
q:               10111001: B9
P:               10111000: B8
Q:               10110110: B6
D,O,o:           10110100: B4
n:               10110000: B0
b,h,d:           10101100: AC
H:               10101000: A8
U,u:             10100100: A4
M,W,w:           10100011: A3
N:               10100010: A2
E:               10011100: 9C
F,f:             10011000: 98
C,c:             10010100: 94
r:               10010000: 90
L:               10000100: 84
K,k:             10000011: 83
T:               01010000: 50
t:               01001000: 48
J,j:             01000100: 44
Y:               01000011: 43
I,l,i:           01000000: 40
Z,z:             00010101: 15
A:               00001011: 0B
y:               00000101: 05
V,v,X,x:         00000011: 03

就我而言,这太原始了,需要更多工作。但是,您可能可以使用它,也可以对其进行调整以适合您的目的。该方案非常简单。该排名是针对等宽字体的。如果您使用的是Sans-serif字体,则可能需要重新处理这些值。

This, as it stands, is too primitive for my purposes and requires more work. You may be able to use it, however, or perhaps adapt it to suit your purposes. The scheme is fairly simple. This ranking is for a mono-space font. If you are using a sans-serif font, then you likely have to re-work the values.

此表是一个混合表,包括所有字符,上下左右-case,但如果仅将其拆分为大写和小写,则可能会更有效,这也将允许应用特定的大小写惩罚。

This table is a hybrid table including all characters, lower- and upper-case, but if you split it into upper-case only and lower-case only it might prove more effective, and that would also allow to apply specific casing penalties.

请记住,这是早期实验。如果您想方设法改善它(例如通过更改位顺序),就可以随时进行。

Keep in mind that this is early experimentation. If you see a way to improve it (for example by changing the bit-sequencing) by all means feel free to do so.

这篇关于如何确定人物相似度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-19 05:40