本文介绍了Lucene外国字符问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用Zend_Lucene和åäö等外来字符时遇到了一些严重问题.这些问题在创建索引和查询索引时都会出现.我已经尝试过iso-8859-1和utf-8.

I'm having some serious issues using Zend_Lucene and foreign characters like åäö. These issues appear both when the index is created and when it's queried. I've tried both iso-8859-1 and utf-8.

不起作用的查询看起来像"+_area:skåne".使用Zend_Lucene不会获得任何匹配,但是如果我在Luke中运行此查询,则会得到许多匹配的文档.

The query that doesn't work looks like "+_area:skåne". With Zend_Lucene I'm getting no matches, but if I run this query in Luke I get many matching docuements.

索引包含20个字段.使用以下语法添加"_area"字段:

The index contains 20 fields. The "_area" field is added with the following syntax:

$doc->addField(Zend_Search_Lucene_Field::keyword('_area', strtolower($item['area']), 'iso-8859-1')); 

我正在使用Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive分析器.

运行索引时,有时会出现以下错误消息(被索引的文档是从使用iso-8859-1编码的数据库中随机选择的)

While running indexing, the error message below appeared sometimes (the documents indexed were randomly selected from DB with iso-8859-1 encoding)

通过检查$ this-> _ input是否为空来解决此问题,因为这似乎引起了通知.注意:奇怪的查询结果是一个预先存在的条件.

This was "solved" by checking if $this->_input is empty, as it seemed that this caused the notices. Note: The weird query results were a pre-existing condition.

当我使用外来字符搜索关键字字段时,会收到上述错误,但是当我搜索文本字段时,其行为会有所不同.然后,它会在下面产生大约一百个错误.

When I search keyword fields using foreign characters I receive the error above, but when I search text fields it behaves differently. Then it generates about a hundred of the error below.

但是它会产生看起来正确的结果集!顺便说一句,第二个查询在Luke中不会产生任何结果.

But it produces what looks like a correct result set! On a side note, this second query doesn't generate any results in Luke.

我还尝试了UTF-8,因为据我所知,Zend_Lucene在内部使用它.由于数据集是ISO-8859-1,因此我使用utf8_encode对其进行了转换.但是索引会产生以下错误.

I've also tried UTF-8 because, to my knowledge, Zend_Lucene uses it internally. Since the data set is ISO-8859-1, I convert it using utf8_encode. But the indexing produces the following errors.

注意:试图获取的财产 非对象 \ Zend \ Search \ Lucene \ Index \ SegmentMerger.php 在第196行

Notice: Trying to get property of non-object in \Zend\Search\Lucene\Index\SegmentMerger.php on line 196

注意:试图获取的财产 非对象 \ Zend \ Search \ Lucene \ Index \ SegmentMerger.php 在第200行

Notice: Trying to get property of non-object in \Zend\Search\Lucene\Index\SegmentMerger.php on line 200

注意:未定义的索引:在 \ Zend \ Search \ Lucene \ Index \ SegmentWriter.php 在第231行

Notice: Undefined index: in \Zend\Search\Lucene\Index\SegmentWriter.php on line 231

注意:试图获取的财产 非对象 \ Zend \ Search \ Lucene \ Index \ SegmentWriter.php 在第231行

Notice: Trying to get property of non-object in \Zend\Search\Lucene\Index\SegmentWriter.php on line 231

注意:未定义的偏移量:250595英寸 \ Zend \ Search \ Lucene \ Index \ SegmentInfo.php 2020年在线

Notice: Undefined offset: 250595 in \Zend\Search\Lucene\Index\SegmentInfo.php on line 2020

注意:试图获取的财产 非对象 \ Zend \ Search \ Lucene \ Index \ SegmentInfo.php 2020年在线

Notice: Trying to get property of non-object in \Zend\Search\Lucene\Index\SegmentInfo.php on line 2020

注意:未定义的索引:在 \ Zend \ Search \ Lucene \ Index \ SegmentWriter.php 465行 ...

Notice: Undefined index: in \Zend\Search\Lucene\Index\SegmentWriter.php on line 465 ...


所以.有人可以说明一下吗? :)我相信(经过数天的搜寻后),我并不是唯一一个经历过这种情况的人.


So. Can someone please shed some light? :) I believe (after days of googling) that I'm not the only one experiencing this.

推荐答案

我建议您尝试使用与UTF-8兼容的文本分析器.看起来您正在使用的分析器破坏了非ASCII字符.您应该确保正确输入文本,并以正确的格式将其输入Lucene.

I suggest you try using a UTF-8 compatible text analyzer.It looks like the analyzer you are using destroys the non-ASCII characters.You should make sure that the text is input properly, and that it reaches Lucene in the proper format.

这篇关于Lucene外国字符问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-28 22:33