本文介绍了如何优化MySQL的布尔全文搜索? (或者有什么用替代它?) - C#的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含约22000行的表,我曾经为了一个布尔全文搜索来查找我真的感兴趣的东西。我的问题是,我创建了一个动态搜索的感觉,由一个 DataGridView的,它是每个框TextChanged 事件发生后刷新。正如你可能已经想通了,它需要大量的时间去寻找每个事件后插入字符串。

I have a table that contains approximately 22000 rows and I used a Boolean Full-Text Search in order to find what I`m interested in. My problem is that I created a 'dynamic search feeling' that consists of a DataGridView that it is refreshed after every TextChanged event. As you might have figured out it takes a lot of time to search for the inserted string after every event.

我可以为了提高搜索速度做什么?

What could I do in order to improve the search speed?

任何建议都欢迎!

推荐答案

首先,你应该明白为全文索引的RDBMS的支持是一个黑客给力设计为允许结构化数据的高效访问,处理非结构化文本的技术。 (是的,这只是的我的的意见。如果需要,我可以保卫它,因为我知道这两种技术非常好。)

First, you should realize that RDBMS support for full text indexing is a hack to force a technology designed to allow efficient access to structured data to deal with unstructured text. (Yes, that's just my opinion. If required, I can defend it as I understand both technologies extremely well. ;)

那么,是什么?可以做,以提高搜索性能。

So, what can be done to improve search performance?

到文档集内处理全文搜索最好的办法是利用技术专为做到这一点,如( Lucene的),从Apache或 。

The best way to handle full-text search within a corpus of documents is the use technology specifically designed to do so, such as SOLR (Lucene) from Apache or Sphinx from err, Sphinx.

有关的原因,将成为。清除下面,我强烈推荐这种方式。

For reasons that will become clear below, I strongly recommend this approach.

构建基于文本的当搜索解决方案时,通常的做法是索引的所有文件到一个单一的搜索索引,虽然这可能是最方便的,它不是唯一的方法。

When constructing text-based search solutions, the usual approach is to index all documents into a single searchable index and while this might be the most expedient, it is not the only approach.

假设你再搜索可以很容易地量化为一组称为规则,你可以提供比单纯不合格全文更多的搜索的引导的风格。我的意思是,如果你的应用程序可能从guilding用户成果中受益,您可以预载多套基于一组已知的规则转化为自己的表的结果,从而减少了大量的数据来进行搜索。

Assuming what you're searching for can be easily quantified into a set of known rules, you could offer more of a "guided" style of search than simply unqualified full-text. What I mean by this is, if your application might benefit from guilding users to results, you can preload various sets of results based on a known set of rules into their own tables, and thus reduce the bulk of data to be searched.

如果您希望您的大多数用户将从一组已知的搜索字词在一个已知的秩序中获益非浅,你可以构建你的搜索用户界面,青睐那些条款。

If you expect a majority of your users will benefit from a known set of search terms in a known order, you can construct your search UI to favor those terms.

因此,假设大多数用户都在寻找各种汽车,你可以提供基于型号,年份,车况等预定义搜索你的搜索界面将被制作为一系列的下拉菜单来引导用户的具体成果。

So assuming a majority of users are looking for a variety of automobile, you might offer predefined searches based on model, year, condition, etc. Your search UI would be crafted as a series of dropdown menus to "guide" users to specific results.

或者,如果大多数的搜索将是一个特定的主要议题(说'汽车'),你可以预先定义只有那些记录的表之前已确定为被涉及到的汽车。

Or if a majority of searches will be for a specific main topic (say 'automobiles') you could predefine a table of only those records you've previously identified as being related to automobiles.

这两种方法将减少的记录数要搜索等,增加响应。次

Both of these approaches would reduce the number of records to be searched and so, increase response times.

如果您不能整合外部的搜索技术整合到您的项目预加载是不是一种选择,还是有方法可以大大提高搜索查询的响应时间,但它们之间的区别的基础上,你需要完成什么,你希望如何进行搜索。

If you cannot integrate an external search technology into your project and preloading isn't an option, there are still ways to vastly improve search query response times, but they differ based on what you need to accomplish and how you expect searches to be carried out.

如果您希望用户使用单个关键字或短语以及它们之间的关系布尔搜索,你可以考虑构建自己的。 (这是MySQL的布尔全文搜索已经这样做,但做自己能够在速度和搜索精度两个更大的控制权。)

If you expect users to search using single keywords or phrases and boolean relationships between them, you might consider constructing your own 'inverted index' of your corpus. (This is what MySQL's Boolean Full-Text Search already does, but doing it yourself allows greater control over both the speed and accuracy of search.)

要建立一个倒排索引从现有的数据:

To build an inverted index from your existing data:


    // dict - a dictionary containing one row per unique word in corpus  
    create table dict (    
      id int primary key,  
      word varchar  
    )

    // invert - an inverted_index to map words to records in corpus  
    create table invert (    
      id int primary key,  
      rec_id int,  
      word_id int  
    )

    // stopwords - to contain words to ignore when indexing (like a, an, the, etc)
    create table stopwords ( 
      id int primary key,  
      word varchar  
    )

注:这只是一个草图。你会希望添加索引和约束等,当你真正创建这些表。

禁用词表用于减少索引的大小只事关用户的期望查询这些话。例如,它的索引英文文章很少有用,比如'一','一','中',因为它们不以关键字搜索提供有用的含义。

The stopwords table is used to reduce the size of your index to only those words that matter to users' expected queries. For example, it's rarely useful to index English articles, like 'a', 'an', 'the', since they do not contribute useful meaning to keyword searches.

通常情况下,你需要停止字的特制应用于应用程序的需求。如果你从来没有期望用户包括术语红色,白色或在自己的查询或'蓝',如果这些词出现在的每个的搜索记录,你会想要将它们添加到您的停止字

Typically, you'll require a stopword list specifically crafted to the needs of your application. If you never expect users to include the terms 'red', 'white' or 'blue' in their queries or if these terms appear in every searchable record, you would want to add them to your stopword list.

请参阅的说明在此消息的结尾使用MySQL中自己的禁用词列表的说明。

See the note at the end of this message for instructions on using your own stopwords list in MySQL.

参见:


要从现有的记录建立一个倒排索引,你需要(伪代码):

To build an inverted index from your existing records, you'll need to (pseudo-code):


    foreach( word(w) in record(r) ) {
      if(w is not in stopwords) {
        if( w does not exist in dictionary) {
          insert w to dictionary at w.id
        }
        insert (r.id, w.id) into inverted_index
      }
    }



更多关于禁用词:

使用特定的停止字的nstead中,如果(w是不是禁用词)测试可能使其他决定或者代替或作为一种辅助你不能接受的话清单。

More on stopwords:

nstead of using a specific stopword list, the 'if(w is not in stopwords)' test could make other decisions either instead of or as an adjunct to your list of unacceptable words.

您的应用程序可能希望过滤掉少于4个字符的所有单词或只的包括的从一组预定义的话。

Your application might wish to filter out all words less than 4 characters long or to only include words from a predefined set.

通过创建自己的倒排索引,你获得过搜索更大的和更为精细的控制。

By creating your own inverted index, you gain far greater and finer-grained control over search.

这一步真的取决于你如何指望查询提交你的索引。

This step really depends on how you expect queries to submitted to your index.

如果查询是要硬编码,你可以简单地自己创建的SELECT语句,或者如果你需要支持用户输入的查询,你需要转换无论您选择的查询语言到SQL语句(通常使用一个简单的解析器完成)。

If queries are to be 'hard-coded', you can simply create the select statement yourself or if you need to support user-entered queries, you'll need to convert whatever query language you choose into an SQL statement (typically done using a simple parser).

假设你要检索匹配逻辑查询(WORD1和WORD2的所有文件)OR WORD3,一种可能的方法可能是:

Assuming you wish to retrieve all documents matching the logical query '(word1 AND word2) OR word3', a possible approach might be:

CREATE TEMPORARY TABLE temp_results ( rec_id int, count int ) AS 
    ( SELECT rec_id, COUNT(rec_id) AS count 
      FROM invert AS I, dict AS D 
      WHERE I.word_id=D.id AND (D.word='word1' OR D.word='word2') 
      GROUP BY I.rec_id 
      HAVING count=2
    ) 
    UNION (
      SELECT rec_id, 1 AS count 
      FROM invert AS I, dict AS D
      WHERE I.word_id=D.id AND D.word='word3'
    );

SELECT DISTINCT rec_id FROM temp_results;

DROP TABLE temp_results;

请注意:这只是第一关过我的头顶。我相信有一个布尔查询表达式转换为有效的SQL语句的更有效的方法,并欢迎任何改进和所有的建议。

NOTE: This is just a first pass off the top of my head. I am confident there are more efficient ways of converting a boolean query expression into an efficient SQL statement and welcome any and all suggestions for improvement.

要搜索短语,你'会需要一个字段添加到倒排索引来代表这个词出现的记录,该因素将您的选择。

To search for phrases, you'll need to add a field to the inverted index to represent the position the word appeared within its record and factor that into your SELECT.

和最后中的位置,你需要当您添加新记录或删除旧的更新倒排索引。

And finally, you'll need to update your inverted index as you add new records or delete old ones.

全文搜索信息检索或IR下被称为一个非常大的领域的研究瀑布,有很多关于这个主题的书,包括

"Full text search" falls under a very large area of research known as "Information Retrieval" or IR and there are many books on the subject, including


  • 的评价搜索引擎(七月23,2010)

  • Information Retrieval: Implementing and Evaluating Search Engines by Stefan Büttcher, Charles L. A. Clarke and Gordon V. Cormack (Jul 23, 2010)

搜索引擎:信息检索//rads.stackoverflow:由布鲁斯·克劳馥,唐纳德·梅茨勒和Trevor Strohman(2009年2月16日)

Search Engines: Information Retrieval in Practice by Bruce Croft, Donald Metzler and Trevor Strohman (Feb 16, 2009)

.COM / AMZN /点击/ 0615204252>构建搜索应用:Lucene的,LingPipe和门被马努Konchady(2008年6月)

Building Search Applications: Lucene, LingPipe, and Gate by Manu Konchady (Jun 2008)

检查亚马逊等等。

要在MySQL中使用自己的停止字:

To use your own stopword list in MySQL:


  1. 创建您自己的停用词列表,每行一个单词,然后将其保存到服务器上的已知位置,说:/usr/local/lib/IR/stopwords.txt


  2. 编辑my.cnf中添加或更新以下行:
  1. Create your own list of stopwords, one word per line, and save it to a known location on your server, say: /usr/local/lib/IR/stopwords.txt

  2. Edit my.cnf to add or update the following lines:

    [mysqld]  
    ft_min_word_len=1    
    ft_max_word_len=40  
    ft_stopword_file=/usr/local/lib/IR/stopwords.txt

,将分别设置的最小和法律词语的最大长度为1和40,
以及。告诉mysqld的在哪里可以找到停止字的自定义列表

which will set the minimum and maximum length of legal words to 1 and 40, respectively, and tell mysqld where to find your custom list of stopwords.

(注意:默认ft_max_word_len是84,我相信这是很过分的
,并可能导致字符串不在运行要编制索引实际的话。)

(Note: the default ft_max_word_len is 84, which I believe is pretty excessive and can cause runs of strings that are not real words to be indexed.)

这篇关于如何优化MySQL的布尔全文搜索? (或者有什么用替代它?) - C#的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!