本文介绍了NLP(自然语言处理)如何用任何方法检测问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我搜索发现一些问题的机器学习方法.

I search a machine learning method detecting some question.

示例

 User: Please tell me your name ?     
 AI  : (AI find User want to know his name)   
       My name is [AI's name]. 

我的数据集如下.

[label], [question]    
   1   , What's your name?    
   1   , Tell me your name.
   ...

但是问题是要在输入中包含不是问题的内容.

But the problem is to include something that is not a question in the input.

示例

User: Hello, my name is [User name]
AI  : (this is not a question)    
      (throw another process)
      (->) Nice to meet you.

Question的类别数为10〜20,但不是问题的句子数太多.

The number of Question's categories is 10~20, but the number of sentences which is not a question is too many.

您知道如何解决此问题或与此有关的任何任务吗?

Do you know how to solve this question Or any task related to this?

推荐答案

您可能希望将问题分为三部分.

You'll probably want to factor the problem into three parts.

  • 首先,您需要将任意长度的文本序列映射到固定长度的向量上.为此,您可以查看 Le和Mikolov的句子和文档的分布式表示形式"
  • 有了这些,问题就会减少到一个简单的分类任务.您有一组向量和每个向量映射到的类别.具有一个隐藏层和一个softmax输出层的网络应该足够了.这将使您可以按类别进行分配.
  • 最后,您需要确定每个预测的置信度.我想到了两种广泛的方法.
  • First, you'll want to map the arbitrary-length sequence of text onto a fixed-length vector. For this, you might look at Le and Mikolov's "Distributed Representations of Sentences and Documents"
  • Once you have that, the problem reduces to a simple classification task. You have a set of vectors and the categories each maps to. A network with one hidden layer and a softmax output layer should probably be sufficient. This will give you a distribution over categories.
  • Finally, you need to determine a confidence level for each prediction. There are two broad approaches that come to mind.
  1. 首先,您可以为杂项"引入新的类别,并向其中添加任何不属于真实"类别之一的句子.这种方法的弱点在于,该类实际上是许多不相关类的并集,因此可能很难学习,因为这些点将散布在整个段落矢量空间中,而不是存在于一个好的簇中.
  2. 或者,您可以在使用softmax进行归一化之前,先查看输出神经元的值 ,如果最可能类别的值不超过某个阈值,则输出未知类别".您可能需要调整阈值,以最大程度地提高验证集的准确性.
  1. First, you can introduce a new category for "miscellaneous", and add any sentences that don't fit into one of the "real" categories to it. The weakness of this approach is that this class is actually the union of many unrelated classes and thus might be difficult to learn, since the points will be scattered all over the paragraph vector space rather than existing in a nice cluster.
  2. Alternately, you might look at the values of the output neurons before you normalize using softmax, and output "unknown category" if the value for the most likely category doesn't exceed some threshold. You'll probably want to tune the threshold value to maximize accuracy over a validation set.

我不能保证这会成功,但这是我首先尝试的方法.

I can't guarantee that this will work, but it's the approach I would try first.

这篇关于NLP(自然语言处理)如何用任何方法检测问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-29 13:04