本文介绍了关于自然语言处理项目的想法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须为我的计算语言学课程做最后的项目.我们一直在使用OCaml,但我也熟悉Java.我们研究了形态学,FSM,收集解析树,CYK解析,尝试,下推自动机,正则表达式,形式语言理论,某些语义等.

I have to do a final project for my computational linguistics class. We've been using OCaml the entire time, but I also have familiarity with Java. We've studied morphology, FSMs, collecting parse trees, CYK parsing, tries, pushdown automata, regular expressions, formal language theory, some semantics, etc.

以下是我提出的一些想法.你有什么觉得很酷的东西吗?

Here are some ideas I've come up with. Do you have anything you think would be cool?

  1. 一个脚本,该脚本扫描Facebook线程中令人讨厌的评论,并使用JS静默隐藏它们(显然,这需要用户的同意才能运行)

  1. A script that scans Facebook threads for obnoxious* comments and silently hides them with JS (this would be run with the user's consent, obviously)

使用语义,语法,标点符号使用情况和其他度量标准对一篇文章进行分析,以尝试指纹化"作者.它可用于确定同一作者是否可能撰写了两本作品.或者,某人可以花一些时间写些自己写的东西,并了解他的风格如何变化.

An analysis of a piece of writing using semantics, syntax, punctuation usage, and other metrics, to try to "fingerprint" the author. It could be used to determine if two works are likely written by the same author. Or, someone could put in a bunch of writing he's done over time, and get a sense of how his style has changed.

聊天机器人(不太有趣/原始)

A chat bot (less interesting/original)

我可能被允许使用预先存在的库来执行此操作. OCaml是否存在?如果没有库/工具包,除非我将其限制在一个非常特定的领域,否则以上三个想法可能是行不通的.

I may be permitted to use pre-existing libraries to do this. Do any exist for OCaml? Without a library/toolkit, the above three ideas are probably infeasible, unless I limit it to a very specific domain.

下层思路:

  1. 有限状态机上的操作-最小化,组成换能器,证明FSM处于最小可能状态.我对图论非常感兴趣,因此与FSM的任何重叠都可能是一个探索的好地方. (我还能对FSM做什么?)

  1. Operations on finite state machines - minimizing, composing transducers, proving that an FSM is in a minimal possible state. I am very interested in graph theory, so any overlap with FSMs could be a good venue to explore. (What else can I do with FSMs?)

使用正则表达式会很酷吗?

Something cool with regex?

使用CYK有点酷吗?

其他人有什么好主意吗?

Does anyone else have any cool ideas?

*令人讨厌的定义为具有以下某些典型的初中生模式.这个词的模糊性不是问题.为了功劳,我可以定义我想要的任何东西并将其定位.

*obnoxious defined as having following certain patterns typical of junior high schoolers. The vagueness of this term is not an issue; for the credit I could define whatever I want and target that.

推荐答案

  1. 令人讨厌的语言过滤-我认为这将减少到非常类似于垃圾邮件过滤的过程.也就是说,计算一组或多或少令人讨厌"的单词的频率.除非您还使用其他信息源(例如,发送方和接收方之间可能共享的社交链接的结构),否则听起来似乎您无权做任何特别聪明的事情.另一方面,在线欺凌是非常严重的事情,您可以打赌Facebook/Myspace,其他社交网站也非常在意如何解决它.

  1. Obnoxious language filtering - I think this will reduce down to a process very similar to spam email filtering. That is, counting the frequency of a set of more-or-less 'obnoxious' words. It doesn't sound like you will get the scope to do anything particularly clever, unless you also use other sources of information (e.g. the structure of the social links shared between the sender and recipient, perhaps). On the other hand, online bullying is a very serious thing and you can bet Facebook/Myspace and the other social networking sites care a lot about tackling it.

文体分析-已经以各种形式对此进行了一些工作,通常以作者身份分析为名. Shlomo Argamon 在这方面做了很多工作,您可能会从在他的论文中引用.描绘作者的最佳方法之一是了解他们使用一组停用词(也称为功能词)的分布情况,例如'and','but','if'等.我认为还有很多在该领域中进行新的有趣研究的范围-对互联网数据进行作者分析是一个难题-但也有很多失败的范围.

Stylistic Analysis - There has been some work done on this in various forms, often under the name authorship analysis. Shlomo Argamon does a lot of work in this area and you could probably discover a lot more from the references in his papers. One of the best ways to profile an author is to learn the distribution of their usage of a set of stopwords (a.k.a functional words), such as 'and' ,'but', 'if', etc. I think there's a lot more scope to do something new and interesting in this area - authorship analysis on internet data is a hard problem - but also a lot more scope to fail.

聊天机器人-是的,这是一个非常标准的项目.衡量成功/失败的难度也很大.我认为,如果该项目是一个具有某种目的的聊天机器人,例如在有限的领域中回答问题,那么它将更具吸引力.但这很难做到.

Chat bot - You're right, this is a pretty standard project. It's also quite hard to measure success/failure. I think the project would be more compelling if it was a chat-bot with some kind of purpose, like answering questions in a limited domain, but that's something that's very difficult to do well.

其余的内容太含糊,无法发表任何评论,对不起.

The rest are really too vague to make any comments on, sorry.

我在OCaml中没有任何NLP库,它不是一种特别流行的编程语言.但是,我确实知道Ocaml中的机器学习库,称为 MEGAM 由非常好的NLP研究人员Hal Daume撰写,已用于NLP任务.我觉得搞清楚MEGAM并用它来执行一些NLP任务可能是一个太大的项目,无法执行.

There aren't any NLP libraries that I know of in OCaml, it's just not a particularly popular programming language. However, I do know of a machine learning library in Ocaml, called MEGAM, written by Hal Daume, who is a very good NLP researcher, which has been used for NLP tasks. I get a feeling that figuring out MEGAM and using it to do some NLP task might be too big a project to take on, however.

其他一些想法:

  • 情感分析-一个非常新潮的研究领域.您可以根据自己的喜好使此任务变得容易或困难,从对文档进行正面/负面评分到提取特定主题并为每个主题生成情感评分.
  • 共指/回影解析-一项艰巨的任务,但又非常重要.一些方法使用图形表示法(如果它们共同引用,则每个提述都是一个在其间具有边的节点)来强制执行诸如传递性之类的事情.
  • 文档分类-您可以尝试在StackOverflow数据集,用于为给定问题建议标签.对于某些已建立的技术来说,这是一个众所周知的问题,但是它是一个有趣的数据集,并且在现实世界中具有明显而有用的应用程序.您还可以查看是否可以找到问题的特定功能(单词选择,长度,格式,标点符号等),从而使它们获得很高的评价.
  • Haiku Generation -一种愚蠢的一个,但是我一直认为这是一个有趣的主意.可以使用 CMU发音词典完成音节计数.如果不是特别有用,应该会很有趣.
  • Sentiment Analysis - A very trendy area of research. You could make this task as easy or hard as you like, from scoring a document as positive/negative to extracting specific topics and generating a sentiment score for each one.
  • Coreference/Anaphora resolution - A difficult task but a very important one. Some approaches use a graph representation (each mention is a node with edges between them if they co-refer) to enforce things like transitivity.
  • Document Classification - You could try and learn a system on the StackOverflow data set to suggest tags for a given question. It's a fairly well known problem with some established techniques, but an it's interesting data set and has an obvious and useful application to the real world . You could also see if you can find specific features of a question (word choice, length, formatting, punctuation, etc.) that cause them to be voted highly.
  • Haiku Generation - Kind of a silly one, but I always thought it was an interesting idea. Syllable counting could be done with the CMU pronouncing dictionary. Should be a lot of fun, if not particularly useful.

这篇关于关于自然语言处理项目的想法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-30 22:47