本文介绍了如何理解Mallet中的Topic Model类的输出?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在尝试主题建模开发人员指南上的示例代码时,我真的很想了解该代码输出的含义.

As I'm trying out the examples code on topic modeling developer's guide, I really want to understand the meaning of the output of that code.

在运行过程中,它首先给出:

First during the running process, it gives out:

Coded LDA: 10 topics, 4 topic bits, 1111 topic mask
max tokens: 148
total tokens: 1333
<10> LL/token: -9,24097
<20> LL/token: -9,1026
<30> LL/token: -8,95386
<40> LL/token: -8,75353

0   0,5 battle union confederate tennessee american states 
1   0,5 hawes sunderland echo war paper commonwealth 
2   0,5 test including cricket australian hill career 
3   0,5 average equipartition theorem law energy system 
4   0,5 kentucky army grant gen confederates buell 
5   0,5 years yard national thylacine wilderness parks 
6   0,5 gunnhild norway life extinct gilbert thespis 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings south ring dust 2 uranus 
9   0,5 tasmanian back time sullivan london century 

<50> LL/token: -8,59033
<60> LL/token: -8,63711
<70> LL/token: -8,56168
<80> LL/token: -8,57189
<90> LL/token: -8,46669

0   0,5 battle union confederate tennessee united numerous 
1   0,5 hawes sunderland echo paper commonwealth early 
2   0,5 test cricket south australian hill england 
3   0,5 average equipartition theorem law energy system 
4   0,5 kentucky army grant gen war time 
5   0,5 yard national thylacine years wilderness tasmanian 
6   0,5 including gunnhild norway life time thespis 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus survived 
9   0,5 back london modern sullivan gilbert needham 

<100> LL/token: -8,49005
<110> LL/token: -8,57995
<120> LL/token: -8,55601
<130> LL/token: -8,50673
<140> LL/token: -8,46388

0   0,5 battle union confederate tennessee war united 
1   0,5 sunderland echo paper edward england world 
2   0,5 test cricket south australian hill record 
3   0,5 average equipartition theorem energy system kinetic 
4   0,5 hawes kentucky army gen grant confederates 
5   0,5 years yard national thylacine wilderness tasmanian 
6   0,5 gunnhild norway including king life devil 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus number 
9   0,5 london sullivan gilbert thespis back mother 

<150> LL/token: -8,51129
<160> LL/token: -8,50269
<170> LL/token: -8,44308
<180> LL/token: -8,47441
<190> LL/token: -8,62186

0   0,5 battle union confederate grant tennessee numerous 
1   0,5 sunderland echo survived paper edward england 
2   0,5 test cricket south australian hill park 
3   0,5 average equipartition theorem energy system law 
4   0,5 hawes kentucky army gen time confederates 
5   0,5 yard national thylacine years wilderness tasmanian 
6   0,5 gunnhild including norway life king time 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus number 
9   0,5 back london sullivan gilbert thespis 3 

<200> LL/token: -8,54771

Total time: 6 seconds

所以 Question1 :第一行中的编码的LDA:10个主题,4个主题位,1111个主题掩码"是什么意思?我只知道"10个主题"是什么.

so Question1: what does "Coded LDA: 10 topics, 4 topic bits, 1111 topic mask" mean in the first line? I only know what "10 topics" is about.

问题2 :< 10> LL/令牌:-9,24097< 20> LL/令牌:-9,1026< 30> LL/令牌中的LL/令牌是什么:-8,95386< 40> LL/令牌:-8,75353是什么意思?这似乎是对Gibss采样的度量.但这不是单调增加吗?

Question2: what does LL/Token in " <10> LL/token: -9,24097 <20> LL/token: -9,1026 <30> LL/token: -8,95386 <40> LL/token: -8,75353" mean? it seems like a metric to Gibss sampling. But isn't it monotonically increasing?

然后,将打印以下内容:

And after that, the following is printed:

elizabeth-9 needham-9 died-7 3-9 1731-6 mother-6 needham-9 english-7 procuress-6 brothel-4 keeper-9 18th-8.......
0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4) 
1   0.008   sunderland (6) years (6) echo (5) survived (3) paper (3) 
2   0.040   test (6) cricket (5) hill (4) park (3) career (3) 
3   0.008   average (6) equipartition (6) system (5) theorem (5) law (4) 
4   0.073   hawes (7) kentucky (6) army (5) gen (4) war (4) 
5   0.008   yard (6) national (6) thylacine (5) wilderness (4) tasmanian (4) 
6   0.202   gunnhild (5) norway (4) life (4) including (3) king (3) 
7   0.202   zinta (4) role (3) hindi (3) actress (3) film (3) 
8   0.040   rings (10) ring (3) dust (3) 2 (3) uranus (3) 
9   0.411   london (4) sullivan (3) gilbert (3) thespis (3) back (3) 
0   0.55

这部分的第一行可能是令牌主题分配,对吧?

The first line in this part is probably the token-topic assignment, right?

问题3 :对于第一个主题,

Question3:for the first topic,

0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4)   

0.008被称为主题分布",这是该主题在整个语料库中的分布吗?然后似乎有冲突:如上所示,主题0的令牌将出现在copus中8 + 7 + 6 + 4 + 4 + ...次;相比之下,主题7在语料库中的识别时间为4 + 3 + 3 + 3 + 3....结果,主题7的分布应该比主题0低.这是我无法理解的.还有,到底到底是"0 0.55"是什么?

0.008 is said to be the "topic distribution", is it the distribution of this topic in whole corpus? Then there seems to be a conflict: topic 0 as shown above will have its token appeared in the copus 8+7+6+4+4+... times; and in comparison topic 7 have 4+3+3+3+3... times recognized in the corpus. As a result, topic 7 should have lower distribution than topic 0. This is what I cann't understand. Further more, what ist that "0 0.55" at the end?

非常感谢您阅读这篇冗长的文章.希望您能回答这个问题,并希望这对对Mallet感兴趣的人有所帮助.

Thank you very much for reading this long post. Hope you can answer it and hope this could be helpful for others interested in Mallet.

最好

推荐答案

我认为我所提供的答案不够完整,但这是其中的一部分……对于第一季度,您可以检查一些代码以查看这些值的值计算的.对于第二季度,LL是模型的对数相似度除以令牌总数,这是对数据被赋予模型的可能性的度量.值增加表示模型正在改善.这些也可以在R包中用于主题建模. Q2,是的,我认为第一行是正确的. Q3,很好的问题,对我来说还不是很清楚,也许(x)是某种索引,令牌频率似乎不太可能...大概其中大多数是某种诊断.

I don't think I know enough to give a very complete answer, but here's a shot at some of it... for Q1 you can inspect some code to see how those values are calculated. For Q2, LL is the model's log-liklihood divided by the total number of tokens, this is a measure of how likely the data are given the model. Increasing values mean the model is improving. These are also available in the R packages for topic modeling. Q2, yes I think that's right for the first line. Q3, good question, it's not immediately clear to me, perhaps the (x) are some kind of index, token frequency seems unlikely... Presumably most of these are diagnostics of some kind.

可以使用bin\mallet run cc.mallet.topics.tui.TopicTrainer ...your various options... --diagnostics-file diagnostics.xml获得一组更有用的诊断信息,这将产生大量的主题质量度量.他们绝对值得一试.

A more useful set of diagnostics can be obtained with bin\mallet run cc.mallet.topics.tui.TopicTrainer ...your various options... --diagnostics-file diagnostics.xml which will produce a large number of measures of topic quality. They're definitely worth checking out.

有关所有这些的完整故事,我建议写一封电子邮件给MALLET的(主要?)维护者普林斯顿的David Mimno,或通过http://blog.gmane.org/gmane.comp.ai.mallet.devel ,然后在此处将有关我们对MALLET的内部运作方式感到好奇...

For the full story about all of this I'd suggest writing an email to David Mimno at Princeton who is the (main?) maintainer of MALLET, or writing to him via the list at http://blog.gmane.org/gmane.comp.ai.mallet.devel and then posting answers back here for those of us curious about the inner workings of MALLET...

这篇关于如何理解Mallet中的Topic Model类的输出?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-18 16:39