问题描述
对于Google AutoML自然语言多标签文本分类,输入数据集的格式应该是什么?我知道对于多类分类,我需要一列文本和另一列标签.标签列每行包含一个标签.
What should the format of the input dataset be for Google AutoML Natural Language multi-label text classification? I know that for multi-class classification I need a column of text and another column for labels. The labels column include one label per row.
我为每个文本有多个标签,并且我想进行多标签分类.我尝试每个标签有一个列和一个热编码,但是却收到此错误消息:最多支持1000个标签.找到了9823个标签.
I have multiple labels for each text and I want to do multi-label classification. I tried having one column per label and one-hot encoding but I got this error message:Max 1000 labels supported. Found 9823 labels.
推荐答案
一开始非常令人困惑,但后来我设法在文档中找到了CSV文件格式,例如:
It was very confusing at first but later I managed to find the format in the documentation, which is a CSV file like:
text1, label1, label2 text2, label2 text3, label3, label2, label1
text1, label1, label2 text2, label2 text3, label3, label2, label1
解析器无法理解带有以CSV格式保存的NULL单元格的表,例如:
The parser doesn't understand a table with NULL cells saved as a standard CSV file, which is like:
text1, label1, label2, text2, label2,, text3, label3, label2, label1
text1, label1, label2, text2, label2,, text3, label3, label2, label1
我不得不从熊猫生成的CSV文件中手动删除多余的逗号.
I had to manually remove extra commas from the CSV file generated by Pandas.
这篇关于Google AutoML自然语言多标签文本分类的输入数据集格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!