本文介绍了Google AutoML自然语言多标签文本分类的输入数据集格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于Google AutoML自然语言多标签文本分类,输入数据集的格式应该是什么?我知道对于多类分类,我需要一列文本和另一列标签.标签列每行包含一个标签.

What should the format of the input dataset be for Google AutoML Natural Language multi-label text classification? I know that for multi-class classification I need a column of text and another column for labels. The labels column include one label per row.

我为每个文本有多个标签,并且我想进行多标签分类.我尝试每个标签有一个列和一个热编码,但是却收到此错误消息:最多支持1000个标签.找到了9823个标签.

I have multiple labels for each text and I want to do multi-label classification. I tried having one column per label and one-hot encoding but I got this error message:Max 1000 labels supported. Found 9823 labels.

推荐答案

一开始非常令人困惑,但后来我设法在文档中找到了CSV文件格式,例如:

It was very confusing at first but later I managed to find the format in the documentation, which is a CSV file like:

text1, label1, label2 text2, label2 text3, label3, label2, label1

text1, label1, label2 text2, label2 text3, label3, label2, label1

解析器无法理解带有以CSV格式保存的NULL单元格的表,例如:

The parser doesn't understand a table with NULL cells saved as a standard CSV file, which is like:

text1, label1, label2, text2, label2,, text3, label3, label2, label1

text1, label1, label2, text2, label2,, text3, label3, label2, label1

我不得不从熊猫生成的CSV文件中手动删除多余的逗号.

I had to manually remove extra commas from the CSV file generated by Pandas.

这篇关于Google AutoML自然语言多标签文本分类的输入数据集格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 19:14