|
@@ -0,0 +1,309 @@ |
|
|
|
|
|
{ |
|
|
|
|
|
"cells": [ |
|
|
|
|
|
{ |
|
|
|
|
|
"cell_type": "markdown", |
|
|
|
|
|
"metadata": {}, |
|
|
|
|
|
"source": [ |
|
|
|
|
|
"# 使用Loader和Pipe加载并处理数据集\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"这一部分是关于如何加载数据集的教程\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"## Part I: 数据集容器DataBundle\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"而由于对于同一个任务,训练集,验证集和测试集会共用同一个词表以及具有相同的目标值,所以在fastNLP中我们使用了 DataBundle 来承载同一个任务的多个数据集 DataSet 以及它们的词表 Vocabulary 。下面会有例子介绍 DataBundle 的相关使用。\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"DataBundle 在fastNLP中主要在各个 Loader 和 Pipe 中被使用。 下面我们先介绍一下 Loader 和 Pipe 。\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"## Part II: 加载的各种数据集的Loader\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"在fastNLP中,所有的 Loader 都可以通过其文档判断其支持读取的数据格式,以及读取之后返回的 DataSet 的格式, 例如 ChnSentiCorpLoader \n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"- download() 函数:自动将该数据集下载到缓存地址,默认缓存地址为~/.fastNLP/datasets/。由于版权等原因,不是所有的Loader都实现了该方法。该方法会返回下载后文件所处的缓存地址。\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"- _load() 函数:从一个数据文件中读取数据,返回一个 DataSet 。返回的DataSet的格式可从Loader文档判断。\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"- load() 函数:从文件或者文件夹中读取数据为 DataSet 并将它们组装成 DataBundle。支持接受的参数类型有以下的几种\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
" - None, 将尝试读取自动缓存的数据,仅支持提供了自动下载数据的Loader\n", |
|
|
|
|
|
" - 文件夹路径, 默认将尝试在该文件夹下匹配文件名中含有 train , test , dev 的文件,如果有多个文件含有相同的关键字,将无法通过该方式读取\n", |
|
|
|
|
|
" - dict, 例如{'train':\"/path/to/tr.conll\", 'dev':\"/to/validate.conll\", \"test\":\"/to/te.conll\"}。" |
|
|
|
|
|
] |
|
|
|
|
|
}, |
|
|
|
|
|
{ |
|
|
|
|
|
"cell_type": "code", |
|
|
|
|
|
"execution_count": 1, |
|
|
|
|
|
"metadata": {}, |
|
|
|
|
|
"outputs": [ |
|
|
|
|
|
{ |
|
|
|
|
|
"name": "stdout", |
|
|
|
|
|
"output_type": "stream", |
|
|
|
|
|
"text": [ |
|
|
|
|
|
"In total 3 datasets:\n", |
|
|
|
|
|
"\ttest has 1944 instances.\n", |
|
|
|
|
|
"\ttrain has 17196 instances.\n", |
|
|
|
|
|
"\tdev has 1858 instances.\n", |
|
|
|
|
|
"\n" |
|
|
|
|
|
] |
|
|
|
|
|
} |
|
|
|
|
|
], |
|
|
|
|
|
"source": [ |
|
|
|
|
|
"from fastNLP.io import CWSLoader\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"loader = CWSLoader(dataset_name='pku')\n", |
|
|
|
|
|
"data_bundle = loader.load()\n", |
|
|
|
|
|
"print(data_bundle)" |
|
|
|
|
|
] |
|
|
|
|
|
}, |
|
|
|
|
|
{ |
|
|
|
|
|
"cell_type": "markdown", |
|
|
|
|
|
"metadata": {}, |
|
|
|
|
|
"source": [ |
|
|
|
|
|
"这里表示一共有3个数据集。其中:\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
" 3个数据集的名称分别为train、dev、test,分别有17223、1831、1944个instance\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"也可以取出DataSet,并打印DataSet中的具体内容" |
|
|
|
|
|
] |
|
|
|
|
|
}, |
|
|
|
|
|
{ |
|
|
|
|
|
"cell_type": "code", |
|
|
|
|
|
"execution_count": 2, |
|
|
|
|
|
"metadata": {}, |
|
|
|
|
|
"outputs": [ |
|
|
|
|
|
{ |
|
|
|
|
|
"name": "stdout", |
|
|
|
|
|
"output_type": "stream", |
|
|
|
|
|
"text": [ |
|
|
|
|
|
"+----------------------------------------------------------------+\n", |
|
|
|
|
|
"| raw_words |\n", |
|
|
|
|
|
"+----------------------------------------------------------------+\n", |
|
|
|
|
|
"| 迈向 充满 希望 的 新 世纪 —— 一九九八年 新年 讲话 ... |\n", |
|
|
|
|
|
"| 中共中央 总书记 、 国家 主席 江 泽民 |\n", |
|
|
|
|
|
"+----------------------------------------------------------------+\n" |
|
|
|
|
|
] |
|
|
|
|
|
} |
|
|
|
|
|
], |
|
|
|
|
|
"source": [ |
|
|
|
|
|
"tr_data = data_bundle.get_dataset('train')\n", |
|
|
|
|
|
"print(tr_data[:2])" |
|
|
|
|
|
] |
|
|
|
|
|
}, |
|
|
|
|
|
{ |
|
|
|
|
|
"cell_type": "markdown", |
|
|
|
|
|
"metadata": {}, |
|
|
|
|
|
"source": [ |
|
|
|
|
|
"## Part III: 使用Pipe对数据集进行预处理\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"通过 Loader 可以将文本数据读入,但并不能直接被神经网络使用,还需要进行一定的预处理。\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"在fastNLP中,我们使用 Pipe 的子类作为数据预处理的类, Loader 和 Pipe 一般具备一一对应的关系,该关系可以从其名称判断, 例如 CWSLoader 与 CWSPipe 是一一对应的。一般情况下Pipe处理包含以下的几个过程,\n", |
|
|
|
|
|
"1. 将raw_words或 raw_chars进行tokenize以切分成不同的词或字; \n", |
|
|
|
|
|
"2. 再建立词或字的 Vocabulary , 并将词或字转换为index; \n", |
|
|
|
|
|
"3. 将target 列建立词表并将target列转为index;\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"所有的Pipe都可通过其文档查看该Pipe支持处理的 DataSet 以及返回的 DataBundle 中的Vocabulary的情况; 如 OntoNotesNERPipe\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"各种数据集的Pipe当中,都包含了以下的两个函数:\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"- process() 函数:对输入的 DataBundle 进行处理, 然后返回处理之后的 DataBundle 。process函数的文档中包含了该Pipe支持处理的DataSet的格式。\n", |
|
|
|
|
|
"- process_from_file() 函数:输入数据集所在文件夹,使用对应的Loader读取数据(所以该函数支持的参数类型是由于其对应的Loader的load函数决定的),然后调用相对应的process函数对数据进行预处理。相当于是把Load和process放在一个函数中执行。\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"接着上面 CWSLoader 的例子,我们展示一下 CWSPipe 的功能:" |
|
|
|
|
|
] |
|
|
|
|
|
}, |
|
|
|
|
|
{ |
|
|
|
|
|
"cell_type": "code", |
|
|
|
|
|
"execution_count": 3, |
|
|
|
|
|
"metadata": {}, |
|
|
|
|
|
"outputs": [ |
|
|
|
|
|
{ |
|
|
|
|
|
"name": "stdout", |
|
|
|
|
|
"output_type": "stream", |
|
|
|
|
|
"text": [ |
|
|
|
|
|
"In total 3 datasets:\n", |
|
|
|
|
|
"\ttest has 1944 instances.\n", |
|
|
|
|
|
"\ttrain has 17196 instances.\n", |
|
|
|
|
|
"\tdev has 1858 instances.\n", |
|
|
|
|
|
"In total 2 vocabs:\n", |
|
|
|
|
|
"\tchars has 4777 entries.\n", |
|
|
|
|
|
"\ttarget has 4 entries.\n", |
|
|
|
|
|
"\n" |
|
|
|
|
|
] |
|
|
|
|
|
} |
|
|
|
|
|
], |
|
|
|
|
|
"source": [ |
|
|
|
|
|
"from fastNLP.io import CWSPipe\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"data_bundle = CWSPipe().process(data_bundle)\n", |
|
|
|
|
|
"print(data_bundle)" |
|
|
|
|
|
] |
|
|
|
|
|
}, |
|
|
|
|
|
{ |
|
|
|
|
|
"cell_type": "markdown", |
|
|
|
|
|
"metadata": {}, |
|
|
|
|
|
"source": [ |
|
|
|
|
|
"表示一共有3个数据集和2个词表。其中:\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"- 3个数据集的名称分别为train、dev、test,分别有17223、1831、1944个instance\n", |
|
|
|
|
|
"- 2个词表分别为chars词表与target词表。其中chars词表为句子文本所构建的词表,一共有4777个不同的字;target词表为目标标签所构建的词表,一共有4种标签。\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"相较于之前CWSLoader读取的DataBundle,新增了两个Vocabulary。 我们可以打印一下处理之后的DataSet" |
|
|
|
|
|
] |
|
|
|
|
|
}, |
|
|
|
|
|
{ |
|
|
|
|
|
"cell_type": "code", |
|
|
|
|
|
"execution_count": 4, |
|
|
|
|
|
"metadata": {}, |
|
|
|
|
|
"outputs": [ |
|
|
|
|
|
{ |
|
|
|
|
|
"name": "stdout", |
|
|
|
|
|
"output_type": "stream", |
|
|
|
|
|
"text": [ |
|
|
|
|
|
"+---------------------+---------------------+---------------------+---------+\n", |
|
|
|
|
|
"| raw_words | chars | target | seq_len |\n", |
|
|
|
|
|
"+---------------------+---------------------+---------------------+---------+\n", |
|
|
|
|
|
"| 迈向 充满 希望... | [1224, 178, 674,... | [0, 1, 0, 1, 0, ... | 29 |\n", |
|
|
|
|
|
"| 中共中央 总书记... | [11, 212, 11, 33... | [0, 3, 3, 1, 0, ... | 15 |\n", |
|
|
|
|
|
"+---------------------+---------------------+---------------------+---------+\n" |
|
|
|
|
|
] |
|
|
|
|
|
} |
|
|
|
|
|
], |
|
|
|
|
|
"source": [ |
|
|
|
|
|
"tr_data = data_bundle.get_dataset('train')\n", |
|
|
|
|
|
"print(tr_data[:2])" |
|
|
|
|
|
] |
|
|
|
|
|
}, |
|
|
|
|
|
{ |
|
|
|
|
|
"cell_type": "markdown", |
|
|
|
|
|
"metadata": {}, |
|
|
|
|
|
"source": [ |
|
|
|
|
|
"可以看到有两列为int的field: chars和target。这两列的名称同时也是DataBundle中的Vocabulary的名称。可以通过下列的代码获取并查看Vocabulary的 信息" |
|
|
|
|
|
] |
|
|
|
|
|
}, |
|
|
|
|
|
{ |
|
|
|
|
|
"cell_type": "code", |
|
|
|
|
|
"execution_count": 5, |
|
|
|
|
|
"metadata": {}, |
|
|
|
|
|
"outputs": [ |
|
|
|
|
|
{ |
|
|
|
|
|
"name": "stdout", |
|
|
|
|
|
"output_type": "stream", |
|
|
|
|
|
"text": [ |
|
|
|
|
|
"Vocabulary(['B', 'E', 'S', 'M']...)\n" |
|
|
|
|
|
] |
|
|
|
|
|
} |
|
|
|
|
|
], |
|
|
|
|
|
"source": [ |
|
|
|
|
|
"vocab = data_bundle.get_vocab('target')\n", |
|
|
|
|
|
"print(vocab)" |
|
|
|
|
|
] |
|
|
|
|
|
}, |
|
|
|
|
|
{ |
|
|
|
|
|
"cell_type": "markdown", |
|
|
|
|
|
"metadata": {}, |
|
|
|
|
|
"source": [ |
|
|
|
|
|
"## Part IV: fastNLP封装好的Loader和Pipe\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"fastNLP封装了多种任务/数据集的 Loader 和 Pipe 并提供自动下载功能,具体参见文档 [数据集](https://docs.qq.com/sheet/DVnpkTnF6VW9UeXdh?c=A1A0A0)\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"## Part V: 不同格式类型的基础Loader\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"除了上面提到的针对具体任务的Loader,我们还提供了CSV格式和JSON格式的Loader\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"**CSVLoader** 读取CSV类型的数据集文件。例子如下:\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"```python\n", |
|
|
|
|
|
"from fastNLP.io.loader import CSVLoader\n", |
|
|
|
|
|
"data_set_loader = CSVLoader(\n", |
|
|
|
|
|
" headers=('raw_words', 'target'), sep='\\t'\n", |
|
|
|
|
|
")\n", |
|
|
|
|
|
"```\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"表示将CSV文件中每一行的第一项将填入'raw_words' field,第二项填入'target' field。其中项之间由'\\t'分割开来\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"```python\n", |
|
|
|
|
|
"data_set = data_set_loader._load('path/to/your/file')\n", |
|
|
|
|
|
"```\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"文件内容样例如下\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"```csv\n", |
|
|
|
|
|
"But it does not leave you with much . 1\n", |
|
|
|
|
|
"You could hate it for the same reason . 1\n", |
|
|
|
|
|
"The performances are an absolute joy . 4\n", |
|
|
|
|
|
"```\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"读取之后的DataSet具有以下的field\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"| raw_words | target |\n", |
|
|
|
|
|
"| --------------------------------------- | ------ |\n", |
|
|
|
|
|
"| But it does not leave you with much . | 1 |\n", |
|
|
|
|
|
"| You could hate it for the same reason . | 1 |\n", |
|
|
|
|
|
"| The performances are an absolute joy . | 4 |\n" |
|
|
|
|
|
] |
|
|
|
|
|
}, |
|
|
|
|
|
{ |
|
|
|
|
|
"cell_type": "markdown", |
|
|
|
|
|
"metadata": {}, |
|
|
|
|
|
"source": [ |
|
|
|
|
|
"**JsonLoader** 读取Json类型的数据集文件,数据必须按行存储,每行是一个包含各类属性的Json对象。例子如下\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"```python\n", |
|
|
|
|
|
"from fastNLP.io.loader import JsonLoader\n", |
|
|
|
|
|
"loader = JsonLoader(\n", |
|
|
|
|
|
" fields={'sentence1': 'raw_words1', 'sentence2': 'raw_words2', 'gold_label': 'target'}\n", |
|
|
|
|
|
")\n", |
|
|
|
|
|
"```\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"表示将Json对象中'sentence1'、'sentence2'和'gold_label'对应的值赋给'raw_words1'、'raw_words2'、'target'这三个fields\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"```python\n", |
|
|
|
|
|
"data_set = loader._load('path/to/your/file')\n", |
|
|
|
|
|
"```\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"数据集内容样例如下\n", |
|
|
|
|
|
"```\n", |
|
|
|
|
|
"{\"annotator_labels\": [\"neutral\"], \"captionID\": \"3416050480.jpg#4\", \"gold_label\": \"neutral\", ... }\n", |
|
|
|
|
|
"{\"annotator_labels\": [\"contradiction\"], \"captionID\": \"3416050480.jpg#4\", \"gold_label\": \"contradiction\", ... }\n", |
|
|
|
|
|
"{\"annotator_labels\": [\"entailment\"], \"captionID\": \"3416050480.jpg#4\", \"gold_label\": \"entailment\", ... }\n", |
|
|
|
|
|
"```\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"读取之后的DataSet具有以下的field\n", |
|
|
|
|
|
"\n", |
|
|
|
|
|
"| raw_words0 | raw_words1 | target |\n", |
|
|
|
|
|
"| ------------------------------------------------------ | ------------------------------------------------- | ------------- |\n", |
|
|
|
|
|
"| A person on a horse jumps over a broken down airplane. | A person is training his horse for a competition. | neutral |\n", |
|
|
|
|
|
"| A person on a horse jumps over a broken down airplane. | A person is at a diner, ordering an omelette. | contradiction |\n", |
|
|
|
|
|
"| A person on a horse jumps over a broken down airplane. | A person is outdoors, on a horse. | entailment |" |
|
|
|
|
|
] |
|
|
|
|
|
}, |
|
|
|
|
|
{ |
|
|
|
|
|
"cell_type": "code", |
|
|
|
|
|
"execution_count": null, |
|
|
|
|
|
"metadata": {}, |
|
|
|
|
|
"outputs": [], |
|
|
|
|
|
"source": [] |
|
|
|
|
|
} |
|
|
|
|
|
], |
|
|
|
|
|
"metadata": { |
|
|
|
|
|
"kernelspec": { |
|
|
|
|
|
"display_name": "Python Now", |
|
|
|
|
|
"language": "python", |
|
|
|
|
|
"name": "now" |
|
|
|
|
|
}, |
|
|
|
|
|
"language_info": { |
|
|
|
|
|
"codemirror_mode": { |
|
|
|
|
|
"name": "ipython", |
|
|
|
|
|
"version": 3 |
|
|
|
|
|
}, |
|
|
|
|
|
"file_extension": ".py", |
|
|
|
|
|
"mimetype": "text/x-python", |
|
|
|
|
|
"name": "python", |
|
|
|
|
|
"nbconvert_exporter": "python", |
|
|
|
|
|
"pygments_lexer": "ipython3", |
|
|
|
|
|
"version": "3.8.0" |
|
|
|
|
|
} |
|
|
|
|
|
}, |
|
|
|
|
|
"nbformat": 4, |
|
|
|
|
|
"nbformat_minor": 2 |
|
|
|
|
|
} |