From f4057d20c91af9f9a4cb70b57c4799fb9a13dcec Mon Sep 17 00:00:00 2001
From: ChenXin <will131@foxmail.com>
Date: Thu, 27 Feb 2020 21:04:05 +0800
Subject: [PATCH] =?UTF-8?q?=E6=B7=BB=E5=8A=A0=E4=BA=86=20tutorial=5F4=20?=
 =?UTF-8?q?=E7=9A=84=20ipynb?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 docs/source/tutorials/文本分类.rst  |   2 -
 tutorials/tutorial_4_load_dataset.ipynb | 309 ++++++++++++++++++++++++
 2 files changed, 309 insertions(+), 2 deletions(-)
 create mode 100644 tutorials/tutorial_4_load_dataset.ipynb

diff --git a/docs/source/tutorials/文本分类.rst b/docs/source/tutorials/文本分类.rst
index 29e96c20..2f675115 100644
--- a/docs/source/tutorials/文本分类.rst
+++ b/docs/source/tutorials/文本分类.rst
@@ -19,8 +19,6 @@
 .. figure:: ./cn_cls_example.png
    :alt: jupyter
 
-   jupyter
-
 步骤
 ----
 
diff --git a/tutorials/tutorial_4_load_dataset.ipynb b/tutorials/tutorial_4_load_dataset.ipynb
new file mode 100644
index 00000000..f6de83bc
--- /dev/null
+++ b/tutorials/tutorial_4_load_dataset.ipynb
@@ -0,0 +1,309 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 使用Loader和Pipe加载并处理数据集\n",
+    "\n",
+    "这一部分是关于如何加载数据集的教程\n",
+    "\n",
+    "## Part I: 数据集容器DataBundle\n",
+    "\n",
+    "而由于对于同一个任务，训练集，验证集和测试集会共用同一个词表以及具有相同的目标值，所以在fastNLP中我们使用了 DataBundle 来承载同一个任务的多个数据集 DataSet 以及它们的词表 Vocabulary 。下面会有例子介绍 DataBundle 的相关使用。\n",
+    "\n",
+    "DataBundle 在fastNLP中主要在各个 Loader 和 Pipe 中被使用。 下面我们先介绍一下 Loader 和 Pipe 。\n",
+    "\n",
+    "## Part II: 加载的各种数据集的Loader\n",
+    "\n",
+    "在fastNLP中，所有的 Loader 都可以通过其文档判断其支持读取的数据格式，以及读取之后返回的 DataSet 的格式, 例如 ChnSentiCorpLoader \n",
+    "\n",
+    "- download() 函数：自动将该数据集下载到缓存地址，默认缓存地址为~/.fastNLP/datasets/。由于版权等原因，不是所有的Loader都实现了该方法。该方法会返回下载后文件所处的缓存地址。\n",
+    "\n",
+    "- _load() 函数：从一个数据文件中读取数据，返回一个 DataSet 。返回的DataSet的格式可从Loader文档判断。\n",
+    "\n",
+    "- load() 函数：从文件或者文件夹中读取数据为 DataSet 并将它们组装成 DataBundle。支持接受的参数类型有以下的几种\n",
+    "\n",
+    "    - None, 将尝试读取自动缓存的数据，仅支持提供了自动下载数据的Loader\n",
+    "    - 文件夹路径, 默认将尝试在该文件夹下匹配文件名中含有 train , test , dev 的文件，如果有多个文件含有相同的关键字，将无法通过该方式读取\n",
+    "    - dict, 例如{'train':\"/path/to/tr.conll\", 'dev':\"/to/validate.conll\", \"test\":\"/to/te.conll\"}。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "In total 3 datasets:\n",
+      "\ttest has 1944 instances.\n",
+      "\ttrain has 17196 instances.\n",
+      "\tdev has 1858 instances.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "from fastNLP.io import CWSLoader\n",
+    "\n",
+    "loader = CWSLoader(dataset_name='pku')\n",
+    "data_bundle = loader.load()\n",
+    "print(data_bundle)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "这里表示一共有3个数据集。其中：\n",
+    "\n",
+    "    3个数据集的名称分别为train、dev、test，分别有17223、1831、1944个instance\n",
+    "\n",
+    "也可以取出DataSet，并打印DataSet中的具体内容"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+----------------------------------------------------------------+\n",
+      "| raw_words                                                      |\n",
+      "+----------------------------------------------------------------+\n",
+      "| 迈向  充满  希望  的  新  世纪  ——  一九九八年  新年  讲话 ... |\n",
+      "| 中共中央  总书记  、  国家  主席  江  泽民                     |\n",
+      "+----------------------------------------------------------------+\n"
+     ]
+    }
+   ],
+   "source": [
+    "tr_data = data_bundle.get_dataset('train')\n",
+    "print(tr_data[:2])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Part III: 使用Pipe对数据集进行预处理\n",
+    "\n",
+    "通过 Loader 可以将文本数据读入，但并不能直接被神经网络使用，还需要进行一定的预处理。\n",
+    "\n",
+    "在fastNLP中，我们使用 Pipe 的子类作为数据预处理的类， Loader 和 Pipe 一般具备一一对应的关系，该关系可以从其名称判断， 例如 CWSLoader 与 CWSPipe 是一一对应的。一般情况下Pipe处理包含以下的几个过程，\n",
+    "1. 将raw_words或 raw_chars进行tokenize以切分成不同的词或字; \n",
+    "2. 再建立词或字的 Vocabulary , 并将词或字转换为index; \n",
+    "3. 将target 列建立词表并将target列转为index;\n",
+    "\n",
+    "所有的Pipe都可通过其文档查看该Pipe支持处理的 DataSet 以及返回的 DataBundle 中的Vocabulary的情况; 如 OntoNotesNERPipe\n",
+    "\n",
+    "各种数据集的Pipe当中，都包含了以下的两个函数:\n",
+    "\n",
+    "- process() 函数：对输入的 DataBundle 进行处理, 然后返回处理之后的 DataBundle 。process函数的文档中包含了该Pipe支持处理的DataSet的格式。\n",
+    "- process_from_file() 函数：输入数据集所在文件夹，使用对应的Loader读取数据(所以该函数支持的参数类型是由于其对应的Loader的load函数决定的)，然后调用相对应的process函数对数据进行预处理。相当于是把Load和process放在一个函数中执行。\n",
+    "\n",
+    "接着上面 CWSLoader 的例子，我们展示一下 CWSPipe 的功能："
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "In total 3 datasets:\n",
+      "\ttest has 1944 instances.\n",
+      "\ttrain has 17196 instances.\n",
+      "\tdev has 1858 instances.\n",
+      "In total 2 vocabs:\n",
+      "\tchars has 4777 entries.\n",
+      "\ttarget has 4 entries.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "from fastNLP.io import CWSPipe\n",
+    "\n",
+    "data_bundle = CWSPipe().process(data_bundle)\n",
+    "print(data_bundle)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "表示一共有3个数据集和2个词表。其中：\n",
+    "\n",
+    "- 3个数据集的名称分别为train、dev、test，分别有17223、1831、1944个instance\n",
+    "- 2个词表分别为chars词表与target词表。其中chars词表为句子文本所构建的词表，一共有4777个不同的字；target词表为目标标签所构建的词表，一共有4种标签。\n",
+    "\n",
+    "相较于之前CWSLoader读取的DataBundle，新增了两个Vocabulary。 我们可以打印一下处理之后的DataSet"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+---------------------+---------------------+---------------------+---------+\n",
+      "| raw_words           | chars               | target              | seq_len |\n",
+      "+---------------------+---------------------+---------------------+---------+\n",
+      "| 迈向  充满  希望... | [1224, 178, 674,... | [0, 1, 0, 1, 0, ... | 29      |\n",
+      "| 中共中央  总书记... | [11, 212, 11, 33... | [0, 3, 3, 1, 0, ... | 15      |\n",
+      "+---------------------+---------------------+---------------------+---------+\n"
+     ]
+    }
+   ],
+   "source": [
+    "tr_data = data_bundle.get_dataset('train')\n",
+    "print(tr_data[:2])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "可以看到有两列为int的field: chars和target。这两列的名称同时也是DataBundle中的Vocabulary的名称。可以通过下列的代码获取并查看Vocabulary的 信息"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Vocabulary(['B', 'E', 'S', 'M']...)\n"
+     ]
+    }
+   ],
+   "source": [
+    "vocab = data_bundle.get_vocab('target')\n",
+    "print(vocab)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Part IV: fastNLP封装好的Loader和Pipe\n",
+    "\n",
+    "fastNLP封装了多种任务/数据集的 Loader 和 Pipe 并提供自动下载功能，具体参见文档 [数据集](https://docs.qq.com/sheet/DVnpkTnF6VW9UeXdh?c=A1A0A0)\n",
+    "\n",
+    "## Part V: 不同格式类型的基础Loader\n",
+    "\n",
+    "除了上面提到的针对具体任务的Loader，我们还提供了CSV格式和JSON格式的Loader\n",
+    "\n",
+    "**CSVLoader** 读取CSV类型的数据集文件。例子如下：\n",
+    "\n",
+    "```python\n",
+    "from fastNLP.io.loader import CSVLoader\n",
+    "data_set_loader = CSVLoader(\n",
+    "    headers=('raw_words', 'target'), sep='\\t'\n",
+    ")\n",
+    "```\n",
+    "\n",
+    "表示将CSV文件中每一行的第一项将填入'raw_words' field，第二项填入'target' field。其中项之间由'\\t'分割开来\n",
+    "\n",
+    "```python\n",
+    "data_set = data_set_loader._load('path/to/your/file')\n",
+    "```\n",
+    "\n",
+    "文件内容样例如下\n",
+    "\n",
+    "```csv\n",
+    "But it does not leave you with much .   1\n",
+    "You could hate it for the same reason . 1\n",
+    "The performances are an absolute joy .  4\n",
+    "```\n",
+    "\n",
+    "读取之后的DataSet具有以下的field\n",
+    "\n",
+    "| raw_words                               | target |\n",
+    "| --------------------------------------- | ------ |\n",
+    "| But it does not leave you with much .   | 1      |\n",
+    "| You could hate it for the same reason . | 1      |\n",
+    "| The performances are an absolute joy .  | 4      |\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**JsonLoader** 读取Json类型的数据集文件，数据必须按行存储，每行是一个包含各类属性的Json对象。例子如下\n",
+    "\n",
+    "```python\n",
+    "from fastNLP.io.loader import JsonLoader\n",
+    "loader = JsonLoader(\n",
+    "    fields={'sentence1': 'raw_words1', 'sentence2': 'raw_words2', 'gold_label': 'target'}\n",
+    ")\n",
+    "```\n",
+    "\n",
+    "表示将Json对象中'sentence1'、'sentence2'和'gold_label'对应的值赋给'raw_words1'、'raw_words2'、'target'这三个fields\n",
+    "\n",
+    "```python\n",
+    "data_set = loader._load('path/to/your/file')\n",
+    "```\n",
+    "\n",
+    "数据集内容样例如下\n",
+    "```\n",
+    "{\"annotator_labels\": [\"neutral\"], \"captionID\": \"3416050480.jpg#4\", \"gold_label\": \"neutral\", ... }\n",
+    "{\"annotator_labels\": [\"contradiction\"], \"captionID\": \"3416050480.jpg#4\", \"gold_label\": \"contradiction\", ... }\n",
+    "{\"annotator_labels\": [\"entailment\"], \"captionID\": \"3416050480.jpg#4\", \"gold_label\": \"entailment\", ... }\n",
+    "```\n",
+    "\n",
+    "读取之后的DataSet具有以下的field\n",
+    "\n",
+    "| raw_words0                                             | raw_words1                                        | target        |\n",
+    "| ------------------------------------------------------ | ------------------------------------------------- | ------------- |\n",
+    "| A person on a horse jumps over a broken down airplane. | A person is training his horse for a competition. | neutral       |\n",
+    "| A person on a horse jumps over a broken down airplane. | A person is at a diner, ordering an omelette.     | contradiction |\n",
+    "| A person on a horse jumps over a broken down airplane. | A person is outdoors, on a horse.                 | entailment    |"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python Now",
+   "language": "python",
+   "name": "now"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}