修改部分tutorial

5 years ago · 3d726b887e
--- a/docs/source/tutorials/tutorial_1_data_preprocess.rst
+++ b/docs/source/tutorials/tutorial_1_data_preprocess.rst
@@ -133,7 +133,7 @@ FastNLP 同样提供了多种删除数据的方法 :func:`~fastNLP.DataSet.drop`
        return words
    dataset.apply(get_words, new_field_name='words')

 除了手动处理数据集之外，你还可以使用 fastNLP 提供的各种 :class:`~fastNLP.io.Loader`和:class:`~fastNLP.io.Pipe` 来进行数据处理。
 除了手动处理数据集之外，你还可以使用 fastNLP 提供的各种 :class:`~fastNLP.io.Loader` 和 :class:`~fastNLP.io.Pipe` 来进行数据处理。
 详细请参考这篇教程  :doc:`使用Loader和Pipe处理数据 </tutorials/tutorial_4_load_dataset>` 。

 -----------------------------
@@ -142,27 +142,28 @@ fastNLP中field的命名习惯

 在英文任务中，fastNLP常用的field名称有:

    - raw_words: 表示的是原始的str。例如"This is a demo sentence ."。存在多个raw_words的情况，例如matching任务，它们会被定义为raw_words0, raw_words1。但在conll格式下，raw_words列也可能为["This", "is", "a", "demo", "sentence", "."]的形式。
    - words: 表示的是已经tokenize后的词语。例如["This", "is", "a", "demo", "sentence"], 但由于str并不能直接被神经网络所使用，所以words中的内容往往被转换为int，如[3, 10, 4, 2, 7, ...]等。多列words的情况，会被命名为words0, words1
    - target: 表示目标值。分类场景下，只有一个值；序列标注场景下是一个序列。
    - seq_len: 一般用于表示words列的长度
    - **raw_words**: 表示的是原始的str。例如"This is a demo sentence ."。存在多个raw_words的情况，例如matching任务，它们会被定义为raw_words0, raw_words1。但在conll格式下，raw_words列也可能为["This", "is", "a", "demo", "sentence", "."]的形式。
    - **words**: 表示的是已经tokenize后的词语。例如["This", "is", "a", "demo", "sentence"], 但由于str并不能直接被神经网络所使用，所以words中的内容往往被转换为int，如[3, 10, 4, 2, 7, ...]等。多列words的情况，会被命名为words0, words1
    - **target**: 表示目标值。分类场景下，只有一个值；序列标注场景下是一个序列。
    - **seq_len**: 一般用于表示words列的长度

 在中文任务中，fastNLP常用的field名称有:

    - raw_chars: 表示的是原始的连续汉字序列。例如"这是一个示例。"
    - chars: 表示已经切分为单独的汉字的序列。例如["这", "是", "一", "个", "示", "例", "。"]。但由于神经网络不能识别汉字，所以一般该列会被转为int形式，如[3, 4, 5, 6, ...]。
    - raw_words: 如果原始汉字序列中已经包含了词语的边界，则该列称为raw_words。如"上海 浦东 开发 与 法制 建设 同步"。
    - words: 表示单独的汉字词语序列。例如["上海", "", "浦东", "开发", "与", "法制", "建设", ...]或[2, 3, 4, ...]
    - target: 表示目标值。分类场景下，只有一个值；序列标注场景下是一个序列。
    - seq_len: 表示输入序列的长度

 .. todo::
    这一段移动到datasetiter那里
    - **raw_words**: 如果原始汉字序列中已经包含了词语的边界，则该列称为raw_words。如"上海 浦东 开发 与 法制 建设 同步"。
    - **words**: 表示单独的汉字词语序列。例如["上海", "", "浦东", "开发", "与", "法制", "建设", ...]或[2, 3, 4, ...]
    - **raw_chars**: 表示的是原始的连续汉字序列。例如"这是一个示例。"
    - **chars**: 表示已经切分为单独的汉字的序列。例如["这", "是", "一", "个", "示", "例", "。"]。但由于神经网络不能识别汉字，所以一般该列会被转为int形式，如[3, 4, 5, 6, ...]。
    - **target**: 表示目标值。分类场景下，只有一个值；序列标注场景下是一个序列
    - **seq_len**: 表示输入序列的长度

 -----------------------------
 DataSet与pad
 -----------------------------


 .. todo::
    这一段移动到datasetiter那里

 在fastNLP里，pad是与一个 :mod:`~fastNLP.core.field` 绑定的。即不同的 :mod:`~fastNLP.core.field` 可以使用不同的pad方式，比如在英文任务中word需要的pad和
 character的pad方式往往是不同的。fastNLP是通过一个叫做 :class:`~fastNLP.Padder` 的子类来完成的。
 默认情况下，所有field使用 :class:`~fastNLP.AutoPadder`
--- a/docs/source/tutorials/tutorial_2_vocabulary.rst
+++ b/docs/source/tutorials/tutorial_2_vocabulary.rst
@@ -19,10 +19,10 @@ Vocabulary
    vocab.to_index('我')  # 会输出1，Vocabulary中默认pad的index为0, unk(没有找到的词)的index为1

    #  在构建target的Vocabulary时，词表中应该用不上pad和unk，可以通过以下的初始化
    vocab = Vocabulary(unknown=None, pad=None)
    vocab = Vocabulary(unknown=None, padding=None)
    vocab.add_word_lst(['positive', 'negative'])
    vocab.to_index('positive')  # 输出0
    vocab.to_index('neutral')  # 会报错
    vocab.to_index('neutral')  # 会报错，因为没有unk这种情况

 除了通过以上的方式建立词表，Vocabulary还可以通过使用下面的函数直从 :class:`~fastNLP.DataSet` 中的某一列建立词表以及将该列转换为index

@@ -86,7 +86,7 @@ Vocabulary
    vocab.from_dataset(tr_data, field_name='chars', no_create_entry_dataset=[dev_data])


 :class:`~fastNLP.Vocabulary` 中的 `no_create_entry` , 建议在添加来自于测试集和验证集的词的时候将该参数置为True, 或将验证集和测试集
 :class:`~fastNLP.Vocabulary` 中的 `no_create_entry` , 建议在添加来自于测试集和验证集的词的时候将该参数置为True, 或将验证集和测试集
 传入 `no_create_entry_dataset` 参数。它们的意义是在接下来的模型会使用pretrain的embedding(包括glove, word2vec, elmo与bert)且会finetune的
 情况下，如果仅使用来自于train的数据建立vocabulary，会导致只出现在test与dev中的词语无法充分利用到来自于预训练embedding的信息(因为他们
 会被认为是unk)，所以在建立词表的时候将test与dev考虑进来会使得最终的结果更好。通过与fastNLP中的各种Embedding配合使用，会有如下的效果，
@@ -96,7 +96,7 @@ Vocabulary
 果找到了，就使用该表示; 如果没有找到，则认为该词的表示应该为unk的表示。

 下面我们结合部分 :class:`~fastNLP.embeddings.StaticEmbedding` 的例子来说明下该值造成的影响，如果您对
 :class:`~fastNLP.embeddings.StaticEmbedding` 不太了解，您可以先参考 :doc:`tutorial_3_embedding` 部分再来阅读该部分
 :class:`~fastNLP.embeddings.StaticEmbedding` 不太了解，您可以先参考 :doc:`使用Embedding模块将文本转成向量 </tutorials/tutorial_3_embedding>` 部分再来阅读该部分

 .. code-block:: python

@@ -108,7 +108,7 @@ Vocabulary
    vocab.add_word('train')
    vocab.add_word('only_in_train')  # 仅在train出现，但肯定在预训练词表中不存在
    vocab.add_word('test', no_create_entry=True)  # 该词只在dev或test中出现
    vocab.add_word('only_in_test', no_create_entry=True)  # 这个词肯定在预训练中找不到
    vocab.add_word('only_in_test', no_create_entry=True)  # 这个词在预训练的词表中找不到

    embed = StaticEmbedding(vocab, model_dir_or_name='en-glove-6b-50d')
    print(embed(torch.LongTensor([vocab.to_index('train')])))
@@ -119,12 +119,12 @@ Vocabulary

 输出结果(只截取了部分vector)::

    tensor([[ 0.9497,  0.3433,  0.8450, -0.8852, ...]], grad_fn=<EmbeddingBackward>)  # train
    tensor([[ 0.0540, -0.0557, -0.0514, -0.1688, ...]], grad_fn=<EmbeddingBackward>)  # only_in_train
    tensor([[ 0.1318, -0.2552, -0.0679,  0.2619, ...]], grad_fn=<EmbeddingBackward>)  # test
    tensor([[0., 0., 0., 0., 0., ...]], grad_fn=<EmbeddingBackward>)   # only_in_test
    tensor([[0., 0., 0., 0., 0., ...]], grad_fn=<EmbeddingBackward>)   # unk
    tensor([[ 0.9497,  0.3433,  0.8450, -0.8852, ...]], grad_fn=<EmbeddingBackward>)  # train，en-glove-6b-50d，找到了该词
    tensor([[ 0.0540, -0.0557, -0.0514, -0.1688, ...]], grad_fn=<EmbeddingBackward>)  # only_in_train，en-glove-6b-50d，使用了随机初始化
    tensor([[ 0.1318, -0.2552, -0.0679,  0.2619, ...]], grad_fn=<EmbeddingBackward>)  # test，在en-glove-6b-50d中找到了这个词
    tensor([[0., 0., 0., 0., 0., ...]], grad_fn=<EmbeddingBackward>)   # only_in_test, en-glove-6b-50d中找不到这个词，使用unk的vector
    tensor([[0., 0., 0., 0., 0., ...]], grad_fn=<EmbeddingBackward>)   # unk，使用zero初始化

 首先train和test都能够从预训练中找到对应的vector，所以它们是各自的vector表示; only_in_train在预训练中找不到，StaticEmbedding为它
 新建了一个entry，所以它有一个单独的vector; 而only_in_dev在预训练中找不到被指向了unk的值(fastNLP用零向量初始化unk)，与最后一行unk的
 新建了一个entry，所以它有一个单独的vector; 而only_in_test在预训练中找不到改词，因此被指向了unk的值(fastNLP用零向量初始化unk)，与最后一行unk的
 表示相同。
--- a/docs/source/tutorials/tutorial_3_embedding.rst
+++ b/docs/source/tutorials/tutorial_3_embedding.rst
@@ -24,7 +24,7 @@ Part I: embedding介绍
 Embedding是一种词嵌入技术，可以将字或者词转换为实向量。目前使用较多的预训练词嵌入有word2vec, fasttext, glove, character embedding,
 elmo以及bert。
 但使用这些词嵌入方式的时候都需要做一些加载上的处理，比如预训练的word2vec, fasttext以及glove都有着超过几十万个词语的表示，但一般任务大概
 只会用到其中几万个词，如果直接加载所有的词汇，会导致内存占用变大以及运行速度变慢，需要从预训练文件中抽取本次实验的用到的词汇；而对于英文的
 只会用到其中的几万个词，如果直接加载所有的词汇，会导致内存占用变大以及训练速度变慢，需要从预训练文件中抽取本次实验的用到的词汇；而对于英文的
 elmo和character embedding, 需要将word拆分成character才能使用；Bert的使用更是涉及到了Byte pair encoding(BPE)相关的内容。为了方便
 大家的使用，fastNLP通过 :class:`~fastNLP.Vocabulary` 统一了不同embedding的使用。下面我们将讲述一些例子来说明一下

@@ -35,7 +35,7 @@ Part II: 使用预训练的静态embedding

 在fastNLP中，加载预训练的word2vec, glove以及fasttext都使用的是 :class:`~fastNLP.embeddings.StaticEmbedding` 。另外，为了方便大家的
 使用，fastNLP提供了多种静态词向量的自动下载并缓存(默认缓存到~/.fastNLP/embeddings文件夹下)的功能，支持自动下载的预训练向量可以在
 `此处 <https://docs.qq.com/sheet/DVnpkTnF6VW9UeXdh?c=A1A0A0>`_ 查看。
 `下载文档 <https://docs.qq.com/sheet/DVnpkTnF6VW9UeXdh?c=A1A0A0>`_ 查看。

 .. code-block:: python

@@ -46,10 +46,10 @@ Part II: 使用预训练的静态embedding
    vocab = Vocabulary()
    vocab.add_word_lst("this is a demo .".split())

    embed = StaticEmbedding(vocab, model_dir_or_name='en-glove-6b-50d', requires_grad=True)
    embed = StaticEmbedding(vocab, model_dir_or_name='en-glove-6b-50d')

    words = torch.LongTensor([[vocab.to_index(word) for word in "this is a demo .".split()]])
    print(embed(words).size())
    words = torch.LongTensor([[vocab.to_index(word) for word in "this is a demo .".split()]])  # 将文本转为index
    print(embed(words).size())  # StaticEmbedding的使用和pytorch的nn.Embedding是类似的

 输出为::

@@ -92,7 +92,7 @@ Part IV: ELMo Embedding

 在fastNLP中，我们提供了ELMo和BERT的embedding： :class:`~fastNLP.embeddings.ElmoEmbedding`
 和 :class:`~fastNLP.embeddings.BertEmbedding` 。可自动下载的ElmoEmbedding可以
 从 `此处 <https://docs.qq.com/sheet/DVnpkTnF6VW9UeXdh?c=A1A0A0>`_ 找到。
 从 `下载文档 <https://docs.qq.com/sheet/DVnpkTnF6VW9UeXdh?c=A1A0A0>`_ 找到。

 与静态embedding类似，ELMo的使用方法如下：

@@ -123,13 +123,13 @@ Part IV: ELMo Embedding

    torch.Size([1, 5, 512])

 另外，根据 `这篇文章 <https://arxiv.org/abs/1802.05365>`_ ，不同层之间使用可学习的权重可以使得ELMo的效果更好，在fastNLP中可以通过以下的初始化
 另外，根据 `Deep contextualized word representations <https://arxiv.org/abs/1802.05365>`_ ，不同层之间使用可学习的权重可以使得ELMo的效果更好，在fastNLP中可以通过以下的初始化
 实现3层输出的结果通过可学习的权重进行加法融合。

 .. code-block:: python

    embed = ElmoEmbedding(vocab, model_dir_or_name='en-small', requires_grad=True, layers='mix')
    print(embed(words).size())
    print(embed(words).size())  # 三层输出按照权重element-wise的加起来

 输出为::

@@ -141,7 +141,7 @@ Part V: Bert Embedding
 -----------------------------------------------------------

 虽然Bert并不算严格意义上的Embedding，但通过将Bert封装成Embedding的形式将极大减轻使用的复杂程度。可自动下载的Bert Embedding可以
 从 `此处 <https://docs.qq.com/sheet/DVnpkTnF6VW9UeXdh?c=A1A0A0>`_ 找到。我们将使用下面的例子讲述一下
 从 `下载文档 <https://docs.qq.com/sheet/DVnpkTnF6VW9UeXdh?c=A1A0A0>`_ 找到。我们将使用下面的例子讲述一下
 BertEmbedding的使用

 .. code-block:: python
@@ -187,7 +187,7 @@ BertEmbedding的使用
    torch.Size([1, 7, 768])

 在英文Bert模型中，一个英文单词可能会被切分为多个subword，例如"fairness"会被拆分为 ``["fair", "##ness"]`` ，这样一个word对应的将有两个输出，
 :class:`~fastNLP.embeddings.BertEmbedding` 会使用pooling方法将一个word的subword的表示合并成一个vector，通过pool_method可以控制
 :class:`~fastNLP.embeddings.BertEmbedding` 会使用pooling方法将一个word的subword的表示合并成一个vector，通过pool_method可以控制
 该pooling方法，支持的有"first"(即使用fair的表示作为fairness的表示), "last"(使用##ness的表示作为fairness的表示), "max"(对fair和
 ##ness在每一维上做max),"avg"(对fair和##ness每一维做average)。

@@ -200,7 +200,8 @@ BertEmbedding的使用

    torch.Size([1, 5, 768])

 另外，根据 `文章 <https://arxiv.org/abs/1810.04805>`_ ，Bert的还存在一种用法，句子之间通过[SEP]拼接起来，前一句话的token embedding为0，
 另外，根据 `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 <https://arxiv.org/abs/1810.04805>`_ ，Bert在针对具有两句话的任务时（如matching，Q&A任务），句子之间通过[SEP]拼接起来，前一句话的token embedding为0，
 后一句话的token embedding为1。BertEmbedding能够自动识别句子中间的[SEP]来正确设置对应的token_type_id的。

 .. code-block:: python
@@ -229,7 +230,7 @@ Part VI: 使用character-level的embedding
 -----------------------------------------------------

 除了预训练的embedding以外，fastNLP还提供了两种Character Embedding： :class:`~fastNLP.embeddings.CNNCharEmbedding` 和
 :class:`~fastNLP.embeddings.LSTMCharEmbedding` 。一般在使用character embedding时，需要在预处理的时候将word拆分成character，这
 :class:`~fastNLP.embeddings.LSTMCharEmbedding` 。一般在使用character embedding时，需要在预处理的时候将word拆分成character，这
 会使得预处理过程变得非常繁琐。在fastNLP中，使用character embedding也只需要传入 :class:`~fastNLP.Vocabulary` 即可，而且该
 Vocabulary与其它Embedding使用的Vocabulary是一致的，下面我们看两个例子。

@@ -297,9 +298,9 @@ Part VII: 叠加使用多个embedding

    torch.Size([1, 5, 114])

 :class:`~fastNLP.embeddings.StaticEmbedding` , :class:`~fastNLP.embeddings.ElmoEmbedding` ,
 :class:`~fastNLP.embeddings.CNNCharEmbedding` , :class:`~fastNLP.embeddings.BertEmbedding` 等都可以互相拼接。
 :class:`~fastNLP.embeddings.StackEmbedding` 的使用也是和其它Embedding是一致的，即输出index返回对应的表示。但能够拼接起来的Embedding
 :class:`~fastNLP.embeddings.StaticEmbedding` , :class:`~fastNLP.embeddings.ElmoEmbedding` ,
 :class:`~fastNLP.embeddings.CNNCharEmbedding` , :class:`~fastNLP.embeddings.BertEmbedding` 等都可以互相拼接。
 :class:`~fastNLP.embeddings.StackEmbedding` 的使用也是和其它Embedding是一致的，即输出index返回对应的表示。但能够拼接起来的Embedding
 必须使用同样的 :class:`~fastNLP.Vocabulary` ，因为只有使用同样的 :class:`~fastNLP.Vocabulary` 才能保证同一个index指向的是同一个词或字

 -----------------------------------------------------------
@@ -339,27 +340,27 @@ Part VIII: Embedding的其它说明
    vocab = Vocabulary()
    vocab.add_word_lst("this is a demo .".split())

    embed = BertEmbedding(vocab, model_dir_or_name='en-base-cased')
    embed.requires_grad = False  # BertEmbedding不更新
    embed = BertEmbedding(vocab, model_dir_or_name='en-base-cased', requires_grad=True)  # 初始化时设定为需要更新
    embed.requires_grad = False  # 修改BertEmbedding的权重为不更新

 (3) 各种Embedding中word_dropout与dropout的说明

 fastNLP中所有的Embedding都支持传入word_dropout和dropout参数，word_dropout指示的是以多大概率将输入的word置为unk的index，这样既可以
 是的unk得到训练，也可以有一定的regularize效果; dropout参数是在获取到word的表示之后，以多大概率将一些维度的表示置为0。

 如果使用 :class:`~fastNLP.embeddings.StackEmbedding` 且需要用到word_dropout，建议将word_dropout设置在 :class:`~fastNLP.embeddings.StackEmbedding` 。
 如果使用 :class:`~fastNLP.embeddings.StackEmbedding` 且需要用到word_dropout，建议将word_dropout设置在 :class:`~fastNLP.embeddings.StackEmbedding` 上。


 -----------------------------------------------------------
 Part IX: StaticEmbedding的使用建议
 -----------------------------------------------------------

 在英文的命名实体识别(NER)任务中，由 `论文 <http://xxx.itp.ac.cn/pdf/1511.08308.pdf>`_ 指出，同时使用cnn character embedding和word embedding
 在英文的命名实体识别(NER)任务中，由 `Named Entity Recognition with Bidirectional LSTM-CNNs <http://xxx.itp.ac.cn/pdf/1511.08308.pdf>`_ 指出，同时使用cnn character embedding和word embedding
 会使得NER的效果有比较大的提升。正如你在上节中看到的那样，fastNLP支持将 :class:`~fastNLP.embeddings.CNNCharEmbedding`
 与 :class:`~fastNLP.embeddings.StaticEmbedding` 拼成一个 :class:`~fastNLP.embeddings.StackEmbedding` 。如果通过这种方式使用，需要
 在预处理文本时，不要将词汇小写化(因为Character Embedding需要利用词语中的大小写信息)且不要将出现频次低于某个阈值的word设置为unk(因为
 Character embedding需要利用字形信息)；但 :class:`~fastNLP.embeddings.StaticEmbedding` 使用的某些预训练词嵌入的词汇表中只有小写的词
 语, 且某些低频词并未在预训练中出现需要被剔除。即(1) character embedding需要保留大小写，而某些static embedding不需要保留大小写。(2)
 语, 且某些低频词并未在预训练中出现需要被剔除。即(1) character embedding需要保留大小写，而预训练词向量不需要保留大小写。(2)
 character embedding需要保留所有的字形, 而static embedding需要设置一个最低阈值以学到更好的表示。

 (1) fastNLP如何解决关于大小写的问题
@@ -372,7 +373,7 @@ fastNLP通过在 :class:`~fastNLP.embeddings.StaticEmbedding` 增加了一个low
    from fastNLP import Vocabulary

    vocab = Vocabulary().add_word_lst("The the a A".split())
    #  下面用随机的StaticEmbedding演示，但与使用预训练时效果是一致的
    #  下面用随机的StaticEmbedding演示，但与使用预训练词向量时效果是一致的
    embed = StaticEmbedding(vocab, model_name_or_dir=None, embedding_dim=5)
    print(embed(torch.LongTensor([vocab.to_index('The')])))
    print(embed(torch.LongTensor([vocab.to_index('the')])))
--- a/docs/source/tutorials/tutorial_4_load_dataset.rst
+++ b/docs/source/tutorials/tutorial_4_load_dataset.rst
@@ -2,7 +2,7 @@
 使用Loader和Pipe加载并处理数据集
 =======================================

 这一部分是一个关于如何加载数据集的教程
 这一部分是关于如何加载数据集的教程

 教程目录：

@@ -18,26 +18,26 @@ Part I: 数据集容器DataBundle
 ------------------------------------

 而由于对于同一个任务，训练集，验证集和测试集会共用同一个词表以及具有相同的目标值，所以在fastNLP中我们使用了 :class:`~fastNLP.io.DataBundle`
 来承载同一个任务的多个数据集 :class:`~fastNLP.DataSet` 以及它们的词表 :class:`~fastNLP.Vocabulary`。下面会有例子介绍 :class:`~fastNLP.io.DataBundle`
 来承载同一个任务的多个数据集 :class:`~fastNLP.DataSet` 以及它们的词表 :class:`~fastNLP.Vocabulary` 。下面会有例子介绍 :class:`~fastNLP.io.DataBundle`
 的相关使用。

 :class:`~fastNLP.io.DataBundle` 在fastNLP中主要在各个 :class:`~fastNLP.io.Loader` 和 :class:`~fastNLP.io.Pipe` 中被使用。
 下面我们将先介绍一下 :class:`~fastNLP.io.Loader` 和 :class:`~fastNLP.io.Pipe` , 之后我们将给出相应的例子。
 :class:`~fastNLP.io.DataBundle` 在fastNLP中主要在各个 :class:`~fastNLP.io.Loader` 和 :class:`~fastNLP.io.Pipe` 中被使用。
 下面我们先介绍一下 :class:`~fastNLP.io.Loader` 和 :class:`~fastNLP.io.Pipe` 。

 -------------------------------------
 Part II: 加载的各种数据集的Loader
 -------------------------------------

 在fastNLP中，所有的数据Loader都可以通过其文档判断其支持读取的数据格式，以及读取之后返回的 :class:`~fastNLP.DataSet` 的格式。例如
 \ref 加个引用。
 在fastNLP中，所有的 :class:`~fastNLP.io.Loader` 都可以通过其文档判断其支持读取的数据格式，以及读取之后返回的 :class:`~fastNLP.DataSet` 的格式,
 例如 :class:`~fastNLP.io.ChnSentiCorpLoader` 。

    - download 函数：自动将该数据集下载到缓存地址，默认缓存地址为~/.fastNLP/datasets/。由于版权等原因，不是所有的Loader都实现了该方法。该方法会返回下载后文件所处的缓存地址。可以查看对应Loader的download的方法的文档来判断该Loader加载的数据。
    - _load 函数：从一个数据文件中读取数据，返回一个 :class:`~fastNLP.DataSet` 。返回的DataSet的格式可从Loader文档判断。
    - load 函数：从文件或者文件夹中读取数据并组装成 :class:`~fastNLP.io.DataBundle`。支持接受的参数类型有以下的几种
    - **download()** 函数：自动将该数据集下载到缓存地址，默认缓存地址为~/.fastNLP/datasets/。由于版权等原因，不是所有的Loader都实现了该方法。该方法会返回下载后文件所处的缓存地址。
    - **_load()** 函数：从一个数据文件中读取数据，返回一个 :class:`~fastNLP.DataSet` 。返回的DataSet的格式可从Loader文档判断。
    - **load()** 函数：从文件或者文件夹中读取数据为 :class:`~fastNLP.DataSet` 并将它们组装成 :class:`~fastNLP.io.DataBundle`。支持接受的参数类型有以下的几种

        - None, 将尝试读取自动缓存的数据，仅支持提供了自动下载数据的Loader
        - 文件夹路径, 默认将尝试在该路径下匹配文件名中含有 `train` , `test` , `dev` 的python文件，如果有多个文件含有这相同的关键字，将无法通过该方式读取
        - dict, 例如{'train':"/path/to/tr.conll", 'dev':"/to/validate.conll", "test":"/to/te.conll"}
        - 文件夹路径, 默认将尝试在该文件夹下匹配文件名中含有 `train` , `test` , `dev` 的文件，如果有多个文件含有相同的关键字，将无法通过该方式读取
        - dict, 例如{'train':"/path/to/tr.conll", 'dev':"/to/validate.conll", "test":"/to/te.conll"}。

 .. code-block:: python

@@ -56,9 +56,9 @@ Part II: 加载的各种数据集的Loader

 这里表示一共有3个数据集。其中：

    - 3个数据集分别为train、dev、test数据集，分别有17223、1831、1944个instance
    - 3个数据集的名称分别为train、dev、test，分别有17223、1831、1944个instance

 也可以取出DataSet并DataSet中的具体内容
 也可以取出DataSet，并打印DataSet中的具体内容

 .. code-block:: python

@@ -77,21 +77,22 @@ Part II: 加载的各种数据集的Loader
 ------------------------------------------
 Part III: 使用Pipe对数据集进行预处理
 ------------------------------------------
 通过:class:`~fastNLP.io.Loader` 可以将文本数据读入，但并不能直接被神经网络使用，还需要进行一定的预处理。
 通过 :class:`~fastNLP.io.Loader` 可以将文本数据读入，但并不能直接被神经网络使用，还需要进行一定的预处理。

 在fastNLP中，我们使用 :class:`~fastNLP.io.Pipe`的子类作为数据预处理的类，Pipe和Loader一般具备一一对应的关系，该关系可以从其名称判断，
 在fastNLP中，我们使用 :class:`~fastNLP.io.Pipe` 的子类作为数据预处理的类， :class:`~fastNLP.io.Loader` 和 :class:`~fastNLP.io.Pipe` 一般具备一一对应的关系，该关系可以从其名称判断，
 例如 :class:`~fastNLP.io.CWSLoader` 与 :class:`~fastNLP.io.CWSPipe` 是一一对应的。一般情况下Pipe处理包含以下的几个过程，(1)将raw_words或
 raw_chars进行tokenize以切分成不同的词或字; (2) 再建立词或字的 :class:`~fastNLP.Vocabulary` , 并将词或字转换为index; (3)将target
 列建立词表并将target列转为index;

 所有的Pipe都可通过其文档查看通过该Pipe之后DataSet中的field的情况; 如 \ref{TODO 添加对例子的引用}
 所有的Pipe都可通过其文档查看该Pipe支持处理的 :class:`~fastNLP.DataSet` 以及返回的 :class:`~fastNLP.io.DataSet` 中的field的情况;
 如 :class:`~fastNLP.io.`

 各种数据集的Pipe当中，都包含了以下的两个函数:

    - process 函数：对输入的 :class:`~fastNLP.io.DataBundle` 进行处理, 然后返回处理之后的 :class:`~fastNLP.io.DataBundle` 。process函数的文档中包含了该Pipe支持处理的DataSet的格式。
    - process_from_file 函数：输入数据集所在文件夹，使用对应的Loader读取数据(所以该函数支持的参数类型是由于其对应的Loader的load函数决定的)，然后调用相对应的process函数对数据进行预处理。相当于是把Load和process放在一个函数中执行。
    - process() 函数：对输入的 :class:`~fastNLP.io.DataBundle` 进行处理, 然后返回处理之后的 :class:`~fastNLP.io.DataBundle` 。process函数的文档中包含了该Pipe支持处理的DataSet的格式。
    - process_from_file() 函数：输入数据集所在文件夹，使用对应的Loader读取数据(所以该函数支持的参数类型是由于其对应的Loader的load函数决定的)，然后调用相对应的process函数对数据进行预处理。相当于是把Load和process放在一个函数中执行。

 接着上面CWSLoader的例子，我们展示一下CWSPipe的功能：
 接着上面 :class:`~fastNLP.io.CWSLoader` 的例子，我们展示一下 :class:`~fastNLP.io.CWSPipe` 的功能：

 .. code-block:: python

@@ -112,8 +113,8 @@ raw_chars进行tokenize以切分成不同的词或字; (2) 再建立词或字的

 表示一共有3个数据集和2个词表。其中：

    - 3个数据集分别为train、dev、test数据集，分别有17223、1831、1944个instance
    - 2个词表分别为chars词表与target词表。其中chars词表为句子文本所构建的词表，一共有4777个字；target词表为目标标签所构建的词表，一共有4种标签。
    - 3个数据集的名称分别为train、dev、test，分别有17223、1831、1944个instance
    - 2个词表分别为chars词表与target词表。其中chars词表为句子文本所构建的词表，一共有4777个不同的字；target词表为目标标签所构建的词表，一共有4种标签。

 相较于之前CWSLoader读取的DataBundle，新增了两个Vocabulary。 我们可以打印一下处理之后的DataSet

@@ -147,9 +148,8 @@ raw_chars进行tokenize以切分成不同的词或字; (2) 再建立词或字的
 Part IV: fastNLP封装好的Loader和Pipe
 ------------------------------------------

 fastNLP封装了多种任务/数据集的Loader和Pipe并提供自动下载功能，具体参见文档

 `fastNLP可加载数据集 <https://docs.qq.com/sheet/DVnpkTnF6VW9UeXdh?c=A1A0A0>`_
 fastNLP封装了多种任务/数据集的 :class:`~fastNLP.io.Loader` 和 :class:`~fastNLP.io.Pipe` 并提供自动下载功能，具体参见文档
 `数据集 <https://docs.qq.com/sheet/DVnpkTnF6VW9UeXdh?c=A1A0A0>`_

 --------------------------------------------------------
 Part V: 不同格式类型的基础Loader
@@ -165,12 +165,12 @@ Part V: 不同格式类型的基础Loader
        data_set_loader = CSVLoader(
            headers=('raw_words', 'target'), sep='\t'
        )
        # 表示将CSV文件中每一行的第一项填入'words' field，第二项填入'target' field。
        # 表示将CSV文件中每一行的第一项将填入'raw_words' field，第二项填入'target' field。
        # 其中项之间由'\t'分割开来

        data_set = data_set_loader._load('path/to/your/file')

    数据集内容样例如下 ::
    文件内容样例如下 ::

        But it does not leave you with much .	1
        You could hate it for the same reason .	1
--- a/fastNLP/core/dataset.py
+++ b/fastNLP/core/dataset.py
@@ -480,6 +480,7 @@ class DataSet(object):
            for field in fields:
                table.add_row(field)
            logger.info(table)
            return table

    def append(self, instance):
        """
--- a/fastNLP/io/loader/init.py
+++ b/fastNLP/io/loader/init.py
@@ -8,7 +8,7 @@ Loader用于读取数据，并将内容读取到 :class:`~fastNLP.DataSet` 或
    将尝试自动下载数据集并缓存。但不是所有的数据都可以直接下载。

 1.传入一个文件的 path
    返回的 `data_bundle` 包含一个名为 `train` 的 dataset ,可以通过 ``data_bundle.datasets['train']`` 获取
    返回的 `data_bundle` 包含一个名为 `train` 的 dataset ,可以通过 ``data_bundle.get_dataset('train')`` 获取

 2.传入一个文件夹目录
    将读取的是这个文件夹下文件名中包含 `train` , `test` , `dev` 的文件，其它文件会被忽略。假设某个目录下的文件为::
@@ -19,23 +19,24 @@ Loader用于读取数据，并将内容读取到 :class:`~fastNLP.DataSet` 或
        +-test.txt
        +-other.txt

    在 Loader().load('/path/to/dir') 返回的 `data_bundle` 中可以用 ``data_bundle.datasets['train']`` , ``data_bundle.datasets['dev']`` ,
    ``data_bundle.datasets['test']`` 获取对应的 `dataset` ，其中 `other.txt` 的内容会被忽略。假设某个目录下的文件为::
    在 Loader().load('/path/to/dir') 返回的 `data_bundle` 中可以用 ``data_bundle.get_dataset('train')`` ,
    ``data_bundle.get_dataset('dev')`` ,
    ``data_bundle.get_dataset('test')`` 获取对应的 `dataset` ，其中 `other.txt` 的内容会被忽略。假设某个目录下的文件为::

        |
        +-train.txt
        +-dev.txt

    在 Loader().load('/path/to/dir') 返回的 `data_bundle` 中可以用 ``data_bundle.datasets['train']`` ,
    ``data_bundle.datasets['dev']`` 获取对应的 dataset。
    在 Loader().load('/path/to/dir') 返回的 `data_bundle` 中可以用 ``data_bundle.get_dataset('train')`` ,
    ``data_bundle.get_dataset('dev')`` 获取对应的 dataset。

 3.传入一个字典
    字典的的 key 为 `dataset` 的名称，value 是该 `dataset` 的文件路径::

        paths = {'train':'/path/to/train', 'dev': '/path/to/dev', 'test':'/path/to/test'}
    
    在 Loader().load(paths)  返回的 `data_bundle` 中可以用 ``data_bundle.datasets['train']`` , ``data_bundle.datasets['dev']`` ,
    ``data_bundle.datasets['test']`` 来获取对应的 `dataset`
    在 Loader().load(paths)  返回的 `data_bundle` 中可以用 ``data_bundle.get_dataset('train')`` , ``data_bundle.get_dataset('dev')`` ,
    ``data_bundle.get_dataset('test')`` 来获取对应的 `dataset`

 fastNLP 目前提供了如下的 Loader

--- a/fastNLP/io/loader/classification.py
+++ b/fastNLP/io/loader/classification.py
@@ -287,7 +287,7 @@ class SST2Loader(Loader):
    数据SST2的Loader
    读取之后DataSet将如下所示

    .. csv-table:: 下面是使用SSTLoader读取的DataSet所具备的field
    .. csv-table::
        :header: "raw_words", "target"

        "it 's a charming and often affecting...", "1"
@@ -345,7 +345,7 @@ class SST2Loader(Loader):
 class ChnSentiCorpLoader(Loader):
    """
    支持读取的数据的格式为，第一行为标题(具体内容会被忽略)，之后一行为一个sample，第一个制表符之前被认为是label，第
    一个制表符及之后认为是句子
    一个制表符之后认为是句子

    Example::

--- a/fastNLP/io/loader/loader.py
+++ b/fastNLP/io/loader/loader.py
@@ -15,6 +15,11 @@ from ...core.dataset import DataSet
 class Loader:
    """
    各种数据 Loader 的基类，提供了 API 的参考.
    Loader支持以下的三个函数

    - download() 函数：自动将该数据集下载到缓存地址，默认缓存地址为~/.fastNLP/datasets/。由于版权等原因，不是所有的Loader都实现了该方法。该方法会返回下载后文件所处的缓存地址。
    - _load() 函数：从一个数据文件中读取数据，返回一个 :class:`~fastNLP.DataSet` 。返回的DataSet的内容可以通过每个Loader的文档判断出。
    - load() 函数：将文件分别读取为DataSet，然后将多个DataSet放入到一个DataBundle中并返回
    
    """
    
--- a/fastNLP/io/pipe/cws.py
+++ b/fastNLP/io/pipe/cws.py
@@ -144,7 +144,7 @@ class CWSPipe(Pipe):
       "2001年  新年  钟声...", "[8, 9, 9, 7, ...]", "[0, 1, 1, 1, 2...]", "[11, 12, ...]","[3, 9, ...]", 20
       "...", "[...]","[...]", "[...]","[...]", .

    其中bigrams仅当bigrams列为True的时候为真
    其中bigrams仅当bigrams列为True的时候存在

    """
    
--- a/fastNLP/io/pipe/pipe.py
+++ b/fastNLP/io/pipe/pipe.py
@@ -9,8 +9,16 @@ from .. import DataBundle

 class Pipe:
    """
    .. todo::
        doc
    Pipe是fastNLP中用于处理DataBundle的类，但实际是处理DataBundle中的DataSet。所有Pipe都会在其process()函数的文档中指出该Pipe可处理的DataSet应该具备怎样的格式；在Pipe
    文档中说明该Pipe返回后DataSet的格式以及其field的信息；以及新增的Vocabulary的信息。

    一般情况下Pipe处理包含以下的几个过程，(1)将raw_words或raw_chars进行tokenize以切分成不同的词或字;
    (2) 再建立词或字的 :class:`~fastNLP.Vocabulary` , 并将词或字转换为index; (3)将target列建立词表并将target列转为index;

    Pipe中提供了两个方法

    -process()函数，输入为DataBundle
    -process_from_file()函数，输入为对应Loader的load函数可接受的类型。

    """