其它调整

2 years ago · 2a31cf831f
--- a/docs/source/tutorials/fastnlp_tutorial_0.ipynb
+++ b/docs/source/tutorials/fastnlp_tutorial_0.ipynb
--- a/docs/source/tutorials/fastnlp_tutorial_1.ipynb
+++ b/docs/source/tutorials/fastnlp_tutorial_1.ipynb
--- a/docs/source/tutorials/fastnlp_tutorial_2.ipynb
+++ b/docs/source/tutorials/fastnlp_tutorial_2.ipynb
@@ -0,0 +1,884 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# T2. databundle 和 tokenizer 的基本使用\n",
    "\n",
    "&emsp; 1 &ensp; fastNLP 中 dataset 的延伸\n",
    "\n",
    "&emsp; &emsp; 1.1 &ensp; databundle 的概念与使用\n",
    "\n",
    "&emsp; 2 &ensp; fastNLP 中的 tokenizer\n",
    " \n",
    "&emsp; &emsp; 2.1 &ensp; PreTrainedTokenizer 的概念\n",
    "\n",
    "&emsp; &emsp; 2.2 &ensp; BertTokenizer 的基本使用\n",
    "<!--  \n",
    "&emsp; &emsp; 2.3 &ensp; 补充：GloVe 词嵌入的使用 -->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. fastNLP 中 dataset 的延伸\n",
    "\n",
    "### 1.1 databundle 的概念与使用\n",
    "\n",
    "在`fastNLP 1.0`中，在常用的数据加载模块`DataLoader`和数据集`DataSet`模块之间，还存在\n",
    "\n",
    "&emsp; 一个中间模块，即 **数据包 DataBundle 模块**，可以从`fastNLP.io`路径中导入该模块\n",
    "\n",
    "在`fastNLP 1.0`中，**一个 databundle 数据包包含若干 dataset 数据集和 vocabulary 词汇表**\n",
    "\n",
    "&emsp; 分别存储在`datasets`和`vocabs`两个变量中，所以了解`databundle`数据包之前\n",
    "\n",
    "需要首先**复习 dataset 数据集和 vocabulary 词汇表**，**下面的一串代码**，**你知道其大概含义吗？**\n",
    "\n",
    "<!-- 必要提示：`NG20`，全称[`News Group 20`](http://qwone.com/~jason/20Newsgroups/)，是一个新闻文本分类数据集，包含20个类别\n",
    "\n",
    "&emsp; 数据集包含训练集`'ng20_train.csv'`和测试集`'ng20_test.csv'`两部分，每条数据\n",
    "\n",
    "&emsp; 包括`'label'`标签和`'text'`文本两个条目，通过`sample(frac=1)[:6]`随机采样并读取前6条 -->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Processing:   0%|          | 0/6 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------------------------------------+----------+\n",
      "| text                                     | label    |\n",
      "+------------------------------------------+----------+\n",
      "| ['a', 'series', 'of', 'escapades', 'd... | negative |\n",
      "| ['this', 'quiet', ',', 'introspective... | positive |\n",
      "| ['even', 'fans', 'of', 'ismail', 'mer... | negative |\n",
      "| ['the', 'importance', 'of', 'being', ... | neutral  |\n",
      "+------------------------------------------+----------+\n",
      "+------------------------------------------+----------+\n",
      "| text                                     | label    |\n",
      "+------------------------------------------+----------+\n",
      "| ['a', 'comedy-drama', 'of', 'nearly',... | positive |\n",
      "| ['a', 'positively', 'thrilling', 'com... | neutral  |\n",
      "+------------------------------------------+----------+\n",
      "{'<pad>': 0, '<unk>': 1, 'negative': 2, 'positive': 3, 'neutral': 4}\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "from fastNLP import DataSet\n",
    "from fastNLP import Vocabulary\n",
    "from fastNLP.io import DataBundle\n",
    "\n",
    "datasets = DataSet.from_pandas(pd.read_csv('./data/test4dataset.tsv', sep='\\t'))\n",
    "datasets.rename_field('Sentence', 'text')\n",
    "datasets.rename_field('Sentiment', 'label')\n",
    "datasets.apply_more(lambda ins:{'label': ins['label'].lower(), \n",
    "                                'text': ins['text'].lower().split()},\n",
    "                    progress_bar='tqdm')\n",
    "datasets.delete_field('SentenceId')\n",
    "train_ds, test_ds = datasets.split(ratio=0.7)\n",
    "datasets = {'train': train_ds, 'test': test_ds}\n",
    "print(datasets['train'])\n",
    "print(datasets['test'])\n",
    "\n",
    "vocabs = {}\n",
    "vocabs['label'] = Vocabulary().from_dataset(datasets['train'].concat(datasets['test'], inplace=False), field_name='label')\n",
    "vocabs['text'] = Vocabulary().from_dataset(datasets['train'].concat(datasets['test'], inplace=False), field_name='text')\n",
    "print(vocabs['label'].word2idx)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!-- 上述代码的含义是：随机读取`NG20`数据集中的各十条训练数据和测试数据，将标签都设为小写，对文本进行分词\n",
    " -->\n",
    "上述代码的含义是：从`test4dataset`的 6 条数据中，划分 4 条训练集（`int(6*0.7) = 4`），2 条测试集\n",
    "\n",
    "&emsp; &emsp; 修改相关字段名称，删除序号字段，同时将标签都设为小写，对文本进行分词\n",
    "\n",
    "&emsp; 接着通过`concat`方法拼接测试集训练集，注意设置`inplace=False`，生成临时的新数据集\n",
    "\n",
    "&emsp; 使用`from_dataset`方法从拼接的数据集中抽取词汇表，为将数据集中的单词替换为序号做准备\n",
    "\n",
    "由此就可以得到**数据集字典 datasets**（**对应训练集、测试集**）和**词汇表字典 vocabs**（**对应数据集各字段**）\n",
    "\n",
    "&emsp; 然后就可以初始化`databundle`了，通过`print`可以观察其大致结构，效果如下"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "In total 2 datasets:\n",
      "\ttrain has 4 instances.\n",
      "\ttest has 2 instances.\n",
      "In total 2 vocabs:\n",
      "\tlabel has 5 entries.\n",
      "\ttext has 96 entries.\n",
      "\n",
      "['train', 'test']\n",
      "['label', 'text']\n"
     ]
    }
   ],
   "source": [
    "data_bundle = DataBundle(datasets=datasets, vocabs=vocabs)\n",
    "print(data_bundle)\n",
    "print(data_bundle.get_dataset_names())\n",
    "print(data_bundle.get_vocab_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "此外，也可以通过`data_bundle`的`num_dataset`和`num_vocab`返回数据表和词汇表个数\n",
    "\n",
    "&emsp; 通过`data_bundle`的`iter_datasets`和`iter_vocabs`遍历数据表和词汇表"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "In total 2 datasets:\n",
      "\ttrain has 4 instances.\n",
      "\ttest has 2 instances.\n",
      "In total 2 datasets:\n",
      "\tlabel has 5 entries.\n",
      "\ttext has 96 entries.\n"
     ]
    }
   ],
   "source": [
    "print(\"In total %d datasets:\" % data_bundle.num_dataset)\n",
    "for name, dataset in data_bundle.iter_datasets():\n",
    "    print(\"\\t%s has %d instances.\" % (name, len(dataset)))\n",
    "print(\"In total %d datasets:\" % data_bundle.num_dataset)\n",
    "for name, vocab in data_bundle.iter_vocabs():\n",
    "    print(\"\\t%s has %d entries.\" % (name, len(vocab)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在数据包`databundle`中，也有和数据集`dataset`类似的四个`apply`函数，即\n",
    "\n",
    "&emsp; `apply`函数、`apply_field`函数、`apply_field_more`函数和`apply_more`函数\n",
    "\n",
    "&emsp; 负责对数据集进行预处理，如下所示是`apply_more`函数的示例，其他函数类似\n",
    "\n",
    "此外，通过`get_dataset`函数，可以通过数据表名`name`称找到对应数据表\n",
    "\n",
    "&emsp; 通过`get_vocab`函数，可以通过词汇表名`field_name`称找到对应词汇表"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Processing:   0%|          | 0/4 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Processing:   0%|          | 0/2 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------------------------+----------+-----+\n",
      "| text                         | label    | len |\n",
      "+------------------------------+----------+-----+\n",
      "| ['a', 'series', 'of', 'es... | negative | 37  |\n",
      "| ['this', 'quiet', ',', 'i... | positive | 11  |\n",
      "| ['even', 'fans', 'of', 'i... | negative | 21  |\n",
      "| ['the', 'importance', 'of... | neutral  | 20  |\n",
      "+------------------------------+----------+-----+\n"
     ]
    }
   ],
   "source": [
    "data_bundle.apply_more(lambda ins:{'len': len(ins['text'])}, progress_bar='tqdm')\n",
    "print(data_bundle.get_dataset('train'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. fastNLP 中的 tokenizer\n",
    "\n",
    "### 2.1 PreTrainTokenizer 的提出\n",
    "<!-- \n",
    "*词嵌入是什么，为什么不用了*\n",
    "\n",
    "*什么是字节对编码，BPE的提出*\n",
    "\n",
    "*以BERT模型为例，WordPiece的提出*\n",
    " -->\n",
    "在`fastNLP 1.0`中，**使用 PreTrainedTokenizer 模块来为数据集中的词语进行词向量的标注**\n",
    "\n",
    "&emsp; 需要注意的是，`PreTrainedTokenizer`模块的下载和导入**需要确保环境安装了 transformers 模块**\n",
    "\n",
    "&emsp; 这是因为 `fastNLP 1.0`中`PreTrainedTokenizer`模块的实现基于`Huggingface Transformers`库\n",
    "\n",
    "**Huggingface Transformers 是一个开源的**，**基于 transformer 模型结构提供的预训练语言库**\n",
    "\n",
    "&emsp; 包含了多种经典的基于`transformer`的预训练模型，如`BERT`、`BART`、`RoBERTa`、`GPT2`、`CPT`\n",
    "\n",
    "&emsp; 更多相关内容可以参考`Huggingface Transformers`的[相关论文](https://arxiv.org/pdf/1910.03771.pdf)、[官方文档](https://huggingface.co/transformers/)以及[的代码仓库](https://github.com/huggingface/transformers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 BertTokenizer 的基本使用\n",
    "\n",
    "在`fastNLP 1.0`中，以`PreTrainedTokenizer`为基类，泛化出多个子类，实现基于`BERT`等模型的标注\n",
    "\n",
    "&emsp; 本节以`BertTokenizer`模块为例，展示`PreTrainedTokenizer`模块的使用方法与应用实例\n",
    "\n",
    "**BertTokenizer 的初始化包括 导入模块和导入数据 两步**，先通过从`fastNLP.transformers.torch`中\n",
    "\n",
    "&emsp; 导入`BertTokenizer`模块，再**通过 from_pretrained 方法指定 tokenizer 参数类型下载**\n",
    "\n",
    "&emsp; 其中，**'bert-base-uncased' 指定 tokenizer 使用的预训练 BERT 类型**：单词不区分大小写\n",
    "\n",
    "&emsp; &emsp; **模块层数 L=12**，**隐藏层维度 H=768**，**自注意力头数 A=12**，**总参数量 110M**\n",
    "\n",
    "&emsp; 另外，模型参数自动下载至 home 目录下的`~\\.cache\\huggingface\\transformers`文件夹中"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "from fastNLP.transformers.torch import BertTokenizer\n",
    "\n",
    "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通过变量`vocab_size`和`vocab_files_names`可以查看`BertTokenizer`的词汇表的大小和对应文件\n",
    "\n",
    "&emsp; 通过变量`vocab`可以访问`BertTokenizer`预训练的词汇表（由于内容过大就不演示了"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "30522 {'vocab_file': 'vocab.txt'}\n"
     ]
    }
   ],
   "source": [
    "print(tokenizer.vocab_size, tokenizer.vocab_files_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通过变量`all_special_tokens`或通过变量`special_tokens_map`可以**查看 BertTokenizer 内置的特殊词素**\n",
    "\n",
    "&emsp; 包括**未知符 '[UNK]'**, **断句符 '[SEP]'**, **补零符 '[PAD]'**, **分类符 '[CLS]'**, **掩码 '[MASK]'**\n",
    "\n",
    "通过变量`all_special_ids`可以**查看 BertTokenizer 内置的特殊词素对应的词汇表编号**，相同功能\n",
    "\n",
    "&emsp; 也可以直接通过查看`pad_token`，值为`'[UNK]'`，和`pad_token_id`，值为`0`，等变量来实现"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pad_token [PAD] 0\n",
      "unk_token [UNK] 100\n",
      "cls_token [CLS] 101\n",
      "sep_token [SEP] 102\n",
      "msk_token [MASK] 103\n",
      "all_tokens ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]'] [100, 102, 0, 101, 103]\n",
      "{'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}\n"
     ]
    }
   ],
   "source": [
    "print('pad_token', tokenizer.pad_token, tokenizer.pad_token_id) \n",
    "print('unk_token', tokenizer.unk_token, tokenizer.unk_token_id) \n",
    "print('cls_token', tokenizer.cls_token, tokenizer.cls_token_id) \n",
    "print('sep_token', tokenizer.sep_token, tokenizer.sep_token_id)\n",
    "print('msk_token', tokenizer.mask_token, tokenizer.mask_token_id)\n",
    "print('all_tokens', tokenizer.all_special_tokens, tokenizer.all_special_ids)\n",
    "print(tokenizer.special_tokens_map)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "此外，还可以添加其他特殊字符，例如起始符`[BOS]`、终止符`[EOS]`，添加后词汇表编号也会相应改变\n",
    "\n",
    "&emsp; *但是如何添加这两个之外的字符，并且如何将这两个的编号设置为 [UNK] 之外的编号？？？*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "bos_token [BOS] 100\n",
      "eos_token [EOS] 100\n",
      "all_tokens ['[BOS]', '[EOS]', '[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]'] [100, 100, 100, 102, 0, 101, 103]\n",
      "{'bos_token': '[BOS]', 'eos_token': '[EOS]', 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}\n"
     ]
    }
   ],
   "source": [
    "tokenizer.bos_token = '[BOS]'\n",
    "tokenizer.eos_token = '[EOS]'\n",
    "# tokenizer.bos_token_id = 104\n",
    "# tokenizer.eos_token_id = 105\n",
    "print('bos_token', tokenizer.bos_token, tokenizer.bos_token_id)\n",
    "print('eos_token', tokenizer.eos_token, tokenizer.eos_token_id)\n",
    "print('all_tokens', tokenizer.all_special_tokens, tokenizer.all_special_ids)\n",
    "print(tokenizer.special_tokens_map)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在`BertTokenizer`中，**使用 tokenize 函数和 convert_tokens_to_string 函数可以实现文本和词素列表的互转**\n",
    "\n",
    "&emsp; 此外，**使用 convert_tokens_to_ids 函数和 convert_ids_to_tokens 函数则可以实现词素和词素编号的互转**\n",
    "\n",
    "&emsp; 上述四个函数的使用效果如下所示，此处可以明显看出，`tokenizer`分词和传统分词的不同效果，例如`'##cap'`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1037, 2186, 1997, 9686, 17695, 18673, 14313, 1996, 15262, 3351, 2008, 2054, 2003, 2204, 2005, 1996, 13020, 2003, 2036, 2204, 2005, 1996, 25957, 4063, 1010, 2070, 1997, 2029, 5681, 2572, 25581, 2021, 3904, 1997, 2029, 8310, 2000, 2172, 1997, 1037, 2466, 1012]\n",
      "['a', 'series', 'of', 'es', '##cap', '##ades', 'demonstrating', 'the', 'ada', '##ge', 'that', 'what', 'is', 'good', 'for', 'the', 'goose', 'is', 'also', 'good', 'for', 'the', 'gan', '##der', ',', 'some', 'of', 'which', 'occasionally', 'am', '##uses', 'but', 'none', 'of', 'which', 'amounts', 'to', 'much', 'of', 'a', 'story', '.']\n",
      "a series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .\n"
     ]
    }
   ],
   "source": [
    "text = \"a series of escapades demonstrating the adage that what is \" \\\n",
    "       \"good for the goose is also good for the gander , some of which \" \\\n",
    "       \"occasionally amuses but none of which amounts to much of a story .\" \n",
    "tks = ['a', 'series', 'of', 'es', '##cap', '##ades', 'demonstrating', 'the', \n",
    "       'ada', '##ge', 'that', 'what', 'is', 'good', 'for', 'the', 'goose', \n",
    "       'is', 'also', 'good', 'for', 'the', 'gan', '##der', ',', 'some', 'of', \n",
    "       'which', 'occasionally', 'am', '##uses', 'but', 'none', 'of', 'which', \n",
    "       'amounts', 'to', 'much', 'of', 'a', 'story', '.']\n",
    "ids = [ 1037,  2186,  1997,  9686, 17695, 18673, 14313,  1996, 15262,  3351, \n",
    "        2008,  2054,  2003,  2204,  2005,  1996, 13020,  2003,  2036,  2204,\n",
    "        2005,  1996, 25957,  4063,  1010,  2070,  1997,  2029,  5681,  2572,\n",
    "       25581,  2021,  3904,  1997,  2029,  8310,  2000,  2172,  1997,  1037,\n",
    "        2466,  1012]\n",
    "\n",
    "tokens = tokenizer.tokenize(text)\n",
    "print(tokenizer.convert_tokens_to_ids(tokens))\n",
    "\n",
    "ids = tokenizer.convert_tokens_to_ids(tokens)\n",
    "print(tokenizer.convert_ids_to_tokens(ids))\n",
    "\n",
    "print(tokenizer.convert_tokens_to_string(tokens))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在`BertTokenizer`中，还有另外两个函数可以实现分词标注，分别是 **encode 和 decode 函数**，**可以直接实现**\n",
    "\n",
    "&emsp; **文本字符串和词素编号列表的互转**，但是编码过程中会按照`BERT`的规则，**在句子首末加入 [CLS] 和 [SEP]**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[101, 1037, 2186, 1997, 9686, 17695, 18673, 14313, 1996, 15262, 3351, 2008, 2054, 2003, 2204, 2005, 1996, 13020, 2003, 2036, 2204, 2005, 1996, 25957, 4063, 1010, 2070, 1997, 2029, 5681, 2572, 25581, 2021, 3904, 1997, 2029, 8310, 2000, 2172, 1997, 1037, 2466, 1012, 102]\n",
      "[CLS] a series of escapades demonstrating the adage that what is good for the goose is also good for the gander, some of which occasionally amuses but none of which amounts to much of a story. [SEP]\n"
     ]
    }
   ],
   "source": [
    "enc = tokenizer.encode(text)\n",
    "print(tokenizer.encode(text))\n",
    "dec = tokenizer.decode(enc)\n",
    "print(tokenizer.decode(enc))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在`encode`函数之上，还有`encode_plus`函数，这也是在数据预处理中，`BertTokenizer`模块最常用到的函数\n",
    "\n",
    "&emsp; **encode 函数的参数**，**encode_plus 函数都有**；**encode 函数词素编号列表**，**encode_plus 函数返回字典**\n",
    "\n",
    "在`encode_plus`函数的返回值中，字段`input_ids`表示词素编号，其余两个字段后文有详细解释\n",
    "\n",
    "&emsp; **字段 token_type_ids 详见 text_pairs 的示例**，**字段 attention_mask 详见 batch_text 的示例**\n",
    "\n",
    "在`encode_plus`函数的参数中，参数`add_special_tokens`表示是否按照`BERT`的规则，加入相关特殊字符\n",
    "\n",
    "&emsp; 参数`max_length`表示句子截取最大长度（算特殊字符），在参数`truncation=True`时会自动截取\n",
    "\n",
    "&emsp; 参数`return_attention_mask`约定返回的字典中是否包括`attention_mask`字段，以上案例如下"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'input_ids': [101, 1037, 2186, 1997, 9686, 17695, 18673, 14313, 1996, 15262, 3351, 2008, 2054, 2003, 2204, 2005, 1996, 13020, 2003, 2036, 2204, 2005, 1996, 25957, 4063, 1010, 2070, 1997, 2029, 5681, 2572, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}\n"
     ]
    }
   ],
   "source": [
    "text = \"a series of escapades demonstrating the adage that what is good for the goose is also good for \"\\\n",
    "       \"the gander , some of which occasionally amuses but none of which amounts to much of a story .\" \n",
    "\n",
    "encoded = tokenizer.encode_plus(text=text, add_special_tokens=True, max_length=32, \n",
    "                                truncation=True, return_attention_mask=True)\n",
    "print(encoded)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在`encode_plus`函数之上，还有`batch_encode_plus`函数（类似地，在`decode`之上，还有`batch_decode`\n",
    "\n",
    "&emsp; 两者参数类似，**batch_encode_plus 函数针对批量文本 batch_text**，**或者批量句对 text_pairs**\n",
    "\n",
    "在针对批量文本`batch_text`的示例中，注意`batch_encode_plus`函数返回字典中的`attention_mask`字段\n",
    "\n",
    "&emsp; 可以发现，**attention_mask 字段通过 01 标注出词素序列中该位置是否为补零**，可以用做自注意力的掩模"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'input_ids': [[101, 1037, 2186, 1997, 9686, 17695, 18673, 14313, 1996, 15262, 3351, 2008, 102, 0, 0], [101, 2054, 2003, 2204, 2005, 1996, 13020, 2003, 2036, 2204, 2005, 1996, 25957, 4063, 102], [101, 2070, 1997, 2029, 5681, 2572, 25581, 102, 0, 0, 0, 0, 0, 0, 0], [101, 2021, 3904, 1997, 2029, 8310, 2000, 2172, 1997, 1037, 2466, 102, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]}\n"
     ]
    }
   ],
   "source": [
    "batch_text = [\"a series of escapades demonstrating the adage that\",\n",
    "              \"what is good for the goose is also good for the gander\",\n",
    "              \"some of which occasionally amuses\",\n",
    "              \"but none of which amounts to much of a story\" ]\n",
    "\n",
    "encoded = tokenizer.batch_encode_plus(batch_text_or_text_pairs=batch_text, padding=True,\n",
    "                                      add_special_tokens=True, max_length=16, truncation=True, \n",
    "                                      return_attention_mask=True)\n",
    "print(encoded)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "而在针对批量句对`text_pairs`的示例中，注意`batch_encode_plus`函数返回字典中的`attention_mask`字段\n",
    "\n",
    "&emsp; 可以发现，**token_type_ids 字段通过 01 标注出词素序列中该位置为句对中的第几句**，句对用 [SEP] 分割"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'input_ids': [[101, 1037, 2186, 1997, 9686, 17695, 18673, 14313, 1996, 15262, 3351, 2008, 102, 2054, 2003, 2204, 2005, 1996, 13020, 2003, 2036, 2204, 2005, 1996, 25957, 4063, 102], [101, 2070, 1997, 2029, 5681, 2572, 25581, 102, 2021, 3904, 1997, 2029, 8310, 2000, 2172, 1997, 1037, 2466, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}\n"
     ]
    }
   ],
   "source": [
    "text_pairs = [(\"a series of escapades demonstrating the adage that\",\n",
    "               \"what is good for the goose is also good for the gander\"),\n",
    "              (\"some of which occasionally amuses\",\n",
    "               \"but none of which amounts to much of a story\")]\n",
    "\n",
    "encoded = tokenizer.batch_encode_plus(batch_text_or_text_pairs=text_pairs, padding=True,\n",
    "                                      add_special_tokens=True, max_length=32, truncation=True, \n",
    "                                      return_attention_mask=True)\n",
    "print(encoded)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "回到`encode_plus`上，在接下来的示例中，**使用内置的 functools.partial 模块构造 encode 函数**\n",
    "\n",
    "&emsp; 接着**使用该函数对 databundle 进行数据预处理**，由于`tokenizer.encode_plus`返回的是一个字典\n",
    "\n",
    "&emsp; 读入的是一个字段，所以此处使用`apply_field_more`方法，得到结果自动并入`databundle`中如下"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "functools.partial(<bound method PreTrainedTokenizerBase.encode_plus of PreTrainedTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=False, padding_side='right', special_tokens={'bos_token': '[BOS]', 'eos_token': '[EOS]', 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})>, max_length=32, truncation=True, return_attention_mask=True)\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Processing:   0%|          | 0/4 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Processing:   0%|          | 0/2 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------------+----------+-----+------------------+--------------------+--------------------+\n",
      "| text             | label    | len | input_ids        | token_type_ids     | attention_mask     |\n",
      "+------------------+----------+-----+------------------+--------------------+--------------------+\n",
      "| ['a', 'series... | negative | 37  | [101, 1037, 2... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... |\n",
      "| ['this', 'qui... | positive | 11  | [101, 2023, 4... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... |\n",
      "| ['even', 'fan... | negative | 21  | [101, 2130, 4... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... |\n",
      "| ['the', 'impo... | neutral  | 20  | [101, 1996, 5... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... |\n",
      "+------------------+----------+-----+------------------+--------------------+--------------------+\n"
     ]
    }
   ],
   "source": [
    "from functools import partial\n",
    "\n",
    "encode = partial(tokenizer.encode_plus, max_length=32, truncation=True,\n",
    "                 return_attention_mask=True)\n",
    "print(encode)\n",
    "\n",
    "data_bundle.apply_field_more(encode, field_name='text', progress_bar='tqdm')\n",
    "print(data_bundle.datasets['train'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "经过`tokenizer`的处理，原始数据集中的文本被替换为词素编号列表，此时，调用`databundle`模块的\n",
    "\n",
    "&emsp; **set_pad 函数**，**将 databundle 的补零符编号 pad_val 和 tokenizer 补零符编号 pad_token_id 统一**\n",
    "\n",
    "&emsp; 该函数同时将`databundle`的`'input_ids'`字段添加到对应数据集的`collator`中（见`tutorial 3.`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{}\n",
      "{}\n",
      "{'input_ids': {'pad_val': 0, 'dtype': None, 'backend': 'auto', 'pad_fn': None}}\n",
      "{'input_ids': {'pad_val': 0, 'dtype': None, 'backend': 'auto', 'pad_fn': None}}\n"
     ]
    }
   ],
   "source": [
    "print(data_bundle.get_dataset('train').collator.input_fields)\n",
    "print(data_bundle.get_dataset('test').collator.input_fields)\n",
    "data_bundle.set_pad('input_ids', pad_val=tokenizer.pad_token_id)\n",
    "print(data_bundle.get_dataset('train').collator.input_fields)\n",
    "print(data_bundle.get_dataset('test').collator.input_fields)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最后，使用`from_dataset`、`index_dataset`和`iter_datasets`方法，为处理数据集的`'label'`字段编码\n",
    "\n",
    "&emsp; 接着**通过 set_ignore 函数**，**指定 databundle 的部分字段**，如`'text'`等，**在划分 batch 时不再出现**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----------------+----------+-----+----------------+--------------------+--------------------+--------+\n",
      "| text           | label    | len | input_ids      | token_type_ids     | attention_mask     | target |\n",
      "+----------------+----------+-----+----------------+--------------------+--------------------+--------+\n",
      "| ['a', 'seri... | negative | 37  | [101, 1037,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 0      |\n",
      "| ['this', 'q... | positive | 11  | [101, 2023,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 1      |\n",
      "| ['even', 'f... | negative | 21  | [101, 2130,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 0      |\n",
      "| ['the', 'im... | neutral  | 20  | [101, 1996,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 2      |\n",
      "+----------------+----------+-----+----------------+--------------------+--------------------+--------+\n"
     ]
    }
   ],
   "source": [
    "target_vocab = Vocabulary(padding=None, unknown=None)\n",
    "\n",
    "target_vocab.from_dataset(*[ds for _, ds in data_bundle.iter_datasets()], field_name='label')\n",
    "target_vocab.index_dataset(*[ds for _, ds in data_bundle.iter_datasets()], field_name='label',\n",
    "                           new_field_name='target')\n",
    "\n",
    "data_bundle.set_ignore('text', 'len', 'label') \n",
    "print(data_bundle.datasets['train'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "以上就是使用`dataset`、`vocabulary`、`databundle`和`tokenizer`实现输入文本数据的读取\n",
    "\n",
    "&emsp; 分词标注、序列化的全部预处理过程，通过下方的代码梳理，相信你会有更详细的了解\n",
    "\n",
    "```python\n",
    "# 首先，导入预训练的 BertTokenizer，这里使用 'bert-base-uncased' 版本\n",
    "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n",
    "\n",
    "# 接着，导入数据，先生成为 dataset 形式，再变成 dataset-dict，并转为 databundle 形式\n",
    "datasets = DataSet.from_pandas(pd.read_csv('./data/test4dataset.tsv', sep='\\t'))\n",
    "train_ds, test_ds = datasets.split(ratio=0.7)\n",
    "data_bundle = DataBundle(datasets={'train': train_ds, 'test': test_ds})\n",
    "\n",
    "# 然后，通过 tokenizer.encode_plus 函数，进行文本分词标注、修改并补充数据包内容\n",
    "encode = partial(tokenizer.encode_plus, max_length=100, truncation=True,\n",
    "                 return_attention_mask=True)\n",
    "data_bundle.apply_field_more(encode, field_name='Sentence', progress_bar='tqdm')\n",
    "\n",
    "# 在修改好 'text' 字段的文本信息后，接着处理 'label' 字段的预测信息\n",
    "target_vocab = Vocabulary(padding=None, unknown=None)\n",
    "target_vocab.from_dataset(*[ds for _, ds in data_bundle.iter_datasets()], field_name='Sentiment')\n",
    "target_vocab.index_dataset(*[ds for _, ds in data_bundle.iter_datasets()], field_name='Sentiment',\n",
    "                           new_field_name='target')\n",
    "\n",
    "# 最后，通过 data_bundle 的其他一些函数，完成善后内容\n",
    "data_bundle.set_pad('input_ids', pad_val=tokenizer.pad_token_id)\n",
    "data_bundle.set_ignore('SentenceId', 'Sentiment', 'Sentence')  \n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "<!-- ### 2.3 补充：GloVe 词嵌入的使用\n",
    "\n",
    "如何使用传统的GloVe词嵌入\n",
    "\n",
    "from utils import get_from_cache\n",
    "\n",
    "filepath = get_from_cache(\"http://download.fastnlp.top/embedding/glove.6B.50d.zip\") -->\n",
    "\n",
    "在接下来的`tutorial 3.`中，将会介绍`fastNLP v1.0`中的`dataloader`模块，会涉及本章中\n",
    "\n",
    "&emsp; 提到的`collator`模块，`fastNLP`的多框架适应以及完整的数据加载过程，敬请期待"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.13"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
 }
--- a/docs/source/tutorials/fastnlp_tutorial_3.ipynb
+++ b/docs/source/tutorials/fastnlp_tutorial_3.ipynb
@@ -0,0 +1,621 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "213d538c",
   "metadata": {},
   "source": [
    "# T3. dataloader 的内部结构和基本使用\n",
    "\n",
    "&emsp; 1 &ensp; fastNLP 中的 dataloader\n",
    " \n",
    "&emsp; &emsp; 1.1 &ensp; dataloader 的基本介绍\n",
    "\n",
    "&emsp; &emsp; 1.2 &ensp; dataloader 的函数创建\n",
    "\n",
    "&emsp; 2 &ensp; fastNLP 中 dataloader 的延伸\n",
    "\n",
    "&emsp; &emsp; 2.1 &ensp; collator 的概念与使用\n",
    "\n",
    "&emsp; &emsp; 2.2 &ensp; 结合 datasets 框架"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85857115",
   "metadata": {},
   "source": [
    "## 1. fastNLP 中的 dataloader\n",
    "\n",
    "### 1.1 dataloader 的基本介绍\n",
    "\n",
    "在`fastNLP 1.0`的开发中，最关键的开发目标就是**实现 fastNLP 对当前主流机器学习框架**，例如\n",
    "\n",
    "&emsp; **当下流行的 pytorch**，以及**国产的 paddle 、jittor 和 oneflow 的兼容**，扩大受众的同时，也是助力国产\n",
    "\n",
    "本着分而治之的思想，我们可以将`fastNLP 1.0`对`pytorch`、`paddle`、`jittor`、`oneflow`框架的兼容，划分为\n",
    "\n",
    "&emsp; &emsp; **对数据预处理**、**批量 batch 的划分与补齐**、**模型训练**、**模型评测**，**四个部分的兼容**\n",
    "\n",
    "&emsp; 针对数据预处理，我们已经在`tutorial-1`中介绍了`dataset`和`vocabulary`的使用\n",
    "\n",
    "&emsp; &emsp; 而结合`tutorial-0`，我们可以发现**数据预处理环节本质上是框架无关的**\n",
    "\n",
    "&emsp; &emsp; 因为在不同框架下，读取的原始数据格式都差异不大，彼此也很容易转换\n",
    "\n",
    "只有涉及到张量、模型，不同框架才展现出其各自的特色：**pytorch 和 oneflow 中的 tensor 和 nn.Module**\n",
    "\n",
    "&emsp; &emsp; **在 paddle 中称为 tensor 和 nn.Layer**，**在 jittor 中则称为 Var 和 Module**\n",
    "\n",
    "&emsp; &emsp; 因此，**模型训练、模型评测**，**是兼容的重难点**，我们将会在`tutorial-5`中详细介绍\n",
    "\n",
    "&emsp; 针对批量`batch`的处理，作为`fastNLP 1.0`中框架无关部分想框架相关部分的过渡\n",
    "\n",
    "&emsp; &emsp; 就是`dataloader`模块的职责，这也是本篇教程`tutorial-3`讲解的重点\n",
    "\n",
    "**dataloader 模块的职责**，详细划分可以包含以下三部分，**采样划分、补零对齐、框架匹配**\n",
    "\n",
    "&emsp; &emsp; 第一，确定`batch`大小，确定采样方式，划分后通过迭代器即可得到`batch`序列\n",
    "\n",
    "&emsp; &emsp; 第二，对于序列处理，这也是`fastNLP`主要针对的，将同个`batch`内的数据对齐\n",
    "\n",
    "&emsp; &emsp; 第三，**batch 内数据格式要匹配框架**，**但 batch 结构需保持一致**，**参数匹配机制**\n",
    "\n",
    "&emsp; 对此，`fastNLP 1.0`给出了 **TorchDataLoader 、 PaddleDataLoader 、 JittorDataLoader 和 OneflowDataLoader**\n",
    "\n",
    "&emsp; &emsp; 分别针对并匹配不同框架，但彼此之间参数名、属性、方法仍然类似，前两者大致如下表所示\n",
    "\n",
    "名称|参数|属性|功能|内容\n",
    "----|----|----|----|----|\n",
    " `dataset` | √ | √ | 指定`dataloader`的数据内容  |  |\n",
    " `batch_size` | √ | √ | 指定`dataloader`的`batch`大小 | 默认`16` |\n",
    " `shuffle` | √ | √ | 指定`dataloader`的数据是否打乱 | 默认`False` |\n",
    " `collate_fn` | √ | √ | 指定`dataloader`的`batch`打包方法 | 视框架而定 |\n",
    " `sampler` | √ | √ | 指定`dataloader`的`__len__`和`__iter__`函数的实现 | 默认`None` |\n",
    " `batch_sampler` | √ | √ | 指定`dataloader`的`__len__`和`__iter__`函数的实现 | 默认`None` |\n",
    " `drop_last` | √ | √ | 指定`dataloader`划分`batch`时是否丢弃剩余的 | 默认`False` |\n",
    " `cur_batch_indices` |  | √ | 记录`dataloader`当前遍历批量序号 |  |\n",
    " `num_workers` | √ | √ | 指定`dataloader`开启子进程数量 | 默认`0` |\n",
    " `worker_init_fn` | √ | √ | 指定`dataloader`子进程初始方法 | 默认`None` |\n",
    " `generator` | √ | √ | 指定`dataloader`子进程随机种子 | 默认`None` |\n",
    " `prefetch_factor` |  | √ | 指定为每个`worker`装载的`sampler`数量 | 默认`2` |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60a8a224",
   "metadata": {},
   "source": [
    "&emsp; 论及`dataloader`的函数，其中，`get_batch_indices`用来获取当前遍历到的`batch`序号，其他函数\n",
    "\n",
    "&emsp; &emsp; 包括`set_ignore`、`set_pad`和`databundle`类似，请参考`tutorial-2`，此处不做更多介绍\n",
    "\n",
    "&emsp; &emsp; 以下是`tutorial-2`中已经介绍过的数据预处理流程，接下来是对相关数据进行`dataloader`处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "aca72b49",
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[38;5;2m[i 0604 15:44:29.773860 92 log.cc:351] Load log_sync: 1\u001b[m\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Processing:   0%|          | 0/4 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Processing:   0%|          | 0/2 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Processing:   0%|          | 0/2 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------+----------------+-----------+----------------+--------------------+--------------------+--------+\n",
      "| SentenceId | Sentence       | Sentiment | input_ids      | token_type_ids     | attention_mask     | target |\n",
      "+------------+----------------+-----------+----------------+--------------------+--------------------+--------+\n",
      "| 1          | A series of... | negative  | [101, 1037,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 1      |\n",
      "| 4          | A positivel... | neutral   | [101, 1037,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 2      |\n",
      "| 3          | Even fans o... | negative  | [101, 2130,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 1      |\n",
      "| 5          | A comedy-dr... | positive  | [101, 1037,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 0      |\n",
      "+------------+----------------+-----------+----------------+--------------------+--------------------+--------+\n"
     ]
    }
   ],
   "source": [
    "import sys\n",
    "sys.path.append('..')\n",
    "\n",
    "import pandas as pd\n",
    "from functools import partial\n",
    "from fastNLP.transformers.torch import BertTokenizer\n",
    "\n",
    "from fastNLP import DataSet\n",
    "from fastNLP import Vocabulary\n",
    "from fastNLP.io import DataBundle\n",
    "\n",
    "\n",
    "class PipeDemo:\n",
    "    def __init__(self, tokenizer='bert-base-uncased'):\n",
    "        self.tokenizer = BertTokenizer.from_pretrained(tokenizer)\n",
    "\n",
    "    def process_from_file(self, path='./data/test4dataset.tsv'):\n",
    "        datasets = DataSet.from_pandas(pd.read_csv(path, sep='\\t'))\n",
    "        train_ds, test_ds = datasets.split(ratio=0.7)\n",
    "        train_ds, dev_ds = datasets.split(ratio=0.8)\n",
    "        data_bundle = DataBundle(datasets={'train': train_ds, 'dev': dev_ds, 'test': test_ds})\n",
    "\n",
    "        encode = partial(self.tokenizer.encode_plus, max_length=100, truncation=True,\n",
    "                         return_attention_mask=True)\n",
    "        data_bundle.apply_field_more(encode, field_name='Sentence', progress_bar='tqdm')\n",
    "        \n",
    "        target_vocab = Vocabulary(padding=None, unknown=None)\n",
    "\n",
    "        target_vocab.from_dataset(*[ds for _, ds in data_bundle.iter_datasets()], field_name='Sentiment')\n",
    "        target_vocab.index_dataset(*[ds for _, ds in data_bundle.iter_datasets()], field_name='Sentiment',\n",
    "                                   new_field_name='target')\n",
    "\n",
    "        data_bundle.set_pad('input_ids', pad_val=self.tokenizer.pad_token_id)\n",
    "        data_bundle.set_ignore('SentenceId', 'Sentence', 'Sentiment')  \n",
    "        return data_bundle\n",
    "\n",
    "    \n",
    "pipe = PipeDemo(tokenizer='bert-base-uncased')\n",
    "\n",
    "data_bundle = pipe.process_from_file('./data/test4dataset.tsv')\n",
    "\n",
    "print(data_bundle.get_dataset('train'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76e6b8ab",
   "metadata": {},
   "source": [
    "### 1.2 dataloader 的函数创建\n",
    "\n",
    "在`fastNLP 1.0`中，**更方便、可能更常用的 dataloader 创建方法是通过 prepare_xx_dataloader 函数**\n",
    "\n",
    "&emsp; 例如下方的`prepare_torch_dataloader`函数，指定必要参数，读取数据集，生成对应`dataloader`\n",
    "\n",
    "&emsp; 类型为`TorchDataLoader`，只能适用于`pytorch`框架，因此对应`trainer`初始化时`driver='torch'`\n",
    "\n",
    "同时我们看还可以发现，在`fastNLP 1.0`中，**batch 表示为字典 dict 类型**，**key 值就是原先数据集中各个字段**\n",
    "\n",
    "&emsp; **除去经过 DataBundle.set_ignore 函数隐去的部分**，而`value`值为`pytorch`框架对应的`torch.Tensor`类型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "5fd60e42",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'fastNLP.core.dataloaders.torch_dataloader.fdl.TorchDataLoader'>\n",
      "<class 'dict'> <class 'torch.Tensor'> ['input_ids', 'token_type_ids', 'attention_mask', 'target']\n",
      "{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
      "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
      "        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
      "         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
      "        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
      "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
      "        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
      "         1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),\n",
      " 'input_ids': tensor([[  101,  1037,  4038,  1011,  3689,  1997,  3053,  8680, 19173, 15685,\n",
      "          1999,  1037, 18006,  2836,  2011,  1996,  2516,  2839, 14996,  3054,\n",
      "         15509,  5325,  1012,   102,     0,     0,     0,     0,     0,     0,\n",
      "             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
      "             0,     0,     0,     0],\n",
      "        [  101,  1037,  2186,  1997,  9686, 17695, 18673, 14313,  1996, 15262,\n",
      "          3351,  2008,  2054,  2003,  2204,  2005,  1996, 13020,  2003,  2036,\n",
      "          2204,  2005,  1996, 25957,  4063,  1010,  2070,  1997,  2029,  5681,\n",
      "          2572, 25581,  2021,  3904,  1997,  2029,  8310,  2000,  2172,  1997,\n",
      "          1037,  2466,  1012,   102],\n",
      "        [  101,  2130,  4599,  1997, 19214,  6432,  1005,  1055,  2147,  1010,\n",
      "          1045,  8343,  1010,  2052,  2031,  1037,  2524,  2051,  3564,  2083,\n",
      "          2023,  2028,  1012,   102,     0,     0,     0,     0,     0,     0,\n",
      "             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
      "             0,     0,     0,     0],\n",
      "        [  101,  1037, 13567, 26162,  5257,  1997,  3802,  7295,  9888,  1998,\n",
      "          2035,  1996, 20014, 27611,  1010, 14583,  1010, 11703, 20175,  1998,\n",
      "          4028,  1997,  1037,  8101,  2319, 10576,  2030,  1037, 28900,  7815,\n",
      "          3850,  1012,   102,     0,     0,     0,     0,     0,     0,     0,\n",
      "             0,     0,     0,     0]]),\n",
      " 'target': tensor([0, 1, 1, 2]),\n",
      " 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
      "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
      "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
      "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
      "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
      "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
      "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
      "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}\n"
     ]
    }
   ],
   "source": [
    "from fastNLP import prepare_torch_dataloader\n",
    "\n",
    "train_dataset = data_bundle.get_dataset('train')\n",
    "evaluate_dataset = data_bundle.get_dataset('dev')\n",
    "\n",
    "train_dataloader = prepare_torch_dataloader(train_dataset, batch_size=16, shuffle=True)\n",
    "evaluate_dataloader = prepare_torch_dataloader(evaluate_dataset, batch_size=16)\n",
    "\n",
    "print(type(train_dataloader))\n",
    "\n",
    "import pprint\n",
    "\n",
    "for batch in train_dataloader:\n",
    "    print(type(batch), type(batch['input_ids']), list(batch))\n",
    "    pprint.pprint(batch, width=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f457a6e",
   "metadata": {},
   "source": [
    "之所以说`prepare_xx_dataloader`函数更方便，是因为其**导入对象不仅可也是 DataSet 类型**，**还可以**\n",
    "\n",
    "&emsp; **是 DataBundle 类型**，不过数据集名称需要是`'train'`、`'dev'`、`'test'`供`fastNLP`识别\n",
    "\n",
    "例如下方就是**直接通过 prepare_paddle_dataloader 函数生成基于 PaddleDataLoader 的字典**\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "7827557d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'fastNLP.core.dataloaders.paddle_dataloader.fdl.PaddleDataLoader'>\n"
     ]
    }
   ],
   "source": [
    "from fastNLP import prepare_paddle_dataloader\n",
    "\n",
    "dl_bundle = prepare_paddle_dataloader(data_bundle, batch_size=16, shuffle=True)\n",
    "\n",
    "print(type(dl_bundle['train']))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d898cf40",
   "metadata": {},
   "source": [
    "&emsp; 而在接下来`trainer`的初始化过程中，按如下方式使用即可，除了初始化时`driver='paddle'`外\n",
    "\n",
    "&emsp; 这里也可以看出`trainer`模块中，**evaluate_dataloaders 的设计允许评测可以针对多个数据集**\n",
    "\n",
    "```python\n",
    "trainer = Trainer(\n",
    "    model=model,\n",
    "    train_dataloader=dl_bundle['train'],\n",
    "    optimizers=optimizer,\n",
    "\t...\n",
    "\tdriver='paddle',\n",
    "\tdevice='gpu',\n",
    "\t...\n",
    "    evaluate_dataloaders={'dev': dl_bundle['dev'], 'test': dl_bundle['test']},     \n",
    "    metrics={'acc': Accuracy()},\n",
    "\t...\n",
    ")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d74d0523",
   "metadata": {},
   "source": [
    "## 2. fastNLP 中 dataloader 的延伸\n",
    "\n",
    "### 2.1 collator 的概念与使用\n",
    "\n",
    "在`fastNLP 1.0`中，在数据加载模块`dataloader`内部，如之前表格所列举的，还存在其他的一些模块\n",
    "\n",
    "&emsp; 例如，**实现序列的补零对齐的核对器 collator 模块**；注：`collate vt. 整理（文件或书等）；核对，校勘`\n",
    "\n",
    "在`fastNLP 1.0`中，虽然`dataloader`随框架不同，但`collator`模块却是统一的，主要属性、方法如下表所示\n",
    "\n",
    "名称|属性|方法|功能|内容\n",
    " ----|----|----|----|----|\n",
    " `backend` | √ |  | 记录`collator`对应框架 | 字符串型，如`'torch'` |\n",
    " `padders` | √ |  | 记录各字段对应的`padder`，每个负责具体补零对齐&emsp; | 字典类型 |\n",
    " `ignore_fields` | √ |  | 记录`dataloader`采样`batch`时不予考虑的字段 | 集合类型 |\n",
    " `input_fields` | √ |  | 记录`collator`每个字段的补零值、数据类型等 | 字典类型 |\n",
    " `set_backend` |  | √ | 设置`collator`对应框架 | 字符串型，如`'torch'` |\n",
    " `set_ignore` |  | √ | 设置`dataloader`采样`batch`时不予考虑的字段 | 字符串型，表示`field_name`&emsp; |\n",
    " `set_pad` |  | √ | 设置`collator`每个字段的补零值、数据类型等 |  |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d0795b3e",
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'function'>\n"
     ]
    }
   ],
   "source": [
    "train_dataloader.collate_fn\n",
    "\n",
    "print(type(train_dataloader.collate_fn))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f816ef5",
   "metadata": {},
   "source": [
    "此外，还可以 **手动定义 dataloader 中的 collate_fn**，而不是使用`fastNLP 1.0`中自带的`collator`模块\n",
    "\n",
    "&emsp; 该函数的定义可以大致如下，需要注意的是，**定义 collate_fn 之前需要了解 batch 作为字典的格式**\n",
    "\n",
    "&emsp; 该函数通过`collate_fn`参数传入`dataloader`，**在 batch 分发**（**而不是 batch 划分**）**时调用**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ff8e405e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "\n",
    "def collate_fn(batch):\n",
    "    input_ids, atten_mask, labels = [], [], []\n",
    "    max_length = [0] * 3\n",
    "    for each_item in batch:\n",
    "        input_ids.append(each_item['input_ids'])\n",
    "        max_length[0] = max(len(each_item['input_ids']), max_length[0])\n",
    "        atten_mask.append(each_item['token_type_ids'])\n",
    "        max_length[1] = max(len(each_item['token_type_ids']), max_length[1])\n",
    "        labels.append(each_item['attention_mask'])\n",
    "        max_length[2] = max(len(each_item['attention_mask']), max_length[2])\n",
    "\n",
    "    for i in range(3):\n",
    "        each = (input_ids, atten_mask, labels)[i]\n",
    "        for item in each:\n",
    "            item.extend([0] * (max_length[i] - len(item)))\n",
    "    return {'input_ids': torch.cat([torch.tensor([item]) for item in input_ids], dim=0),\n",
    "            'token_type_ids': torch.cat([torch.tensor([item]) for item in atten_mask], dim=0),\n",
    "            'attention_mask': torch.cat([torch.tensor(item) for item in labels], dim=0)}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "487b75fb",
   "metadata": {},
   "source": [
    "注意：使用自定义的`collate_fn`函数，`trainer`的`collate_fn`变量也会自动调整为`function`类型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "e916d1ac",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'fastNLP.core.dataloaders.torch_dataloader.fdl.TorchDataLoader'>\n",
      "<class 'function'>\n",
      "{'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
      "        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,\n",
      "        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
      "        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
      "        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,\n",
      "        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
      "        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,\n",
      "        0, 0, 0, 0, 0, 0, 0, 0]),\n",
      " 'input_ids': tensor([[  101,  1037,  4038,  1011,  3689,  1997,  3053,  8680, 19173, 15685,\n",
      "          1999,  1037, 18006,  2836,  2011,  1996,  2516,  2839, 14996,  3054,\n",
      "         15509,  5325,  1012,   102,     0,     0,     0,     0,     0,     0,\n",
      "             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
      "             0,     0,     0,     0],\n",
      "        [  101,  1037,  2186,  1997,  9686, 17695, 18673, 14313,  1996, 15262,\n",
      "          3351,  2008,  2054,  2003,  2204,  2005,  1996, 13020,  2003,  2036,\n",
      "          2204,  2005,  1996, 25957,  4063,  1010,  2070,  1997,  2029,  5681,\n",
      "          2572, 25581,  2021,  3904,  1997,  2029,  8310,  2000,  2172,  1997,\n",
      "          1037,  2466,  1012,   102],\n",
      "        [  101,  2130,  4599,  1997, 19214,  6432,  1005,  1055,  2147,  1010,\n",
      "          1045,  8343,  1010,  2052,  2031,  1037,  2524,  2051,  3564,  2083,\n",
      "          2023,  2028,  1012,   102,     0,     0,     0,     0,     0,     0,\n",
      "             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
      "             0,     0,     0,     0],\n",
      "        [  101,  1037, 13567, 26162,  5257,  1997,  3802,  7295,  9888,  1998,\n",
      "          2035,  1996, 20014, 27611,  1010, 14583,  1010, 11703, 20175,  1998,\n",
      "          4028,  1997,  1037,  8101,  2319, 10576,  2030,  1037, 28900,  7815,\n",
      "          3850,  1012,   102,     0,     0,     0,     0,     0,     0,     0,\n",
      "             0,     0,     0,     0]]),\n",
      " 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
      "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
      "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
      "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
      "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
      "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
      "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
      "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}\n"
     ]
    }
   ],
   "source": [
    "train_dataloader = prepare_torch_dataloader(train_dataset, collate_fn=collate_fn, shuffle=True)\n",
    "evaluate_dataloader = prepare_torch_dataloader(evaluate_dataset, collate_fn=collate_fn, shuffle=True)\n",
    "\n",
    "print(type(train_dataloader))\n",
    "print(type(train_dataloader.collate_fn))\n",
    "\n",
    "for batch in train_dataloader:\n",
    "    pprint.pprint(batch, width=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0bd98365",
   "metadata": {},
   "source": [
    "### 2.2  fastNLP 与 datasets 的结合\n",
    "\n",
    "从`tutorial-1`至`tutorial-3`，我们已经完成了对`fastNLP v1.0`数据读取、预处理、加载，整个流程的介绍\n",
    "\n",
    "&emsp; 不过在实际使用中，我们往往也会采取更为简便的方法读取数据，例如使用`huggingface`的`datasets`模块\n",
    "\n",
    "**使用 datasets 模块中的 load_dataset 函数**，通过指定数据集两级的名称，示例中即是**GLUE 标准中的 SST-2 数据集**\n",
    "\n",
    "&emsp; 即可以快速从网上下载好`SST-2`数据集读入，之后以`pandas.DataFrame`作为中介，再转化成`fastNLP.DataSet`\n",
    "\n",
    "&emsp; 之后的步骤就和其他关于`dataset`、`databundle`、`vocabulary`、`dataloader`中介绍的相关使用相同了"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "91879c30",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Reusing dataset glue (/remote-home/xrliu/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "639a0ad3c63944c6abef4e8ee1f7bf7c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "sst2data = load_dataset('glue', 'sst2')\n",
    "\n",
    "dataset = DataSet.from_pandas(sst2data['train'].to_pandas())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.13"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/docs/source/tutorials/fastnlp_tutorial_4.ipynb
+++ b/docs/source/tutorials/fastnlp_tutorial_4.ipynb
--- a/docs/source/tutorials/fastnlp_tutorial_5.ipynb
+++ b/docs/source/tutorials/fastnlp_tutorial_5.ipynb
--- a/docs/source/tutorials/fastnlp_tutorial_6.ipynb
+++ b/docs/source/tutorials/fastnlp_tutorial_6.ipynb
--- a/docs/source/tutorials/fastnlp_tutorial_e1.ipynb
+++ b/docs/source/tutorials/fastnlp_tutorial_e1.ipynb
--- a/docs/source/tutorials/fastnlp_tutorial_e2.ipynb
+++ b/docs/source/tutorials/fastnlp_tutorial_e2.ipynb
--- a/docs/source/tutorials/fastnlp_tutorial_paddle_e1.ipynb
+++ b/docs/source/tutorials/fastnlp_tutorial_paddle_e1.ipynb
--- a/docs/source/tutorials/fastnlp_tutorial_paddle_e2.ipynb
+++ b/docs/source/tutorials/fastnlp_tutorial_paddle_e2.ipynb
--- a/docs/source/tutorials/figures/E1-fig-glue-benchmark.png
+++ b/docs/source/tutorials/figures/E1-fig-glue-benchmark.png
--- a/docs/source/tutorials/figures/E2-fig-p-tuning-v2-model.png
+++ b/docs/source/tutorials/figures/E2-fig-p-tuning-v2-model.png
--- a/docs/source/tutorials/figures/E2-fig-pet-model.png
+++ b/docs/source/tutorials/figures/E2-fig-pet-model.png
--- a/docs/source/tutorials/figures/T0-fig-parameter-matching.png
+++ b/docs/source/tutorials/figures/T0-fig-parameter-matching.png
--- a/docs/source/tutorials/figures/T0-fig-trainer-and-evaluator.png
+++ b/docs/source/tutorials/figures/T0-fig-trainer-and-evaluator.png
--- a/docs/source/tutorials/figures/T0-fig-training-structure.png
+++ b/docs/source/tutorials/figures/T0-fig-training-structure.png
--- a/docs/source/tutorials/figures/T1-fig-dataset-and-vocabulary.png
+++ b/docs/source/tutorials/figures/T1-fig-dataset-and-vocabulary.png
--- a/docs/source/tutorials/figures/paddle-ernie-1.0-masking-levels.png
+++ b/docs/source/tutorials/figures/paddle-ernie-1.0-masking-levels.png
--- a/docs/source/tutorials/figures/paddle-ernie-1.0-masking.png
+++ b/docs/source/tutorials/figures/paddle-ernie-1.0-masking.png
--- a/docs/source/tutorials/figures/paddle-ernie-2.0-continual-pretrain.png
+++ b/docs/source/tutorials/figures/paddle-ernie-2.0-continual-pretrain.png
--- a/docs/source/tutorials/figures/paddle-ernie-3.0-framework.png
+++ b/docs/source/tutorials/figures/paddle-ernie-3.0-framework.png
--- a/fastNLP/init.py
+++ b/fastNLP/init.py
@@ -2,4 +2,4 @@
 from fastNLP.envs import *
 from fastNLP.core import *

 __version__ = '0.8.0beta'
 __version__ = '1.0.0alpha'
--- a/fastNLP/core/collators/padders/oneflow_padder.py
+++ b/fastNLP/core/collators/padders/oneflow_padder.py
@@ -7,6 +7,7 @@ from inspect import isclass
 import numpy as np

 from fastNLP.envs.imports import _NEED_IMPORT_ONEFLOW
 from fastNLP.envs.utils import _module_available

 if _NEED_IMPORT_ONEFLOW:
    import oneflow
--- a/fastNLP/core/controllers/trainer.py
+++ b/fastNLP/core/controllers/trainer.py
@@ -83,13 +83,13 @@ class Trainer(TrainerEventTrigger):
        .. warning::

            当使用分布式训练时， **fastNLP** 会默认将 ``dataloader`` 中的 ``Sampler`` 进行处理，以使得在一个 epoch 中，不同卡
            用以训练的数据是不重叠的。如果你对 sampler 有特殊处理，那么请将 ``use_dist_sampler`` 参数设置为 ``False`` ，此刻需要由
            你自身保证每张卡上所使用的数据是不同的。
            用以训练的数据是不重叠的。如果您对 sampler 有特殊处理，那么请将 ``use_dist_sampler`` 参数设置为 ``False`` ，此刻需要由
            您自身保证每张卡上所使用的数据是不同的。

    :param optimizers: 训练所需要的优化器；可以是单独的一个优化器实例，也可以是多个优化器组成的 List；
    :param device: 该参数用来指定具体训练时使用的机器；注意当该参数仅当您通过 ``torch.distributed.launch/run`` 启动时可以为 ``None``，
        此时 fastNLP 不会对模型和数据进行设备之间的移动处理，但是你可以通过参数 ``input_mapping`` 和 ``output_mapping`` 来实现设备之间
        数据迁移的工作（通过这两个参数传入两个处理数据的函数）；同时你也可以通过在 kwargs 添加参数 ``data_device`` 来让我们帮助您将数据
        此时 fastNLP 不会对模型和数据进行设备之间的移动处理，但是您可以通过参数 ``input_mapping`` 和 ``output_mapping`` 来实现设备之间
        数据迁移的工作（通过这两个参数传入两个处理数据的函数）；同时您也可以通过在 kwargs 添加参数 ``data_device`` 来让我们帮助您将数据
        迁移到指定的机器上（注意这种情况理应只出现在用户在 Trainer 实例化前自己构造 DDP 的场景）；

        device 的可选输入如下所示：
@@ -195,7 +195,7 @@ class Trainer(TrainerEventTrigger):
            3. 如果此时 batch 此时是其它类型，那么我们将会直接报错；
        2. 如果 ``input_mapping`` 是一个函数，那么对于取出的 batch，我们将不会做任何处理，而是直接将其传入该函数里；

        注意该参数会被传进 ``Evaluator`` 中；因此你可以通过该参数来实现将训练数据 batch 移到对应机器上的工作（例如当参数 ``device`` 为 ``None`` 时）；
        注意该参数会被传进 ``Evaluator`` 中；因此您可以通过该参数来实现将训练数据 batch 移到对应机器上的工作（例如当参数 ``device`` 为 ``None`` 时）；
        如果 ``Trainer`` 和 ``Evaluator`` 需要使用不同的 ``input_mapping``, 请使用 ``train_input_mapping`` 与 ``evaluate_input_mapping`` 分别进行设置。

    :param output_mapping: 应当为一个字典或者函数。作用和 ``input_mapping`` 类似，区别在于其用于转换输出：
@@ -366,7 +366,7 @@ class Trainer(TrainerEventTrigger):

    .. note::
        ``Trainer`` 是通过在内部直接初始化一个 ``Evaluator`` 来进行验证；
        ``Trainer`` 内部的 ``Evaluator`` 默认是 None，如果您需要在训练过程中进行验证，你需要保证这几个参数得到正确的传入：
        ``Trainer`` 内部的 ``Evaluator`` 默认是 None，如果您需要在训练过程中进行验证，您需要保证这几个参数得到正确的传入：

        必须的参数：``metrics`` 与 ``evaluate_dataloaders``；

@@ -896,7 +896,7 @@ class Trainer(TrainerEventTrigger):

        这段代码意味着 ``fn1`` 和 ``fn2`` 会被加入到 ``trainer1``，``fn3`` 会被加入到 ``trainer2``；

        注意如果你使用该函数修饰器来为你的训练添加 callback，请务必保证你加入 callback 函数的代码在实例化 `Trainer` 之前；
        注意如果您使用该函数修饰器来为您的训练添加 callback，请务必保证您加入 callback 函数的代码在实例化 `Trainer` 之前；

        补充性的解释见 :meth:`~fastNLP.core.controllers.Trainer.add_callback_fn`；

--- a/fastNLP/core/dataset/dataset.py
+++ b/fastNLP/core/dataset/dataset.py
@@ -584,7 +584,7 @@ class DataSet:
        将 :class:`DataSet` 每个 ``instance`` 中为 ``field_name`` 的 field 传给函数 ``func``，并写入到 ``new_field_name``
        中。

        :param func: 对指定 fiel` 进行处理的函数，注意其输入应为 ``instance`` 中名为 ``field_name`` 的 field 的内容；
        :param func: 对指定 field 进行处理的函数，注意其输入应为 ``instance`` 中名为 ``field_name`` 的 field 的内容；
        :param field_name: 传入 ``func`` 的 field 名称；
        :param new_field_name: 函数执行结果写入的 ``field`` 名称。该函数会将 ``func`` 返回的内容放入到 ``new_field_name`` 对
            应的 ``field`` 中，注意如果名称与已有的 field 相同则会进行覆盖。如果为 ``None`` 则不会覆盖和创建 field ；
@@ -624,10 +624,9 @@ class DataSet:
            ``apply_field_more`` 与 ``apply_field`` 的区别参考 :meth:`~fastNLP.core.dataset.DataSet.apply_more` 中关于 ``apply_more`` 与
            ``apply`` 区别的介绍。

        :param func: 对指定 fiel` 进行处理的函数，注意其输入应为 ``instance`` 中名为 ``field_name`` 的 field 的内容；
        :param field_name: 传入 ``func`` 的 fiel` 名称；
        :param new_field_name: 函数执行结果写入的 ``field`` 名称。该函数会将 ``func`` 返回的内容放入到 ``new_field_name`` 对
            应的 ``field`` 中，注意如果名称与已有的 field 相同则会进行覆盖。如果为 ``None`` 则不会覆盖和创建 field ；
        :param func: 对指定 field 进行处理的函数，注意其输入应为 ``instance`` 中名为 ``field_name`` 的 field 的内容；
        :param field_name: 传入 ``func`` 的 field 名称；
        :param modify_fields: 是否用结果修改 ``DataSet`` 中的 ``Field`` ， 默认为 ``True``
        :param num_proc: 使用进程的数量。
        
            .. note::
@@ -751,8 +750,8 @@ class DataSet:

            3. ``apply_more`` 默认修改 ``DataSet`` 中的 field ，``apply`` 默认不修改。

        :param modify_fields: 是否用结果修改 ``DataSet`` 中的 ``Field`` ， 默认为 True
        :param func: 参数是 ``DataSet`` 中的 ``Instance`` ，返回值是一个字典，key 是field 的名字，value 是对应的结果
        :param modify_fields: 是否用结果修改 ``DataSet`` 中的 ``Field`` ， 默认为 ``True``
        :param num_proc: 使用进程的数量。

            .. note::
--- a/fastNLP/core/drivers/torch_driver/torch_fsdp.py
+++ b/fastNLP/core/drivers/torch_driver/torch_fsdp.py
@@ -1,15 +1,17 @@



 from fastNLP.envs.imports import _TORCH_GREATER_EQUAL_1_12
 from fastNLP.envs.imports import _TORCH_GREATER_EQUAL_1_12, _NEED_IMPORT_TORCH

 if _TORCH_GREATER_EQUAL_1_12:
    from torch.distributed.fsdp import FullyShardedDataParallel, StateDictType, FullStateDictConfig, OptimStateKeyType

 if _NEED_IMPORT_TORCH:
    import torch
    import torch.distributed as dist
    from torch.nn.parallel import DistributedDataParallel

 import os
 import torch
 import torch.distributed as dist
 from torch.nn.parallel import DistributedDataParallel
 from typing import Optional, Union, List, Dict, Mapping
 from pathlib import Path

--- a/fastNLP/embeddings/torch/static_embedding.py
+++ b/fastNLP/embeddings/torch/static_embedding.py
@@ -86,7 +86,7 @@ class StaticEmbedding(TokenEmbedding):
    :param requires_grad: 是否需要梯度。
    :param init_method: 如何初始化没有找到的值。可以使用 :mod:`torch.nn.init` 中的各种方法，传入的方法应该接受一个 tensor，并
        inplace 地修改其值。
    :param lower: 是否将 ``vocab`` 中的词语小写后再和预训练的词表进行匹配。如果你的词表中包含大写的词语，或者就是需要单独
    :param lower: 是否将 ``vocab`` 中的词语小写后再和预训练的词表进行匹配。如果您的词表中包含大写的词语，或者就是需要单独
        为大写的词语开辟一个 vector 表示，则将 ``lower`` 设置为 ``False``。
    :param dropout: 以多大的概率对 embedding 的表示进行 Dropout。0.1 即随机将 10% 的值置为 0。
    :param word_dropout: 按照一定概率随机将 word 设置为 ``unk_index`` ，这样可以使得 ``<UNK>`` 这个 token 得到足够的训练，
--- a/fastNLP/transformers/init.py
+++ b/fastNLP/transformers/init.py
@@ -1,4 +1,3 @@
 """
 :mod:`transformers` 模块，包含了常用的预训练模型。
 """
 import sphinx-multiversion