diff --git a/tutorials/fastnlp_tutorial_0.ipynb b/tutorials/fastnlp_tutorial_0.ipynb index 4368652a..4e4ce55e 100644 --- a/tutorials/fastnlp_tutorial_0.ipynb +++ b/tutorials/fastnlp_tutorial_0.ipynb @@ -434,7 +434,7 @@ "\n", "  通过`progress_bar`设定进度条格式,默认为`\"auto\"`,此外还有`\"rich\"`、`\"raw\"`和`None`\n", "\n", - "    但对于`\"auto\"`和`\"rich\"`格式,训练结束后进度条会不显示(???)\n", + "    但对于`\"auto\"`和`\"rich\"`格式,在notebook中,进度条在训练结束后会被丢弃\n", "\n", "  通过`n_epochs`设定优化迭代轮数,默认为20;全部`Trainer`的全部变量与函数可以通过`dir(trainer)`查询" ] @@ -523,19 +523,6 @@ "metadata": {}, "output_type": "display_data" }, - { - "data": { - "text/html": [ - "
\n",
-       "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, { "data": { "text/html": [ @@ -600,7 +587,7 @@ "\n", "  其中,可以通过参数`num_eval_batch_per_dl`决定每个`evaluate_dataloader`运行多少个`batch`停止,默认全部\n", "\n", - "  最终,输出形如`{'acc#acc': acc}`的字典,中间的进度条会在运行结束后丢弃掉(???)" + "  最终,输出形如`{'acc#acc': acc}`的字典,在notebook中,进度条在评测结束后会被丢弃" ] }, { @@ -626,21 +613,11 @@ { "data": { "text/html": [ - "
\n"
-      ],
-      "text/plain": []
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "data": {
-      "text/html": [
-       "
\n",
+       "
{'acc#acc': 0.41, 'total#acc': 100.0, 'correct#acc': 41.0}\n",
        "
\n" ], "text/plain": [ - "\n" + "\u001b[1m{\u001b[0m\u001b[32m'acc#acc'\u001b[0m: \u001b[1;36m0.41\u001b[0m, \u001b[32m'total#acc'\u001b[0m: \u001b[1;36m100.0\u001b[0m, \u001b[32m'correct#acc'\u001b[0m: \u001b[1;36m41.0\u001b[0m\u001b[1m}\u001b[0m\n" ] }, "metadata": {}, @@ -648,21 +625,8 @@ }, { "data": { - "text/html": [ - "
{'acc#acc': 0.39}\n",
-       "
\n" - ], "text/plain": [ - "\u001b[1m{\u001b[0m\u001b[32m'acc#acc'\u001b[0m: \u001b[1;36m0.39\u001b[0m\u001b[1m}\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/plain": [ - "{'acc#acc': 0.39}" + "{'acc#acc': 0.41, 'total#acc': 100.0, 'correct#acc': 41.0}" ] }, "execution_count": 9, @@ -683,7 +647,7 @@ "\n", "通过在初始化`trainer`实例时加入`evaluate_dataloaders`和`metrics`,可以实现在训练过程中进行评测\n", "\n", - "  通过`progress_bar`同时设定训练和评估进度条格式,训练结束后进度条会不显示(???)\n", + "  通过`progress_bar`同时设定训练和评估进度条格式,在notebook中,在进度条训练结束后会被丢弃\n", "\n", "  **通过`evaluate_every`设定评估频率**,可以为负数、正数或者函数:\n", "\n", @@ -767,29 +731,47 @@ }, "metadata": {}, "output_type": "display_data" - }, + } + ], + "source": [ + "trainer.run()" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "c4e9c619", + "metadata": {}, + "outputs": [ { "data": { "text/html": [ - "
\n",
-       "
\n" + "
\n"
       ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
       "text/plain": [
-       "\n"
+       "{'acc#acc': 0.46, 'total#acc': 100.0, 'correct#acc': 46.0}"
       ]
      },
+     "execution_count": 12,
      "metadata": {},
-     "output_type": "display_data"
+     "output_type": "execute_result"
     }
    ],
    "source": [
-    "trainer.run()"
+    "trainer.evaluator.run()"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "c4e9c619",
+   "id": "db784d5b",
    "metadata": {},
    "outputs": [],
    "source": []
diff --git a/tutorials/fastnlp_tutorial_1.ipynb b/tutorials/fastnlp_tutorial_1.ipynb
index ba7452b9..09e8821d 100644
--- a/tutorials/fastnlp_tutorial_1.ipynb
+++ b/tutorials/fastnlp_tutorial_1.ipynb
@@ -153,7 +153,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "1630555358408 1630228349768\n",
+      "2492313174344 2491986424200\n",
       "+-----+------------------------+------------------------+-----+\n",
       "| idx | sentence               | words                  | num |\n",
       "+-----+------------------------+------------------------+-----+\n",
@@ -198,7 +198,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "1630228349768 1630228349768\n",
+      "2491986424200 2491986424200\n",
       "+-----+------------------------+------------------------+-----+\n",
       "| idx | sentence               | words                  | num |\n",
       "+-----+------------------------+------------------------+-----+\n",
@@ -680,9 +680,9 @@
     {
      "data": {
       "text/plain": [
-       "{'sentence': ,\n",
-       " 'words': ,\n",
-       " 'num': }"
+       "{'sentence': ,\n",
+       " 'words': ,\n",
+       " 'num': }"
       ]
      },
      "execution_count": 15,
diff --git a/tutorials/fastnlp_tutorial_2.ipynb b/tutorials/fastnlp_tutorial_2.ipynb
index ba9ad109..74a0cb49 100644
--- a/tutorials/fastnlp_tutorial_2.ipynb
+++ b/tutorials/fastnlp_tutorial_2.ipynb
@@ -4,36 +4,28 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# T2. dataloader 和 tokenizer 的基本使用\n",
+    "# T2. databundle 和 tokenizer 的基本使用\n",
     "\n",
-    "  1   fastNLP 中的 dataloader\n",
+    "  1   fastNLP 中 dataset 的延伸\n",
     "\n",
-    "    1.1   databundle 的结构与使用\n",
-    "\n",
-    "    1.2   dataloader 的结构与使用\n",
+    "    1.1   databundle 的概念与使用\n",
     "\n",
     "  2   fastNLP 中的 tokenizer\n",
     " \n",
-    "    2.1   传统 GloVe 词嵌入的加载\n",
-    " \n",
-    "    2.2   PreTrainedTokenizer 的概念\n",
-    "\n",
-    "    2.3   BertTokenizer 的基本使用\n",
+    "    2.1   PreTrainedTokenizer 的概念\n",
     "\n",
-    "  3   实例:NG20 数据集的完整加载过程\n",
+    "    2.2   BertTokenizer 的基本使用\n",
     " \n",
-    "    3.1   \n",
-    "\n",
-    "    3.2   "
+    "    2.3   补充:GloVe 词嵌入的使用"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 1. fastNLP 中的 dataloader\n",
+    "## 1. fastNLP 中 dataset 的延伸\n",
     "\n",
-    "### 1.1 databundle 的结构与使用\n",
+    "### 1.1 databundle 的概念与使用\n",
     "\n",
     "在`fastNLP 0.8`中,在常用的数据加载模块`DataLoader`和数据集`DataSet`模块之间,还存在\n",
     "\n",
@@ -43,13 +35,13 @@
     "\n",
     "  分别存储在`datasets`和`vocabs`两个变量中,所以了解`databundle`数据包之前\n",
     "\n",
-    "  需要首先**复习`dataset`数据集和`vocabulary`词汇表**,**下面的一串代码**,**你知道其大概含义吗?**\n",
+    "需要首先**复习`dataset`数据集和`vocabulary`词汇表**,**下面的一串代码**,**你知道其大概含义吗?**\n",
     "\n",
-    "必要提示:`NG20`,全称[`News Group 20`](http://qwone.com/~jason/20Newsgroups/),是一个新闻文本分类数据集,包含20个大类以及若干小类\n",
+    ""
    ]
   },
   {
@@ -78,21 +70,7 @@
        "version_minor": 0
       },
       "text/plain": [
-       "Processing:   0%|          | 0/10 [00:00': 0, '': 1, 'rec': 2, 'talk': 3, 'comp': 4, 'soc': 5, 'misc': 6, 'sci': 7}\n"
+      "+------------------------------------------+----------+\n",
+      "| text                                     | label    |\n",
+      "+------------------------------------------+----------+\n",
+      "| ['this', 'quiet', ',', 'introspective... | positive |\n",
+      "| ['a', 'comedy-drama', 'of', 'nearly',... | positive |\n",
+      "| ['a', 'positively', 'thrilling', 'com... | neutral  |\n",
+      "| ['a', 'series', 'of', 'escapades', 'd... | negative |\n",
+      "+------------------------------------------+----------+\n",
+      "+------------------------------------------+----------+\n",
+      "| text                                     | label    |\n",
+      "+------------------------------------------+----------+\n",
+      "| ['even', 'fans', 'of', 'ismail', 'mer... | negative |\n",
+      "| ['the', 'importance', 'of', 'being', ... | neutral  |\n",
+      "+------------------------------------------+----------+\n",
+      "{'': 0, '': 1, 'positive': 2, 'neutral': 3, 'negative': 4}\n"
      ]
     }
    ],
@@ -127,16 +105,17 @@
     "from fastNLP import Vocabulary\n",
     "from fastNLP.io import DataBundle\n",
     "\n",
-    "datasets = {}\n",
-    "datasets['train'] = DataSet.from_pandas(pd.read_csv('./data/ng20_train.csv').sample(frac=1)[:10])\n",
-    "datasets['train'].apply_more(lambda ins:{'label': ins['label'].lower().split('.')[0], \n",
-    "                                         'text': ins['text'].lower().split()},\n",
-    "                             progress_bar='tqdm')\n",
-    "datasets['test'] = DataSet.from_pandas(pd.read_csv('./data/ng20_test.csv').sample(frac=1)[:10])\n",
-    "datasets['test'].apply_more(lambda ins:{'label': ins['label'].lower().split('.')[0], \n",
-    "                                        'text': ins['text'].lower().split()},\n",
-    "                            progress_bar='tqdm')\n",
+    "datasets = DataSet.from_pandas(pd.read_csv('./data/test4dataset.tsv', sep='\\t'))\n",
+    "datasets.rename_field('Sentence', 'text')\n",
+    "datasets.rename_field('Sentiment', 'label')\n",
+    "datasets.apply_more(lambda ins:{'label': ins['label'].lower(), \n",
+    "                                'text': ins['text'].lower().split()},\n",
+    "                    progress_bar='tqdm')\n",
+    "datasets.delete_field('SentenceId')\n",
+    "train_ds, test_ds = datasets.split(ratio=0.7)\n",
+    "datasets = {'train': train_ds, 'test': test_ds}\n",
     "print(datasets['train'])\n",
+    "print(datasets['test'])\n",
     "\n",
     "vocabs = {}\n",
     "vocabs['label'] = Vocabulary().from_dataset(datasets['train'].concat(datasets['test'], inplace=False), field_name='label')\n",
@@ -148,9 +127,19 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "\n",
+    "上述代码的含义是:从`test4dataset`的 6 条数据中,划分 4 条训练集(`int(6*0.7) = 4`),2 条测试集\n",
+    "\n",
+    "    修改相关字段名称,删除序号字段,同时将标签都设为小写,对文本进行分词\n",
+    "\n",
+    "  接着通过`concat`方法拼接测试集训练集,注意设置`inplace=False`,生成临时的新数据集\n",
+    "\n",
+    "  使用`from_dataset`方法从拼接的数据集中抽取词汇表,为将数据集中的单词替换为序号做准备\n",
     "\n",
-    "数据集(比如:分开的训练集、验证集和测试集)以及各个field对应的vocabulary。\n",
-    "    该对象一般由fastNLP中各种Loader的load函数生成,可以通过以下的方法获取里面的内容"
+    "由此就可以得到**数据集字典`datasets`**(**对应训练集、测试集**)和**词汇表字典`vocabs`**(**对应数据集各字段**)\n",
+    "\n",
+    "  然后就可以初始化`databundle`了,通过`print`可以观察其大致结构,效果如下"
    ]
   },
   {
@@ -163,48 +152,137 @@
      "output_type": "stream",
      "text": [
       "In total 2 datasets:\n",
-      "\ttrain has 10 instances.\n",
-      "\ttest has 10 instances.\n",
+      "\ttrain has 4 instances.\n",
+      "\ttest has 2 instances.\n",
       "In total 2 vocabs:\n",
-      "\tlabel has 8 entries.\n",
-      "\ttext has 1687 entries.\n",
-      "\n"
+      "\tlabel has 5 entries.\n",
+      "\ttext has 96 entries.\n",
+      "\n",
+      "['train', 'test']\n",
+      "['label', 'text']\n"
      ]
     }
    ],
    "source": [
     "data_bundle = DataBundle(datasets=datasets, vocabs=vocabs)\n",
-    "print(data_bundle)"
+    "print(data_bundle)\n",
+    "print(data_bundle.get_dataset_names())\n",
+    "print(data_bundle.get_vocab_names())"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
-   "source": []
+   "source": [
+    "此外,也可以通过`data_bundle`的`num_dataset`和`num_vocab`返回数据表和词汇表个数\n",
+    "\n",
+    "  通过`data_bundle`的`iter_datasets`和`iter_vocabs`遍历数据表和词汇表"
+   ]
   },
   {
-   "cell_type": "markdown",
+   "cell_type": "code",
+   "execution_count": 3,
    "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "In total 2 datasets:\n",
+      "\ttrain has 4 instances.\n",
+      "\ttest has 2 instances.\n",
+      "In total 2 datasets:\n",
+      "\tlabel has 5 entries.\n",
+      "\ttext has 96 entries.\n"
+     ]
+    }
+   ],
    "source": [
-    "### 1.2 dataloader 的结构与使用"
+    "print(\"In total %d datasets:\" % data_bundle.num_dataset)\n",
+    "for name, dataset in data_bundle.iter_datasets():\n",
+    "    print(\"\\t%s has %d instances.\" % (name, len(dataset)))\n",
+    "print(\"In total %d datasets:\" % data_bundle.num_dataset)\n",
+    "for name, vocab in data_bundle.iter_vocabs():\n",
+    "    print(\"\\t%s has %d entries.\" % (name, len(vocab)))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 2. fastNLP 中的 tokenizer\n",
+    "在数据包`databundle`中,也有和数据集`dataset`类似的四个`apply`函数,即\n",
+    "\n",
+    "  `apply`函数、`apply_field`函数、`apply_field_more`函数和`apply_more`函数\n",
+    "\n",
+    "  负责对数据集进行预处理,如下所示是`apply_more`函数的示例,其他函数类似\n",
+    "\n",
+    "此外,通过`get_dataset`函数,可以通过数据表名`name`称找到对应数据表\n",
     "\n",
-    "### 2.1 传统 GloVe 词嵌入的加载"
+    "  通过`get_vocab`函数,可以通过词汇表名`field_name`称找到对应词汇表"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Processing:   0%|          | 0/4 [00:00, max_length=32, truncation=True, return_attention_mask=True)\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Processing:   0%|          | 0/4 [00:00')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "51bf0878",
+   "metadata": {},
+   "source": [
+    "  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3fd2486f",
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f9bbd9a7",
+   "metadata": {},
+   "source": [
+    "### 3.2 dataloader 的结构与使用"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "651baef6",
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from fastNLP import prepare_torch_dataloader\n",
+    "\n",
+    "dl_bundle = prepare_torch_dataloader(data_bundle, train_batch_size=2)\n",
+    "\n",
+    "print(type(dl_bundle), type(dl_bundle['train']))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "726ba357",
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "dataloader = prepare_torch_dataloader(datasets['train'], train_batch_size=2)\n",
+    "print(type(dataloader))\n",
+    "print(dir(dataloader))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d0795b3e",
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "dataloader.collate_fn"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b0c3c58d",
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "dataloader.batch_sampler"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7ed431cc",
+   "metadata": {},
+   "source": [
+    "### 3.3 实例:NG20 的加载预处理\n",
+    "\n",
+    "在`fastNLP 0.8`中,**`Trainer`模块和`Evaluator`模块分别表示“训练器”和“评测器”**\n",
+    "\n",
+    "  对应于之前的`fastNLP`版本中的`Trainer`模块和`Tester`模块,其定义方法如下所示\n",
+    "\n",
+    "在`fastNLP 0.8`中,需要注意,在同个`python`脚本中先使用`Trainer`训练,然后使用`Evaluator`评测\n",
+    "\n",
+    "  非常关键的问题在于**如何正确设置二者的`driver`**。这就引入了另一个问题:什么是 `driver`?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a89ef613",
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "from fastNLP import DataSet\n",
+    "from fastNLP import Vocabulary\n",
+    "\n",
+    "dataset = DataSet.from_pandas(pd.read_csv('./data/ng20_test.csv'))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1624b0fa",
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from functools import partial\n",
+    "\n",
+    "encode = partial(tokenizer.encode_plus, max_length=100, truncation=True,\n",
+    "                 return_attention_mask=True)\n",
+    "# 会新增 input_ids 、 attention_mask 和 token_type_ids 这三个 field\n",
+    "dataset.apply_field_more(encode, field_name='text')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0991a8ee",
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "target_vocab = Vocabulary(padding=None, unknown=None)\n",
+    "\n",
+    "target_vocab.from_dataset(*[ds for _, ds in data_bundle.iter_datasets()], field_name='label')\n",
+    "target_vocab.index_dataset(*[ds for _, ds in data_bundle.iter_datasets()], field_name='label',\n",
+    "                           new_field_name='labels')\n",
+    "# 需要将 input_ids 的 pad 值设置为 tokenizer 的 pad 值\n",
+    "dataset.set_pad('input_ids', pad_val=tokenizer.pad_token_id)\n",
+    "dataset.set_ignore('label', 'text')  # 因为 label 是原始的不需要的 str ,所以我们可以忽略它,让它不要在 batch 的输出中出现"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -52,6 +252,15 @@
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
    "version": "3.7.4"
+  },
+  "pycharm": {
+   "stem_cell": {
+    "cell_type": "raw",
+    "metadata": {
+     "collapsed": false
+    },
+    "source": []
+   }
   }
  },
  "nbformat": 4,