{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# T2. databundle 和 tokenizer 的基本使用\n", "\n", " 1 fastNLP 中 dataset 的延伸\n", "\n", " 1.1 databundle 的概念与使用\n", "\n", " 2 fastNLP 中的 tokenizer\n", " \n", " 2.1 PreTrainedTokenizer 的概念\n", "\n", " 2.2 BertTokenizer 的基本使用\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. fastNLP 中 dataset 的延伸\n", "\n", "### 1.1 databundle 的概念与使用\n", "\n", "在`fastNLP 0.8`中,在常用的数据加载模块`DataLoader`和数据集`DataSet`模块之间,还存在\n", "\n", " 一个中间模块,即 **数据包`DataBundle`模块**,可以从`fastNLP.io`路径中导入该模块\n", "\n", "在`fastNLP 0.8`中,**一个`databundle`数据包包含若干`dataset`数据集和`vocabulary`词汇表**\n", "\n", " 分别存储在`datasets`和`vocabs`两个变量中,所以了解`databundle`数据包之前\n", "\n", "需要首先**复习`dataset`数据集和`vocabulary`词汇表**,**下面的一串代码**,**你知道其大概含义吗?**\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n",
"\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/6 [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------------------------------------+----------+\n",
"| text | label |\n",
"+------------------------------------------+----------+\n",
"| ['a', 'series', 'of', 'escapades', 'd... | negative |\n",
"| ['this', 'quiet', ',', 'introspective... | positive |\n",
"| ['even', 'fans', 'of', 'ismail', 'mer... | negative |\n",
"| ['the', 'importance', 'of', 'being', ... | neutral |\n",
"+------------------------------------------+----------+\n",
"+------------------------------------------+----------+\n",
"| text | label |\n",
"+------------------------------------------+----------+\n",
"| ['a', 'comedy-drama', 'of', 'nearly',... | positive |\n",
"| ['a', 'positively', 'thrilling', 'com... | neutral |\n",
"+------------------------------------------+----------+\n",
"{'