You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

tutorial_4_load_dataset.ipynb 12 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309
  1. {
  2. "cells": [
  3. {
  4. "cell_type": "markdown",
  5. "metadata": {},
  6. "source": [
  7. "# 使用Loader和Pipe加载并处理数据集\n",
  8. "\n",
  9. "这一部分是关于如何加载数据集的教程\n",
  10. "\n",
  11. "## Part I: 数据集容器DataBundle\n",
  12. "\n",
  13. "而由于对于同一个任务,训练集,验证集和测试集会共用同一个词表以及具有相同的目标值,所以在fastNLP中我们使用了 DataBundle 来承载同一个任务的多个数据集 DataSet 以及它们的词表 Vocabulary 。下面会有例子介绍 DataBundle 的相关使用。\n",
  14. "\n",
  15. "DataBundle 在fastNLP中主要在各个 Loader 和 Pipe 中被使用。 下面我们先介绍一下 Loader 和 Pipe 。\n",
  16. "\n",
  17. "## Part II: 加载的各种数据集的Loader\n",
  18. "\n",
  19. "在fastNLP中,所有的 Loader 都可以通过其文档判断其支持读取的数据格式,以及读取之后返回的 DataSet 的格式, 例如 ChnSentiCorpLoader \n",
  20. "\n",
  21. "- download() 函数:自动将该数据集下载到缓存地址,默认缓存地址为~/.fastNLP/datasets/。由于版权等原因,不是所有的Loader都实现了该方法。该方法会返回下载后文件所处的缓存地址。\n",
  22. "\n",
  23. "- _load() 函数:从一个数据文件中读取数据,返回一个 DataSet 。返回的DataSet的格式可从Loader文档判断。\n",
  24. "\n",
  25. "- load() 函数:从文件或者文件夹中读取数据为 DataSet 并将它们组装成 DataBundle。支持接受的参数类型有以下的几种\n",
  26. "\n",
  27. " - None, 将尝试读取自动缓存的数据,仅支持提供了自动下载数据的Loader\n",
  28. " - 文件夹路径, 默认将尝试在该文件夹下匹配文件名中含有 train , test , dev 的文件,如果有多个文件含有相同的关键字,将无法通过该方式读取\n",
  29. " - dict, 例如{'train':\"/path/to/tr.conll\", 'dev':\"/to/validate.conll\", \"test\":\"/to/te.conll\"}。"
  30. ]
  31. },
  32. {
  33. "cell_type": "code",
  34. "execution_count": 1,
  35. "metadata": {},
  36. "outputs": [
  37. {
  38. "name": "stdout",
  39. "output_type": "stream",
  40. "text": [
  41. "In total 3 datasets:\n",
  42. "\ttest has 1944 instances.\n",
  43. "\ttrain has 17196 instances.\n",
  44. "\tdev has 1858 instances.\n",
  45. "\n"
  46. ]
  47. }
  48. ],
  49. "source": [
  50. "from fastNLP.io import CWSLoader\n",
  51. "\n",
  52. "loader = CWSLoader(dataset_name='pku')\n",
  53. "data_bundle = loader.load()\n",
  54. "print(data_bundle)"
  55. ]
  56. },
  57. {
  58. "cell_type": "markdown",
  59. "metadata": {},
  60. "source": [
  61. "这里表示一共有3个数据集。其中:\n",
  62. "\n",
  63. " 3个数据集的名称分别为train、dev、test,分别有17223、1831、1944个instance\n",
  64. "\n",
  65. "也可以取出DataSet,并打印DataSet中的具体内容"
  66. ]
  67. },
  68. {
  69. "cell_type": "code",
  70. "execution_count": 2,
  71. "metadata": {},
  72. "outputs": [
  73. {
  74. "name": "stdout",
  75. "output_type": "stream",
  76. "text": [
  77. "+----------------------------------------------------------------+\n",
  78. "| raw_words |\n",
  79. "+----------------------------------------------------------------+\n",
  80. "| 迈向 充满 希望 的 新 世纪 —— 一九九八年 新年 讲话 ... |\n",
  81. "| 中共中央 总书记 、 国家 主席 江 泽民 |\n",
  82. "+----------------------------------------------------------------+\n"
  83. ]
  84. }
  85. ],
  86. "source": [
  87. "tr_data = data_bundle.get_dataset('train')\n",
  88. "print(tr_data[:2])"
  89. ]
  90. },
  91. {
  92. "cell_type": "markdown",
  93. "metadata": {},
  94. "source": [
  95. "## Part III: 使用Pipe对数据集进行预处理\n",
  96. "\n",
  97. "通过 Loader 可以将文本数据读入,但并不能直接被神经网络使用,还需要进行一定的预处理。\n",
  98. "\n",
  99. "在fastNLP中,我们使用 Pipe 的子类作为数据预处理的类, Loader 和 Pipe 一般具备一一对应的关系,该关系可以从其名称判断, 例如 CWSLoader 与 CWSPipe 是一一对应的。一般情况下Pipe处理包含以下的几个过程,\n",
  100. "1. 将raw_words或 raw_chars进行tokenize以切分成不同的词或字; \n",
  101. "2. 再建立词或字的 Vocabulary , 并将词或字转换为index; \n",
  102. "3. 将target 列建立词表并将target列转为index;\n",
  103. "\n",
  104. "所有的Pipe都可通过其文档查看该Pipe支持处理的 DataSet 以及返回的 DataBundle 中的Vocabulary的情况; 如 OntoNotesNERPipe\n",
  105. "\n",
  106. "各种数据集的Pipe当中,都包含了以下的两个函数:\n",
  107. "\n",
  108. "- process() 函数:对输入的 DataBundle 进行处理, 然后返回处理之后的 DataBundle 。process函数的文档中包含了该Pipe支持处理的DataSet的格式。\n",
  109. "- process_from_file() 函数:输入数据集所在文件夹,使用对应的Loader读取数据(所以该函数支持的参数类型是由于其对应的Loader的load函数决定的),然后调用相对应的process函数对数据进行预处理。相当于是把Load和process放在一个函数中执行。\n",
  110. "\n",
  111. "接着上面 CWSLoader 的例子,我们展示一下 CWSPipe 的功能:"
  112. ]
  113. },
  114. {
  115. "cell_type": "code",
  116. "execution_count": 3,
  117. "metadata": {},
  118. "outputs": [
  119. {
  120. "name": "stdout",
  121. "output_type": "stream",
  122. "text": [
  123. "In total 3 datasets:\n",
  124. "\ttest has 1944 instances.\n",
  125. "\ttrain has 17196 instances.\n",
  126. "\tdev has 1858 instances.\n",
  127. "In total 2 vocabs:\n",
  128. "\tchars has 4777 entries.\n",
  129. "\ttarget has 4 entries.\n",
  130. "\n"
  131. ]
  132. }
  133. ],
  134. "source": [
  135. "from fastNLP.io import CWSPipe\n",
  136. "\n",
  137. "data_bundle = CWSPipe().process(data_bundle)\n",
  138. "print(data_bundle)"
  139. ]
  140. },
  141. {
  142. "cell_type": "markdown",
  143. "metadata": {},
  144. "source": [
  145. "表示一共有3个数据集和2个词表。其中:\n",
  146. "\n",
  147. "- 3个数据集的名称分别为train、dev、test,分别有17223、1831、1944个instance\n",
  148. "- 2个词表分别为chars词表与target词表。其中chars词表为句子文本所构建的词表,一共有4777个不同的字;target词表为目标标签所构建的词表,一共有4种标签。\n",
  149. "\n",
  150. "相较于之前CWSLoader读取的DataBundle,新增了两个Vocabulary。 我们可以打印一下处理之后的DataSet"
  151. ]
  152. },
  153. {
  154. "cell_type": "code",
  155. "execution_count": 4,
  156. "metadata": {},
  157. "outputs": [
  158. {
  159. "name": "stdout",
  160. "output_type": "stream",
  161. "text": [
  162. "+---------------------+---------------------+---------------------+---------+\n",
  163. "| raw_words | chars | target | seq_len |\n",
  164. "+---------------------+---------------------+---------------------+---------+\n",
  165. "| 迈向 充满 希望... | [1224, 178, 674,... | [0, 1, 0, 1, 0, ... | 29 |\n",
  166. "| 中共中央 总书记... | [11, 212, 11, 33... | [0, 3, 3, 1, 0, ... | 15 |\n",
  167. "+---------------------+---------------------+---------------------+---------+\n"
  168. ]
  169. }
  170. ],
  171. "source": [
  172. "tr_data = data_bundle.get_dataset('train')\n",
  173. "print(tr_data[:2])"
  174. ]
  175. },
  176. {
  177. "cell_type": "markdown",
  178. "metadata": {},
  179. "source": [
  180. "可以看到有两列为int的field: chars和target。这两列的名称同时也是DataBundle中的Vocabulary的名称。可以通过下列的代码获取并查看Vocabulary的 信息"
  181. ]
  182. },
  183. {
  184. "cell_type": "code",
  185. "execution_count": 5,
  186. "metadata": {},
  187. "outputs": [
  188. {
  189. "name": "stdout",
  190. "output_type": "stream",
  191. "text": [
  192. "Vocabulary(['B', 'E', 'S', 'M']...)\n"
  193. ]
  194. }
  195. ],
  196. "source": [
  197. "vocab = data_bundle.get_vocab('target')\n",
  198. "print(vocab)"
  199. ]
  200. },
  201. {
  202. "cell_type": "markdown",
  203. "metadata": {},
  204. "source": [
  205. "## Part IV: fastNLP封装好的Loader和Pipe\n",
  206. "\n",
  207. "fastNLP封装了多种任务/数据集的 Loader 和 Pipe 并提供自动下载功能,具体参见文档 [数据集](https://docs.qq.com/sheet/DVnpkTnF6VW9UeXdh?c=A1A0A0)\n",
  208. "\n",
  209. "## Part V: 不同格式类型的基础Loader\n",
  210. "\n",
  211. "除了上面提到的针对具体任务的Loader,我们还提供了CSV格式和JSON格式的Loader\n",
  212. "\n",
  213. "**CSVLoader** 读取CSV类型的数据集文件。例子如下:\n",
  214. "\n",
  215. "```python\n",
  216. "from fastNLP.io.loader import CSVLoader\n",
  217. "data_set_loader = CSVLoader(\n",
  218. " headers=('raw_words', 'target'), sep='\\t'\n",
  219. ")\n",
  220. "```\n",
  221. "\n",
  222. "表示将CSV文件中每一行的第一项将填入'raw_words' field,第二项填入'target' field。其中项之间由'\\t'分割开来\n",
  223. "\n",
  224. "```python\n",
  225. "data_set = data_set_loader._load('path/to/your/file')\n",
  226. "```\n",
  227. "\n",
  228. "文件内容样例如下\n",
  229. "\n",
  230. "```csv\n",
  231. "But it does not leave you with much . 1\n",
  232. "You could hate it for the same reason . 1\n",
  233. "The performances are an absolute joy . 4\n",
  234. "```\n",
  235. "\n",
  236. "读取之后的DataSet具有以下的field\n",
  237. "\n",
  238. "| raw_words | target |\n",
  239. "| --------------------------------------- | ------ |\n",
  240. "| But it does not leave you with much . | 1 |\n",
  241. "| You could hate it for the same reason . | 1 |\n",
  242. "| The performances are an absolute joy . | 4 |\n"
  243. ]
  244. },
  245. {
  246. "cell_type": "markdown",
  247. "metadata": {},
  248. "source": [
  249. "**JsonLoader** 读取Json类型的数据集文件,数据必须按行存储,每行是一个包含各类属性的Json对象。例子如下\n",
  250. "\n",
  251. "```python\n",
  252. "from fastNLP.io.loader import JsonLoader\n",
  253. "loader = JsonLoader(\n",
  254. " fields={'sentence1': 'raw_words1', 'sentence2': 'raw_words2', 'gold_label': 'target'}\n",
  255. ")\n",
  256. "```\n",
  257. "\n",
  258. "表示将Json对象中'sentence1'、'sentence2'和'gold_label'对应的值赋给'raw_words1'、'raw_words2'、'target'这三个fields\n",
  259. "\n",
  260. "```python\n",
  261. "data_set = loader._load('path/to/your/file')\n",
  262. "```\n",
  263. "\n",
  264. "数据集内容样例如下\n",
  265. "```\n",
  266. "{\"annotator_labels\": [\"neutral\"], \"captionID\": \"3416050480.jpg#4\", \"gold_label\": \"neutral\", ... }\n",
  267. "{\"annotator_labels\": [\"contradiction\"], \"captionID\": \"3416050480.jpg#4\", \"gold_label\": \"contradiction\", ... }\n",
  268. "{\"annotator_labels\": [\"entailment\"], \"captionID\": \"3416050480.jpg#4\", \"gold_label\": \"entailment\", ... }\n",
  269. "```\n",
  270. "\n",
  271. "读取之后的DataSet具有以下的field\n",
  272. "\n",
  273. "| raw_words0 | raw_words1 | target |\n",
  274. "| ------------------------------------------------------ | ------------------------------------------------- | ------------- |\n",
  275. "| A person on a horse jumps over a broken down airplane. | A person is training his horse for a competition. | neutral |\n",
  276. "| A person on a horse jumps over a broken down airplane. | A person is at a diner, ordering an omelette. | contradiction |\n",
  277. "| A person on a horse jumps over a broken down airplane. | A person is outdoors, on a horse. | entailment |"
  278. ]
  279. },
  280. {
  281. "cell_type": "code",
  282. "execution_count": null,
  283. "metadata": {},
  284. "outputs": [],
  285. "source": []
  286. }
  287. ],
  288. "metadata": {
  289. "kernelspec": {
  290. "display_name": "Python Now",
  291. "language": "python",
  292. "name": "now"
  293. },
  294. "language_info": {
  295. "codemirror_mode": {
  296. "name": "ipython",
  297. "version": 3
  298. },
  299. "file_extension": ".py",
  300. "mimetype": "text/x-python",
  301. "name": "python",
  302. "nbconvert_exporter": "python",
  303. "pygments_lexer": "ipython3",
  304. "version": "3.8.0"
  305. }
  306. },
  307. "nbformat": 4,
  308. "nbformat_minor": 2
  309. }