Browse Source

modify the structure of the docs

tags/v0.4.10
ChenXin 5 years ago
parent
commit
e933df4227
3 changed files with 431 additions and 115 deletions
  1. BIN
      docs/source/quickstart/cn_cls_example.png
  2. +426
    -0
      docs/source/quickstart/文本分类.rst
  3. +5
    -115
      docs/source/user/quickstart.rst

BIN
docs/source/quickstart/cn_cls_example.png View File

Before After
Width: 722  |  Height: 317  |  Size: 162 kB

+ 426
- 0
docs/source/quickstart/文本分类.rst View File

@@ -0,0 +1,426 @@
文本分类(Text classification)
=============================

文本分类任务是将一句话或一段话划分到某个具体的类别。比如垃圾邮件识别,文本情绪分类等。

.. code-block:: text

1, 商务大床房,房间很大,床有2M宽,整体感觉经济实惠不错!

其中开头的1是只这条评论的标签,表示是正面的情绪。我们将使用到的数据可以通过 `此链接 <http://dbcloud.irocn.cn:8989/api/public/dl/dataset/chn\_senti\_corp.zip>`_
下载并解压,当然也可以通过fastNLP自动下载该数据。

数据中的内容如下图所示。接下来,我们将用fastNLP在这个数据上训练一个分类网络。

.. figure:: ./cn_cls_example.png
:alt: jupyter

jupyter

步骤
----

一共有以下的几个步骤:

1. `读取数据 <#id4>`_

2. `预处理数据 <#id5>`_

3. `选择预训练词向量 <#id6>`_

4. `创建模型 <#id7>`_

5. `训练模型 <#id8>`_

(1) 读取数据
~~~~~~~~~~~~~~~~~~~~

fastNLP提供多种数据的自动下载与自动加载功能,对于这里我们要用到的数据,我们可以用 :class:`~fastNLP.io.Loader` 自动下载并加载该数据。
更多有关Loader的使用可以参考 :mod:`~fastNLP.io.loader`

.. code-block:: python

from fastNLP.io import ChnSentiCorpLoader
loader = ChnSentiCorpLoader() # 初始化一个中文情感分类的loader
data_dir = loader.download() # 这一行代码将自动下载数据到默认的缓存地址, 并将该地址返回
data_bundle = loader.load(data_dir) # 这一行代码将从{data_dir}处读取数据至DataBundle


DataBundle的相关介绍,可以参考 :class:`~fastNLP.io.DataBundle` 。我们可以打印该data\_bundle的基本信息。

.. code-block:: python

print(data_bundle)


.. code-block:: text

In total 3 datasets:
dev has 1200 instances.
train has 9600 instances.
test has 1200 instances.
In total 0 vocabs:


可以看出,该data\_bundle中一个含有三个 :class:`~fastNLP.DataSet` 。通过下面的代码,我们可以查看DataSet的基本情况

.. code-block:: python

print(data_bundle.get_dataset('train')[:2]) # 查看Train集前两个sample


.. code-block:: text

DataSet({'raw_chars': 选择珠江花园的原因就是方便,有电动扶梯直接到达海边,周围餐馆、食廊、商场、超市、摊位一应俱全。酒店装修一般,但还算整洁。 泳池在大堂的屋顶,因此很小,不过女儿倒是喜欢。 包的早餐是西式的,还算丰富。 服务吗,一般 type=str,
'target': 1 type=str},
{'raw_chars': 15.4寸笔记本的键盘确实爽,基本跟台式机差不多了,蛮喜欢数字小键盘,输数字特方便,样子也很美观,做工也相当不错 type=str,
'target': 1 type=str})


(2) 预处理数据
~~~~~~~~~~~~~~~~~~~~

在NLP任务中,预处理一般包括:

(a) 将一整句话切分成汉字或者词;

(b) 将文本转换为index

fastNLP中也提供了多种数据集的处理类,这里我们直接使用fastNLP的ChnSentiCorpPipe。更多关于Pipe的说明可以参考 :mod:`~fastNLP.io.pipe` 。

.. code-block:: python

from fastNLP.io import ChnSentiCorpPipe

pipe = ChnSentiCorpPipe()
data_bundle = pipe.process(data_bundle) # 所有的Pipe都实现了process()方法,且输入输出都为DataBundle类型

print(data_bundle) # 打印data_bundle,查看其变化


.. code-block:: text

In total 3 datasets:
dev has 1200 instances.
train has 9600 instances.
test has 1200 instances.
In total 2 vocabs:
chars has 4409 entries.
target has 2 entries.



可以看到除了之前已经包含的3个 :class:`~fastNLP.DataSet` ,还新增了两个 :class:`~fastNLP.Vocabulary` 。我们可以打印DataSet中的内容

.. code-block:: python

print(data_bundle.get_dataset('train')[:2])


.. code-block:: text

DataSet({'raw_chars': 选择珠江花园的原因就是方便,有电动扶梯直接到达海边,周围餐馆、食廊、商场、超市、摊位一应俱全。酒店装修一般,但还算整洁。 泳池在大堂的屋顶,因此很小,不过女儿倒是喜欢。 包的早餐是西式的,还算丰富。 服务吗,一般 type=str,
'target': 1 type=int,
'chars': [338, 464, 1400, 784, 468, 739, 3, 289, 151, 21, 5, 88, 143, 2, 9, 81, 134, 2573, 766, 233, 196, 23, 536, 342, 297, 2, 405, 698, 132, 281, 74, 744, 1048, 74, 420, 387, 74, 412, 433, 74, 2021, 180, 8, 219, 1929, 213, 4, 34, 31, 96, 363, 8, 230, 2, 66, 18, 229, 331, 768, 4, 11, 1094, 479, 17, 35, 593, 3, 1126, 967, 2, 151, 245, 12, 44, 2, 6, 52, 260, 263, 635, 5, 152, 162, 4, 11, 336, 3, 154, 132, 5, 236, 443, 3, 2, 18, 229, 761, 700, 4, 11, 48, 59, 653, 2, 8, 230] type=list,
'seq_len': 106 type=int},
{'raw_chars': 15.4寸笔记本的键盘确实爽,基本跟台式机差不多了,蛮喜欢数字小键盘,输数字特方便,样子也很美观,做工也相当不错 type=str,
'target': 1 type=int,
'chars': [50, 133, 20, 135, 945, 520, 343, 24, 3, 301, 176, 350, 86, 785, 2, 456, 24, 461, 163, 443, 128, 109, 6, 47, 7, 2, 916, 152, 162, 524, 296, 44, 301, 176, 2, 1384, 524, 296, 259, 88, 143, 2, 92, 67, 26, 12, 277, 269, 2, 188, 223, 26, 228, 83, 6, 63] type=list,
'seq_len': 56 type=int})


新增了一列为数字列表的chars,以及变为数字的target列。可以看出这两列的名称和刚好与data\_bundle中两个Vocabulary的名称是一致的,我们可以打印一下Vocabulary看一下里面的内容。

.. code-block:: python

char_vocab = data_bundle.get_vocab('chars')
print(char_vocab)


.. code-block:: text

Vocabulary(['选', '择', '珠', '江', '花']...)


Vocabulary是一个记录着词语与index之间映射关系的类,比如

.. code-block:: python

index = char_vocab.to_index('选')
print("'选'的index是{}".format(index)) # 这个值与上面打印出来的第一个instance的chars的第一个index是一致的
print("index:{}对应的汉字是{}".format(index, char_vocab.to_word(index)))


.. code-block:: text

'选'的index是338
index:338对应的汉字是选


(3) 选择预训练词向量
~~~~~~~~~~~~~~~~~~~~

由于Word2vec, Glove, Elmo,
Bert等预训练模型可以增强模型的性能,所以在训练具体任务前,选择合适的预训练词向量非常重要。
在fastNLP中我们提供了多种Embedding使得加载这些预训练模型的过程变得更加便捷。
这里我们先给出一个使用word2vec的中文汉字预训练的示例,之后再给出一个使用Bert的文本分类。
这里使用的预训练词向量为'cn-fastnlp-100d',fastNLP将自动下载该embedding至本地缓存,
fastNLP支持使用名字指定的Embedding以及相关说明可以参见 :mod:`fastNLP.embeddings`

.. code-block:: python

from fastNLP.embeddings import StaticEmbedding

word2vec_embed = StaticEmbedding(char_vocab, model_dir_or_name='cn-char-fastnlp-100d')


.. code-block:: text

Found 4321 out of 4409 compound in the pre-training embedding.

(4) 创建模型
~~~~~~~~~~~~

这里我们使用到的模型结构如下所示

.. todo::
补图

.. code-block:: python

from torch import nn
from fastNLP.modules import LSTM
import torch
# 定义模型
class BiLSTMMaxPoolCls(nn.Module):
def __init__(self, embed, num_classes, hidden_size=400, num_layers=1, dropout=0.3):
super().__init__()
self.embed = embed
self.lstm = LSTM(self.embed.embedding_dim, hidden_size=hidden_size//2, num_layers=num_layers,
batch_first=True, bidirectional=True)
self.dropout_layer = nn.Dropout(dropout)
self.fc = nn.Linear(hidden_size, num_classes)
def forward(self, chars, seq_len): # 这里的名称必须和DataSet中相应的field对应,比如之前我们DataSet中有chars,这里就必须为chars
# chars:[batch_size, max_len]
# seq_len: [batch_size, ]
chars = self.embed(chars)
outputs, _ = self.lstm(chars, seq_len)
outputs = self.dropout_layer(outputs)
outputs, _ = torch.max(outputs, dim=1)
outputs = self.fc(outputs)
return {'pred':outputs} # [batch_size,], 返回值必须是dict类型,且预测值的key建议设为pred
# 初始化模型
model = BiLSTMMaxPoolCls(word2vec_embed, len(data_bundle.get_vocab('target')))

(5) 训练模型
~~~~~~~~~~~~

fastNLP提供了Trainer对象来组织训练过程,包括完成loss计算(所以在初始化Trainer的时候需要指定loss类型),梯度更新(所以在初始化Trainer的时候需要提供优化器optimizer)以及在验证集上的性能验证(所以在初始化时需要提供一个Metric)

.. code-block:: python

from fastNLP import Trainer
from fastNLP import CrossEntropyLoss
from torch.optim import Adam
from fastNLP import AccuracyMetric
loss = CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=0.001)
metric = AccuracyMetric()
device = 0 if torch.cuda.is_available() else 'cpu' # 如果有gpu的话在gpu上运行,训练速度会更快
trainer = Trainer(train_data=data_bundle.get_dataset('train'), model=model, loss=loss,
optimizer=optimizer, batch_size=32, dev_data=data_bundle.get_dataset('dev'),
metrics=metric, device=device)
trainer.train() # 开始训练,训练完成之后默认会加载在dev上表现最好的模型
# 在测试集上测试一下模型的性能
from fastNLP import Tester
print("Performance on test is:")
tester = Tester(data=data_bundle.get_dataset('test'), model=model, metrics=metric, batch_size=64, device=device)
tester.test()


.. code-block:: text

input fields after batch(if batch size is 2):
target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
chars: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 106])
seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
target fields after batch(if batch size is 2):
target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
Evaluate data in 0.01 seconds!
training epochs started 2019-09-03-23-57-10

HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=3000), HTML(value='')), layout=Layout(display…

HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.43 seconds!
Evaluation on dev at Epoch 1/10. Step:300/3000:
AccuracyMetric: acc=0.81

HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.44 seconds!
Evaluation on dev at Epoch 2/10. Step:600/3000:
AccuracyMetric: acc=0.8675

HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.44 seconds!
Evaluation on dev at Epoch 3/10. Step:900/3000:
AccuracyMetric: acc=0.878333

HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.43 seconds!
Evaluation on dev at Epoch 4/10. Step:1200/3000:
AccuracyMetric: acc=0.873333

HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.44 seconds!
Evaluation on dev at Epoch 5/10. Step:1500/3000:
AccuracyMetric: acc=0.878333

HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.42 seconds!
Evaluation on dev at Epoch 6/10. Step:1800/3000:
AccuracyMetric: acc=0.895833

HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.44 seconds!
Evaluation on dev at Epoch 7/10. Step:2100/3000:
AccuracyMetric: acc=0.8975

HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.43 seconds!
Evaluation on dev at Epoch 8/10. Step:2400/3000:
AccuracyMetric: acc=0.894167

HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.48 seconds!
Evaluation on dev at Epoch 9/10. Step:2700/3000:
AccuracyMetric: acc=0.8875

HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.43 seconds!
Evaluation on dev at Epoch 10/10. Step:3000/3000:
AccuracyMetric: acc=0.895833
In Epoch:7/Step:2100, got best dev performance:
AccuracyMetric: acc=0.8975
Reloaded the best model.

HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=19), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.34 seconds!
[tester]
AccuracyMetric: acc=0.8975

{'AccuracyMetric': {'acc': 0.8975}}



使用Bert进行文本分类
~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

# 只需要切换一下Embedding即可
from fastNLP.embeddings import BertEmbedding
# 这里为了演示一下效果,所以默认Bert不更新权重
bert_embed = BertEmbedding(char_vocab, model_dir_or_name='cn', auto_truncate=True, requires_grad=False)
model = BiLSTMMaxPoolCls(bert_embed, len(data_bundle.get_vocab('target')), )
import torch
from fastNLP import Trainer
from fastNLP import CrossEntropyLoss
from torch.optim import Adam
from fastNLP import AccuracyMetric
loss = CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=2e-5)
metric = AccuracyMetric()
device = 0 if torch.cuda.is_available() else 'cpu' # 如果有gpu的话在gpu上运行,训练速度会更快
trainer = Trainer(train_data=data_bundle.get_dataset('train'), model=model, loss=loss,
optimizer=optimizer, batch_size=16, dev_data=data_bundle.get_dataset('test'),
metrics=metric, device=device, n_epochs=3)
trainer.train() # 开始训练,训练完成之后默认会加载在dev上表现最好的模型
# 在测试集上测试一下模型的性能
from fastNLP import Tester
print("Performance on test is:")
tester = Tester(data=data_bundle.get_dataset('test'), model=model, metrics=metric, batch_size=64, device=device)
tester.test()


.. code-block:: text

loading vocabulary file /home/yh/.fastNLP/embedding/bert-chinese-wwm/vocab.txt
Load pre-trained BERT parameters from file /home/yh/.fastNLP/embedding/bert-chinese-wwm/chinese_wwm_pytorch.bin.
Start to generating word pieces for word.
Found(Or segment into word pieces) 4286 words out of 4409.
input fields after batch(if batch size is 2):
target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
chars: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 106])
seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
target fields after batch(if batch size is 2):
target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
Evaluate data in 0.05 seconds!
training epochs started 2019-09-04-00-02-37

HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=3600), HTML(value='')), layout=Layout(display…

HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=150), HTML(value='')), layout=Layout(display=…

Evaluate data in 15.89 seconds!
Evaluation on dev at Epoch 1/3. Step:1200/3600:
AccuracyMetric: acc=0.9

HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=150), HTML(value='')), layout=Layout(display=…

Evaluate data in 15.92 seconds!
Evaluation on dev at Epoch 2/3. Step:2400/3600:
AccuracyMetric: acc=0.904167

HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=150), HTML(value='')), layout=Layout(display=…

Evaluate data in 15.91 seconds!
Evaluation on dev at Epoch 3/3. Step:3600/3600:
AccuracyMetric: acc=0.918333

In Epoch:3/Step:3600, got best dev performance:
AccuracyMetric: acc=0.918333
Reloaded the best model.
Performance on test is:

HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=19), HTML(value='')), layout=Layout(display='…

Evaluate data in 29.24 seconds!
[tester]
AccuracyMetric: acc=0.919167

{'AccuracyMetric': {'acc': 0.919167}}



+ 5
- 115
docs/source/user/quickstart.rst View File

@@ -2,123 +2,13 @@
快速入门
===============

这是一个简单的分类任务 (数据来源 `kaggle <https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews>`_ )。
给出一段文字,预测它的标签是0~4中的哪一个。
如果你想用 fastNLP 来快速地解决某类自然语言处理问题,你可以参考以下教程之一

我们可以使用 fastNLP 中 io 模块中的 :class:`~fastNLP.io.CSVLoader` 类,轻松地从 csv 文件读取我们的数据。
.. toctree::
:maxdepth: 1

.. code-block:: python
/quickstart/文本分类

from fastNLP.io import CSVLoader

loader = CSVLoader(headers=('raw_sentence', 'label'), sep='\t')
dataset = loader.load("./sample_data/tutorial_sample_dataset.csv")
这些教程是简单地介绍了使用 fastNLP 的流程,更多的教程分析见 :doc:`/user/tutorials`

此时的 `dataset[0]` 的值如下,可以看到,数据集中的每个数据包含 ``raw_sentence`` 和 ``label`` 两个字段,他们的类型都是 ``str``::

{'raw_sentence': A series of escapades demonstrating the adage that what is good for the
goose is also good for the gander , some of which occasionally amuses but none of which
amounts to much of a story . type=str,
'label': 1 type=str}


我们使用 :class:`~fastNLP.DataSet` 类的 :meth:`~fastNLP.DataSet.apply` 方法将 ``raw_sentence`` 中字母变成小写,并将句子分词。

.. code-block:: python

dataset.apply(lambda x: x['raw_sentence'].lower(), new_field_name='sentence')
dataset.apply(lambda x: x['sentence'].split(), new_field_name='words', is_input=True)

然后我们再用 :class:`~fastNLP.Vocabulary` 类来统计数据中出现的单词,并将单词序列转化为训练可用的数字序列。

.. code-block:: python

from fastNLP import Vocabulary
vocab = Vocabulary(min_freq=2).from_dataset(dataset, field_name='words')
vocab.index_dataset(dataset, field_name='words',new_field_name='words')

同时,我们也将原来 str 类型的标签转化为数字,并设置为训练中的标准答案 ``target``

.. code-block:: python

dataset.apply(lambda x: int(x['label']), new_field_name='target', is_target=True)

现在我们可以导入 fastNLP 内置的文本分类模型 :class:`~fastNLP.models.CNNText` ,


.. code-block:: python

from fastNLP.models import CNNText
model = CNNText((len(vocab),50), num_classes=5, dropout=0.1)

:class:`~fastNLP.models.CNNText` 的网络结构如下::

CNNText(
(embed): Embedding(
177, 50
(dropout): Dropout(p=0.0)
)
(conv_pool): ConvMaxpool(
(convs): ModuleList(
(0): Conv1d(50, 3, kernel_size=(3,), stride=(1,), padding=(2,))
(1): Conv1d(50, 4, kernel_size=(4,), stride=(1,), padding=(2,))
(2): Conv1d(50, 5, kernel_size=(5,), stride=(1,), padding=(2,))
)
)
(dropout): Dropout(p=0.1)
(fc): Linear(in_features=12, out_features=5, bias=True)
)

下面我们用 :class:`~fastNLP.DataSet` 类的 :meth:`~fastNLP.DataSet.split` 方法将数据集划分为 ``train_data`` 和 ``dev_data``
两个部分,分别用于训练和验证

.. code-block:: python

train_data, dev_data = dataset.split(0.2)

最后我们用 fastNLP 的 :class:`~fastNLP.Trainer` 进行训练,训练的过程中需要传入模型 ``model`` ,训练数据集 ``train_data`` ,
验证数据集 ``dev_data`` ,损失函数 ``loss`` 和衡量标准 ``metrics`` 。
其中损失函数使用的是 fastNLP 提供的 :class:`~fastNLP.CrossEntropyLoss` 损失函数;
衡量标准使用的是 fastNLP 提供的 :class:`~fastNLP.AccuracyMetric` 正确率指标。

.. code-block:: python

from fastNLP import Trainer, CrossEntropyLoss, AccuracyMetric

trainer = Trainer(model=model, train_data=train_data, dev_data=dev_data,
loss=CrossEntropyLoss(), metrics=AccuracyMetric())
trainer.train()

训练过程的输出如下::

input fields after batch(if batch size is 2):
words: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
target fields after batch(if batch size is 2):
target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])

training epochs started 2019-05-09-10-59-39
Evaluation at Epoch 1/10. Step:2/20. AccuracyMetric: acc=0.333333

Evaluation at Epoch 2/10. Step:4/20. AccuracyMetric: acc=0.533333

Evaluation at Epoch 3/10. Step:6/20. AccuracyMetric: acc=0.533333

Evaluation at Epoch 4/10. Step:8/20. AccuracyMetric: acc=0.533333

Evaluation at Epoch 5/10. Step:10/20. AccuracyMetric: acc=0.6

Evaluation at Epoch 6/10. Step:12/20. AccuracyMetric: acc=0.8

Evaluation at Epoch 7/10. Step:14/20. AccuracyMetric: acc=0.8

Evaluation at Epoch 8/10. Step:16/20. AccuracyMetric: acc=0.733333

Evaluation at Epoch 9/10. Step:18/20. AccuracyMetric: acc=0.733333

Evaluation at Epoch 10/10. Step:20/20. AccuracyMetric: acc=0.733333


In Epoch:6/Step:12, got best dev performance:AccuracyMetric: acc=0.8
Reloaded the best model.

这份教程只是简单地介绍了使用 fastNLP 工作的流程,更多的教程分析见 :doc:`/user/tutorials`

Loading…
Cancel
Save