@@ -6,50 +6,59 @@ | |||
![Hex.pm](https://img.shields.io/hexpm/l/plug.svg) | |||
[![Documentation Status](https://readthedocs.org/projects/fastnlp/badge/?version=latest)](http://fastnlp.readthedocs.io/?badge=latest) | |||
fastNLP 是一款轻量级的 NLP 处理套件。你既可以使用它快速地完成一个序列标注([NER](reproduction/seqence_labelling/ner/)、POS-Tagging等)、中文分词、文本分类、[Matching](reproduction/matching/)、指代消解、摘要等任务; 也可以使用它构建许多复杂的网络模型,进行科研。它具有如下的特性: | |||
fastNLP 是一款轻量级的 NLP 处理套件。你既可以使用它快速地完成一个序列标注([NER](reproduction/seqence_labelling/ner)、POS-Tagging等)、中文分词、[文本分类](reproduction/text_classification)、[Matching](reproduction/matching)、[指代消解](reproduction/coreference_resolution)、[摘要](reproduction/Summarization)等任务; 也可以使用它构建许多复杂的网络模型,进行科研。它具有如下的特性: | |||
- 统一的Tabular式数据容器,让数据预处理过程简洁明了。内置多种数据集的DataSet Loader,省去预处理代码; | |||
- 多种训练、测试组件,例如训练器Trainer;测试器Tester;以及各种评测metrics等等; | |||
- 各种方便的NLP工具,例如预处理embedding加载(包括EMLo和BERT); 中间数据cache等; | |||
- 详尽的中文[文档](https://fastnlp.readthedocs.io/)、教程以供查阅; | |||
- 各种方便的NLP工具,例如预处理embedding加载(包括ELMo和BERT); 中间数据cache等; | |||
- 详尽的中文[文档](https://fastnlp.readthedocs.io/)、[教程](https://fastnlp.readthedocs.io/zh/latest/user/tutorials.html)以供查阅; | |||
- 提供诸多高级模块,例如Variational LSTM, Transformer, CRF等; | |||
- 在序列标注、中文分词、文本分类、Matching、指代消解、摘要等任务上封装了各种模型可供直接使用; [详细链接](reproduction/) | |||
- 在序列标注、中文分词、文本分类、Matching、指代消解、摘要等任务上封装了各种模型可供直接使用,详细内容见 [reproduction](reproduction) 部分; | |||
- 便捷且具有扩展性的训练器; 提供多种内置callback函数,方便实验记录、异常捕获等。 | |||
## 安装指南 | |||
fastNLP 依赖如下包: | |||
fastNLP 依赖以下包: | |||
+ numpy>=1.14.2 | |||
+ torch>=1.0.0 | |||
+ tqdm>=4.28.1 | |||
+ nltk>=3.4.1 | |||
+ requests | |||
+ spacy | |||
其中torch的安装可能与操作系统及 CUDA 的版本相关,请参见 [PyTorch 官网](https://pytorch.org/) 。 | |||
在依赖包安装完成后,您可以在命令行执行如下指令完成安装 | |||
```shell | |||
pip install fastNLP | |||
python -m spacy download en | |||
``` | |||
## 参考资源 | |||
## fastNLP教程 | |||
- [文档](https://fastnlp.readthedocs.io/zh/latest/) | |||
- [源码](https://github.com/fastnlp/fastNLP) | |||
- [1. 使用DataSet预处理文本](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_1_data_preprocess.html) | |||
- [2. 使用DataSetLoader加载数据集](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_2_load_dataset.html) | |||
- [3. 使用Embedding模块将文本转成向量](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_3_embedding.html) | |||
- [4. 动手实现一个文本分类器I-使用Trainer和Tester快速训练和测试](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_4_loss_optimizer.html) | |||
- [5. 动手实现一个文本分类器II-使用DataSetIter实现自定义训练过程](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_5_datasetiter.html) | |||
- [6. 快速实现序列标注模型](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_6_seq_labeling.html) | |||
- [7. 使用Modules和Models快速搭建自定义模型](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_7_modules_models.html) | |||
- [8. 使用Metric快速评测你的模型](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_8_metrics.html) | |||
- [9. 使用Callback自定义你的训练过程](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_9_callback.html) | |||
## 内置组件 | |||
大部分用于的 NLP 任务神经网络都可以看做由编码(encoder)、聚合(aggregator)、解码(decoder)三种模块组成。 | |||
大部分用于的 NLP 任务神经网络都可以看做由编码器(encoder)、解码器(decoder)两种模块组成。 | |||
![](./docs/source/figures/text_classification.png) | |||
fastNLP 在 modules 模块中内置了三种模块的诸多组件,可以帮助用户快速搭建自己所需的网络。 三种模块的功能和常见组件如下: | |||
fastNLP 在 modules 模块中内置了两种模块的诸多组件,可以帮助用户快速搭建自己所需的网络。 两种模块的功能和常见组件如下: | |||
<table> | |||
<tr> | |||
@@ -59,29 +68,17 @@ fastNLP 在 modules 模块中内置了三种模块的诸多组件,可以帮助 | |||
</tr> | |||
<tr> | |||
<td> encoder </td> | |||
<td> 将输入编码为具有具 有表示能力的向量 </td> | |||
<td> 将输入编码为具有具有表示能力的向量 </td> | |||
<td> embedding, RNN, CNN, transformer | |||
</tr> | |||
<tr> | |||
<td> aggregator </td> | |||
<td> 从多个向量中聚合信息 </td> | |||
<td> self-attention, max-pooling </td> | |||
</tr> | |||
<tr> | |||
<td> decoder </td> | |||
<td> 将具有某种表示意义的 向量解码为需要的输出 形式 </td> | |||
<td> 将具有某种表示意义的向量解码为需要的输出形式 </td> | |||
<td> MLP, CRF </td> | |||
</tr> | |||
</table> | |||
## 完整模型 | |||
fastNLP 为不同的 NLP 任务实现了许多完整的模型,它们都经过了训练和测试。 | |||
你可以在以下两个地方查看相关信息 | |||
- [模型介绍](reproduction/) | |||
- [模型源码](fastNLP/models/) | |||
## 项目结构 | |||
![](./docs/source/figures/workflow.png) | |||
@@ -37,7 +37,7 @@ __all__ = [ | |||
"AccuracyMetric", | |||
"SpanFPreRecMetric", | |||
"SQuADMetric", | |||
"ExtractiveQAMetric", | |||
"Optimizer", | |||
"SGD", | |||
@@ -61,3 +61,4 @@ __version__ = '0.4.0' | |||
from .core import * | |||
from . import models | |||
from . import modules | |||
from .io import data_loader |
@@ -21,7 +21,7 @@ from .dataset import DataSet | |||
from .field import FieldArray, Padder, AutoPadder, EngChar2DPadder | |||
from .instance import Instance | |||
from .losses import LossFunc, CrossEntropyLoss, L1Loss, BCELoss, NLLLoss, LossInForward | |||
from .metrics import AccuracyMetric, SpanFPreRecMetric, SQuADMetric | |||
from .metrics import AccuracyMetric, SpanFPreRecMetric, ExtractiveQAMetric | |||
from .optimizer import Optimizer, SGD, Adam | |||
from .sampler import SequentialSampler, BucketSampler, RandomSampler, Sampler | |||
from .tester import Tester | |||
@@ -3,7 +3,6 @@ batch 模块实现了 fastNLP 所需的 Batch 类。 | |||
""" | |||
__all__ = [ | |||
"BatchIter", | |||
"DataSetIter", | |||
"TorchLoaderIter", | |||
] | |||
@@ -50,6 +49,7 @@ class DataSetGetter: | |||
return len(self.dataset) | |||
def collate_fn(self, batch: list): | |||
# TODO 支持在DataSet中定义collate_fn,因为有时候可能需要不同的field之间融合,比如BERT的场景 | |||
batch_x = {n:[] for n in self.inputs.keys()} | |||
batch_y = {n:[] for n in self.targets.keys()} | |||
indices = [] | |||
@@ -136,6 +136,31 @@ class BatchIter: | |||
class DataSetIter(BatchIter): | |||
""" | |||
别名::class:`fastNLP.DataSetIter` :class:`fastNLP.core.batch.DataSetIter` | |||
DataSetIter 用于从 `DataSet` 中按一定的顺序, 依次按 ``batch_size`` 的大小将数据取出, | |||
组成 `x` 和 `y`:: | |||
batch = DataSetIter(data_set, batch_size=16, sampler=SequentialSampler()) | |||
num_batch = len(batch) | |||
for batch_x, batch_y in batch: | |||
# do stuff ... | |||
:param dataset: :class:`~fastNLP.DataSet` 对象, 数据集 | |||
:param int batch_size: 取出的batch大小 | |||
:param sampler: 规定使用的 :class:`~fastNLP.Sampler` 方式. 若为 ``None`` , 使用 :class:`~fastNLP.SequentialSampler`. | |||
Default: ``None`` | |||
:param bool as_numpy: 若为 ``True`` , 输出batch为 numpy.array. 否则为 :class:`torch.Tensor`. | |||
Default: ``False`` | |||
:param int num_workers: 使用多少个进程来预处理数据 | |||
:param bool pin_memory: 是否将产生的tensor使用pin memory, 可能会加快速度。 | |||
:param bool drop_last: 如果最后一个batch没有batch_size这么多sample,就扔掉最后一个 | |||
:param timeout: | |||
:param worker_init_fn: 在每个worker启动时调用该函数,会传入一个值,该值是worker的index。 | |||
""" | |||
def __init__(self, dataset, batch_size=1, sampler=None, as_numpy=False, | |||
num_workers=0, pin_memory=False, drop_last=False, | |||
timeout=0, worker_init_fn=None): | |||
@@ -66,6 +66,8 @@ import os | |||
import torch | |||
from copy import deepcopy | |||
import sys | |||
from .utils import _save_model | |||
try: | |||
from tensorboardX import SummaryWriter | |||
@@ -737,6 +739,132 @@ class TensorboardCallback(Callback): | |||
del self._summary_writer | |||
class WarmupCallback(Callback): | |||
""" | |||
按一定的周期调节Learning rate的大小。 | |||
:param int,float warmup: 如果warmup为int,则在该step之前,learning rate根据schedule的策略变化; 如果warmup为float, | |||
如0.1, 则前10%的step是按照schedule策略调整learning rate。 | |||
:param str schedule: 以哪种方式调整。linear: 前warmup的step上升到指定的learning rate(从Trainer中的optimizer处获取的), 后 | |||
warmup的step下降到0; constant前warmup的step上升到指定learning rate,后面的step保持learning rate. | |||
""" | |||
def __init__(self, warmup=0.1, schedule='constant'): | |||
super().__init__() | |||
self.warmup = max(warmup, 0.) | |||
self.initial_lrs = [] # 存放param_group的learning rate | |||
if schedule == 'constant': | |||
self.get_lr = self._get_constant_lr | |||
elif schedule == 'linear': | |||
self.get_lr = self._get_linear_lr | |||
else: | |||
raise RuntimeError("Only support 'linear', 'constant'.") | |||
def _get_constant_lr(self, progress): | |||
if progress<self.warmup: | |||
return progress/self.warmup | |||
return 1 | |||
def _get_linear_lr(self, progress): | |||
if progress<self.warmup: | |||
return progress/self.warmup | |||
return max((progress - 1.) / (self.warmup - 1.), 0.) | |||
def on_train_begin(self): | |||
self.t_steps = (len(self.trainer.train_data) // (self.batch_size*self.update_every) + | |||
int(len(self.trainer.train_data) % (self.batch_size*self.update_every)!= 0)) * self.n_epochs | |||
if self.warmup>1: | |||
self.warmup = self.warmup/self.t_steps | |||
self.t_steps = max(2, self.t_steps) # 不能小于2 | |||
# 获取param_group的初始learning rate | |||
for group in self.optimizer.param_groups: | |||
self.initial_lrs.append(group['lr']) | |||
def on_backward_end(self): | |||
if self.step%self.update_every==0: | |||
progress = (self.step/self.update_every)/self.t_steps | |||
for lr, group in zip(self.initial_lrs, self.optimizer.param_groups): | |||
group['lr'] = lr * self.get_lr(progress) | |||
class SaveModelCallback(Callback): | |||
""" | |||
由于Trainer在训练过程中只会保存最佳的模型, 该callback可实现多种方式的结果存储。 | |||
会根据训练开始的时间戳在save_dir下建立文件夹,再在文件夹下存放多个模型 | |||
-save_dir | |||
-2019-07-03-15-06-36 | |||
-epoch:0_step:20_{metric_key}:{evaluate_performance}.pt # metric是给定的metric_key, evaluate_performance是性能 | |||
-epoch:1_step:40_{metric_key}:{evaluate_performance}.pt | |||
-2019-07-03-15-10-00 | |||
-epoch:0_step:20_{metric_key}:{evaluate_performance}.pt # metric是给定的metric_key, evaluate_perfomance是性能 | |||
:param str save_dir: 将模型存放在哪个目录下,会在该目录下创建以时间戳命名的目录,并存放模型 | |||
:param int top: 保存dev表现top多少模型。-1为保存所有模型。 | |||
:param bool only_param: 是否只保存模型d饿权重。 | |||
:param save_on_exception: 发生exception时,是否保存一份发生exception的模型。模型名称为epoch:x_step:x_Exception:{exception_name}. | |||
""" | |||
def __init__(self, save_dir, top=3, only_param=False, save_on_exception=False): | |||
super().__init__() | |||
if not os.path.isdir(save_dir): | |||
raise IsADirectoryError("{} is not a directory.".format(save_dir)) | |||
self.save_dir = save_dir | |||
if top < 0: | |||
self.top = sys.maxsize | |||
else: | |||
self.top = top | |||
self._ordered_save_models = [] # List[Tuple], Tuple[0]是metric, Tuple[1]是path。metric是依次变好的,所以从头删 | |||
self.only_param = only_param | |||
self.save_on_exception = save_on_exception | |||
def on_train_begin(self): | |||
self.save_dir = os.path.join(self.save_dir, self.trainer.start_time) | |||
def on_valid_end(self, eval_result, metric_key, optimizer, is_better_eval): | |||
metric_value = list(eval_result.values())[0][metric_key] | |||
self._save_this_model(metric_value) | |||
def _insert_into_ordered_save_models(self, pair): | |||
# pair:(metric_value, model_name) | |||
# 返回save的模型pair与删除的模型pair. pair中第一个元素是metric的值,第二个元素是模型的名称 | |||
index = -1 | |||
for _pair in self._ordered_save_models: | |||
if _pair[0]>=pair[0] and self.trainer.increase_better: | |||
break | |||
if not self.trainer.increase_better and _pair[0]<=pair[0]: | |||
break | |||
index += 1 | |||
save_pair = None | |||
if len(self._ordered_save_models)<self.top or (len(self._ordered_save_models)>=self.top and index!=-1): | |||
save_pair = pair | |||
self._ordered_save_models.insert(index+1, pair) | |||
delete_pair = None | |||
if len(self._ordered_save_models)>self.top: | |||
delete_pair = self._ordered_save_models.pop(0) | |||
return save_pair, delete_pair | |||
def _save_this_model(self, metric_value): | |||
name = "epoch:{}_step:{}_{}:{:.6f}.pt".format(self.epoch, self.step, self.trainer.metric_key, metric_value) | |||
save_pair, delete_pair = self._insert_into_ordered_save_models((metric_value, name)) | |||
if save_pair: | |||
try: | |||
_save_model(self.model, model_name=name, save_dir=self.save_dir, only_param=self.only_param) | |||
except Exception as e: | |||
print(f"The following exception:{e} happens when save model to {self.save_dir}.") | |||
if delete_pair: | |||
try: | |||
delete_model_path = os.path.join(self.save_dir, delete_pair[1]) | |||
if os.path.exists(delete_model_path): | |||
os.remove(delete_model_path) | |||
except Exception as e: | |||
print(f"Fail to delete model {name} at {self.save_dir} caused by exception:{e}.") | |||
def on_exception(self, exception): | |||
if self.save_on_exception: | |||
name = "epoch:{}_step:{}_Exception:{}.pt".format(self.epoch, self.step, exception.__class__.__name__) | |||
_save_model(self.model, model_name=name, save_dir=self.save_dir, only_param=self.only_param) | |||
class CallbackException(BaseException): | |||
""" | |||
当需要通过callback跳出训练的时候可以通过抛出CallbackException并在on_exception中捕获这个值。 | |||
@@ -6,7 +6,7 @@ __all__ = [ | |||
"MetricBase", | |||
"AccuracyMetric", | |||
"SpanFPreRecMetric", | |||
"SQuADMetric" | |||
"ExtractiveQAMetric" | |||
] | |||
import inspect | |||
@@ -24,6 +24,7 @@ from .utils import seq_len_to_mask | |||
from .vocabulary import Vocabulary | |||
from abc import abstractmethod | |||
class MetricBase(object): | |||
""" | |||
所有metrics的基类,,所有的传入到Trainer, Tester的Metric需要继承自该对象,需要覆盖写入evaluate(), get_metric()方法。 | |||
@@ -735,11 +736,11 @@ def _pred_topk(y_prob, k=1): | |||
return y_pred_topk, y_prob_topk | |||
class SQuADMetric(MetricBase): | |||
class ExtractiveQAMetric(MetricBase): | |||
r""" | |||
别名::class:`fastNLP.SQuADMetric` :class:`fastNLP.core.metrics.SQuADMetric` | |||
别名::class:`fastNLP.ExtractiveQAMetric` :class:`fastNLP.core.metrics.ExtractiveQAMetric` | |||
SQuAD数据集metric | |||
抽取式QA(如SQuAD)的metric. | |||
:param pred1: 参数映射表中 `pred1` 的映射关系,None表示映射关系为 `pred1` -> `pred1` | |||
:param pred2: 参数映射表中 `pred2` 的映射关系,None表示映射关系为 `pred2` -> `pred2` | |||
@@ -755,7 +756,7 @@ class SQuADMetric(MetricBase): | |||
def __init__(self, pred1=None, pred2=None, target1=None, target2=None, | |||
beta=1, right_open=True, print_predict_stat=False): | |||
super(SQuADMetric, self).__init__() | |||
super(ExtractiveQAMetric, self).__init__() | |||
self._init_param_map(pred1=pred1, pred2=pred2, target1=target1, target2=target2) | |||
@@ -16,6 +16,7 @@ from collections import Counter, namedtuple | |||
import numpy as np | |||
import torch | |||
import torch.nn as nn | |||
from typing import List | |||
_CheckRes = namedtuple('_CheckRes', ['missing', 'unused', 'duplicated', 'required', 'all_needed', | |||
'varargs']) | |||
@@ -162,6 +163,30 @@ def cache_results(_cache_fp, _refresh=False, _verbose=1): | |||
return wrapper_ | |||
def _save_model(model, model_name, save_dir, only_param=False): | |||
""" 存储不含有显卡信息的state_dict或model | |||
:param model: | |||
:param model_name: | |||
:param save_dir: 保存的directory | |||
:param only_param: | |||
:return: | |||
""" | |||
model_path = os.path.join(save_dir, model_name) | |||
if not os.path.isdir(save_dir): | |||
os.makedirs(save_dir, exist_ok=True) | |||
if isinstance(model, nn.DataParallel): | |||
model = model.module | |||
if only_param: | |||
state_dict = model.state_dict() | |||
for key in state_dict: | |||
state_dict[key] = state_dict[key].cpu() | |||
torch.save(state_dict, model_path) | |||
else: | |||
_model_device = _get_model_device(model) | |||
model.cpu() | |||
torch.save(model, model_path) | |||
model.to(_model_device) | |||
# def save_pickle(obj, pickle_path, file_name): | |||
# """Save an object into a pickle file. | |||
@@ -277,7 +302,6 @@ def _move_model_to_device(model, device): | |||
return model | |||
def _get_model_device(model): | |||
""" | |||
传入一个nn.Module的模型,获取它所在的device | |||
@@ -285,7 +309,7 @@ def _get_model_device(model): | |||
:param model: nn.Module | |||
:return: torch.device,None 如果返回值为None,说明这个模型没有任何参数。 | |||
""" | |||
# TODO 这个函数存在一定的风险,因为同一个模型可能存在某些parameter不在显卡中,比如BertEmbedding | |||
# TODO 这个函数存在一定的风险,因为同一个模型可能存在某些parameter不在显卡中,比如BertEmbedding. 或者跨显卡 | |||
assert isinstance(model, nn.Module) | |||
parameters = list(model.parameters()) | |||
@@ -712,3 +736,52 @@ class _pseudo_tqdm: | |||
def __exit__(self, exc_type, exc_val, exc_tb): | |||
del self | |||
def iob2(tags:List[str])->List[str]: | |||
""" | |||
检查数据是否是合法的IOB数据,如果是IOB1会被自动转换为IOB2。两者的差异见 | |||
https://datascience.stackexchange.com/questions/37824/difference-between-iob-and-iob2-format | |||
:param tags: 需要转换的tags, 需要为大写的BIO标签。 | |||
""" | |||
for i, tag in enumerate(tags): | |||
if tag == "O": | |||
continue | |||
split = tag.split("-") | |||
if len(split) != 2 or split[0] not in ["I", "B"]: | |||
raise TypeError("The encoding schema is not a valid IOB type.") | |||
if split[0] == "B": | |||
continue | |||
elif i == 0 or tags[i - 1] == "O": # conversion IOB1 to IOB2 | |||
tags[i] = "B" + tag[1:] | |||
elif tags[i - 1][1:] == tag[1:]: | |||
continue | |||
else: # conversion IOB1 to IOB2 | |||
tags[i] = "B" + tag[1:] | |||
return tags | |||
def iob2bioes(tags:List[str])->List[str]: | |||
""" | |||
将iob的tag转换为bioes编码 | |||
:param tags: List[str]. 编码需要是大写的。 | |||
:return: | |||
""" | |||
new_tags = [] | |||
for i, tag in enumerate(tags): | |||
if tag == 'O': | |||
new_tags.append(tag) | |||
else: | |||
split = tag.split('-')[0] | |||
if split == 'B': | |||
if i+1!=len(tags) and tags[i+1].split('-')[0] == 'I': | |||
new_tags.append(tag) | |||
else: | |||
new_tags.append(tag.replace('B-', 'S-')) | |||
elif split == 'I': | |||
if i + 1<len(tags) and tags[i+1].split('-')[0] == 'I': | |||
new_tags.append(tag) | |||
else: | |||
new_tags.append(tag.replace('I-', 'E-')) | |||
else: | |||
raise TypeError("Invalid IOB format.") | |||
return new_tags |
@@ -91,47 +91,84 @@ class Vocabulary(object): | |||
self.idx2word = None | |||
self.rebuild = True | |||
# 用于承载不需要单独创建entry的词语,具体见from_dataset()方法 | |||
self._no_create_word = defaultdict(int) | |||
self._no_create_word = Counter() | |||
@_check_build_status | |||
def update(self, word_lst): | |||
def update(self, word_lst, no_create_entry=False): | |||
"""依次增加序列中词在词典中的出现频率 | |||
:param list word_lst: a list of strings | |||
""" | |||
:param bool no_create_entry: 在使用fastNLP.TokenEmbedding加载预训练模型时,没有从预训练词表中找到这个词的处理方式。 | |||
如果为True,则不会有这个词语创建一个单独的entry,它将一直被指向unk的表示; 如果为False,则为这个词创建一个单独 | |||
的entry。如果这个word来自于dev或者test,一般设置为True,如果来自与train一般设置为False。以下两种情况: 如果新 | |||
加入一个word,且no_create_entry为True,但这个词之前已经在Vocabulary中且并不是no_create_entry的,则还是会为这 | |||
个词创建一个单独的vector; 如果no_create_entry为False,但这个词之前已经在Vocabulary中且并不是no_create_entry的, | |||
则这个词将认为是需要创建单独的vector的。 | |||
""" | |||
self._add_no_create_entry(word_lst, no_create_entry) | |||
self.word_count.update(word_lst) | |||
@_check_build_status | |||
def add(self, word): | |||
def add(self, word, no_create_entry=False): | |||
""" | |||
增加一个新词在词典中的出现频率 | |||
:param str word: 新词 | |||
""" | |||
:param bool no_create_entry: 在使用fastNLP.TokenEmbedding加载预训练模型时,没有从预训练词表中找到这个词的处理方式。 | |||
如果为True,则不会有这个词语创建一个单独的entry,它将一直被指向unk的表示; 如果为False,则为这个词创建一个单独 | |||
的entry。如果这个word来自于dev或者test,一般设置为True,如果来自与train一般设置为False。以下两种情况: 如果新 | |||
加入一个word,且no_create_entry为True,但这个词之前已经在Vocabulary中且并不是no_create_entry的,则还是会为这 | |||
个词创建一个单独的vector; 如果no_create_entry为False,但这个词之前已经在Vocabulary中且并不是no_create_entry的, | |||
则这个词将认为是需要创建单独的vector的。 | |||
""" | |||
self._add_no_create_entry(word, no_create_entry) | |||
self.word_count[word] += 1 | |||
def _add_no_create_entry(self, word, no_create_entry): | |||
""" | |||
在新加入word时,检查_no_create_word的设置。 | |||
:param str, List[str] word: | |||
:param bool no_create_entry: | |||
:return: | |||
""" | |||
if isinstance(word, str): | |||
word = [word] | |||
for w in word: | |||
if no_create_entry and self.word_count.get(w, 0) == self._no_create_word.get(w, 0): | |||
self._no_create_word[w] += 1 | |||
elif not no_create_entry and w in self._no_create_word: | |||
self._no_create_word.pop(w) | |||
@_check_build_status | |||
def add_word(self, word): | |||
def add_word(self, word, no_create_entry=False): | |||
""" | |||
增加一个新词在词典中的出现频率 | |||
:param str word: 新词 | |||
""" | |||
if word in self._no_create_word: | |||
self._no_create_word.pop(word) | |||
self.add(word) | |||
:param bool no_create_entry: 在使用fastNLP.TokenEmbedding加载预训练模型时,没有从预训练词表中找到这个词的处理方式。 | |||
如果为True,则不会有这个词语创建一个单独的entry,它将一直被指向unk的表示; 如果为False,则为这个词创建一个单独 | |||
的entry。如果这个word来自于dev或者test,一般设置为True,如果来自与train一般设置为False。以下两种情况: 如果新 | |||
加入一个word,且no_create_entry为True,但这个词之前已经在Vocabulary中且并不是no_create_entry的,则还是会为这 | |||
个词创建一个单独的vector; 如果no_create_entry为False,但这个词之前已经在Vocabulary中且并不是no_create_entry的, | |||
则这个词将认为是需要创建单独的vector的。 | |||
""" | |||
self.add(word, no_create_entry=no_create_entry) | |||
@_check_build_status | |||
def add_word_lst(self, word_lst): | |||
def add_word_lst(self, word_lst, no_create_entry=False): | |||
""" | |||
依次增加序列中词在词典中的出现频率 | |||
:param list[str] word_lst: 词的序列 | |||
""" | |||
for word in word_lst: | |||
if word in self._no_create_word: | |||
self._no_create_word.pop(word) | |||
self.update(word_lst) | |||
:param bool no_create_entry: 在使用fastNLP.TokenEmbedding加载预训练模型时,没有从预训练词表中找到这个词的处理方式。 | |||
如果为True,则不会有这个词语创建一个单独的entry,它将一直被指向unk的表示; 如果为False,则为这个词创建一个单独 | |||
的entry。如果这个word来自于dev或者test,一般设置为True,如果来自与train一般设置为False。以下两种情况: 如果新 | |||
加入一个word,且no_create_entry为True,但这个词之前已经在Vocabulary中且并不是no_create_entry的,则还是会为这 | |||
个词创建一个单独的vector; 如果no_create_entry为False,但这个词之前已经在Vocabulary中且并不是no_create_entry的, | |||
则这个词将认为是需要创建单独的vector的。 | |||
""" | |||
self.update(word_lst, no_create_entry=no_create_entry) | |||
def build_vocab(self): | |||
""" | |||
@@ -141,10 +178,10 @@ class Vocabulary(object): | |||
""" | |||
if self.word2idx is None: | |||
self.word2idx = {} | |||
if self.padding is not None: | |||
self.word2idx[self.padding] = len(self.word2idx) | |||
if self.unknown is not None: | |||
self.word2idx[self.unknown] = len(self.word2idx) | |||
if self.padding is not None: | |||
self.word2idx[self.padding] = len(self.word2idx) | |||
if self.unknown is not None: | |||
self.word2idx[self.unknown] = len(self.word2idx) | |||
max_size = min(self.max_size, len(self.word_count)) if self.max_size else None | |||
words = self.word_count.most_common(max_size) | |||
@@ -283,23 +320,17 @@ class Vocabulary(object): | |||
for fn in field_name: | |||
field = ins[fn] | |||
if isinstance(field, str): | |||
if no_create_entry and field not in self.word_count: | |||
self._no_create_word[field] += 1 | |||
self.add_word(field) | |||
self.add_word(field, no_create_entry=no_create_entry) | |||
elif isinstance(field, (list, np.ndarray)): | |||
if not isinstance(field[0], (list, np.ndarray)): | |||
for word in field: | |||
if no_create_entry and word not in self.word_count: | |||
self._no_create_word[word] += 1 | |||
self.add_word(word) | |||
self.add_word(word, no_create_entry=no_create_entry) | |||
else: | |||
if isinstance(field[0][0], (list, np.ndarray)): | |||
raise RuntimeError("Only support field with 2 dimensions.") | |||
for words in field: | |||
for word in words: | |||
if no_create_entry and word not in self.word_count: | |||
self._no_create_word[word] += 1 | |||
self.add_word(word) | |||
self.add_word(word, no_create_entry=no_create_entry) | |||
for idx, dataset in enumerate(datasets): | |||
if isinstance(dataset, DataSet): | |||
@@ -12,22 +12,22 @@ | |||
__all__ = [ | |||
'EmbedLoader', | |||
'DataInfo', | |||
'DataBundle', | |||
'DataSetLoader', | |||
'CSVLoader', | |||
'JsonLoader', | |||
'ConllLoader', | |||
'PeopleDailyCorpusLoader', | |||
'Conll2003Loader', | |||
'ModelLoader', | |||
'ModelSaver', | |||
'SSTLoader', | |||
'ConllLoader', | |||
'Conll2003Loader', | |||
'MatchingLoader', | |||
'PeopleDailyCorpusLoader', | |||
'SNLILoader', | |||
'SSTLoader', | |||
'SST2Loader', | |||
'MNLILoader', | |||
'QNLILoader', | |||
'QuoraLoader', | |||
@@ -35,11 +35,8 @@ __all__ = [ | |||
] | |||
from .embed_loader import EmbedLoader | |||
from .base_loader import DataInfo, DataSetLoader | |||
from .dataset_loader import CSVLoader, JsonLoader, ConllLoader, \ | |||
PeopleDailyCorpusLoader, Conll2003Loader | |||
from .base_loader import DataBundle, DataSetLoader | |||
from .dataset_loader import CSVLoader, JsonLoader | |||
from .model_io import ModelLoader, ModelSaver | |||
from .data_loader.sst import SSTLoader | |||
from .data_loader.matching import MatchingLoader, SNLILoader, \ | |||
MNLILoader, QNLILoader, QuoraLoader, RTELoader | |||
from .data_loader import * |
@@ -1,6 +1,6 @@ | |||
__all__ = [ | |||
"BaseLoader", | |||
'DataInfo', | |||
'DataBundle', | |||
'DataSetLoader', | |||
] | |||
@@ -109,7 +109,7 @@ def _uncompress(src, dst): | |||
raise ValueError('unsupported file {}'.format(src)) | |||
class DataInfo: | |||
class DataBundle: | |||
""" | |||
经过处理的数据信息,包括一系列数据集(比如:分开的训练集、验证集和测试集)及它们所用的词表和词嵌入。 | |||
@@ -201,20 +201,20 @@ class DataSetLoader: | |||
""" | |||
raise NotImplementedError | |||
def process(self, paths: Union[str, Dict[str, str]], **options) -> DataInfo: | |||
def process(self, paths: Union[str, Dict[str, str]], **options) -> DataBundle: | |||
""" | |||
对于特定的任务和数据集,读取并处理数据,返回处理DataInfo类对象或字典。 | |||
从指定一个或多个路径中的文件中读取数据,DataInfo对象中可以包含一个或多个数据集 。 | |||
如果处理多个路径,传入的 dict 的 key 与返回DataInfo中的 dict 中的 key 保存一致。 | |||
返回的 :class:`DataInfo` 对象有如下属性: | |||
返回的 :class:`DataBundle` 对象有如下属性: | |||
- vocabs: 由从数据集中获取的词表组成的字典,每个词表 | |||
- datasets: 一个dict,包含一系列 :class:`~fastNLP.DataSet` 类型的对象。其中 field 的命名参考 :mod:`~fastNLP.core.const` | |||
:param paths: 原始数据读取的路径 | |||
:param options: 根据不同的任务和数据集,设计自己的参数 | |||
:return: 返回一个 DataInfo | |||
:return: 返回一个 DataBundle | |||
""" | |||
raise NotImplementedError |
@@ -4,16 +4,32 @@ | |||
这些模块的使用方法如下: | |||
""" | |||
__all__ = [ | |||
'SSTLoader', | |||
'ConllLoader', | |||
'Conll2003Loader', | |||
'IMDBLoader', | |||
'MatchingLoader', | |||
'SNLILoader', | |||
'MNLILoader', | |||
'MTL16Loader', | |||
'PeopleDailyCorpusLoader', | |||
'QNLILoader', | |||
'QuoraLoader', | |||
'RTELoader', | |||
'SSTLoader', | |||
'SST2Loader', | |||
'SNLILoader', | |||
'YelpLoader', | |||
] | |||
from .sst import SSTLoader | |||
from .matching import MatchingLoader, SNLILoader, \ | |||
MNLILoader, QNLILoader, QuoraLoader, RTELoader | |||
from .conll import ConllLoader, Conll2003Loader | |||
from .imdb import IMDBLoader | |||
from .matching import MatchingLoader | |||
from .mnli import MNLILoader | |||
from .mtl import MTL16Loader | |||
from .people_daily import PeopleDailyCorpusLoader | |||
from .qnli import QNLILoader | |||
from .quora import QuoraLoader | |||
from .rte import RTELoader | |||
from .snli import SNLILoader | |||
from .sst import SSTLoader, SST2Loader | |||
from .yelp import YelpLoader |
@@ -0,0 +1,73 @@ | |||
from ...core.dataset import DataSet | |||
from ...core.instance import Instance | |||
from ..base_loader import DataSetLoader | |||
from ..file_reader import _read_conll | |||
class ConllLoader(DataSetLoader): | |||
""" | |||
别名::class:`fastNLP.io.ConllLoader` :class:`fastNLP.io.data_loader.ConllLoader` | |||
读取Conll格式的数据. 数据格式详见 http://conll.cemantix.org/2012/data.html. 数据中以"-DOCSTART-"开头的行将被忽略,因为 | |||
该符号在conll 2003中被用为文档分割符。 | |||
列号从0开始, 每列对应内容为:: | |||
Column Type | |||
0 Document ID | |||
1 Part number | |||
2 Word number | |||
3 Word itself | |||
4 Part-of-Speech | |||
5 Parse bit | |||
6 Predicate lemma | |||
7 Predicate Frameset ID | |||
8 Word sense | |||
9 Speaker/Author | |||
10 Named Entities | |||
11:N Predicate Arguments | |||
N Coreference | |||
:param headers: 每一列数据的名称,需为List or Tuple of str。``header`` 与 ``indexes`` 一一对应 | |||
:param indexes: 需要保留的数据列下标,从0开始。若为 ``None`` ,则所有列都保留。Default: ``None`` | |||
:param dropna: 是否忽略非法数据,若 ``False`` ,遇到非法数据时抛出 ``ValueError`` 。Default: ``False`` | |||
""" | |||
def __init__(self, headers, indexes=None, dropna=False): | |||
super(ConllLoader, self).__init__() | |||
if not isinstance(headers, (list, tuple)): | |||
raise TypeError( | |||
'invalid headers: {}, should be list of strings'.format(headers)) | |||
self.headers = headers | |||
self.dropna = dropna | |||
if indexes is None: | |||
self.indexes = list(range(len(self.headers))) | |||
else: | |||
if len(indexes) != len(headers): | |||
raise ValueError | |||
self.indexes = indexes | |||
def _load(self, path): | |||
ds = DataSet() | |||
for idx, data in _read_conll(path, indexes=self.indexes, dropna=self.dropna): | |||
ins = {h: data[i] for i, h in enumerate(self.headers)} | |||
ds.append(Instance(**ins)) | |||
return ds | |||
class Conll2003Loader(ConllLoader): | |||
""" | |||
别名::class:`fastNLP.io.Conll2003Loader` :class:`fastNLP.io.dataset_loader.Conll2003Loader` | |||
读取Conll2003数据 | |||
关于数据集的更多信息,参考: | |||
https://sites.google.com/site/ermasoftware/getting-started/ne-tagging-conll2003-data | |||
""" | |||
def __init__(self): | |||
headers = [ | |||
'tokens', 'pos', 'chunks', 'ner', | |||
] | |||
super(Conll2003Loader, self).__init__(headers=headers) |
@@ -0,0 +1,96 @@ | |||
from typing import Union, Dict | |||
from ..embed_loader import EmbeddingOption, EmbedLoader | |||
from ..base_loader import DataSetLoader, DataBundle | |||
from ...core.vocabulary import VocabularyOption, Vocabulary | |||
from ...core.dataset import DataSet | |||
from ...core.instance import Instance | |||
from ...core.const import Const | |||
from ..utils import get_tokenizer | |||
class IMDBLoader(DataSetLoader): | |||
""" | |||
读取IMDB数据集,DataSet包含以下fields: | |||
words: list(str), 需要分类的文本 | |||
target: str, 文本的标签 | |||
""" | |||
def __init__(self): | |||
super(IMDBLoader, self).__init__() | |||
self.tokenizer = get_tokenizer() | |||
def _load(self, path): | |||
dataset = DataSet() | |||
with open(path, 'r', encoding="utf-8") as f: | |||
for line in f: | |||
line = line.strip() | |||
if not line: | |||
continue | |||
parts = line.split('\t') | |||
target = parts[0] | |||
words = self.tokenizer(parts[1].lower()) | |||
dataset.append(Instance(words=words, target=target)) | |||
if len(dataset) == 0: | |||
raise RuntimeError(f"{path} has no valid data.") | |||
return dataset | |||
def process(self, | |||
paths: Union[str, Dict[str, str]], | |||
src_vocab_opt: VocabularyOption = None, | |||
tgt_vocab_opt: VocabularyOption = None, | |||
char_level_op=False): | |||
datasets = {} | |||
info = DataBundle() | |||
for name, path in paths.items(): | |||
dataset = self.load(path) | |||
datasets[name] = dataset | |||
def wordtochar(words): | |||
chars = [] | |||
for word in words: | |||
word = word.lower() | |||
for char in word: | |||
chars.append(char) | |||
chars.append('') | |||
chars.pop() | |||
return chars | |||
if char_level_op: | |||
for dataset in datasets.values(): | |||
dataset.apply_field(wordtochar, field_name="words", new_field_name='chars') | |||
datasets["train"], datasets["dev"] = datasets["train"].split(0.1, shuffle=False) | |||
src_vocab = Vocabulary() if src_vocab_opt is None else Vocabulary(**src_vocab_opt) | |||
src_vocab.from_dataset(datasets['train'], field_name='words') | |||
src_vocab.index_dataset(*datasets.values(), field_name='words') | |||
tgt_vocab = Vocabulary(unknown=None, padding=None) \ | |||
if tgt_vocab_opt is None else Vocabulary(**tgt_vocab_opt) | |||
tgt_vocab.from_dataset(datasets['train'], field_name='target') | |||
tgt_vocab.index_dataset(*datasets.values(), field_name='target') | |||
info.vocabs = { | |||
Const.INPUT: src_vocab, | |||
Const.TARGET: tgt_vocab | |||
} | |||
info.datasets = datasets | |||
for name, dataset in info.datasets.items(): | |||
dataset.set_input(Const.INPUT) | |||
dataset.set_target(Const.TARGET) | |||
return info | |||
@@ -1,18 +1,17 @@ | |||
import os | |||
from typing import Union, Dict | |||
from typing import Union, Dict, List | |||
from ...core.const import Const | |||
from ...core.vocabulary import Vocabulary | |||
from ..base_loader import DataInfo, DataSetLoader | |||
from ..dataset_loader import JsonLoader, CSVLoader | |||
from ..base_loader import DataBundle, DataSetLoader | |||
from ..file_utils import _get_base_url, cached_path, PRETRAINED_BERT_MODEL_DIR | |||
from ...modules.encoder._bert import BertTokenizer | |||
class MatchingLoader(DataSetLoader): | |||
""" | |||
别名::class:`fastNLP.io.MatchingLoader` :class:`fastNLP.io.dataset_loader.MatchingLoader` | |||
别名::class:`fastNLP.io.MatchingLoader` :class:`fastNLP.io.data_loader.MatchingLoader` | |||
读取Matching任务的数据集 | |||
@@ -34,7 +33,8 @@ class MatchingLoader(DataSetLoader): | |||
to_lower=False, seq_len_type: str=None, bert_tokenizer: str=None, | |||
cut_text: int = None, get_index=True, auto_pad_length: int=None, | |||
auto_pad_token: str='<pad>', set_input: Union[list, str, bool]=True, | |||
set_target: Union[list, str, bool] = True, concat: Union[str, list, bool]=None, ) -> DataInfo: | |||
set_target: Union[list, str, bool]=True, concat: Union[str, list, bool]=None, | |||
extra_split: List[str]=None, ) -> DataBundle: | |||
""" | |||
:param paths: str或者Dict[str, str]。如果是str,则为数据集所在的文件夹或者是全路径文件名:如果是文件夹, | |||
则会从self.paths里面找对应的数据集名称与文件名。如果是Dict,则为数据集名称(如train、dev、test)和 | |||
@@ -57,6 +57,7 @@ class MatchingLoader(DataSetLoader): | |||
:param concat: 是否需要将两个句子拼接起来。如果为False则不会拼接。如果为True则会在两个句子之间插入一个<sep>。 | |||
如果传入一个长度为4的list,则分别表示插在第一句开始前、第一句结束后、第二句开始前、第二句结束后的标识符。如果 | |||
传入字符串 ``bert`` ,则会采用bert的拼接方式,等价于['[CLS]', '[SEP]', '', '[SEP]']. | |||
:param extra_split: 额外的分隔符,即除了空格之外的用于分词的字符。 | |||
:return: | |||
""" | |||
if isinstance(set_input, str): | |||
@@ -79,7 +80,7 @@ class MatchingLoader(DataSetLoader): | |||
else: | |||
path = paths | |||
data_info = DataInfo() | |||
data_info = DataBundle() | |||
for data_name in path.keys(): | |||
data_info.datasets[data_name] = self._load(path[data_name]) | |||
@@ -90,6 +91,24 @@ class MatchingLoader(DataSetLoader): | |||
if Const.TARGET in data_set.get_field_names(): | |||
data_set.set_target(Const.TARGET) | |||
if extra_split is not None: | |||
for data_name, data_set in data_info.datasets.items(): | |||
data_set.apply(lambda x: ' '.join(x[Const.INPUTS(0)]), new_field_name=Const.INPUTS(0)) | |||
data_set.apply(lambda x: ' '.join(x[Const.INPUTS(1)]), new_field_name=Const.INPUTS(1)) | |||
for s in extra_split: | |||
data_set.apply(lambda x: x[Const.INPUTS(0)].replace(s, ' ' + s + ' '), | |||
new_field_name=Const.INPUTS(0)) | |||
data_set.apply(lambda x: x[Const.INPUTS(0)].replace(s, ' ' + s + ' '), | |||
new_field_name=Const.INPUTS(0)) | |||
_filt = lambda x: x | |||
data_set.apply(lambda x: list(filter(_filt, x[Const.INPUTS(0)].split(' '))), | |||
new_field_name=Const.INPUTS(0), is_input=auto_set_input) | |||
data_set.apply(lambda x: list(filter(_filt, x[Const.INPUTS(1)].split(' '))), | |||
new_field_name=Const.INPUTS(1), is_input=auto_set_input) | |||
_filt = None | |||
if to_lower: | |||
for data_name, data_set in data_info.datasets.items(): | |||
data_set.apply(lambda x: [w.lower() for w in x[Const.INPUTS(0)]], new_field_name=Const.INPUTS(0), | |||
@@ -227,204 +246,3 @@ class MatchingLoader(DataSetLoader): | |||
data_set.set_target(*[target for target in set_target if target in data_set.get_field_names()]) | |||
return data_info | |||
class SNLILoader(MatchingLoader, JsonLoader): | |||
""" | |||
别名::class:`fastNLP.io.SNLILoader` :class:`fastNLP.io.dataset_loader.SNLILoader` | |||
读取SNLI数据集,读取的DataSet包含fields:: | |||
words1: list(str),第一句文本, premise | |||
words2: list(str), 第二句文本, hypothesis | |||
target: str, 真实标签 | |||
数据来源: https://nlp.stanford.edu/projects/snli/snli_1.0.zip | |||
""" | |||
def __init__(self, paths: dict=None): | |||
fields = { | |||
'sentence1_binary_parse': Const.INPUTS(0), | |||
'sentence2_binary_parse': Const.INPUTS(1), | |||
'gold_label': Const.TARGET, | |||
} | |||
paths = paths if paths is not None else { | |||
'train': 'snli_1.0_train.jsonl', | |||
'dev': 'snli_1.0_dev.jsonl', | |||
'test': 'snli_1.0_test.jsonl'} | |||
MatchingLoader.__init__(self, paths=paths) | |||
JsonLoader.__init__(self, fields=fields) | |||
def _load(self, path): | |||
ds = JsonLoader._load(self, path) | |||
parentheses_table = str.maketrans({'(': None, ')': None}) | |||
ds.apply(lambda ins: ins[Const.INPUTS(0)].translate(parentheses_table).strip().split(), | |||
new_field_name=Const.INPUTS(0)) | |||
ds.apply(lambda ins: ins[Const.INPUTS(1)].translate(parentheses_table).strip().split(), | |||
new_field_name=Const.INPUTS(1)) | |||
ds.drop(lambda x: x[Const.TARGET] == '-') | |||
return ds | |||
class RTELoader(MatchingLoader, CSVLoader): | |||
""" | |||
别名::class:`fastNLP.io.RTELoader` :class:`fastNLP.io.dataset_loader.RTELoader` | |||
读取RTE数据集,读取的DataSet包含fields:: | |||
words1: list(str),第一句文本, premise | |||
words2: list(str), 第二句文本, hypothesis | |||
target: str, 真实标签 | |||
数据来源: | |||
""" | |||
def __init__(self, paths: dict=None): | |||
paths = paths if paths is not None else { | |||
'train': 'train.tsv', | |||
'dev': 'dev.tsv', | |||
'test': 'test.tsv' # test set has not label | |||
} | |||
MatchingLoader.__init__(self, paths=paths) | |||
self.fields = { | |||
'sentence1': Const.INPUTS(0), | |||
'sentence2': Const.INPUTS(1), | |||
'label': Const.TARGET, | |||
} | |||
CSVLoader.__init__(self, sep='\t') | |||
def _load(self, path): | |||
ds = CSVLoader._load(self, path) | |||
for k, v in self.fields.items(): | |||
if v in ds.get_field_names(): | |||
ds.rename_field(k, v) | |||
for fields in ds.get_all_fields(): | |||
if Const.INPUT in fields: | |||
ds.apply(lambda x: x[fields].strip().split(), new_field_name=fields) | |||
return ds | |||
class QNLILoader(MatchingLoader, CSVLoader): | |||
""" | |||
别名::class:`fastNLP.io.QNLILoader` :class:`fastNLP.io.dataset_loader.QNLILoader` | |||
读取QNLI数据集,读取的DataSet包含fields:: | |||
words1: list(str),第一句文本, premise | |||
words2: list(str), 第二句文本, hypothesis | |||
target: str, 真实标签 | |||
数据来源: | |||
""" | |||
def __init__(self, paths: dict=None): | |||
paths = paths if paths is not None else { | |||
'train': 'train.tsv', | |||
'dev': 'dev.tsv', | |||
'test': 'test.tsv' # test set has not label | |||
} | |||
MatchingLoader.__init__(self, paths=paths) | |||
self.fields = { | |||
'question': Const.INPUTS(0), | |||
'sentence': Const.INPUTS(1), | |||
'label': Const.TARGET, | |||
} | |||
CSVLoader.__init__(self, sep='\t') | |||
def _load(self, path): | |||
ds = CSVLoader._load(self, path) | |||
for k, v in self.fields.items(): | |||
if v in ds.get_field_names(): | |||
ds.rename_field(k, v) | |||
for fields in ds.get_all_fields(): | |||
if Const.INPUT in fields: | |||
ds.apply(lambda x: x[fields].strip().split(), new_field_name=fields) | |||
return ds | |||
class MNLILoader(MatchingLoader, CSVLoader): | |||
""" | |||
别名::class:`fastNLP.io.MNLILoader` :class:`fastNLP.io.dataset_loader.MNLILoader` | |||
读取MNLI数据集,读取的DataSet包含fields:: | |||
words1: list(str),第一句文本, premise | |||
words2: list(str), 第二句文本, hypothesis | |||
target: str, 真实标签 | |||
数据来源: | |||
""" | |||
def __init__(self, paths: dict=None): | |||
paths = paths if paths is not None else { | |||
'train': 'train.tsv', | |||
'dev_matched': 'dev_matched.tsv', | |||
'dev_mismatched': 'dev_mismatched.tsv', | |||
'test_matched': 'test_matched.tsv', | |||
'test_mismatched': 'test_mismatched.tsv', | |||
# 'test_0.9_matched': 'multinli_0.9_test_matched_unlabeled.txt', | |||
# 'test_0.9_mismatched': 'multinli_0.9_test_mismatched_unlabeled.txt', | |||
# test_0.9_mathed与mismatched是MNLI0.9版本的(数据来源:kaggle) | |||
} | |||
MatchingLoader.__init__(self, paths=paths) | |||
CSVLoader.__init__(self, sep='\t') | |||
self.fields = { | |||
'sentence1_binary_parse': Const.INPUTS(0), | |||
'sentence2_binary_parse': Const.INPUTS(1), | |||
'gold_label': Const.TARGET, | |||
} | |||
def _load(self, path): | |||
ds = CSVLoader._load(self, path) | |||
for k, v in self.fields.items(): | |||
if k in ds.get_field_names(): | |||
ds.rename_field(k, v) | |||
if Const.TARGET in ds.get_field_names(): | |||
if ds[0][Const.TARGET] == 'hidden': | |||
ds.delete_field(Const.TARGET) | |||
parentheses_table = str.maketrans({'(': None, ')': None}) | |||
ds.apply(lambda ins: ins[Const.INPUTS(0)].translate(parentheses_table).strip().split(), | |||
new_field_name=Const.INPUTS(0)) | |||
ds.apply(lambda ins: ins[Const.INPUTS(1)].translate(parentheses_table).strip().split(), | |||
new_field_name=Const.INPUTS(1)) | |||
if Const.TARGET in ds.get_field_names(): | |||
ds.drop(lambda x: x[Const.TARGET] == '-') | |||
return ds | |||
class QuoraLoader(MatchingLoader, CSVLoader): | |||
""" | |||
别名::class:`fastNLP.io.QuoraLoader` :class:`fastNLP.io.dataset_loader.QuoraLoader` | |||
读取MNLI数据集,读取的DataSet包含fields:: | |||
words1: list(str),第一句文本, premise | |||
words2: list(str), 第二句文本, hypothesis | |||
target: str, 真实标签 | |||
数据来源: | |||
""" | |||
def __init__(self, paths: dict=None): | |||
paths = paths if paths is not None else { | |||
'train': 'train.tsv', | |||
'dev': 'dev.tsv', | |||
'test': 'test.tsv', | |||
} | |||
MatchingLoader.__init__(self, paths=paths) | |||
CSVLoader.__init__(self, sep='\t', headers=(Const.TARGET, Const.INPUTS(0), Const.INPUTS(1), 'pairID')) | |||
def _load(self, path): | |||
ds = CSVLoader._load(self, path) | |||
return ds |
@@ -0,0 +1,60 @@ | |||
from ...core.const import Const | |||
from .matching import MatchingLoader | |||
from ..dataset_loader import CSVLoader | |||
class MNLILoader(MatchingLoader, CSVLoader): | |||
""" | |||
别名::class:`fastNLP.io.MNLILoader` :class:`fastNLP.io.data_loader.MNLILoader` | |||
读取MNLI数据集,读取的DataSet包含fields:: | |||
words1: list(str),第一句文本, premise | |||
words2: list(str), 第二句文本, hypothesis | |||
target: str, 真实标签 | |||
数据来源: | |||
""" | |||
def __init__(self, paths: dict=None): | |||
paths = paths if paths is not None else { | |||
'train': 'train.tsv', | |||
'dev_matched': 'dev_matched.tsv', | |||
'dev_mismatched': 'dev_mismatched.tsv', | |||
'test_matched': 'test_matched.tsv', | |||
'test_mismatched': 'test_mismatched.tsv', | |||
# 'test_0.9_matched': 'multinli_0.9_test_matched_unlabeled.txt', | |||
# 'test_0.9_mismatched': 'multinli_0.9_test_mismatched_unlabeled.txt', | |||
# test_0.9_mathed与mismatched是MNLI0.9版本的(数据来源:kaggle) | |||
} | |||
MatchingLoader.__init__(self, paths=paths) | |||
CSVLoader.__init__(self, sep='\t') | |||
self.fields = { | |||
'sentence1_binary_parse': Const.INPUTS(0), | |||
'sentence2_binary_parse': Const.INPUTS(1), | |||
'gold_label': Const.TARGET, | |||
} | |||
def _load(self, path): | |||
ds = CSVLoader._load(self, path) | |||
for k, v in self.fields.items(): | |||
if k in ds.get_field_names(): | |||
ds.rename_field(k, v) | |||
if Const.TARGET in ds.get_field_names(): | |||
if ds[0][Const.TARGET] == 'hidden': | |||
ds.delete_field(Const.TARGET) | |||
parentheses_table = str.maketrans({'(': None, ')': None}) | |||
ds.apply(lambda ins: ins[Const.INPUTS(0)].translate(parentheses_table).strip().split(), | |||
new_field_name=Const.INPUTS(0)) | |||
ds.apply(lambda ins: ins[Const.INPUTS(1)].translate(parentheses_table).strip().split(), | |||
new_field_name=Const.INPUTS(1)) | |||
if Const.TARGET in ds.get_field_names(): | |||
ds.drop(lambda x: x[Const.TARGET] == '-') | |||
return ds |
@@ -0,0 +1,65 @@ | |||
from typing import Union, Dict | |||
from ..base_loader import DataBundle | |||
from ..dataset_loader import CSVLoader | |||
from ...core.vocabulary import Vocabulary, VocabularyOption | |||
from ...core.const import Const | |||
from ..utils import check_dataloader_paths | |||
class MTL16Loader(CSVLoader): | |||
""" | |||
读取MTL16数据集,DataSet包含以下fields: | |||
words: list(str), 需要分类的文本 | |||
target: str, 文本的标签 | |||
数据来源:https://pan.baidu.com/s/1c2L6vdA | |||
""" | |||
def __init__(self): | |||
super(MTL16Loader, self).__init__(headers=(Const.TARGET, Const.INPUT), sep='\t') | |||
def _load(self, path): | |||
dataset = super(MTL16Loader, self)._load(path) | |||
dataset.apply(lambda x: x[Const.INPUT].lower().split(), new_field_name=Const.INPUT) | |||
if len(dataset) == 0: | |||
raise RuntimeError(f"{path} has no valid data.") | |||
return dataset | |||
def process(self, | |||
paths: Union[str, Dict[str, str]], | |||
src_vocab_opt: VocabularyOption = None, | |||
tgt_vocab_opt: VocabularyOption = None,): | |||
paths = check_dataloader_paths(paths) | |||
datasets = {} | |||
info = DataBundle() | |||
for name, path in paths.items(): | |||
dataset = self.load(path) | |||
datasets[name] = dataset | |||
src_vocab = Vocabulary() if src_vocab_opt is None else Vocabulary(**src_vocab_opt) | |||
src_vocab.from_dataset(datasets['train'], field_name=Const.INPUT) | |||
src_vocab.index_dataset(*datasets.values(), field_name=Const.INPUT) | |||
tgt_vocab = Vocabulary(unknown=None, padding=None) \ | |||
if tgt_vocab_opt is None else Vocabulary(**tgt_vocab_opt) | |||
tgt_vocab.from_dataset(datasets['train'], field_name=Const.TARGET) | |||
tgt_vocab.index_dataset(*datasets.values(), field_name=Const.TARGET) | |||
info.vocabs = { | |||
Const.INPUT: src_vocab, | |||
Const.TARGET: tgt_vocab | |||
} | |||
info.datasets = datasets | |||
for name, dataset in info.datasets.items(): | |||
dataset.set_input(Const.INPUT) | |||
dataset.set_target(Const.TARGET) | |||
return info |
@@ -0,0 +1,85 @@ | |||
from ..base_loader import DataSetLoader | |||
from ...core.dataset import DataSet | |||
from ...core.instance import Instance | |||
from ...core.const import Const | |||
class PeopleDailyCorpusLoader(DataSetLoader): | |||
""" | |||
别名::class:`fastNLP.io.PeopleDailyCorpusLoader` :class:`fastNLP.io.dataset_loader.PeopleDailyCorpusLoader` | |||
读取人民日报数据集 | |||
""" | |||
def __init__(self, pos=True, ner=True): | |||
super(PeopleDailyCorpusLoader, self).__init__() | |||
self.pos = pos | |||
self.ner = ner | |||
def _load(self, data_path): | |||
with open(data_path, "r", encoding="utf-8") as f: | |||
sents = f.readlines() | |||
examples = [] | |||
for sent in sents: | |||
if len(sent) <= 2: | |||
continue | |||
inside_ne = False | |||
sent_pos_tag = [] | |||
sent_words = [] | |||
sent_ner = [] | |||
words = sent.strip().split()[1:] | |||
for word in words: | |||
if "[" in word and "]" in word: | |||
ner_tag = "U" | |||
print(word) | |||
elif "[" in word: | |||
inside_ne = True | |||
ner_tag = "B" | |||
word = word[1:] | |||
elif "]" in word: | |||
ner_tag = "L" | |||
word = word[:word.index("]")] | |||
if inside_ne is True: | |||
inside_ne = False | |||
else: | |||
raise RuntimeError("only ] appears!") | |||
else: | |||
if inside_ne is True: | |||
ner_tag = "I" | |||
else: | |||
ner_tag = "O" | |||
tmp = word.split("/") | |||
token, pos = tmp[0], tmp[1] | |||
sent_ner.append(ner_tag) | |||
sent_pos_tag.append(pos) | |||
sent_words.append(token) | |||
example = [sent_words] | |||
if self.pos is True: | |||
example.append(sent_pos_tag) | |||
if self.ner is True: | |||
example.append(sent_ner) | |||
examples.append(example) | |||
return self.convert(examples) | |||
def convert(self, data): | |||
""" | |||
:param data: python 内置对象 | |||
:return: 一个 :class:`~fastNLP.DataSet` 类型的对象 | |||
""" | |||
data_set = DataSet() | |||
for item in data: | |||
sent_words = item[0] | |||
if self.pos is True and self.ner is True: | |||
instance = Instance( | |||
words=sent_words, pos_tags=item[1], ner=item[2]) | |||
elif self.pos is True: | |||
instance = Instance(words=sent_words, pos_tags=item[1]) | |||
elif self.ner is True: | |||
instance = Instance(words=sent_words, ner=item[1]) | |||
else: | |||
instance = Instance(words=sent_words) | |||
data_set.append(instance) | |||
data_set.apply(lambda ins: len(ins[Const.INPUT]), new_field_name=Const.INPUT_LEN) | |||
return data_set |
@@ -0,0 +1,45 @@ | |||
from ...core.const import Const | |||
from .matching import MatchingLoader | |||
from ..dataset_loader import CSVLoader | |||
class QNLILoader(MatchingLoader, CSVLoader): | |||
""" | |||
别名::class:`fastNLP.io.QNLILoader` :class:`fastNLP.io.data_loader.QNLILoader` | |||
读取QNLI数据集,读取的DataSet包含fields:: | |||
words1: list(str),第一句文本, premise | |||
words2: list(str), 第二句文本, hypothesis | |||
target: str, 真实标签 | |||
数据来源: | |||
""" | |||
def __init__(self, paths: dict=None): | |||
paths = paths if paths is not None else { | |||
'train': 'train.tsv', | |||
'dev': 'dev.tsv', | |||
'test': 'test.tsv' # test set has not label | |||
} | |||
MatchingLoader.__init__(self, paths=paths) | |||
self.fields = { | |||
'question': Const.INPUTS(0), | |||
'sentence': Const.INPUTS(1), | |||
'label': Const.TARGET, | |||
} | |||
CSVLoader.__init__(self, sep='\t') | |||
def _load(self, path): | |||
ds = CSVLoader._load(self, path) | |||
for k, v in self.fields.items(): | |||
if k in ds.get_field_names(): | |||
ds.rename_field(k, v) | |||
for fields in ds.get_all_fields(): | |||
if Const.INPUT in fields: | |||
ds.apply(lambda x: x[fields].strip().split(), new_field_name=fields) | |||
return ds |
@@ -0,0 +1,32 @@ | |||
from ...core.const import Const | |||
from .matching import MatchingLoader | |||
from ..dataset_loader import CSVLoader | |||
class QuoraLoader(MatchingLoader, CSVLoader): | |||
""" | |||
别名::class:`fastNLP.io.QuoraLoader` :class:`fastNLP.io.data_loader.QuoraLoader` | |||
读取MNLI数据集,读取的DataSet包含fields:: | |||
words1: list(str),第一句文本, premise | |||
words2: list(str), 第二句文本, hypothesis | |||
target: str, 真实标签 | |||
数据来源: | |||
""" | |||
def __init__(self, paths: dict=None): | |||
paths = paths if paths is not None else { | |||
'train': 'train.tsv', | |||
'dev': 'dev.tsv', | |||
'test': 'test.tsv', | |||
} | |||
MatchingLoader.__init__(self, paths=paths) | |||
CSVLoader.__init__(self, sep='\t', headers=(Const.TARGET, Const.INPUTS(0), Const.INPUTS(1), 'pairID')) | |||
def _load(self, path): | |||
ds = CSVLoader._load(self, path) | |||
return ds |
@@ -0,0 +1,45 @@ | |||
from ...core.const import Const | |||
from .matching import MatchingLoader | |||
from ..dataset_loader import CSVLoader | |||
class RTELoader(MatchingLoader, CSVLoader): | |||
""" | |||
别名::class:`fastNLP.io.RTELoader` :class:`fastNLP.io.data_loader.RTELoader` | |||
读取RTE数据集,读取的DataSet包含fields:: | |||
words1: list(str),第一句文本, premise | |||
words2: list(str), 第二句文本, hypothesis | |||
target: str, 真实标签 | |||
数据来源: | |||
""" | |||
def __init__(self, paths: dict=None): | |||
paths = paths if paths is not None else { | |||
'train': 'train.tsv', | |||
'dev': 'dev.tsv', | |||
'test': 'test.tsv' # test set has not label | |||
} | |||
MatchingLoader.__init__(self, paths=paths) | |||
self.fields = { | |||
'sentence1': Const.INPUTS(0), | |||
'sentence2': Const.INPUTS(1), | |||
'label': Const.TARGET, | |||
} | |||
CSVLoader.__init__(self, sep='\t') | |||
def _load(self, path): | |||
ds = CSVLoader._load(self, path) | |||
for k, v in self.fields.items(): | |||
if k in ds.get_field_names(): | |||
ds.rename_field(k, v) | |||
for fields in ds.get_all_fields(): | |||
if Const.INPUT in fields: | |||
ds.apply(lambda x: x[fields].strip().split(), new_field_name=fields) | |||
return ds |
@@ -0,0 +1,44 @@ | |||
from ...core.const import Const | |||
from .matching import MatchingLoader | |||
from ..dataset_loader import JsonLoader | |||
class SNLILoader(MatchingLoader, JsonLoader): | |||
""" | |||
别名::class:`fastNLP.io.SNLILoader` :class:`fastNLP.io.data_loader.SNLILoader` | |||
读取SNLI数据集,读取的DataSet包含fields:: | |||
words1: list(str),第一句文本, premise | |||
words2: list(str), 第二句文本, hypothesis | |||
target: str, 真实标签 | |||
数据来源: https://nlp.stanford.edu/projects/snli/snli_1.0.zip | |||
""" | |||
def __init__(self, paths: dict=None): | |||
fields = { | |||
'sentence1_binary_parse': Const.INPUTS(0), | |||
'sentence2_binary_parse': Const.INPUTS(1), | |||
'gold_label': Const.TARGET, | |||
} | |||
paths = paths if paths is not None else { | |||
'train': 'snli_1.0_train.jsonl', | |||
'dev': 'snli_1.0_dev.jsonl', | |||
'test': 'snli_1.0_test.jsonl'} | |||
MatchingLoader.__init__(self, paths=paths) | |||
JsonLoader.__init__(self, fields=fields) | |||
def _load(self, path): | |||
ds = JsonLoader._load(self, path) | |||
parentheses_table = str.maketrans({'(': None, ')': None}) | |||
ds.apply(lambda ins: ins[Const.INPUTS(0)].translate(parentheses_table).strip().split(), | |||
new_field_name=Const.INPUTS(0)) | |||
ds.apply(lambda ins: ins[Const.INPUTS(1)].translate(parentheses_table).strip().split(), | |||
new_field_name=Const.INPUTS(1)) | |||
ds.drop(lambda x: x[Const.TARGET] == '-') | |||
return ds |
@@ -1,19 +1,19 @@ | |||
from typing import Iterable | |||
from typing import Union, Dict | |||
from nltk import Tree | |||
import spacy | |||
from ..base_loader import DataInfo, DataSetLoader | |||
from ..base_loader import DataBundle, DataSetLoader | |||
from ..dataset_loader import CSVLoader | |||
from ...core.vocabulary import VocabularyOption, Vocabulary | |||
from ...core.dataset import DataSet | |||
from ...core.const import Const | |||
from ...core.instance import Instance | |||
from ..utils import check_dataloader_paths, get_tokenizer | |||
class SSTLoader(DataSetLoader): | |||
URL = 'https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip' | |||
DATA_DIR = 'sst/' | |||
""" | |||
别名::class:`fastNLP.io.SSTLoader` :class:`fastNLP.io.dataset_loader.SSTLoader` | |||
别名::class:`fastNLP.io.SSTLoader` :class:`fastNLP.io.data_loader.SSTLoader` | |||
读取SST数据集, DataSet包含fields:: | |||
@@ -26,6 +26,9 @@ class SSTLoader(DataSetLoader): | |||
:param fine_grained: 是否使用SST-5标准,若 ``False`` , 使用SST-2。Default: ``False`` | |||
""" | |||
URL = 'https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip' | |||
DATA_DIR = 'sst/' | |||
def __init__(self, subtree=False, fine_grained=False): | |||
self.subtree = subtree | |||
@@ -57,8 +60,8 @@ class SSTLoader(DataSetLoader): | |||
def _get_one(self, data, subtree): | |||
tree = Tree.fromstring(data) | |||
if subtree: | |||
return [([x.text for x in self.tokenizer(' '.join(t.leaves()))], t.label()) for t in tree.subtrees() ] | |||
return [([x.text for x in self.tokenizer(' '.join(tree.leaves()))], tree.label())] | |||
return [(self.tokenizer(' '.join(t.leaves())), t.label()) for t in tree.subtrees() ] | |||
return [(self.tokenizer(' '.join(tree.leaves())), tree.label())] | |||
def process(self, | |||
paths, train_subtree=True, | |||
@@ -70,7 +73,7 @@ class SSTLoader(DataSetLoader): | |||
tgt_vocab = Vocabulary(unknown=None, padding=None) \ | |||
if tgt_vocab_op is None else Vocabulary(**tgt_vocab_op) | |||
info = DataInfo() | |||
info = DataBundle() | |||
origin_subtree = self.subtree | |||
self.subtree = train_subtree | |||
info.datasets['train'] = self._load(paths['train']) | |||
@@ -98,3 +101,75 @@ class SSTLoader(DataSetLoader): | |||
return info | |||
class SST2Loader(CSVLoader): | |||
""" | |||
数据来源"SST":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8', | |||
""" | |||
def __init__(self): | |||
super(SST2Loader, self).__init__(sep='\t') | |||
self.tokenizer = get_tokenizer() | |||
self.field = {'sentence': Const.INPUT, 'label': Const.TARGET} | |||
def _load(self, path: str) -> DataSet: | |||
ds = super(SST2Loader, self)._load(path) | |||
for k, v in self.field.items(): | |||
if k in ds.get_field_names(): | |||
ds.rename_field(k, v) | |||
ds.apply(lambda x: self.tokenizer(x[Const.INPUT]), new_field_name=Const.INPUT) | |||
print("all count:", len(ds)) | |||
return ds | |||
def process(self, | |||
paths: Union[str, Dict[str, str]], | |||
src_vocab_opt: VocabularyOption = None, | |||
tgt_vocab_opt: VocabularyOption = None, | |||
char_level_op=False): | |||
paths = check_dataloader_paths(paths) | |||
datasets = {} | |||
info = DataBundle() | |||
for name, path in paths.items(): | |||
dataset = self.load(path) | |||
datasets[name] = dataset | |||
def wordtochar(words): | |||
chars = [] | |||
for word in words: | |||
word = word.lower() | |||
for char in word: | |||
chars.append(char) | |||
chars.append('') | |||
chars.pop() | |||
return chars | |||
input_name, target_name = Const.INPUT, Const.TARGET | |||
info.vocabs={} | |||
# 就分隔为char形式 | |||
if char_level_op: | |||
for dataset in datasets.values(): | |||
dataset.apply_field(wordtochar, field_name=Const.INPUT, new_field_name=Const.CHAR_INPUT) | |||
src_vocab = Vocabulary() if src_vocab_opt is None else Vocabulary(**src_vocab_opt) | |||
src_vocab.from_dataset(datasets['train'], field_name=Const.INPUT) | |||
src_vocab.index_dataset(*datasets.values(), field_name=Const.INPUT) | |||
tgt_vocab = Vocabulary(unknown=None, padding=None) \ | |||
if tgt_vocab_opt is None else Vocabulary(**tgt_vocab_opt) | |||
tgt_vocab.from_dataset(datasets['train'], field_name=Const.TARGET) | |||
tgt_vocab.index_dataset(*datasets.values(), field_name=Const.TARGET) | |||
info.vocabs = { | |||
Const.INPUT: src_vocab, | |||
Const.TARGET: tgt_vocab | |||
} | |||
info.datasets = datasets | |||
for name, dataset in info.datasets.items(): | |||
dataset.set_input(Const.INPUT) | |||
dataset.set_target(Const.TARGET) | |||
return info | |||
@@ -0,0 +1,127 @@ | |||
import csv | |||
from typing import Iterable | |||
from ...core.const import Const | |||
from ...core.dataset import DataSet | |||
from ...core.instance import Instance | |||
from ...core.vocabulary import VocabularyOption, Vocabulary | |||
from ..base_loader import DataBundle, DataSetLoader | |||
from typing import Union, Dict | |||
from ..utils import check_dataloader_paths, get_tokenizer | |||
class YelpLoader(DataSetLoader): | |||
""" | |||
读取Yelp_full/Yelp_polarity数据集, DataSet包含fields: | |||
words: list(str), 需要分类的文本 | |||
target: str, 文本的标签 | |||
chars:list(str),未index的字符列表 | |||
数据集:yelp_full/yelp_polarity | |||
:param fine_grained: 是否使用SST-5标准,若 ``False`` , 使用SST-2。Default: ``False`` | |||
:param lower: 是否需要自动转小写,默认为False。 | |||
""" | |||
def __init__(self, fine_grained=False, lower=False): | |||
super(YelpLoader, self).__init__() | |||
tag_v = {'1.0': 'very negative', '2.0': 'negative', '3.0': 'neutral', | |||
'4.0': 'positive', '5.0': 'very positive'} | |||
if not fine_grained: | |||
tag_v['1.0'] = tag_v['2.0'] | |||
tag_v['5.0'] = tag_v['4.0'] | |||
self.fine_grained = fine_grained | |||
self.tag_v = tag_v | |||
self.lower = lower | |||
self.tokenizer = get_tokenizer() | |||
def _load(self, path): | |||
ds = DataSet() | |||
csv_reader = csv.reader(open(path, encoding='utf-8')) | |||
all_count = 0 | |||
real_count = 0 | |||
for row in csv_reader: | |||
all_count += 1 | |||
if len(row) == 2: | |||
target = self.tag_v[row[0] + ".0"] | |||
words = clean_str(row[1], self.tokenizer, self.lower) | |||
if len(words) != 0: | |||
ds.append(Instance(words=words, target=target)) | |||
real_count += 1 | |||
print("all count:", all_count) | |||
print("real count:", real_count) | |||
return ds | |||
def process(self, paths: Union[str, Dict[str, str]], | |||
train_ds: Iterable[str] = None, | |||
src_vocab_op: VocabularyOption = None, | |||
tgt_vocab_op: VocabularyOption = None, | |||
char_level_op=False): | |||
paths = check_dataloader_paths(paths) | |||
info = DataBundle(datasets=self.load(paths)) | |||
src_vocab = Vocabulary() if src_vocab_op is None else Vocabulary(**src_vocab_op) | |||
tgt_vocab = Vocabulary(unknown=None, padding=None) \ | |||
if tgt_vocab_op is None else Vocabulary(**tgt_vocab_op) | |||
_train_ds = [info.datasets[name] | |||
for name in train_ds] if train_ds else info.datasets.values() | |||
def wordtochar(words): | |||
chars = [] | |||
for word in words: | |||
word = word.lower() | |||
for char in word: | |||
chars.append(char) | |||
chars.append('') | |||
chars.pop() | |||
return chars | |||
input_name, target_name = Const.INPUT, Const.TARGET | |||
info.vocabs = {} | |||
# 就分隔为char形式 | |||
if char_level_op: | |||
for dataset in info.datasets.values(): | |||
dataset.apply_field(wordtochar, field_name=Const.INPUT, new_field_name=Const.CHAR_INPUT) | |||
else: | |||
src_vocab.from_dataset(*_train_ds, field_name=input_name) | |||
src_vocab.index_dataset(*info.datasets.values(), field_name=input_name, new_field_name=input_name) | |||
info.vocabs[input_name] = src_vocab | |||
tgt_vocab.from_dataset(*_train_ds, field_name=target_name) | |||
tgt_vocab.index_dataset( | |||
*info.datasets.values(), | |||
field_name=target_name, new_field_name=target_name) | |||
info.vocabs[target_name] = tgt_vocab | |||
info.datasets['train'], info.datasets['dev'] = info.datasets['train'].split(0.1, shuffle=False) | |||
for name, dataset in info.datasets.items(): | |||
dataset.set_input(Const.INPUT) | |||
dataset.set_target(Const.TARGET) | |||
return info | |||
def clean_str(sentence, tokenizer, char_lower=False): | |||
""" | |||
heavily borrowed from github | |||
https://github.com/LukeZhuang/Hierarchical-Attention-Network/blob/master/yelp-preprocess.ipynb | |||
:param sentence: is a str | |||
:return: | |||
""" | |||
if char_lower: | |||
sentence = sentence.lower() | |||
import re | |||
nonalpnum = re.compile('[^0-9a-zA-Z?!\']+') | |||
words = tokenizer(sentence) | |||
words_collection = [] | |||
for word in words: | |||
if word in ['-lrb-', '-rrb-', '<sssss>', '-r', '-l', 'b-']: | |||
continue | |||
tt = nonalpnum.split(word) | |||
t = ''.join(tt) | |||
if t != '': | |||
words_collection.append(t) | |||
return words_collection | |||
@@ -15,199 +15,13 @@ dataset_loader模块实现了许多 DataSetLoader, 用于读取不同格式的 | |||
__all__ = [ | |||
'CSVLoader', | |||
'JsonLoader', | |||
'ConllLoader', | |||
'PeopleDailyCorpusLoader', | |||
'Conll2003Loader', | |||
] | |||
import os | |||
from nltk import Tree | |||
from typing import Union, Dict | |||
from ..core.vocabulary import Vocabulary | |||
from ..core.dataset import DataSet | |||
from ..core.instance import Instance | |||
from .file_reader import _read_csv, _read_json, _read_conll | |||
from .base_loader import DataSetLoader, DataInfo | |||
from ..core.const import Const | |||
from ..modules.encoder._bert import BertTokenizer | |||
class PeopleDailyCorpusLoader(DataSetLoader): | |||
""" | |||
别名::class:`fastNLP.io.PeopleDailyCorpusLoader` :class:`fastNLP.io.dataset_loader.PeopleDailyCorpusLoader` | |||
读取人民日报数据集 | |||
""" | |||
def __init__(self, pos=True, ner=True): | |||
super(PeopleDailyCorpusLoader, self).__init__() | |||
self.pos = pos | |||
self.ner = ner | |||
def _load(self, data_path): | |||
with open(data_path, "r", encoding="utf-8") as f: | |||
sents = f.readlines() | |||
examples = [] | |||
for sent in sents: | |||
if len(sent) <= 2: | |||
continue | |||
inside_ne = False | |||
sent_pos_tag = [] | |||
sent_words = [] | |||
sent_ner = [] | |||
words = sent.strip().split()[1:] | |||
for word in words: | |||
if "[" in word and "]" in word: | |||
ner_tag = "U" | |||
print(word) | |||
elif "[" in word: | |||
inside_ne = True | |||
ner_tag = "B" | |||
word = word[1:] | |||
elif "]" in word: | |||
ner_tag = "L" | |||
word = word[:word.index("]")] | |||
if inside_ne is True: | |||
inside_ne = False | |||
else: | |||
raise RuntimeError("only ] appears!") | |||
else: | |||
if inside_ne is True: | |||
ner_tag = "I" | |||
else: | |||
ner_tag = "O" | |||
tmp = word.split("/") | |||
token, pos = tmp[0], tmp[1] | |||
sent_ner.append(ner_tag) | |||
sent_pos_tag.append(pos) | |||
sent_words.append(token) | |||
example = [sent_words] | |||
if self.pos is True: | |||
example.append(sent_pos_tag) | |||
if self.ner is True: | |||
example.append(sent_ner) | |||
examples.append(example) | |||
return self.convert(examples) | |||
def convert(self, data): | |||
""" | |||
:param data: python 内置对象 | |||
:return: 一个 :class:`~fastNLP.DataSet` 类型的对象 | |||
""" | |||
data_set = DataSet() | |||
for item in data: | |||
sent_words = item[0] | |||
if self.pos is True and self.ner is True: | |||
instance = Instance( | |||
words=sent_words, pos_tags=item[1], ner=item[2]) | |||
elif self.pos is True: | |||
instance = Instance(words=sent_words, pos_tags=item[1]) | |||
elif self.ner is True: | |||
instance = Instance(words=sent_words, ner=item[1]) | |||
else: | |||
instance = Instance(words=sent_words) | |||
data_set.append(instance) | |||
data_set.apply(lambda ins: len(ins[Const.INPUT]), new_field_name=Const.INPUT_LEN) | |||
return data_set | |||
class ConllLoader(DataSetLoader): | |||
""" | |||
别名::class:`fastNLP.io.ConllLoader` :class:`fastNLP.io.dataset_loader.ConllLoader` | |||
读取Conll格式的数据. 数据格式详见 http://conll.cemantix.org/2012/data.html. 数据中以"-DOCSTART-"开头的行将被忽略,因为 | |||
该符号在conll 2003中被用为文档分割符。 | |||
列号从0开始, 每列对应内容为:: | |||
Column Type | |||
0 Document ID | |||
1 Part number | |||
2 Word number | |||
3 Word itself | |||
4 Part-of-Speech | |||
5 Parse bit | |||
6 Predicate lemma | |||
7 Predicate Frameset ID | |||
8 Word sense | |||
9 Speaker/Author | |||
10 Named Entities | |||
11:N Predicate Arguments | |||
N Coreference | |||
:param headers: 每一列数据的名称,需为List or Tuple of str。``header`` 与 ``indexes`` 一一对应 | |||
:param indexes: 需要保留的数据列下标,从0开始。若为 ``None`` ,则所有列都保留。Default: ``None`` | |||
:param dropna: 是否忽略非法数据,若 ``False`` ,遇到非法数据时抛出 ``ValueError`` 。Default: ``False`` | |||
""" | |||
def __init__(self, headers, indexes=None, dropna=False): | |||
super(ConllLoader, self).__init__() | |||
if not isinstance(headers, (list, tuple)): | |||
raise TypeError( | |||
'invalid headers: {}, should be list of strings'.format(headers)) | |||
self.headers = headers | |||
self.dropna = dropna | |||
if indexes is None: | |||
self.indexes = list(range(len(self.headers))) | |||
else: | |||
if len(indexes) != len(headers): | |||
raise ValueError | |||
self.indexes = indexes | |||
def _load(self, path): | |||
ds = DataSet() | |||
for idx, data in _read_conll(path, indexes=self.indexes, dropna=self.dropna): | |||
ins = {h: data[i] for i, h in enumerate(self.headers)} | |||
ds.append(Instance(**ins)) | |||
return ds | |||
class Conll2003Loader(ConllLoader): | |||
""" | |||
别名::class:`fastNLP.io.Conll2003Loader` :class:`fastNLP.io.dataset_loader.Conll2003Loader` | |||
读取Conll2003数据 | |||
关于数据集的更多信息,参考: | |||
https://sites.google.com/site/ermasoftware/getting-started/ne-tagging-conll2003-data | |||
""" | |||
def __init__(self): | |||
headers = [ | |||
'tokens', 'pos', 'chunks', 'ner', | |||
] | |||
super(Conll2003Loader, self).__init__(headers=headers) | |||
def _cut_long_sentence(sent, max_sample_length=200): | |||
""" | |||
将长于max_sample_length的sentence截成多段,只会在有空格的地方发生截断。 | |||
所以截取的句子可能长于或者短于max_sample_length | |||
:param sent: str. | |||
:param max_sample_length: int. | |||
:return: list of str. | |||
""" | |||
sent_no_space = sent.replace(' ', '') | |||
cutted_sentence = [] | |||
if len(sent_no_space) > max_sample_length: | |||
parts = sent.strip().split() | |||
new_line = '' | |||
length = 0 | |||
for part in parts: | |||
length += len(part) | |||
new_line += part + ' ' | |||
if length > max_sample_length: | |||
new_line = new_line[:-1] | |||
cutted_sentence.append(new_line) | |||
length = 0 | |||
new_line = '' | |||
if new_line != '': | |||
cutted_sentence.append(new_line[:-1]) | |||
else: | |||
cutted_sentence.append(sent) | |||
return cutted_sentence | |||
from .file_reader import _read_csv, _read_json | |||
from .base_loader import DataSetLoader | |||
class JsonLoader(DataSetLoader): | |||
@@ -272,6 +86,36 @@ class CSVLoader(DataSetLoader): | |||
return ds | |||
def _cut_long_sentence(sent, max_sample_length=200): | |||
""" | |||
将长于max_sample_length的sentence截成多段,只会在有空格的地方发生截断。 | |||
所以截取的句子可能长于或者短于max_sample_length | |||
:param sent: str. | |||
:param max_sample_length: int. | |||
:return: list of str. | |||
""" | |||
sent_no_space = sent.replace(' ', '') | |||
cutted_sentence = [] | |||
if len(sent_no_space) > max_sample_length: | |||
parts = sent.strip().split() | |||
new_line = '' | |||
length = 0 | |||
for part in parts: | |||
length += len(part) | |||
new_line += part + ' ' | |||
if length > max_sample_length: | |||
new_line = new_line[:-1] | |||
cutted_sentence.append(new_line) | |||
length = 0 | |||
new_line = '' | |||
if new_line != '': | |||
cutted_sentence.append(new_line[:-1]) | |||
else: | |||
cutted_sentence.append(sent) | |||
return cutted_sentence | |||
def _add_seg_tag(data): | |||
""" | |||
@@ -17,6 +17,10 @@ PRETRAINED_BERT_MODEL_DIR = { | |||
'en-large-uncased': 'bert-large-uncased-20939f45.zip', | |||
'en-large-cased': 'bert-large-cased-e0cf90fc.zip', | |||
'en-large-cased-wwm': 'bert-large-cased-wwm-a457f118.zip', | |||
'en-large-uncased-wwm': 'bert-large-uncased-wwm-92a50aeb.zip', | |||
'en-base-cased-mrpc': 'bert-base-cased-finetuned-mrpc-c7099855.zip', | |||
'cn': 'bert-base-chinese-29d0a84a.zip', | |||
'cn-base': 'bert-base-chinese-29d0a84a.zip', | |||
@@ -68,6 +72,7 @@ def cached_path(url_or_filename: str, cache_dir: Path=None) -> Path: | |||
"unable to parse {} as a URL or as a local path".format(url_or_filename) | |||
) | |||
def get_filepath(filepath): | |||
""" | |||
如果filepath中只有一个文件,则直接返回对应的全路径 | |||
@@ -82,6 +87,7 @@ def get_filepath(filepath): | |||
return filepath | |||
return filepath | |||
def get_defalt_path(): | |||
""" | |||
获取默认的fastNLP存放路径, 如果将FASTNLP_CACHE_PATH设置在了环境变量中,将使用环境变量的值,使得不用每个用户都去下载。 | |||
@@ -98,6 +104,7 @@ def get_defalt_path(): | |||
fastnlp_cache_dir = os.path.expanduser(os.path.join("~", ".fastNLP")) | |||
return fastnlp_cache_dir | |||
def _get_base_url(name): | |||
# 返回的URL结尾必须是/ | |||
if 'FASTNLP_BASE_URL' in os.environ: | |||
@@ -105,6 +112,7 @@ def _get_base_url(name): | |||
return fastnlp_base_url | |||
raise RuntimeError("There function is not available right now.") | |||
def split_filename_suffix(filepath): | |||
""" | |||
给定filepath返回对应的name和suffix | |||
@@ -116,6 +124,7 @@ def split_filename_suffix(filepath): | |||
return filename[:-7], '.tar.gz' | |||
return os.path.splitext(filename) | |||
def get_from_cache(url: str, cache_dir: Path = None) -> Path: | |||
""" | |||
尝试在cache_dir中寻找url定义的资源; 如果没有找到。则从url下载并将结果放在cache_dir下,缓存的名称由url的结果推断而来。 | |||
@@ -226,6 +235,7 @@ def get_from_cache(url: str, cache_dir: Path = None) -> Path: | |||
return get_filepath(cache_path) | |||
def unzip_file(file: Path, to: Path): | |||
# unpack and write out in CoNLL column-like format | |||
from zipfile import ZipFile | |||
@@ -234,13 +244,15 @@ def unzip_file(file: Path, to: Path): | |||
# Extract all the contents of zip file in current directory | |||
zipObj.extractall(to) | |||
def untar_gz_file(file:Path, to:Path): | |||
import tarfile | |||
with tarfile.open(file, 'r:gz') as tar: | |||
tar.extractall(to) | |||
def match_file(dir_name:str, cache_dir:str)->str: | |||
def match_file(dir_name: str, cache_dir: str) -> str: | |||
""" | |||
匹配的原则是,在cache_dir下的文件: (1) 与dir_name完全一致; (2) 除了后缀以外和dir_name完全一致。 | |||
如果找到了两个匹配的结果将报错. 如果找到了则返回匹配的文件的名称; 没有找到返回空字符串 | |||
@@ -261,6 +273,7 @@ def match_file(dir_name:str, cache_dir:str)->str: | |||
else: | |||
raise RuntimeError(f"Duplicate matched files:{matched_filenames}, this should be caused by a bug.") | |||
if __name__ == '__main__': | |||
cache_dir = Path('caches') | |||
cache_dir = None | |||
@@ -4,149 +4,209 @@ __all__ = [ | |||
import torch | |||
import torch.nn as nn | |||
import torch.nn.functional as F | |||
from .base_model import BaseModel | |||
from ..core.const import Const | |||
from ..modules import decoder as Decoder | |||
from ..modules import encoder as Encoder | |||
from ..modules import aggregator as Aggregator | |||
from ..core.utils import seq_len_to_mask | |||
from torch.nn import CrossEntropyLoss | |||
my_inf = 10e12 | |||
from fastNLP.models import BaseModel | |||
from fastNLP.modules.encoder.embedding import TokenEmbedding | |||
from fastNLP.modules.encoder.lstm import LSTM | |||
from fastNLP.core.const import Const | |||
from fastNLP.core.utils import seq_len_to_mask | |||
class ESIM(BaseModel): | |||
""" | |||
别名::class:`fastNLP.models.ESIM` :class:`fastNLP.models.snli.ESIM` | |||
ESIM模型的一个PyTorch实现。 | |||
ESIM模型的论文: Enhanced LSTM for Natural Language Inference (arXiv: 1609.06038) | |||
"""ESIM model的一个PyTorch实现 | |||
论文参见: https://arxiv.org/pdf/1609.06038.pdf | |||
:param int vocab_size: 词表大小 | |||
:param int embed_dim: 词嵌入维度 | |||
:param int hidden_size: LSTM隐层大小 | |||
:param float dropout: dropout大小,默认为0 | |||
:param int num_classes: 标签数目,默认为3 | |||
:param numpy.array init_embedding: 初始词嵌入矩阵,形状为(vocab_size, embed_dim),默认为None,即随机初始化词嵌入矩阵 | |||
:param fastNLP.TokenEmbedding init_embedding: 初始化的TokenEmbedding | |||
:param int hidden_size: 隐藏层大小,默认值为Embedding的维度 | |||
:param int num_labels: 目标标签种类数量,默认值为3 | |||
:param float dropout_rate: dropout的比率,默认值为0.3 | |||
:param float dropout_embed: 对Embedding的dropout比率,默认值为0.1 | |||
""" | |||
def __init__(self, vocab_size, embed_dim, hidden_size, dropout=0.0, num_classes=3, init_embedding=None): | |||
def __init__(self, init_embedding: TokenEmbedding, hidden_size=None, num_labels=3, dropout_rate=0.3, | |||
dropout_embed=0.1): | |||
super(ESIM, self).__init__() | |||
self.vocab_size = vocab_size | |||
self.embed_dim = embed_dim | |||
self.hidden_size = hidden_size | |||
self.dropout = dropout | |||
self.n_labels = num_classes | |||
self.drop = nn.Dropout(self.dropout) | |||
self.embedding = Encoder.Embedding( | |||
(self.vocab_size, self.embed_dim), dropout=self.dropout, | |||
) | |||
self.embedding_layer = nn.Linear(self.embed_dim, self.hidden_size) | |||
self.encoder = Encoder.LSTM( | |||
input_size=self.embed_dim, hidden_size=self.hidden_size, num_layers=1, bias=True, | |||
batch_first=True, bidirectional=True | |||
) | |||
self.bi_attention = Aggregator.BiAttention() | |||
self.mean_pooling = Aggregator.AvgPoolWithMask() | |||
self.max_pooling = Aggregator.MaxPoolWithMask() | |||
self.inference_layer = nn.Linear(self.hidden_size * 4, self.hidden_size) | |||
self.decoder = Encoder.LSTM( | |||
input_size=self.hidden_size, hidden_size=self.hidden_size, num_layers=1, bias=True, | |||
batch_first=True, bidirectional=True | |||
) | |||
self.output = Decoder.MLP([4 * self.hidden_size, self.hidden_size, self.n_labels], 'tanh', dropout=self.dropout) | |||
def forward(self, words1, words2, seq_len1=None, seq_len2=None, target=None): | |||
""" Forward function | |||
:param torch.Tensor words1: [batch size(B), premise seq len(PL)] premise的token表示 | |||
:param torch.Tensor words2: [B, hypothesis seq len(HL)] hypothesis的token表示 | |||
:param torch.LongTensor seq_len1: [B] premise的长度 | |||
:param torch.LongTensor seq_len2: [B] hypothesis的长度 | |||
:param torch.LongTensor target: [B] 真实目标值 | |||
:return: dict prediction: [B, n_labels(N)] 预测结果 | |||
self.embedding = init_embedding | |||
self.dropout_embed = EmbedDropout(p=dropout_embed) | |||
if hidden_size is None: | |||
hidden_size = self.embedding.embed_size | |||
self.rnn = BiRNN(self.embedding.embed_size, hidden_size, dropout_rate=dropout_rate) | |||
# self.rnn = LSTM(self.embedding.embed_size, hidden_size, dropout=dropout_rate, bidirectional=True) | |||
self.interfere = nn.Sequential(nn.Dropout(p=dropout_rate), | |||
nn.Linear(8 * hidden_size, hidden_size), | |||
nn.ReLU()) | |||
nn.init.xavier_uniform_(self.interfere[1].weight.data) | |||
self.bi_attention = SoftmaxAttention() | |||
self.rnn_high = BiRNN(self.embedding.embed_size, hidden_size, dropout_rate=dropout_rate) | |||
# self.rnn_high = LSTM(hidden_size, hidden_size, dropout=dropout_rate, bidirectional=True,) | |||
self.classifier = nn.Sequential(nn.Dropout(p=dropout_rate), | |||
nn.Linear(8 * hidden_size, hidden_size), | |||
nn.Tanh(), | |||
nn.Dropout(p=dropout_rate), | |||
nn.Linear(hidden_size, num_labels)) | |||
self.dropout_rnn = nn.Dropout(p=dropout_rate) | |||
nn.init.xavier_uniform_(self.classifier[1].weight.data) | |||
nn.init.xavier_uniform_(self.classifier[4].weight.data) | |||
def forward(self, words1, words2, seq_len1, seq_len2, target=None): | |||
""" | |||
premise0 = self.embedding_layer(self.embedding(words1)) | |||
hypothesis0 = self.embedding_layer(self.embedding(words2)) | |||
if seq_len1 is not None: | |||
seq_len1 = seq_len_to_mask(seq_len1) | |||
else: | |||
seq_len1 = torch.ones(premise0.size(0), premise0.size(1)) | |||
seq_len1 = (seq_len1.long()).to(device=premise0.device) | |||
if seq_len2 is not None: | |||
seq_len2 = seq_len_to_mask(seq_len2) | |||
else: | |||
seq_len2 = torch.ones(hypothesis0.size(0), hypothesis0.size(1)) | |||
seq_len2 = (seq_len2.long()).to(device=hypothesis0.device) | |||
_BP, _PSL, _HP = premise0.size() | |||
_BH, _HSL, _HH = hypothesis0.size() | |||
_BPL, _PLL = seq_len1.size() | |||
_HPL, _HLL = seq_len2.size() | |||
assert _BP == _BH and _BPL == _HPL and _BP == _BPL | |||
assert _HP == _HH | |||
assert _PSL == _PLL and _HSL == _HLL | |||
B, PL, H = premise0.size() | |||
B, HL, H = hypothesis0.size() | |||
a0 = self.encoder(self.drop(premise0)) # a0: [B, PL, H * 2] | |||
b0 = self.encoder(self.drop(hypothesis0)) # b0: [B, HL, H * 2] | |||
a = torch.mean(a0.view(B, PL, -1, H), dim=2) # a: [B, PL, H] | |||
b = torch.mean(b0.view(B, HL, -1, H), dim=2) # b: [B, HL, H] | |||
ai, bi = self.bi_attention(a, b, seq_len1, seq_len2) | |||
ma = torch.cat((a, ai, a - ai, a * ai), dim=2) # ma: [B, PL, 4 * H] | |||
mb = torch.cat((b, bi, b - bi, b * bi), dim=2) # mb: [B, HL, 4 * H] | |||
f_ma = self.inference_layer(ma) | |||
f_mb = self.inference_layer(mb) | |||
vat = self.decoder(self.drop(f_ma)) | |||
vbt = self.decoder(self.drop(f_mb)) | |||
va = torch.mean(vat.view(B, PL, -1, H), dim=2) # va: [B, PL, H] | |||
vb = torch.mean(vbt.view(B, HL, -1, H), dim=2) # vb: [B, HL, H] | |||
va_ave = self.mean_pooling(va, seq_len1, dim=1) # va_ave: [B, H] | |||
va_max, va_arg_max = self.max_pooling(va, seq_len1, dim=1) # va_max: [B, H] | |||
vb_ave = self.mean_pooling(vb, seq_len2, dim=1) # vb_ave: [B, H] | |||
vb_max, vb_arg_max = self.max_pooling(vb, seq_len2, dim=1) # vb_max: [B, H] | |||
v = torch.cat((va_ave, va_max, vb_ave, vb_max), dim=1) # v: [B, 4 * H] | |||
prediction = torch.tanh(self.output(v)) # prediction: [B, N] | |||
if target is not None: | |||
func = nn.CrossEntropyLoss() | |||
loss = func(prediction, target) | |||
return {Const.OUTPUT: prediction, Const.LOSS: loss} | |||
return {Const.OUTPUT: prediction} | |||
def predict(self, words1, words2, seq_len1=None, seq_len2=None, target=None): | |||
""" Predict function | |||
:param torch.Tensor words1: [batch size(B), premise seq len(PL)] premise的token表示 | |||
:param torch.Tensor words2: [B, hypothesis seq len(HL)] hypothesis的token表示 | |||
:param torch.LongTensor seq_len1: [B] premise的长度 | |||
:param torch.LongTensor seq_len2: [B] hypothesis的长度 | |||
:param torch.LongTensor target: [B] 真实目标值 | |||
:return: dict prediction: [B, n_labels(N)] 预测结果 | |||
:param words1: [batch, seq_len] | |||
:param words2: [batch, seq_len] | |||
:param seq_len1: [batch] | |||
:param seq_len2: [batch] | |||
:param target: | |||
:return: | |||
""" | |||
prediction = self.forward(words1, words2, seq_len1, seq_len2)[Const.OUTPUT] | |||
return {Const.OUTPUT: torch.argmax(prediction, dim=-1)} | |||
mask1 = seq_len_to_mask(seq_len1, words1.size(1)) | |||
mask2 = seq_len_to_mask(seq_len2, words2.size(1)) | |||
a0 = self.embedding(words1) # B * len * emb_dim | |||
b0 = self.embedding(words2) | |||
a0, b0 = self.dropout_embed(a0), self.dropout_embed(b0) | |||
a = self.rnn(a0, mask1.byte()) # a: [B, PL, 2 * H] | |||
b = self.rnn(b0, mask2.byte()) | |||
# a = self.dropout_rnn(self.rnn(a0, seq_len1)[0]) # a: [B, PL, 2 * H] | |||
# b = self.dropout_rnn(self.rnn(b0, seq_len2)[0]) | |||
ai, bi = self.bi_attention(a, mask1, b, mask2) | |||
a_ = torch.cat((a, ai, a - ai, a * ai), dim=2) # ma: [B, PL, 8 * H] | |||
b_ = torch.cat((b, bi, b - bi, b * bi), dim=2) | |||
a_f = self.interfere(a_) | |||
b_f = self.interfere(b_) | |||
a_h = self.rnn_high(a_f, mask1.byte()) # ma: [B, PL, 2 * H] | |||
b_h = self.rnn_high(b_f, mask2.byte()) | |||
# a_h = self.dropout_rnn(self.rnn_high(a_f, seq_len1)[0]) # ma: [B, PL, 2 * H] | |||
# b_h = self.dropout_rnn(self.rnn_high(b_f, seq_len2)[0]) | |||
a_avg = self.mean_pooling(a_h, mask1, dim=1) | |||
a_max, _ = self.max_pooling(a_h, mask1, dim=1) | |||
b_avg = self.mean_pooling(b_h, mask2, dim=1) | |||
b_max, _ = self.max_pooling(b_h, mask2, dim=1) | |||
out = torch.cat((a_avg, a_max, b_avg, b_max), dim=1) # v: [B, 8 * H] | |||
logits = torch.tanh(self.classifier(out)) | |||
if target is not None: | |||
loss_fct = CrossEntropyLoss() | |||
loss = loss_fct(logits, target) | |||
return {Const.LOSS: loss, Const.OUTPUT: logits} | |||
else: | |||
return {Const.OUTPUT: logits} | |||
def predict(self, **kwargs): | |||
pred = self.forward(**kwargs)[Const.OUTPUT].argmax(-1) | |||
return {Const.OUTPUT: pred} | |||
# input [batch_size, len , hidden] | |||
# mask [batch_size, len] (111...00) | |||
@staticmethod | |||
def mean_pooling(input, mask, dim=1): | |||
masks = mask.view(mask.size(0), mask.size(1), -1).float() | |||
return torch.sum(input * masks, dim=dim) / torch.sum(masks, dim=1) | |||
@staticmethod | |||
def max_pooling(input, mask, dim=1): | |||
my_inf = 10e12 | |||
masks = mask.view(mask.size(0), mask.size(1), -1) | |||
masks = masks.expand(-1, -1, input.size(2)).float() | |||
return torch.max(input + masks.le(0.5).float() * -my_inf, dim=dim) | |||
class EmbedDropout(nn.Dropout): | |||
def forward(self, sequences_batch): | |||
ones = sequences_batch.data.new_ones(sequences_batch.shape[0], sequences_batch.shape[-1]) | |||
dropout_mask = nn.functional.dropout(ones, self.p, self.training, inplace=False) | |||
return dropout_mask.unsqueeze(1) * sequences_batch | |||
class BiRNN(nn.Module): | |||
def __init__(self, input_size, hidden_size, dropout_rate=0.3): | |||
super(BiRNN, self).__init__() | |||
self.dropout_rate = dropout_rate | |||
self.rnn = nn.LSTM(input_size, hidden_size, | |||
num_layers=1, | |||
bidirectional=True, | |||
batch_first=True) | |||
def forward(self, x, x_mask): | |||
# Sort x | |||
lengths = x_mask.data.eq(1).long().sum(1) | |||
_, idx_sort = torch.sort(lengths, dim=0, descending=True) | |||
_, idx_unsort = torch.sort(idx_sort, dim=0) | |||
lengths = list(lengths[idx_sort]) | |||
x = x.index_select(0, idx_sort) | |||
# Pack it up | |||
rnn_input = nn.utils.rnn.pack_padded_sequence(x, lengths, batch_first=True) | |||
# Apply dropout to input | |||
if self.dropout_rate > 0: | |||
dropout_input = F.dropout(rnn_input.data, p=self.dropout_rate, training=self.training) | |||
rnn_input = nn.utils.rnn.PackedSequence(dropout_input, rnn_input.batch_sizes) | |||
output = self.rnn(rnn_input)[0] | |||
# Unpack everything | |||
output = nn.utils.rnn.pad_packed_sequence(output, batch_first=True)[0] | |||
output = output.index_select(0, idx_unsort) | |||
if output.size(1) != x_mask.size(1): | |||
padding = torch.zeros(output.size(0), | |||
x_mask.size(1) - output.size(1), | |||
output.size(2)).type(output.data.type()) | |||
output = torch.cat([output, padding], 1) | |||
return output | |||
def masked_softmax(tensor, mask): | |||
tensor_shape = tensor.size() | |||
reshaped_tensor = tensor.view(-1, tensor_shape[-1]) | |||
# Reshape the mask so it matches the size of the input tensor. | |||
while mask.dim() < tensor.dim(): | |||
mask = mask.unsqueeze(1) | |||
mask = mask.expand_as(tensor).contiguous().float() | |||
reshaped_mask = mask.view(-1, mask.size()[-1]) | |||
result = F.softmax(reshaped_tensor * reshaped_mask, dim=-1) | |||
result = result * reshaped_mask | |||
# 1e-13 is added to avoid divisions by zero. | |||
result = result / (result.sum(dim=-1, keepdim=True) + 1e-13) | |||
return result.view(*tensor_shape) | |||
def weighted_sum(tensor, weights, mask): | |||
w_sum = weights.bmm(tensor) | |||
while mask.dim() < w_sum.dim(): | |||
mask = mask.unsqueeze(1) | |||
mask = mask.transpose(-1, -2) | |||
mask = mask.expand_as(w_sum).contiguous().float() | |||
return w_sum * mask | |||
class SoftmaxAttention(nn.Module): | |||
def forward(self, premise_batch, premise_mask, hypothesis_batch, hypothesis_mask): | |||
similarity_matrix = premise_batch.bmm(hypothesis_batch.transpose(2, 1) | |||
.contiguous()) | |||
prem_hyp_attn = masked_softmax(similarity_matrix, hypothesis_mask) | |||
hyp_prem_attn = masked_softmax(similarity_matrix.transpose(1, 2) | |||
.contiguous(), | |||
premise_mask) | |||
attended_premises = weighted_sum(hypothesis_batch, | |||
prem_hyp_attn, | |||
premise_mask) | |||
attended_hypotheses = weighted_sum(premise_batch, | |||
hyp_prem_attn, | |||
hypothesis_mask) | |||
return attended_premises, attended_hypotheses |
@@ -46,8 +46,8 @@ class StarTransEnc(nn.Module): | |||
super(StarTransEnc, self).__init__() | |||
self.embedding = get_embeddings(init_embed) | |||
emb_dim = self.embedding.embedding_dim | |||
#self.emb_fc = nn.Linear(emb_dim, hidden_size) | |||
self.emb_drop = nn.Dropout(emb_dropout) | |||
self.emb_fc = nn.Linear(emb_dim, hidden_size) | |||
# self.emb_drop = nn.Dropout(emb_dropout) | |||
self.encoder = StarTransformer(hidden_size=hidden_size, | |||
num_layers=num_layers, | |||
num_head=num_head, | |||
@@ -65,7 +65,7 @@ class StarTransEnc(nn.Module): | |||
[batch, hidden] 全局 relay 节点, 详见论文 | |||
""" | |||
x = self.embedding(x) | |||
#x = self.emb_fc(self.emb_drop(x)) | |||
x = self.emb_fc(x) | |||
nodes, relay = self.encoder(x, mask) | |||
return nodes, relay | |||
@@ -1,11 +1,11 @@ | |||
""" | |||
大部分用于的 NLP 任务神经网络都可以看做由编码 :mod:`~fastNLP.modules.encoder` 、 | |||
聚合 :mod:`~fastNLP.modules.aggregator` 、解码 :mod:`~fastNLP.modules.decoder` 三种模块组成。 | |||
解码 :mod:`~fastNLP.modules.decoder` 两种模块组成。 | |||
.. image:: figures/text_classification.png | |||
:mod:`~fastNLP.modules` 中实现了 fastNLP 提供的诸多模块组件,可以帮助用户快速搭建自己所需的网络。 | |||
三种模块的功能和常见组件如下: | |||
两种模块的功能和常见组件如下: | |||
+-----------------------+-----------------------+-----------------------+ | |||
| module type | functionality | example | | |||
@@ -13,9 +13,6 @@ | |||
| encoder | 将输入编码为具有具 | embedding, RNN, CNN, | | |||
| | 有表示能力的向量 | transformer | | |||
+-----------------------+-----------------------+-----------------------+ | |||
| aggregator | 从多个向量中聚合信息 | self-attention, | | |||
| | | max-pooling | | |||
+-----------------------+-----------------------+-----------------------+ | |||
| decoder | 将具有某种表示意义的 | MLP, CRF | | |||
| | 向量解码为需要的输出 | | | |||
| | 形式 | | | |||
@@ -46,10 +43,8 @@ __all__ = [ | |||
"allowed_transitions", | |||
] | |||
from . import aggregator | |||
from . import decoder | |||
from . import encoder | |||
from .aggregator import * | |||
from .decoder import * | |||
from .dropout import TimestepDropout | |||
from .encoder import * | |||
@@ -1,14 +0,0 @@ | |||
__all__ = [ | |||
"MaxPool", | |||
"MaxPoolWithMask", | |||
"AvgPool", | |||
"MultiHeadAttention", | |||
] | |||
from .pooling import MaxPool | |||
from .pooling import MaxPoolWithMask | |||
from .pooling import AvgPool | |||
from .pooling import AvgPoolWithMask | |||
from .attention import MultiHeadAttention |
@@ -22,7 +22,14 @@ __all__ = [ | |||
"VarRNN", | |||
"VarLSTM", | |||
"VarGRU" | |||
"VarGRU", | |||
"MaxPool", | |||
"MaxPoolWithMask", | |||
"AvgPool", | |||
"AvgPoolWithMask", | |||
"MultiHeadAttention", | |||
] | |||
from ._bert import BertModel | |||
from .bert import BertWordPieceEncoder | |||
@@ -34,3 +41,6 @@ from .lstm import LSTM | |||
from .star_transformer import StarTransformer | |||
from .transformer import TransformerEncoder | |||
from .variational_rnn import VarRNN, VarLSTM, VarGRU | |||
from .pooling import MaxPool, MaxPoolWithMask, AvgPool, AvgPoolWithMask | |||
from .attention import MultiHeadAttention |
@@ -6,14 +6,13 @@ from typing import Optional, Tuple, List, Callable | |||
import os | |||
import h5py | |||
import numpy | |||
import torch | |||
import torch.nn as nn | |||
import torch.nn.functional as F | |||
from torch.nn.utils.rnn import PackedSequence, pad_packed_sequence | |||
from ...core.vocabulary import Vocabulary | |||
import json | |||
import pickle | |||
from ..utils import get_dropout_mask | |||
import codecs | |||
@@ -244,13 +243,13 @@ class LstmbiLm(nn.Module): | |||
def __init__(self, config): | |||
super(LstmbiLm, self).__init__() | |||
self.config = config | |||
self.encoder = nn.LSTM(self.config['encoder']['projection_dim'], | |||
self.config['encoder']['dim'], | |||
num_layers=self.config['encoder']['n_layers'], | |||
self.encoder = nn.LSTM(self.config['lstm']['projection_dim'], | |||
self.config['lstm']['dim'], | |||
num_layers=self.config['lstm']['n_layers'], | |||
bidirectional=True, | |||
batch_first=True, | |||
dropout=self.config['dropout']) | |||
self.projection = nn.Linear(self.config['encoder']['dim'], self.config['encoder']['projection_dim'], bias=True) | |||
self.projection = nn.Linear(self.config['lstm']['dim'], self.config['lstm']['projection_dim'], bias=True) | |||
def forward(self, inputs, seq_len): | |||
sort_lens, sort_idx = torch.sort(seq_len, dim=0, descending=True) | |||
@@ -260,7 +259,7 @@ class LstmbiLm(nn.Module): | |||
output, _ = nn.utils.rnn.pad_packed_sequence(output, batch_first=self.batch_first) | |||
_, unsort_idx = torch.sort(sort_idx, dim=0, descending=False) | |||
output = output[unsort_idx] | |||
forward, backward = output.split(self.config['encoder']['dim'], 2) | |||
forward, backward = output.split(self.config['lstm']['dim'], 2) | |||
return torch.cat([self.projection(forward), self.projection(backward)], dim=2) | |||
@@ -268,13 +267,13 @@ class ElmobiLm(torch.nn.Module): | |||
def __init__(self, config): | |||
super(ElmobiLm, self).__init__() | |||
self.config = config | |||
input_size = config['encoder']['projection_dim'] | |||
hidden_size = config['encoder']['projection_dim'] | |||
cell_size = config['encoder']['dim'] | |||
num_layers = config['encoder']['n_layers'] | |||
memory_cell_clip_value = config['encoder']['cell_clip'] | |||
state_projection_clip_value = config['encoder']['proj_clip'] | |||
recurrent_dropout_probability = config['dropout'] | |||
input_size = config['lstm']['projection_dim'] | |||
hidden_size = config['lstm']['projection_dim'] | |||
cell_size = config['lstm']['dim'] | |||
num_layers = config['lstm']['n_layers'] | |||
memory_cell_clip_value = config['lstm']['cell_clip'] | |||
state_projection_clip_value = config['lstm']['proj_clip'] | |||
recurrent_dropout_probability = 0.0 | |||
self.input_size = input_size | |||
self.hidden_size = hidden_size | |||
@@ -409,199 +408,50 @@ class ElmobiLm(torch.nn.Module): | |||
torch.cat(final_memory_states, 0)) | |||
return stacked_sequence_outputs, final_state_tuple | |||
def load_weights(self, weight_file: str) -> None: | |||
""" | |||
Load the pre-trained weights from the file. | |||
""" | |||
requires_grad = False | |||
with h5py.File(weight_file, 'r') as fin: | |||
for i_layer, lstms in enumerate( | |||
zip(self.forward_layers, self.backward_layers) | |||
): | |||
for j_direction, lstm in enumerate(lstms): | |||
# lstm is an instance of LSTMCellWithProjection | |||
cell_size = lstm.cell_size | |||
dataset = fin['RNN_%s' % j_direction]['RNN']['MultiRNNCell']['Cell%s' % i_layer | |||
]['LSTMCell'] | |||
# tensorflow packs together both W and U matrices into one matrix, | |||
# but pytorch maintains individual matrices. In addition, tensorflow | |||
# packs the gates as input, memory, forget, output but pytorch | |||
# uses input, forget, memory, output. So we need to modify the weights. | |||
tf_weights = numpy.transpose(dataset['W_0'][...]) | |||
torch_weights = tf_weights.copy() | |||
# split the W from U matrices | |||
input_size = lstm.input_size | |||
input_weights = torch_weights[:, :input_size] | |||
recurrent_weights = torch_weights[:, input_size:] | |||
tf_input_weights = tf_weights[:, :input_size] | |||
tf_recurrent_weights = tf_weights[:, input_size:] | |||
# handle the different gate order convention | |||
for torch_w, tf_w in [[input_weights, tf_input_weights], | |||
[recurrent_weights, tf_recurrent_weights]]: | |||
torch_w[(1 * cell_size):(2 * cell_size), :] = tf_w[(2 * cell_size):(3 * cell_size), :] | |||
torch_w[(2 * cell_size):(3 * cell_size), :] = tf_w[(1 * cell_size):(2 * cell_size), :] | |||
lstm.input_linearity.weight.data.copy_(torch.FloatTensor(input_weights)) | |||
lstm.state_linearity.weight.data.copy_(torch.FloatTensor(recurrent_weights)) | |||
lstm.input_linearity.weight.requires_grad = requires_grad | |||
lstm.state_linearity.weight.requires_grad = requires_grad | |||
# the bias weights | |||
tf_bias = dataset['B'][...] | |||
# tensorflow adds 1.0 to forget gate bias instead of modifying the | |||
# parameters... | |||
tf_bias[(2 * cell_size):(3 * cell_size)] += 1 | |||
torch_bias = tf_bias.copy() | |||
torch_bias[(1 * cell_size):(2 * cell_size) | |||
] = tf_bias[(2 * cell_size):(3 * cell_size)] | |||
torch_bias[(2 * cell_size):(3 * cell_size) | |||
] = tf_bias[(1 * cell_size):(2 * cell_size)] | |||
lstm.state_linearity.bias.data.copy_(torch.FloatTensor(torch_bias)) | |||
lstm.state_linearity.bias.requires_grad = requires_grad | |||
# the projection weights | |||
proj_weights = numpy.transpose(dataset['W_P_0'][...]) | |||
lstm.state_projection.weight.data.copy_(torch.FloatTensor(proj_weights)) | |||
lstm.state_projection.weight.requires_grad = requires_grad | |||
class LstmTokenEmbedder(nn.Module): | |||
def __init__(self, config, word_emb_layer, char_emb_layer): | |||
super(LstmTokenEmbedder, self).__init__() | |||
self.config = config | |||
class ConvTokenEmbedder(nn.Module): | |||
def __init__(self, config, weight_file, word_emb_layer, char_emb_layer): | |||
super(ConvTokenEmbedder, self).__init__() | |||
self.weight_file = weight_file | |||
self.word_emb_layer = word_emb_layer | |||
self.char_emb_layer = char_emb_layer | |||
self.output_dim = config['encoder']['projection_dim'] | |||
emb_dim = 0 | |||
if word_emb_layer is not None: | |||
emb_dim += word_emb_layer.n_d | |||
if char_emb_layer is not None: | |||
emb_dim += char_emb_layer.n_d * 2 | |||
self.char_lstm = nn.LSTM(char_emb_layer.n_d, char_emb_layer.n_d, num_layers=1, bidirectional=True, | |||
batch_first=True, dropout=config['dropout']) | |||
self.projection = nn.Linear(emb_dim, self.output_dim, bias=True) | |||
self.output_dim = config['lstm']['projection_dim'] | |||
self._options = config | |||
def forward(self, words, chars): | |||
embs = [] | |||
if self.word_emb_layer is not None: | |||
if hasattr(self, 'words_to_words'): | |||
words = self.words_to_words[words] | |||
word_emb = self.word_emb_layer(words) | |||
embs.append(word_emb) | |||
char_cnn_options = self._options['char_cnn'] | |||
if char_cnn_options['activation'] == 'tanh': | |||
self.activation = torch.tanh | |||
elif char_cnn_options['activation'] == 'relu': | |||
self.activation = torch.nn.functional.relu | |||
else: | |||
raise Exception("Unknown activation") | |||
if self.char_emb_layer is not None: | |||
batch_size, seq_len, _ = chars.shape | |||
chars = chars.view(batch_size * seq_len, -1) | |||
chars_emb = self.char_emb_layer(chars) | |||
# TODO 这里应该要考虑seq_len的问题 | |||
_, (chars_outputs, __) = self.char_lstm(chars_emb) | |||
chars_outputs = chars_outputs.contiguous().view(-1, self.config['token_embedder']['embedding']['dim'] * 2) | |||
embs.append(chars_outputs) | |||
if char_emb_layer is not None: | |||
self.char_conv = [] | |||
cnn_config = config['char_cnn'] | |||
filters = cnn_config['filters'] | |||
char_embed_dim = cnn_config['embedding']['dim'] | |||
convolutions = [] | |||
token_embedding = torch.cat(embs, dim=2) | |||
for i, (width, num) in enumerate(filters): | |||
conv = torch.nn.Conv1d( | |||
in_channels=char_embed_dim, | |||
out_channels=num, | |||
kernel_size=width, | |||
bias=True | |||
) | |||
convolutions.append(conv) | |||
self.add_module('char_conv_{}'.format(i), conv) | |||
return self.projection(token_embedding) | |||
self._convolutions = convolutions | |||
n_filters = sum(f[1] for f in filters) | |||
n_highway = cnn_config['n_highway'] | |||
class ConvTokenEmbedder(nn.Module): | |||
def __init__(self, config, weight_file, word_emb_layer, char_emb_layer, char_vocab): | |||
super(ConvTokenEmbedder, self).__init__() | |||
self.weight_file = weight_file | |||
self.word_emb_layer = word_emb_layer | |||
self.char_emb_layer = char_emb_layer | |||
self._highways = Highway(n_filters, n_highway, activation=torch.nn.functional.relu) | |||
self.output_dim = config['encoder']['projection_dim'] | |||
self._options = config | |||
self.requires_grad = False | |||
self._load_weights() | |||
self._char_embedding_weights = char_emb_layer.weight.data | |||
def _load_weights(self): | |||
self._load_cnn_weights() | |||
self._load_highway() | |||
self._load_projection() | |||
def _load_cnn_weights(self): | |||
cnn_options = self._options['token_embedder'] | |||
filters = cnn_options['filters'] | |||
char_embed_dim = cnn_options['embedding']['dim'] | |||
convolutions = [] | |||
for i, (width, num) in enumerate(filters): | |||
conv = torch.nn.Conv1d( | |||
in_channels=char_embed_dim, | |||
out_channels=num, | |||
kernel_size=width, | |||
bias=True | |||
) | |||
# load the weights | |||
with h5py.File(self.weight_file, 'r') as fin: | |||
weight = fin['CNN']['W_cnn_{}'.format(i)][...] | |||
bias = fin['CNN']['b_cnn_{}'.format(i)][...] | |||
w_reshaped = numpy.transpose(weight.squeeze(axis=0), axes=(2, 1, 0)) | |||
if w_reshaped.shape != tuple(conv.weight.data.shape): | |||
raise ValueError("Invalid weight file") | |||
conv.weight.data.copy_(torch.FloatTensor(w_reshaped)) | |||
conv.bias.data.copy_(torch.FloatTensor(bias)) | |||
conv.weight.requires_grad = self.requires_grad | |||
conv.bias.requires_grad = self.requires_grad | |||
convolutions.append(conv) | |||
self.add_module('char_conv_{}'.format(i), conv) | |||
self._convolutions = convolutions | |||
def _load_highway(self): | |||
# the highway layers have same dimensionality as the number of cnn filters | |||
cnn_options = self._options['token_embedder'] | |||
filters = cnn_options['filters'] | |||
n_filters = sum(f[1] for f in filters) | |||
n_highway = cnn_options['n_highway'] | |||
# create the layers, and load the weights | |||
self._highways = Highway(n_filters, n_highway, activation=torch.nn.functional.relu) | |||
for k in range(n_highway): | |||
# The AllenNLP highway is one matrix multplication with concatenation of | |||
# transform and carry weights. | |||
with h5py.File(self.weight_file, 'r') as fin: | |||
# The weights are transposed due to multiplication order assumptions in tf | |||
# vs pytorch (tf.matmul(X, W) vs pytorch.matmul(W, X)) | |||
w_transform = numpy.transpose(fin['CNN_high_{}'.format(k)]['W_transform'][...]) | |||
# -1.0 since AllenNLP is g * x + (1 - g) * f(x) but tf is (1 - g) * x + g * f(x) | |||
w_carry = -1.0 * numpy.transpose(fin['CNN_high_{}'.format(k)]['W_carry'][...]) | |||
weight = numpy.concatenate([w_transform, w_carry], axis=0) | |||
self._highways._layers[k].weight.data.copy_(torch.FloatTensor(weight)) | |||
self._highways._layers[k].weight.requires_grad = self.requires_grad | |||
b_transform = fin['CNN_high_{}'.format(k)]['b_transform'][...] | |||
b_carry = -1.0 * fin['CNN_high_{}'.format(k)]['b_carry'][...] | |||
bias = numpy.concatenate([b_transform, b_carry], axis=0) | |||
self._highways._layers[k].bias.data.copy_(torch.FloatTensor(bias)) | |||
self._highways._layers[k].bias.requires_grad = self.requires_grad | |||
def _load_projection(self): | |||
cnn_options = self._options['token_embedder'] | |||
filters = cnn_options['filters'] | |||
n_filters = sum(f[1] for f in filters) | |||
self._projection = torch.nn.Linear(n_filters, self.output_dim, bias=True) | |||
with h5py.File(self.weight_file, 'r') as fin: | |||
weight = fin['CNN_proj']['W_proj'][...] | |||
bias = fin['CNN_proj']['b_proj'][...] | |||
self._projection.weight.data.copy_(torch.FloatTensor(numpy.transpose(weight))) | |||
self._projection.bias.data.copy_(torch.FloatTensor(bias)) | |||
self._projection.weight.requires_grad = self.requires_grad | |||
self._projection.bias.requires_grad = self.requires_grad | |||
self._projection = torch.nn.Linear(n_filters, self.output_dim, bias=True) | |||
def forward(self, words, chars): | |||
""" | |||
@@ -616,15 +466,8 @@ class ConvTokenEmbedder(nn.Module): | |||
# self._char_embedding_weights | |||
# ) | |||
batch_size, sequence_length, max_char_len = chars.size() | |||
character_embedding = self.char_emb_layer(chars).reshape(batch_size*sequence_length, max_char_len, -1) | |||
character_embedding = self.char_emb_layer(chars).reshape(batch_size * sequence_length, max_char_len, -1) | |||
# run convolutions | |||
cnn_options = self._options['token_embedder'] | |||
if cnn_options['activation'] == 'tanh': | |||
activation = torch.tanh | |||
elif cnn_options['activation'] == 'relu': | |||
activation = torch.nn.functional.relu | |||
else: | |||
raise Exception("Unknown activation") | |||
# (batch_size * sequence_length, embed_dim, max_chars_per_token) | |||
character_embedding = torch.transpose(character_embedding, 1, 2) | |||
@@ -634,7 +477,7 @@ class ConvTokenEmbedder(nn.Module): | |||
convolved = conv(character_embedding) | |||
# (batch_size * sequence_length, n_filters for this width) | |||
convolved, _ = torch.max(convolved, dim=-1) | |||
convolved = activation(convolved) | |||
convolved = self.activation(convolved) | |||
convs.append(convolved) | |||
# (batch_size * sequence_length, n_filters) | |||
@@ -712,8 +555,8 @@ class _ElmoModel(nn.Module): | |||
def __init__(self, model_dir: str, vocab: Vocabulary = None, cache_word_reprs: bool = False): | |||
super(_ElmoModel, self).__init__() | |||
dir = os.walk(model_dir) | |||
self.model_dir = model_dir | |||
dir = os.walk(self.model_dir) | |||
config_file = None | |||
weight_file = None | |||
config_count = 0 | |||
@@ -723,7 +566,7 @@ class _ElmoModel(nn.Module): | |||
if file_name.__contains__(".json"): | |||
config_file = file_name | |||
config_count += 1 | |||
elif file_name.__contains__(".hdf5"): | |||
elif file_name.__contains__(".pkl"): | |||
weight_file = file_name | |||
weight_count += 1 | |||
if config_count > 1 or weight_count > 1: | |||
@@ -734,7 +577,6 @@ class _ElmoModel(nn.Module): | |||
config = json.load(open(os.path.join(model_dir, config_file), 'r')) | |||
self.weight_file = os.path.join(model_dir, weight_file) | |||
self.config = config | |||
self.requires_grad = False | |||
OOV_TAG = '<oov>' | |||
PAD_TAG = '<pad>' | |||
@@ -744,102 +586,84 @@ class _ElmoModel(nn.Module): | |||
EOW_TAG = '<eow>' | |||
# For the model trained with character-based word encoder. | |||
if config['token_embedder']['embedding']['dim'] > 0: | |||
char_lexicon = {} | |||
with codecs.open(os.path.join(model_dir, 'char.dic'), 'r', encoding='utf-8') as fpi: | |||
for line in fpi: | |||
tokens = line.strip().split('\t') | |||
if len(tokens) == 1: | |||
tokens.insert(0, '\u3000') | |||
token, i = tokens | |||
char_lexicon[token] = int(i) | |||
# 做一些sanity check | |||
for special_word in [PAD_TAG, OOV_TAG, BOW_TAG, EOW_TAG]: | |||
assert special_word in char_lexicon, f"{special_word} not found in char.dic." | |||
# 从vocab中构建char_vocab | |||
char_vocab = Vocabulary(unknown=OOV_TAG, padding=PAD_TAG) | |||
# 需要保证<bow>与<eow>在里面 | |||
char_vocab.add_word_lst([BOW_TAG, EOW_TAG, BOS_TAG, EOS_TAG]) | |||
for word, index in vocab: | |||
char_vocab.add_word_lst(list(word)) | |||
self.bos_index, self.eos_index, self._pad_index = len(vocab), len(vocab)+1, vocab.padding_idx | |||
# 根据char_lexicon调整, 多设置一位,是预留给word padding的(该位置的char表示为全0表示) | |||
char_emb_layer = nn.Embedding(len(char_vocab)+1, int(config['token_embedder']['embedding']['dim']), | |||
padding_idx=len(char_vocab)) | |||
with h5py.File(self.weight_file, 'r') as fin: | |||
char_embed_weights = fin['char_embed'][...] | |||
char_embed_weights = torch.from_numpy(char_embed_weights) | |||
found_char_count = 0 | |||
for char, index in char_vocab: # 调整character embedding | |||
if char in char_lexicon: | |||
index_in_pre = char_lexicon.get(char) | |||
found_char_count += 1 | |||
else: | |||
index_in_pre = char_lexicon[OOV_TAG] | |||
char_emb_layer.weight.data[index] = char_embed_weights[index_in_pre] | |||
print(f"{found_char_count} out of {len(char_vocab)} characters were found in pretrained elmo embedding.") | |||
# 生成words到chars的映射 | |||
if config['token_embedder']['name'].lower() == 'cnn': | |||
max_chars = config['token_embedder']['max_characters_per_token'] | |||
elif config['token_embedder']['name'].lower() == 'lstm': | |||
max_chars = max(map(lambda x: len(x[0]), vocab)) + 2 # 需要补充两个<bow>与<eow> | |||
char_lexicon = {} | |||
with codecs.open(os.path.join(model_dir, 'char.dic'), 'r', encoding='utf-8') as fpi: | |||
for line in fpi: | |||
tokens = line.strip().split('\t') | |||
if len(tokens) == 1: | |||
tokens.insert(0, '\u3000') | |||
token, i = tokens | |||
char_lexicon[token] = int(i) | |||
# 做一些sanity check | |||
for special_word in [PAD_TAG, OOV_TAG, BOW_TAG, EOW_TAG]: | |||
assert special_word in char_lexicon, f"{special_word} not found in char.dic." | |||
# 从vocab中构建char_vocab | |||
char_vocab = Vocabulary(unknown=OOV_TAG, padding=PAD_TAG) | |||
# 需要保证<bow>与<eow>在里面 | |||
char_vocab.add_word_lst([BOW_TAG, EOW_TAG, BOS_TAG, EOS_TAG]) | |||
for word, index in vocab: | |||
char_vocab.add_word_lst(list(word)) | |||
self.bos_index, self.eos_index, self._pad_index = len(vocab), len(vocab) + 1, vocab.padding_idx | |||
# 根据char_lexicon调整, 多设置一位,是预留给word padding的(该位置的char表示为全0表示) | |||
char_emb_layer = nn.Embedding(len(char_vocab) + 1, int(config['char_cnn']['embedding']['dim']), | |||
padding_idx=len(char_vocab)) | |||
# 读入预训练权重 这里的elmo_model 包含char_cnn和 lstm 的 state_dict | |||
elmo_model = torch.load(os.path.join(self.model_dir, weight_file), map_location='cpu') | |||
char_embed_weights = elmo_model["char_cnn"]['char_emb_layer.weight'] | |||
found_char_count = 0 | |||
for char, index in char_vocab: # 调整character embedding | |||
if char in char_lexicon: | |||
index_in_pre = char_lexicon.get(char) | |||
found_char_count += 1 | |||
else: | |||
raise ValueError('Unknown token_embedder: {0}'.format(config['token_embedder']['name'])) | |||
self.words_to_chars_embedding = nn.Parameter(torch.full((len(vocab)+2, max_chars), | |||
fill_value=len(char_vocab), | |||
dtype=torch.long), | |||
requires_grad=False) | |||
for word, index in list(iter(vocab)) + [(BOS_TAG, len(vocab)), (EOS_TAG, len(vocab)+1)]: | |||
if len(word) + 2 > max_chars: | |||
word = word[:max_chars - 2] | |||
if index == self._pad_index: | |||
continue | |||
elif word == BOS_TAG or word == EOS_TAG: | |||
char_ids = [char_vocab.to_index(BOW_TAG)] + [char_vocab.to_index(word)] + [ | |||
char_vocab.to_index(EOW_TAG)] | |||
char_ids += [char_vocab.to_index(PAD_TAG)] * (max_chars - len(char_ids)) | |||
else: | |||
char_ids = [char_vocab.to_index(BOW_TAG)] + [char_vocab.to_index(c) for c in word] + [ | |||
char_vocab.to_index(EOW_TAG)] | |||
char_ids += [char_vocab.to_index(PAD_TAG)] * (max_chars - len(char_ids)) | |||
self.words_to_chars_embedding[index] = torch.LongTensor(char_ids) | |||
self.char_vocab = char_vocab | |||
else: | |||
char_emb_layer = None | |||
if config['token_embedder']['name'].lower() == 'cnn': | |||
self.token_embedder = ConvTokenEmbedder( | |||
config, self.weight_file, None, char_emb_layer, self.char_vocab) | |||
elif config['token_embedder']['name'].lower() == 'lstm': | |||
self.token_embedder = LstmTokenEmbedder( | |||
config, None, char_emb_layer) | |||
if config['token_embedder']['word_dim'] > 0 \ | |||
and vocab._no_create_word_length > 0: # 需要映射,使得来自于dev, test的idx指向unk | |||
words_to_words = nn.Parameter(torch.arange(len(vocab) + 2).long(), requires_grad=False) | |||
for word, idx in vocab: | |||
if vocab._is_word_no_create_entry(word): | |||
words_to_words[idx] = vocab.unknown_idx | |||
setattr(self.token_embedder, 'words_to_words', words_to_words) | |||
self.output_dim = config['encoder']['projection_dim'] | |||
# 暂时只考虑 elmo | |||
if config['encoder']['name'].lower() == 'elmo': | |||
self.encoder = ElmobiLm(config) | |||
elif config['encoder']['name'].lower() == 'lstm': | |||
self.encoder = LstmbiLm(config) | |||
self.encoder.load_weights(self.weight_file) | |||
index_in_pre = char_lexicon[OOV_TAG] | |||
char_emb_layer.weight.data[index] = char_embed_weights[index_in_pre] | |||
print(f"{found_char_count} out of {len(char_vocab)} characters were found in pretrained elmo embedding.") | |||
# 生成words到chars的映射 | |||
max_chars = config['char_cnn']['max_characters_per_token'] | |||
self.words_to_chars_embedding = nn.Parameter(torch.full((len(vocab) + 2, max_chars), | |||
fill_value=len(char_vocab), | |||
dtype=torch.long), | |||
requires_grad=False) | |||
for word, index in list(iter(vocab)) + [(BOS_TAG, len(vocab)), (EOS_TAG, len(vocab) + 1)]: | |||
if len(word) + 2 > max_chars: | |||
word = word[:max_chars - 2] | |||
if index == self._pad_index: | |||
continue | |||
elif word == BOS_TAG or word == EOS_TAG: | |||
char_ids = [char_vocab.to_index(BOW_TAG)] + [char_vocab.to_index(word)] + [ | |||
char_vocab.to_index(EOW_TAG)] | |||
char_ids += [char_vocab.to_index(PAD_TAG)] * (max_chars - len(char_ids)) | |||
else: | |||
char_ids = [char_vocab.to_index(BOW_TAG)] + [char_vocab.to_index(c) for c in word] + [ | |||
char_vocab.to_index(EOW_TAG)] | |||
char_ids += [char_vocab.to_index(PAD_TAG)] * (max_chars - len(char_ids)) | |||
self.words_to_chars_embedding[index] = torch.LongTensor(char_ids) | |||
self.char_vocab = char_vocab | |||
self.token_embedder = ConvTokenEmbedder( | |||
config, self.weight_file, None, char_emb_layer) | |||
elmo_model["char_cnn"]['char_emb_layer.weight'] = char_emb_layer.weight | |||
self.token_embedder.load_state_dict(elmo_model["char_cnn"]) | |||
self.output_dim = config['lstm']['projection_dim'] | |||
# lstm encoder | |||
self.encoder = ElmobiLm(config) | |||
self.encoder.load_state_dict(elmo_model["lstm"]) | |||
if cache_word_reprs: | |||
if config['token_embedder']['embedding']['dim'] > 0: # 只有在使用了chars的情况下有用 | |||
if config['char_cnn']['embedding']['dim'] > 0: # 只有在使用了chars的情况下有用 | |||
print("Start to generate cache word representations.") | |||
batch_size = 320 | |||
# bos eos | |||
@@ -848,7 +672,7 @@ class _ElmoModel(nn.Module): | |||
int(word_size % batch_size != 0) | |||
self.cached_word_embedding = nn.Embedding(word_size, | |||
config['encoder']['projection_dim']) | |||
config['lstm']['projection_dim']) | |||
with torch.no_grad(): | |||
for i in range(num_batches): | |||
words = torch.arange(i * batch_size, | |||
@@ -877,6 +701,8 @@ class _ElmoModel(nn.Module): | |||
expanded_words[:, 0].fill_(self.bos_index) | |||
expanded_words[torch.arange(batch_size).to(words), seq_len + 1] = self.eos_index | |||
seq_len = seq_len + 2 | |||
zero_tensor = expanded_words.new_zeros(expanded_words.shape) | |||
mask = (expanded_words == zero_tensor).unsqueeze(-1) | |||
if hasattr(self, 'cached_word_embedding'): | |||
token_embedding = self.cached_word_embedding(expanded_words) | |||
else: | |||
@@ -886,20 +712,16 @@ class _ElmoModel(nn.Module): | |||
chars = None | |||
token_embedding = self.token_embedder(expanded_words, chars) # batch_size x max_len x embed_dim | |||
if self.config['encoder']['name'] == 'elmo': | |||
encoder_output = self.encoder(token_embedding, seq_len) | |||
if encoder_output.size(2) < max_len + 2: | |||
num_layers, _, output_len, hidden_size = encoder_output.size() | |||
dummy_tensor = encoder_output.new_zeros(num_layers, batch_size, | |||
max_len + 2 - output_len, hidden_size) | |||
encoder_output = torch.cat((encoder_output, dummy_tensor), 2) | |||
sz = encoder_output.size() # 2, batch_size, max_len, hidden_size | |||
token_embedding = torch.cat((token_embedding, token_embedding), dim=2).view(1, sz[1], sz[2], sz[3]) | |||
encoder_output = torch.cat((token_embedding, encoder_output), dim=0) | |||
elif self.config['encoder']['name'] == 'lstm': | |||
encoder_output = self.encoder(token_embedding, seq_len) | |||
else: | |||
raise ValueError('Unknown encoder: {0}'.format(self.config['encoder']['name'])) | |||
encoder_output = self.encoder(token_embedding, seq_len) | |||
if encoder_output.size(2) < max_len + 2: | |||
num_layers, _, output_len, hidden_size = encoder_output.size() | |||
dummy_tensor = encoder_output.new_zeros(num_layers, batch_size, | |||
max_len + 2 - output_len, hidden_size) | |||
encoder_output = torch.cat((encoder_output, dummy_tensor), 2) | |||
sz = encoder_output.size() # 2, batch_size, max_len, hidden_size | |||
token_embedding = token_embedding.masked_fill(mask, 0) | |||
token_embedding = torch.cat((token_embedding, token_embedding), dim=2).view(1, sz[1], sz[2], sz[3]) | |||
encoder_output = torch.cat((token_embedding, encoder_output), dim=0) | |||
# 删除<eos>, <bos>. 这里没有精确地删除,但应该也不会影响最后的结果了。 | |||
encoder_output = encoder_output[:, :, 1:-1] | |||
@@ -8,9 +8,9 @@ import torch | |||
import torch.nn.functional as F | |||
from torch import nn | |||
from ..dropout import TimestepDropout | |||
from fastNLP.modules.dropout import TimestepDropout | |||
from ..utils import initial_parameter | |||
from fastNLP.modules.utils import initial_parameter | |||
class DotAttention(nn.Module): | |||
@@ -45,8 +45,7 @@ class DotAttention(nn.Module): | |||
class MultiHeadAttention(nn.Module): | |||
""" | |||
别名::class:`fastNLP.modules.MultiHeadAttention` :class:`fastNLP.modules.aggregator.attention.MultiHeadAttention` | |||
别名::class:`fastNLP.modules.MultiHeadAttention` :class:`fastNLP.modules.encoder.attention.MultiHeadAttention` | |||
:param input_size: int, 输入维度的大小。同时也是输出维度的大小。 | |||
:param key_size: int, 每个head的维度大小。 |
@@ -2,35 +2,22 @@ | |||
import os | |||
from torch import nn | |||
import torch | |||
from ...io.file_utils import _get_base_url, cached_path | |||
from ...io.file_utils import _get_base_url, cached_path, PRETRAINED_BERT_MODEL_DIR | |||
from ._bert import _WordPieceBertModel, BertModel | |||
class BertWordPieceEncoder(nn.Module): | |||
""" | |||
读取bert模型,读取之后调用index_dataset方法在dataset中生成word_pieces这一列。 | |||
:param fastNLP.Vocabulary vocab: 词表 | |||
:param str model_dir_or_name: 模型所在目录或者模型的名称。默认值为``en-base-uncased`` | |||
:param str layers:最终结果中的表示。以','隔开层数,可以以负数去索引倒数几层 | |||
:param bool requires_grad: 是否需要gradient。 | |||
""" | |||
def __init__(self, model_dir_or_name:str='en-base-uncased', layers:str='-1', | |||
requires_grad:bool=False): | |||
def __init__(self, model_dir_or_name: str='en-base-uncased', layers: str='-1', | |||
requires_grad: bool=False): | |||
super().__init__() | |||
PRETRAIN_URL = _get_base_url('bert') | |||
PRETRAINED_BERT_MODEL_DIR = {'en': 'bert-base-cased-f89bfe08.zip', | |||
'en-base-uncased': 'bert-base-uncased-3413b23c.zip', | |||
'en-base-cased': 'bert-base-cased-f89bfe08.zip', | |||
'en-large-uncased': 'bert-large-uncased-20939f45.zip', | |||
'en-large-cased': 'bert-large-cased-e0cf90fc.zip', | |||
'cn': 'bert-base-chinese-29d0a84a.zip', | |||
'cn-base': 'bert-base-chinese-29d0a84a.zip', | |||
'multilingual': 'bert-base-multilingual-cased-1bd364ee.zip', | |||
'multilingual-base-uncased': 'bert-base-multilingual-uncased-f8730fe4.zip', | |||
'multilingual-base-cased': 'bert-base-multilingual-cased-1bd364ee.zip', | |||
} | |||
if model_dir_or_name in PRETRAINED_BERT_MODEL_DIR: | |||
model_name = PRETRAINED_BERT_MODEL_DIR[model_dir_or_name] | |||
@@ -89,4 +76,4 @@ class BertWordPieceEncoder(nn.Module): | |||
outputs = self.model(word_pieces, token_type_ids) | |||
outputs = torch.cat([*outputs], dim=-1) | |||
return outputs | |||
return outputs |
@@ -35,15 +35,15 @@ class Embedding(nn.Module): | |||
Embedding组件. 可以通过self.num_embeddings获取词表大小; self.embedding_dim获取embedding的维度""" | |||
def __init__(self, init_embed, dropout=0.0, dropout_word=0, unk_index=None): | |||
def __init__(self, init_embed, word_dropout=0, dropout=0.0, unk_index=None): | |||
""" | |||
:param tuple(int,int),torch.FloatTensor,nn.Embedding,numpy.ndarray init_embed: Embedding的大小(传入tuple(int, int), | |||
第一个int为vocab_zie, 第二个int为embed_dim); 如果为Tensor, Embedding, ndarray等则直接使用该值初始化Embedding; | |||
也可以传入TokenEmbedding对象 | |||
:param float word_dropout: 按照一定概率随机将word设置为unk_index,这样可以使得unk这个token得到足够的训练, 且会对网络有 | |||
一定的regularize的作用。 | |||
:param float dropout: 对Embedding的输出的dropout。 | |||
:param float dropout_word: 按照一定比例随机将word设置为unk的idx,这样可以使得unk这个token得到足够的训练 | |||
:param int unk_index: drop word时替换为的index,如果init_embed为TokenEmbedding不需要传入该值。 | |||
:param int unk_index: drop word时替换为的index。fastNLP的Vocabulary的unk_index默认为1。 | |||
""" | |||
super(Embedding, self).__init__() | |||
@@ -52,21 +52,21 @@ class Embedding(nn.Module): | |||
self.dropout = nn.Dropout(dropout) | |||
if not isinstance(self.embed, TokenEmbedding): | |||
self._embed_size = self.embed.weight.size(1) | |||
if dropout_word>0 and not isinstance(unk_index, int): | |||
if word_dropout>0 and not isinstance(unk_index, int): | |||
raise ValueError("When drop word is set, you need to pass in the unk_index.") | |||
else: | |||
self._embed_size = self.embed.embed_size | |||
unk_index = self.embed.get_word_vocab().unknown_idx | |||
self.unk_index = unk_index | |||
self.dropout_word = dropout_word | |||
self.word_dropout = word_dropout | |||
def forward(self, x): | |||
""" | |||
:param torch.LongTensor x: [batch, seq_len] | |||
:return: torch.Tensor : [batch, seq_len, embed_dim] | |||
""" | |||
if self.dropout_word>0 and self.training: | |||
mask = torch.ones_like(x).float() * self.dropout_word | |||
if self.word_dropout>0 and self.training: | |||
mask = torch.ones_like(x).float() * self.word_dropout | |||
mask = torch.bernoulli(mask).byte() # dropout_word越大,越多位置为1 | |||
x = x.masked_fill(mask, self.unk_index) | |||
x = self.embed(x) | |||
@@ -117,11 +117,38 @@ class Embedding(nn.Module): | |||
class TokenEmbedding(nn.Module): | |||
def __init__(self, vocab): | |||
def __init__(self, vocab, word_dropout=0.0, dropout=0.0): | |||
super(TokenEmbedding, self).__init__() | |||
assert vocab.padding_idx is not None, "You vocabulary must have padding." | |||
assert vocab.padding is not None, "Vocabulary must have a padding entry." | |||
self._word_vocab = vocab | |||
self._word_pad_index = vocab.padding_idx | |||
if word_dropout>0: | |||
assert vocab.unknown is not None, "Vocabulary must have unknown entry when you want to drop a word." | |||
self.word_dropout = word_dropout | |||
self._word_unk_index = vocab.unknown_idx | |||
self.dropout_layer = nn.Dropout(dropout) | |||
def drop_word(self, words): | |||
""" | |||
按照设定随机将words设置为unknown_index。 | |||
:param torch.LongTensor words: batch_size x max_len | |||
:return: | |||
""" | |||
if self.word_dropout > 0 and self.training: | |||
mask = torch.ones_like(words).float() * self.word_dropout | |||
mask = torch.bernoulli(mask).byte() # dropout_word越大,越多位置为1 | |||
words = words.masked_fill(mask, self._word_unk_index) | |||
return words | |||
def dropout(self, words): | |||
""" | |||
对embedding后的word表示进行drop。 | |||
:param torch.FloatTensor words: batch_size x max_len x embed_size | |||
:return: | |||
""" | |||
return self.dropout_layer(words) | |||
@property | |||
def requires_grad(self): | |||
@@ -147,8 +174,16 @@ class TokenEmbedding(nn.Module): | |||
def embed_size(self) -> int: | |||
return self._embed_size | |||
@property | |||
def embedding_dim(self) -> int: | |||
return self._embed_size | |||
@property | |||
def num_embedding(self) -> int: | |||
""" | |||
这个值可能会大于实际的embedding矩阵的大小。 | |||
:return: | |||
""" | |||
return len(self._word_vocab) | |||
def get_word_vocab(self): | |||
@@ -163,6 +198,9 @@ class TokenEmbedding(nn.Module): | |||
def size(self): | |||
return torch.Size(self.num_embedding, self._embed_size) | |||
@abstractmethod | |||
def forward(self, *input): | |||
raise NotImplementedError | |||
class StaticEmbedding(TokenEmbedding): | |||
""" | |||
@@ -181,13 +219,15 @@ class StaticEmbedding(TokenEmbedding): | |||
`en-word2vec-300` : GoogleNews-vectors-negative300}。第二种情况将自动查看缓存中是否存在该模型,没有的话将自动下载。 | |||
:param bool requires_grad: 是否需要gradient. 默认为True | |||
:param callable init_method: 如何初始化没有找到的值。可以使用torch.nn.init.*中各种方法。调用该方法时传入一个tensor对象。 | |||
:param bool normailize: 是否对vector进行normalize,使得每个vector的norm为1。 | |||
:param bool lower: 是否将vocab中的词语小写后再和预训练的词表进行匹配。如果你的词表中包含大写的词语,或者就是需要单独 | |||
为大写的词语开辟一个vector表示,则将lower设置为False。 | |||
:param float word_dropout: 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。 | |||
:param float dropout: 以多大的概率对embedding的表示进行Dropout。0.1即随机将10%的值置为0。 | |||
:param bool normailize: 是否对vector进行normalize,使得每个vector的norm为1。 | |||
""" | |||
def __init__(self, vocab: Vocabulary, model_dir_or_name: str='en', requires_grad: bool=True, init_method=None, | |||
normalize=False, lower=False): | |||
super(StaticEmbedding, self).__init__(vocab) | |||
lower=False, dropout=0, word_dropout=0, normalize=False): | |||
super(StaticEmbedding, self).__init__(vocab, word_dropout=word_dropout, dropout=dropout) | |||
# 得到cache_path | |||
if model_dir_or_name.lower() in PRETRAIN_STATIC_FILES: | |||
@@ -362,12 +402,15 @@ class StaticEmbedding(TokenEmbedding): | |||
""" | |||
if hasattr(self, 'words_to_words'): | |||
words = self.words_to_words[words] | |||
return self.embedding(words) | |||
words = self.drop_word(words) | |||
words = self.embedding(words) | |||
words = self.dropout(words) | |||
return words | |||
class ContextualEmbedding(TokenEmbedding): | |||
def __init__(self, vocab: Vocabulary): | |||
super(ContextualEmbedding, self).__init__(vocab) | |||
def __init__(self, vocab: Vocabulary, word_dropout:float=0.0, dropout:float=0.0): | |||
super(ContextualEmbedding, self).__init__(vocab, word_dropout=word_dropout, dropout=dropout) | |||
def add_sentence_cache(self, *datasets, batch_size=32, device='cpu', delete_weights: bool=True): | |||
""" | |||
@@ -473,12 +516,14 @@ class ElmoEmbedding(ContextualEmbedding): | |||
按照这个顺序concat起来。默认为'2'。'mix'会使用可学习的权重结合不同层的表示(权重是否可训练与requires_grad保持一致, | |||
初始化权重对三层结果进行mean-pooling, 可以通过ElmoEmbedding.set_mix_weights_requires_grad()方法只将mix weights设置为可学习。) | |||
:param requires_grad: bool, 该层是否需要gradient, 默认为False. | |||
:param float word_dropout: 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。 | |||
:param float dropout: 以多大的概率对embedding的表示进行Dropout。0.1即随机将10%的值置为0。 | |||
:param cache_word_reprs: 可以选择对word的表示进行cache; 设置为True的话,将在初始化的时候为每个word生成对应的embedding, | |||
并删除character encoder,之后将直接使用cache的embedding。默认为False。 | |||
""" | |||
def __init__(self, vocab: Vocabulary, model_dir_or_name: str='en', | |||
layers: str='2', requires_grad: bool=False, cache_word_reprs: bool=False): | |||
super(ElmoEmbedding, self).__init__(vocab) | |||
def __init__(self, vocab: Vocabulary, model_dir_or_name: str='en', layers: str='2', requires_grad: bool=False, | |||
word_dropout=0.0, dropout=0.0, cache_word_reprs: bool=False): | |||
super(ElmoEmbedding, self).__init__(vocab, word_dropout=word_dropout, dropout=dropout) | |||
# 根据model_dir_or_name检查是否存在并下载 | |||
if model_dir_or_name.lower() in PRETRAINED_ELMO_MODEL_DIR: | |||
@@ -494,11 +539,11 @@ class ElmoEmbedding(ContextualEmbedding): | |||
self.model = _ElmoModel(model_dir, vocab, cache_word_reprs=cache_word_reprs) | |||
if layers=='mix': | |||
self.layer_weights = nn.Parameter(torch.zeros(self.model.config['encoder']['n_layers']+1), | |||
self.layer_weights = nn.Parameter(torch.zeros(self.model.config['lstm']['n_layers']+1), | |||
requires_grad=requires_grad) | |||
self.gamma = nn.Parameter(torch.ones(1), requires_grad=requires_grad) | |||
self._get_outputs = self._get_mixed_outputs | |||
self._embed_size = self.model.config['encoder']['projection_dim'] * 2 | |||
self._embed_size = self.model.config['lstm']['projection_dim'] * 2 | |||
else: | |||
layers = list(map(int, layers.split(','))) | |||
assert len(layers) > 0, "Must choose one output" | |||
@@ -506,7 +551,7 @@ class ElmoEmbedding(ContextualEmbedding): | |||
assert 0 <= layer <= 2, "Layer index should be in range [0, 2]." | |||
self.layers = layers | |||
self._get_outputs = self._get_layer_outputs | |||
self._embed_size = len(self.layers) * self.model.config['encoder']['projection_dim'] * 2 | |||
self._embed_size = len(self.layers) * self.model.config['lstm']['projection_dim'] * 2 | |||
self.requires_grad = requires_grad | |||
@@ -545,11 +590,13 @@ class ElmoEmbedding(ContextualEmbedding): | |||
:param words: batch_size x max_len | |||
:return: torch.FloatTensor. batch_size x max_len x (512*len(self.layers)) | |||
""" | |||
words = self.drop_word(words) | |||
outputs = self._get_sent_reprs(words) | |||
if outputs is not None: | |||
return outputs | |||
return self.dropout(outputs) | |||
outputs = self.model(words) | |||
return self._get_outputs(outputs) | |||
outputs = self._get_outputs(outputs) | |||
return self.dropout(outputs) | |||
def _delete_model_weights(self): | |||
for name in ['layers', 'model', 'layer_weights', 'gamma']: | |||
@@ -595,13 +642,16 @@ class BertEmbedding(ContextualEmbedding): | |||
:param str layers:最终结果中的表示。以','隔开层数,可以以负数去索引倒数几层 | |||
:param str pool_method: 因为在bert中,每个word会被表示为多个word pieces, 当获取一个word的表示的时候,怎样从它的word pieces | |||
中计算得到它对应的表示。支持``last``, ``first``, ``avg``, ``max``。 | |||
:param float word_dropout: 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。 | |||
:param float dropout: 以多大的概率对embedding的表示进行Dropout。0.1即随机将10%的值置为0。 | |||
:param bool include_cls_sep: bool,在bert计算句子的表示的时候,需要在前面加上[CLS]和[SEP], 是否在结果中保留这两个内容。 这样 | |||
会使得word embedding的结果比输入的结果长两个token。在使用 :class::StackEmbedding 可能会遇到问题。 | |||
:param bool requires_grad: 是否需要gradient。 | |||
""" | |||
def __init__(self, vocab: Vocabulary, model_dir_or_name: str='en-base-uncased', layers: str='-1', | |||
pool_method: str='first', include_cls_sep: bool=False, requires_grad: bool=False): | |||
super(BertEmbedding, self).__init__(vocab) | |||
pool_method: str='first', word_dropout=0, dropout=0, requires_grad: bool=False, | |||
include_cls_sep: bool=False): | |||
super(BertEmbedding, self).__init__(vocab, word_dropout=word_dropout, dropout=dropout) | |||
# 根据model_dir_or_name检查是否存在并下载 | |||
if model_dir_or_name.lower() in PRETRAINED_BERT_MODEL_DIR: | |||
@@ -632,13 +682,14 @@ class BertEmbedding(ContextualEmbedding): | |||
:param torch.LongTensor words: [batch_size, max_len] | |||
:return: torch.FloatTensor. batch_size x max_len x (768*len(self.layers)) | |||
""" | |||
words = self.drop_word(words) | |||
outputs = self._get_sent_reprs(words) | |||
if outputs is not None: | |||
return outputs | |||
return self.dropout(words) | |||
outputs = self.model(words) | |||
outputs = torch.cat([*outputs], dim=-1) | |||
return outputs | |||
return self.dropout(words) | |||
@property | |||
def requires_grad(self): | |||
@@ -680,8 +731,8 @@ class CNNCharEmbedding(TokenEmbedding): | |||
""" | |||
别名::class:`fastNLP.modules.CNNCharEmbedding` :class:`fastNLP.modules.encoder.embedding.CNNCharEmbedding` | |||
使用CNN生成character embedding。CNN的结果为, embed(x) -> Dropout(x) -> CNN(x) -> activation(x) -> pool | |||
-> fc. 不同的kernel大小的fitler结果是concat起来的。 | |||
使用CNN生成character embedding。CNN的结果为, embed(x) -> Dropout(x) -> CNN(x) -> activation(x) -> pool -> fc -> Dropout. | |||
不同的kernel大小的fitler结果是concat起来的。 | |||
Example:: | |||
@@ -691,23 +742,24 @@ class CNNCharEmbedding(TokenEmbedding): | |||
:param vocab: 词表 | |||
:param embed_size: 该word embedding的大小,默认值为50. | |||
:param char_emb_size: character的embed的大小。character是从vocab中生成的。默认值为50. | |||
:param dropout: 以多大的概率drop | |||
:param float word_dropout: 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。 | |||
:param float dropout: 以多大的概率drop | |||
:param filter_nums: filter的数量. 长度需要和kernels一致。默认值为[40, 30, 20]. | |||
:param kernel_sizes: kernel的大小. 默认值为[5, 3, 1]. | |||
:param pool_method: character的表示在合成一个表示时所使用的pool方法,支持'avg', 'max'. | |||
:param activation: CNN之后使用的激活方法,支持'relu', 'sigmoid', 'tanh' 或者自定义函数. | |||
:param min_char_freq: character的最少出现次数。默认值为2. | |||
""" | |||
def __init__(self, vocab: Vocabulary, embed_size: int=50, char_emb_size: int=50, dropout:float=0.5, | |||
filter_nums: List[int]=(40, 30, 20), kernel_sizes: List[int]=(5, 3, 1), pool_method: str='max', | |||
activation='relu', min_char_freq: int=2): | |||
super(CNNCharEmbedding, self).__init__(vocab) | |||
def __init__(self, vocab: Vocabulary, embed_size: int=50, char_emb_size: int=50, word_dropout:float=0, | |||
dropout:float=0.5, filter_nums: List[int]=(40, 30, 20), kernel_sizes: List[int]=(5, 3, 1), | |||
pool_method: str='max', activation='relu', min_char_freq: int=2): | |||
super(CNNCharEmbedding, self).__init__(vocab, word_dropout=word_dropout, dropout=dropout) | |||
for kernel in kernel_sizes: | |||
assert kernel % 2 == 1, "Only odd kernel is allowed." | |||
assert pool_method in ('max', 'avg') | |||
self.dropout = nn.Dropout(dropout, inplace=True) | |||
self.dropout = nn.Dropout(dropout) | |||
self.pool_method = pool_method | |||
# activation function | |||
if isinstance(activation, str): | |||
@@ -757,6 +809,7 @@ class CNNCharEmbedding(TokenEmbedding): | |||
:param words: [batch_size, max_len] | |||
:return: [batch_size, max_len, embed_size] | |||
""" | |||
words = self.drop_word(words) | |||
batch_size, max_len = words.size() | |||
chars = self.words_to_chars_embedding[words] # batch_size x max_len x max_word_len | |||
word_lengths = self.word_lengths[words] # batch_size x max_len | |||
@@ -765,7 +818,7 @@ class CNNCharEmbedding(TokenEmbedding): | |||
# 为1的地方为mask | |||
chars_masks = chars.eq(self.char_pad_index) # batch_size x max_len x max_word_len 如果为0, 说明是padding的位置了 | |||
chars = self.char_embedding(chars) # batch_size x max_len x max_word_len x embed_size | |||
self.dropout(chars) | |||
chars = self.dropout(chars) | |||
reshaped_chars = chars.reshape(batch_size*max_len, max_word_len, -1) | |||
reshaped_chars = reshaped_chars.transpose(1, 2) # B' x E x M | |||
conv_chars = [conv(reshaped_chars).transpose(1, 2).reshape(batch_size, max_len, max_word_len, -1) | |||
@@ -779,7 +832,7 @@ class CNNCharEmbedding(TokenEmbedding): | |||
conv_chars = conv_chars.masked_fill(chars_masks.unsqueeze(-1), 0) | |||
chars = torch.sum(conv_chars, dim=-2)/chars_masks.eq(0).sum(dim=-1, keepdim=True).float() | |||
chars = self.fc(chars) | |||
return chars | |||
return self.dropout(chars) | |||
@property | |||
def requires_grad(self): | |||
@@ -826,6 +879,7 @@ class LSTMCharEmbedding(TokenEmbedding): | |||
:param vocab: 词表 | |||
:param embed_size: embedding的大小。默认值为50. | |||
:param char_emb_size: character的embedding的大小。默认值为50. | |||
:param float word_dropout: 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。 | |||
:param dropout: 以多大概率drop | |||
:param hidden_size: LSTM的中间hidden的大小,如果为bidirectional的,hidden会除二,默认为50. | |||
:param pool_method: 支持'max', 'avg' | |||
@@ -833,15 +887,16 @@ class LSTMCharEmbedding(TokenEmbedding): | |||
:param min_char_freq: character的最小出现次数。默认值为2. | |||
:param bidirectional: 是否使用双向的LSTM进行encode。默认值为True。 | |||
""" | |||
def __init__(self, vocab: Vocabulary, embed_size: int=50, char_emb_size: int=50, dropout:float=0.5, hidden_size=50, | |||
pool_method: str='max', activation='relu', min_char_freq: int=2, bidirectional=True): | |||
def __init__(self, vocab: Vocabulary, embed_size: int=50, char_emb_size: int=50, word_dropout:float=0, | |||
dropout:float=0.5, hidden_size=50,pool_method: str='max', activation='relu', min_char_freq: int=2, | |||
bidirectional=True): | |||
super(LSTMCharEmbedding, self).__init__(vocab) | |||
assert hidden_size % 2 == 0, "Only even kernel is allowed." | |||
assert pool_method in ('max', 'avg') | |||
self.pool_method = pool_method | |||
self.dropout = nn.Dropout(dropout, inplace=True) | |||
self.dropout = nn.Dropout(dropout) | |||
# activation function | |||
if isinstance(activation, str): | |||
if activation.lower() == 'relu': | |||
@@ -890,6 +945,7 @@ class LSTMCharEmbedding(TokenEmbedding): | |||
:param words: [batch_size, max_len] | |||
:return: [batch_size, max_len, embed_size] | |||
""" | |||
words = self.drop_word(words) | |||
batch_size, max_len = words.size() | |||
chars = self.words_to_chars_embedding[words] # batch_size x max_len x max_word_len | |||
word_lengths = self.word_lengths[words] # batch_size x max_len | |||
@@ -914,7 +970,7 @@ class LSTMCharEmbedding(TokenEmbedding): | |||
chars = self.fc(chars) | |||
return chars | |||
return self.dropout(chars) | |||
@property | |||
def requires_grad(self): | |||
@@ -953,9 +1009,12 @@ class StackEmbedding(TokenEmbedding): | |||
:param embeds: 一个由若干个TokenEmbedding组成的list,要求每一个TokenEmbedding的词表都保持一致 | |||
:param float word_dropout: 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。不同embedidng会在相同的位置 | |||
被设置为unknown。如果这里设置了dropout,则组成的embedding就不要再设置dropout了。 | |||
:param float dropout: 以多大的概率对embedding的表示进行Dropout。0.1即随机将10%的值置为0。 | |||
""" | |||
def __init__(self, embeds: List[TokenEmbedding]): | |||
def __init__(self, embeds: List[TokenEmbedding], word_dropout=0, dropout=0): | |||
vocabs = [] | |||
for embed in embeds: | |||
if hasattr(embed, 'get_word_vocab'): | |||
@@ -964,7 +1023,7 @@ class StackEmbedding(TokenEmbedding): | |||
for vocab in vocabs[1:]: | |||
assert vocab == _vocab, "All embeddings in StackEmbedding should use the same word vocabulary." | |||
super(StackEmbedding, self).__init__(_vocab) | |||
super(StackEmbedding, self).__init__(_vocab, word_dropout=word_dropout, dropout=dropout) | |||
assert isinstance(embeds, list) | |||
for embed in embeds: | |||
assert isinstance(embed, TokenEmbedding), "Only TokenEmbedding type is supported." | |||
@@ -1016,7 +1075,9 @@ class StackEmbedding(TokenEmbedding): | |||
:return: 返回的shape和当前这个stack embedding中embedding的组成有关 | |||
""" | |||
outputs = [] | |||
words = self.drop_word(words) | |||
for embed in self.embeds: | |||
outputs.append(embed(words)) | |||
return torch.cat(outputs, dim=-1) | |||
outputs = self.dropout(torch.cat(outputs, dim=-1)) | |||
return outputs | |||
@@ -1,7 +1,8 @@ | |||
__all__ = [ | |||
"MaxPool", | |||
"MaxPoolWithMask", | |||
"AvgPool" | |||
"AvgPool", | |||
"AvgPoolWithMask" | |||
] | |||
import torch | |||
import torch.nn as nn | |||
@@ -9,7 +10,7 @@ import torch.nn as nn | |||
class MaxPool(nn.Module): | |||
""" | |||
别名::class:`fastNLP.modules.MaxPool` :class:`fastNLP.modules.aggregator.pooling.MaxPool` | |||
别名::class:`fastNLP.modules.MaxPool` :class:`fastNLP.modules.encoder.pooling.MaxPool` | |||
Max-pooling模块。 | |||
@@ -58,7 +59,7 @@ class MaxPool(nn.Module): | |||
class MaxPoolWithMask(nn.Module): | |||
""" | |||
别名::class:`fastNLP.modules.MaxPoolWithMask` :class:`fastNLP.modules.aggregator.pooling.MaxPoolWithMask` | |||
别名::class:`fastNLP.modules.MaxPoolWithMask` :class:`fastNLP.modules.encoder.pooling.MaxPoolWithMask` | |||
带mask矩阵的max pooling。在做max-pooling的时候不会考虑mask值为0的位置。 | |||
""" | |||
@@ -98,7 +99,7 @@ class KMaxPool(nn.Module): | |||
class AvgPool(nn.Module): | |||
""" | |||
别名::class:`fastNLP.modules.AvgPool` :class:`fastNLP.modules.aggregator.pooling.AvgPool` | |||
别名::class:`fastNLP.modules.AvgPool` :class:`fastNLP.modules.encoder.pooling.AvgPool` | |||
给定形如[batch_size, max_len, hidden_size]的输入,在最后一维进行avg pooling. 输出为[batch_size, hidden_size] | |||
""" | |||
@@ -125,7 +126,7 @@ class AvgPool(nn.Module): | |||
class AvgPoolWithMask(nn.Module): | |||
""" | |||
别名::class:`fastNLP.modules.AvgPoolWithMask` :class:`fastNLP.modules.aggregator.pooling.AvgPoolWithMask` | |||
别名::class:`fastNLP.modules.AvgPoolWithMask` :class:`fastNLP.modules.encoder.pooling.AvgPoolWithMask` | |||
给定形如[batch_size, max_len, hidden_size]的输入,在最后一维进行avg pooling. 输出为[batch_size, hidden_size], pooling | |||
的时候只会考虑mask为1的位置 |
@@ -34,8 +34,8 @@ class StarTransformer(nn.Module): | |||
super(StarTransformer, self).__init__() | |||
self.iters = num_layers | |||
self.norm = nn.ModuleList([nn.LayerNorm(hidden_size) for _ in range(self.iters)]) | |||
self.emb_fc = nn.Conv2d(hidden_size, hidden_size, 1) | |||
self.norm = nn.ModuleList([nn.LayerNorm(hidden_size, eps=1e-6) for _ in range(self.iters)]) | |||
# self.emb_fc = nn.Conv2d(hidden_size, hidden_size, 1) | |||
self.emb_drop = nn.Dropout(dropout) | |||
self.ring_att = nn.ModuleList( | |||
[_MSA1(hidden_size, nhead=num_head, head_dim=head_dim, dropout=0.0) | |||
@@ -3,7 +3,7 @@ __all__ = [ | |||
] | |||
from torch import nn | |||
from ..aggregator.attention import MultiHeadAttention | |||
from fastNLP.modules.encoder.attention import MultiHeadAttention | |||
from ..dropout import TimestepDropout | |||
@@ -8,7 +8,8 @@ import os | |||
from fastNLP.core.dataset import DataSet | |||
from .utils import load_url | |||
from .processor import ModelProcessor | |||
from fastNLP.io.dataset_loader import _cut_long_sentence, ConllLoader | |||
from fastNLP.io.dataset_loader import _cut_long_sentence | |||
from fastNLP.io.data_loader import ConllLoader | |||
from fastNLP.core.instance import Instance | |||
from ..api.pipeline import Pipeline | |||
from fastNLP.core.metrics import SpanFPreRecMetric | |||
@@ -2,14 +2,14 @@ | |||
这里复现了在fastNLP中实现的模型,旨在达到与论文中相符的性能。 | |||
复现的模型有: | |||
- [Star-Transformer](Star_transformer/) | |||
- [Star-Transformer](Star_transformer) | |||
- [Biaffine](https://github.com/fastnlp/fastNLP/blob/999a14381747068e9e6a7cc370037b320197db00/fastNLP/models/biaffine_parser.py#L239) | |||
- [CNNText](https://github.com/fastnlp/fastNLP/blob/999a14381747068e9e6a7cc370037b320197db00/fastNLP/models/cnn_text_classification.py#L12) | |||
- ... | |||
# 任务复现 | |||
## Text Classification (文本分类) | |||
- still in progress | |||
- [Text Classification 文本分类任务复现](text_classification) | |||
## Matching (自然语言推理/句子匹配) | |||
@@ -20,12 +20,12 @@ | |||
- [NER](seqence_labelling/ner) | |||
## Coreference resolution (指代消解) | |||
- still in progress | |||
## Coreference Resolution (共指消解) | |||
- [Coreference Resolution 共指消解任务复现](coreference_resolution) | |||
## Summarization (摘要) | |||
- still in progress | |||
- [Summerization 摘要任务复现](Summarization) | |||
## ... |
@@ -9,26 +9,3 @@ paper: [Star-Transformer](https://arxiv.org/abs/1902.09113) | |||
|Text Classification|SST|-|51.2| | |||
|Natural Language Inference|SNLI|-|83.76| | |||
## Usage | |||
``` python | |||
# for sequence labeling(ner, pos tagging, etc) | |||
from fastNLP.models.star_transformer import STSeqLabel | |||
model = STSeqLabel( | |||
vocab_size=10000, num_cls=50, | |||
emb_dim=300) | |||
# for sequence classification | |||
from fastNLP.models.star_transformer import STSeqCls | |||
model = STSeqCls( | |||
vocab_size=10000, num_cls=50, | |||
emb_dim=300) | |||
# for natural language inference | |||
from fastNLP.models.star_transformer import STNLICls | |||
model = STNLICls( | |||
vocab_size=10000, num_cls=50, | |||
emb_dim=300) | |||
``` |
@@ -2,8 +2,7 @@ import torch | |||
import json | |||
import os | |||
from fastNLP import Vocabulary | |||
from fastNLP.io.dataset_loader import ConllLoader | |||
from fastNLP.io.data_loader import SSTLoader, SNLILoader | |||
from fastNLP.io.data_loader import ConllLoader, SSTLoader, SNLILoader | |||
from fastNLP.core import Const as C | |||
import numpy as np | |||
@@ -10,7 +10,8 @@ from fastNLP.models.star_transformer import STSeqLabel, STSeqCls, STNLICls | |||
from fastNLP.core.const import Const as C | |||
import sys | |||
#sys.path.append('/remote-home/yfshao/workdir/dev_fastnlp/') | |||
pre_dir = '/home/ec2-user/fast_data/' | |||
import os | |||
pre_dir = os.path.join(os.environ['HOME'], 'workdir/datasets/') | |||
g_model_select = { | |||
'pos': STSeqLabel, | |||
@@ -19,7 +20,7 @@ g_model_select = { | |||
'nli': STNLICls, | |||
} | |||
g_emb_file_path = {'en': pre_dir + 'glove.840B.300d.txt', | |||
g_emb_file_path = {'en': pre_dir + 'word_vector/glove.840B.300d.txt', | |||
'zh': pre_dir + 'cc.zh.300.vec'} | |||
g_args = None | |||
@@ -55,7 +56,7 @@ def get_conll2012_ner(): | |||
def get_sst(): | |||
path = pre_dir + 'sst' | |||
path = pre_dir + 'SST' | |||
files = ['train.txt', 'dev.txt', 'test.txt'] | |||
return load_sst(path, files) | |||
@@ -171,10 +172,10 @@ def train(): | |||
sampler=FN.BucketSampler(100, g_args.bsz, C.INPUT_LEN), | |||
callbacks=[MyCallback()]) | |||
trainer.train() | |||
print(trainer.train()) | |||
tester = FN.Tester(data=test_data, model=model, metrics=metric, | |||
batch_size=128, device=device) | |||
tester.test() | |||
print(tester.test()) | |||
def test(): | |||
@@ -0,0 +1,12 @@ | |||
{ | |||
"n_layers": 16, | |||
"layer_sum": false, | |||
"layer_cat": false, | |||
"lstm_hidden_size": 300, | |||
"ffn_inner_hidden_size": 2048, | |||
"n_head": 6, | |||
"recurrent_dropout_prob": 0.1, | |||
"atten_dropout_prob": 0.1, | |||
"ffn_dropout_prob": 0.1, | |||
"fix_mask": true | |||
} |
@@ -0,0 +1,3 @@ | |||
{ | |||
} |
@@ -0,0 +1,9 @@ | |||
{ | |||
"n_layers": 12, | |||
"hidden_size": 512, | |||
"ffn_inner_hidden_size": 2048, | |||
"n_head": 8, | |||
"recurrent_dropout_prob": 0.1, | |||
"atten_dropout_prob": 0.1, | |||
"ffn_dropout_prob": 0.1 | |||
} |
@@ -0,0 +1,188 @@ | |||
import pickle | |||
import numpy as np | |||
from fastNLP.core.vocabulary import Vocabulary | |||
from fastNLP.io.base_loader import DataBundle | |||
from fastNLP.io.dataset_loader import JsonLoader | |||
from fastNLP.core.const import Const | |||
from tools.logger import * | |||
WORD_PAD = "[PAD]" | |||
WORD_UNK = "[UNK]" | |||
DOMAIN_UNK = "X" | |||
TAG_UNK = "X" | |||
class SummarizationLoader(JsonLoader): | |||
""" | |||
读取summarization数据集,读取的DataSet包含fields:: | |||
text: list(str),document | |||
summary: list(str), summary | |||
text_wd: list(list(str)),tokenized document | |||
summary_wd: list(list(str)), tokenized summary | |||
labels: list(int), | |||
flatten_label: list(int), 0 or 1, flatten labels | |||
domain: str, optional | |||
tag: list(str), optional | |||
数据来源: CNN_DailyMail Newsroom DUC | |||
""" | |||
def __init__(self): | |||
super(SummarizationLoader, self).__init__() | |||
def _load(self, path): | |||
ds = super(SummarizationLoader, self)._load(path) | |||
def _lower_text(text_list): | |||
return [text.lower() for text in text_list] | |||
def _split_list(text_list): | |||
return [text.split() for text in text_list] | |||
def _convert_label(label, sent_len): | |||
np_label = np.zeros(sent_len, dtype=int) | |||
if label != []: | |||
np_label[np.array(label)] = 1 | |||
return np_label.tolist() | |||
ds.apply(lambda x: _lower_text(x['text']), new_field_name='text') | |||
ds.apply(lambda x: _lower_text(x['summary']), new_field_name='summary') | |||
ds.apply(lambda x:_split_list(x['text']), new_field_name='text_wd') | |||
ds.apply(lambda x:_split_list(x['summary']), new_field_name='summary_wd') | |||
ds.apply(lambda x:_convert_label(x["label"], len(x["text"])), new_field_name="flatten_label") | |||
return ds | |||
def process(self, paths, vocab_size, vocab_path, sent_max_len, doc_max_timesteps, domain=False, tag=False, load_vocab=True): | |||
""" | |||
:param paths: dict path for each dataset | |||
:param vocab_size: int max_size for vocab | |||
:param vocab_path: str vocab path | |||
:param sent_max_len: int max token number of the sentence | |||
:param doc_max_timesteps: int max sentence number of the document | |||
:param domain: bool build vocab for publication, use 'X' for unknown | |||
:param tag: bool build vocab for tag, use 'X' for unknown | |||
:param load_vocab: bool build vocab (False) or load vocab (True) | |||
:return: DataBundle | |||
datasets: dict keys correspond to the paths dict | |||
vocabs: dict key: vocab(if "train" in paths), domain(if domain=True), tag(if tag=True) | |||
embeddings: optional | |||
""" | |||
def _pad_sent(text_wd): | |||
pad_text_wd = [] | |||
for sent_wd in text_wd: | |||
if len(sent_wd) < sent_max_len: | |||
pad_num = sent_max_len - len(sent_wd) | |||
sent_wd.extend([WORD_PAD] * pad_num) | |||
else: | |||
sent_wd = sent_wd[:sent_max_len] | |||
pad_text_wd.append(sent_wd) | |||
return pad_text_wd | |||
def _token_mask(text_wd): | |||
token_mask_list = [] | |||
for sent_wd in text_wd: | |||
token_num = len(sent_wd) | |||
if token_num < sent_max_len: | |||
mask = [1] * token_num + [0] * (sent_max_len - token_num) | |||
else: | |||
mask = [1] * sent_max_len | |||
token_mask_list.append(mask) | |||
return token_mask_list | |||
def _pad_label(label): | |||
text_len = len(label) | |||
if text_len < doc_max_timesteps: | |||
pad_label = label + [0] * (doc_max_timesteps - text_len) | |||
else: | |||
pad_label = label[:doc_max_timesteps] | |||
return pad_label | |||
def _pad_doc(text_wd): | |||
text_len = len(text_wd) | |||
if text_len < doc_max_timesteps: | |||
padding = [WORD_PAD] * sent_max_len | |||
pad_text = text_wd + [padding] * (doc_max_timesteps - text_len) | |||
else: | |||
pad_text = text_wd[:doc_max_timesteps] | |||
return pad_text | |||
def _sent_mask(text_wd): | |||
text_len = len(text_wd) | |||
if text_len < doc_max_timesteps: | |||
sent_mask = [1] * text_len + [0] * (doc_max_timesteps - text_len) | |||
else: | |||
sent_mask = [1] * doc_max_timesteps | |||
return sent_mask | |||
datasets = {} | |||
train_ds = None | |||
for key, value in paths.items(): | |||
ds = self.load(value) | |||
# pad sent | |||
ds.apply(lambda x:_pad_sent(x["text_wd"]), new_field_name="pad_text_wd") | |||
ds.apply(lambda x:_token_mask(x["text_wd"]), new_field_name="pad_token_mask") | |||
# pad document | |||
ds.apply(lambda x:_pad_doc(x["pad_text_wd"]), new_field_name="pad_text") | |||
ds.apply(lambda x:_sent_mask(x["pad_text_wd"]), new_field_name="seq_len") | |||
ds.apply(lambda x:_pad_label(x["flatten_label"]), new_field_name="pad_label") | |||
# rename field | |||
ds.rename_field("pad_text", Const.INPUT) | |||
ds.rename_field("seq_len", Const.INPUT_LEN) | |||
ds.rename_field("pad_label", Const.TARGET) | |||
# set input and target | |||
ds.set_input(Const.INPUT, Const.INPUT_LEN) | |||
ds.set_target(Const.TARGET, Const.INPUT_LEN) | |||
datasets[key] = ds | |||
if "train" in key: | |||
train_ds = datasets[key] | |||
vocab_dict = {} | |||
if load_vocab == False: | |||
logger.info("[INFO] Build new vocab from training dataset!") | |||
if train_ds == None: | |||
raise ValueError("Lack train file to build vocabulary!") | |||
vocabs = Vocabulary(max_size=vocab_size, padding=WORD_PAD, unknown=WORD_UNK) | |||
vocabs.from_dataset(train_ds, field_name=["text_wd","summary_wd"]) | |||
vocab_dict["vocab"] = vocabs | |||
else: | |||
logger.info("[INFO] Load existing vocab from %s!" % vocab_path) | |||
word_list = [] | |||
with open(vocab_path, 'r', encoding='utf8') as vocab_f: | |||
cnt = 2 # pad and unk | |||
for line in vocab_f: | |||
pieces = line.split("\t") | |||
word_list.append(pieces[0]) | |||
cnt += 1 | |||
if cnt > vocab_size: | |||
break | |||
vocabs = Vocabulary(max_size=vocab_size, padding=WORD_PAD, unknown=WORD_UNK) | |||
vocabs.add_word_lst(word_list) | |||
vocabs.build_vocab() | |||
vocab_dict["vocab"] = vocabs | |||
if domain == True: | |||
domaindict = Vocabulary(padding=None, unknown=DOMAIN_UNK) | |||
domaindict.from_dataset(train_ds, field_name="publication") | |||
vocab_dict["domain"] = domaindict | |||
if tag == True: | |||
tagdict = Vocabulary(padding=None, unknown=TAG_UNK) | |||
tagdict.from_dataset(train_ds, field_name="tag") | |||
vocab_dict["tag"] = tagdict | |||
for ds in datasets.values(): | |||
vocab_dict["vocab"].index_dataset(ds, field_name=Const.INPUT, new_field_name=Const.INPUT) | |||
return DataBundle(vocabs=vocab_dict, datasets=datasets) | |||
@@ -0,0 +1,136 @@ | |||
import numpy as np | |||
import torch | |||
import torch.nn as nn | |||
import torch.nn.init as init | |||
import torch.nn.functional as F | |||
from torch.autograd import Variable | |||
from torch.distributions import Bernoulli | |||
class DeepLSTM(nn.Module): | |||
def __init__(self, input_size, hidden_size, num_layers, recurrent_dropout, use_orthnormal_init=True, fix_mask=True, use_cuda=True): | |||
super(DeepLSTM, self).__init__() | |||
self.fix_mask = fix_mask | |||
self.use_cuda = use_cuda | |||
self.input_size = input_size | |||
self.num_layers = num_layers | |||
self.hidden_size = hidden_size | |||
self.recurrent_dropout = recurrent_dropout | |||
self.lstms = nn.ModuleList([None] * self.num_layers) | |||
self.highway_gate_input = nn.ModuleList([None] * self.num_layers) | |||
self.highway_gate_state = nn.ModuleList([nn.Linear(hidden_size, hidden_size)] * self.num_layers) | |||
self.highway_linear_input = nn.ModuleList([None] * self.num_layers) | |||
# self._input_w = nn.Parameter(torch.Tensor(input_size, hidden_size)) | |||
# init.xavier_normal_(self._input_w) | |||
for l in range(self.num_layers): | |||
input_dim = input_size if l == 0 else hidden_size | |||
self.lstms[l] = nn.LSTMCell(input_size=input_dim, hidden_size=hidden_size) | |||
self.highway_gate_input[l] = nn.Linear(input_dim, hidden_size) | |||
self.highway_linear_input[l] = nn.Linear(input_dim, hidden_size, bias=False) | |||
# logger.info("[INFO] Initing W for LSTM .......") | |||
for l in range(self.num_layers): | |||
if use_orthnormal_init: | |||
# logger.info("[INFO] Initing W using orthnormal init .......") | |||
init.orthogonal_(self.lstms[l].weight_ih) | |||
init.orthogonal_(self.lstms[l].weight_hh) | |||
init.orthogonal_(self.highway_gate_input[l].weight.data) | |||
init.orthogonal_(self.highway_gate_state[l].weight.data) | |||
init.orthogonal_(self.highway_linear_input[l].weight.data) | |||
else: | |||
# logger.info("[INFO] Initing W using xavier_normal .......") | |||
init_weight_value = 6.0 | |||
init.xavier_normal_(self.lstms[l].weight_ih, gain=np.sqrt(init_weight_value)) | |||
init.xavier_normal_(self.lstms[l].weight_hh, gain=np.sqrt(init_weight_value)) | |||
init.xavier_normal_(self.highway_gate_input[l].weight.data, gain=np.sqrt(init_weight_value)) | |||
init.xavier_normal_(self.highway_gate_state[l].weight.data, gain=np.sqrt(init_weight_value)) | |||
init.xavier_normal_(self.highway_linear_input[l].weight.data, gain=np.sqrt(init_weight_value)) | |||
def init_hidden(self, batch_size, hidden_size): | |||
# the first is the hidden h | |||
# the second is the cell c | |||
if self.use_cuda: | |||
return (torch.zeros(batch_size, hidden_size).cuda(), | |||
torch.zeros(batch_size, hidden_size).cuda()) | |||
else: | |||
return (torch.zeros(batch_size, hidden_size), | |||
torch.zeros(batch_size, hidden_size)) | |||
def forward(self, inputs, input_masks, Train): | |||
''' | |||
inputs: [[seq_len, batch, Co * kernel_sizes], n_layer * [None]] (list) | |||
input_masks: [[seq_len, batch, Co * kernel_sizes], n_layer * [None]] (list) | |||
''' | |||
batch_size, seq_len = inputs[0].size(1), inputs[0].size(0) | |||
# inputs[0] = torch.matmul(inputs[0], self._input_w) | |||
# input_masks[0] = input_masks[0].unsqueeze(-1).expand(seq_len, batch_size, self.hidden_size) | |||
self.inputs = inputs | |||
self.input_masks = input_masks | |||
if self.fix_mask: | |||
self.output_dropout_layers = [None] * self.num_layers | |||
for l in range(self.num_layers): | |||
binary_mask = torch.rand((batch_size, self.hidden_size)) > self.recurrent_dropout | |||
# This scaling ensures expected values and variances of the output of applying this mask and the original tensor are the same. | |||
# from allennlp.nn.util.py | |||
self.output_dropout_layers[l] = binary_mask.float().div(1.0 - self.recurrent_dropout) | |||
if self.use_cuda: | |||
self.output_dropout_layers[l] = self.output_dropout_layers[l].cuda() | |||
for l in range(self.num_layers): | |||
h, c = self.init_hidden(batch_size, self.hidden_size) | |||
outputs_list = [] | |||
for t in range(len(self.inputs[l])): | |||
x = self.inputs[l][t] | |||
m = self.input_masks[l][t].float() | |||
h_temp, c_temp = self.lstms[l].forward(x, (h, c)) # [batch, hidden_size] | |||
r = torch.sigmoid(self.highway_gate_input[l](x) + self.highway_gate_state[l](h)) | |||
lx = self.highway_linear_input[l](x) # [batch, hidden_size] | |||
h_temp = r * h_temp + (1 - r) * lx | |||
if Train: | |||
if self.fix_mask: | |||
h_temp = self.output_dropout_layers[l] * h_temp | |||
else: | |||
h_temp = F.dropout(h_temp, p=self.recurrent_dropout) | |||
h = m * h_temp + (1 - m) * h | |||
c = m * c_temp + (1 - m) * c | |||
outputs_list.append(h) | |||
outputs = torch.stack(outputs_list, 0) # [seq_len, batch, hidden_size] | |||
self.inputs[l + 1] = DeepLSTM.flip(outputs, 0) # reverse [seq_len, batch, hidden_size] | |||
self.input_masks[l + 1] = DeepLSTM.flip(self.input_masks[l], 0) | |||
self.output_state = self.inputs # num_layers * [seq_len, batch, hidden_size] | |||
# flip -2 layer | |||
# self.output_state[-2] = DeepLSTM.flip(self.output_state[-2], 0) | |||
# concat last two layer | |||
# self.output_state = torch.cat([self.output_state[-1], self.output_state[-2]], dim=-1).transpose(0, 1) | |||
self.output_state = self.output_state[-1].transpose(0, 1) | |||
assert self.output_state.size() == (batch_size, seq_len, self.hidden_size) | |||
return self.output_state | |||
@staticmethod | |||
def flip(x, dim): | |||
xsize = x.size() | |||
dim = x.dim() + dim if dim < 0 else dim | |||
x = x.contiguous() | |||
x = x.view(-1, *xsize[dim:]).contiguous() | |||
x = x.view(x.size(0), x.size(1), -1)[:, getattr(torch.arange(x.size(1) - 1, | |||
-1, -1), ('cpu','cuda')[x.is_cuda])().long(), :] | |||
return x.view(xsize) |
@@ -0,0 +1,566 @@ | |||
from __future__ import absolute_import | |||
from __future__ import division | |||
from __future__ import print_function | |||
import numpy as np | |||
import torch | |||
import torch.nn as nn | |||
import torch.nn.functional as F | |||
import torch.nn.init as init | |||
from fastNLP.core.vocabulary import Vocabulary | |||
from fastNLP.io.embed_loader import EmbedLoader | |||
# from tools.logger import * | |||
from tools.PositionEmbedding import get_sinusoid_encoding_table | |||
WORD_PAD = "[PAD]" | |||
class Encoder(nn.Module): | |||
def __init__(self, hps, embed): | |||
""" | |||
:param hps: | |||
word_emb_dim: word embedding dimension | |||
sent_max_len: max token number in the sentence | |||
output_channel: output channel for cnn | |||
min_kernel_size: min kernel size for cnn | |||
max_kernel_size: max kernel size for cnn | |||
word_embedding: bool, use word embedding or not | |||
embedding_path: word embedding path | |||
embed_train: bool, whether to train word embedding | |||
cuda: bool, use cuda or not | |||
:param vocab: FastNLP.Vocabulary | |||
""" | |||
super(Encoder, self).__init__() | |||
self._hps = hps | |||
self.sent_max_len = hps.sent_max_len | |||
embed_size = hps.word_emb_dim | |||
sent_max_len = hps.sent_max_len | |||
input_channels = 1 | |||
out_channels = hps.output_channel | |||
min_kernel_size = hps.min_kernel_size | |||
max_kernel_size = hps.max_kernel_size | |||
width = embed_size | |||
# word embedding | |||
self.embed = embed | |||
# position embedding | |||
self.position_embedding = nn.Embedding.from_pretrained(get_sinusoid_encoding_table(sent_max_len + 1, embed_size, padding_idx=0), freeze=True) | |||
# cnn | |||
self.convs = nn.ModuleList([nn.Conv2d(input_channels, out_channels, kernel_size = (height, width)) for height in range(min_kernel_size, max_kernel_size+1)]) | |||
print("[INFO] Initing W for CNN.......") | |||
for conv in self.convs: | |||
init_weight_value = 6.0 | |||
init.xavier_normal_(conv.weight.data, gain=np.sqrt(init_weight_value)) | |||
fan_in, fan_out = Encoder.calculate_fan_in_and_fan_out(conv.weight.data) | |||
std = np.sqrt(init_weight_value) * np.sqrt(2.0 / (fan_in + fan_out)) | |||
def calculate_fan_in_and_fan_out(tensor): | |||
dimensions = tensor.ndimension() | |||
if dimensions < 2: | |||
print("[Error] Fan in and fan out can not be computed for tensor with less than 2 dimensions") | |||
raise ValueError("[Error] Fan in and fan out can not be computed for tensor with less than 2 dimensions") | |||
if dimensions == 2: # Linear | |||
fan_in = tensor.size(1) | |||
fan_out = tensor.size(0) | |||
else: | |||
num_input_fmaps = tensor.size(1) | |||
num_output_fmaps = tensor.size(0) | |||
receptive_field_size = 1 | |||
if tensor.dim() > 2: | |||
receptive_field_size = tensor[0][0].numel() | |||
fan_in = num_input_fmaps * receptive_field_size | |||
fan_out = num_output_fmaps * receptive_field_size | |||
return fan_in, fan_out | |||
def forward(self, input): | |||
# input: a batch of Example object [batch_size, N, seq_len] | |||
batch_size, N, _ = input.size() | |||
input = input.view(-1, input.size(2)) # [batch_size*N, L] | |||
input_sent_len = ((input!=0).sum(dim=1)).int() # [batch_size*N, 1] | |||
enc_embed_input = self.embed(input) # [batch_size*N, L, D] | |||
input_pos = torch.Tensor([np.hstack((np.arange(1, sentlen + 1), np.zeros(self.sent_max_len - sentlen))) for sentlen in input_sent_len]) | |||
if self._hps.cuda: | |||
input_pos = input_pos.cuda() | |||
enc_pos_embed_input = self.position_embedding(input_pos.long()) # [batch_size*N, D] | |||
enc_conv_input = enc_embed_input + enc_pos_embed_input | |||
enc_conv_input = enc_conv_input.unsqueeze(1) # (batch * N,Ci,L,D) | |||
enc_conv_output = [F.relu(conv(enc_conv_input)).squeeze(3) for conv in self.convs] # kernel_sizes * (batch*N, Co, W) | |||
enc_maxpool_output = [F.max_pool1d(x, x.size(2)).squeeze(2) for x in enc_conv_output] # kernel_sizes * (batch*N, Co) | |||
sent_embedding = torch.cat(enc_maxpool_output, 1) # (batch*N, Co * kernel_sizes) | |||
sent_embedding = sent_embedding.view(batch_size, N, -1) | |||
return sent_embedding | |||
class DomainEncoder(Encoder): | |||
def __init__(self, hps, vocab, domaindict): | |||
super(DomainEncoder, self).__init__(hps, vocab) | |||
# domain embedding | |||
self.domain_embedding = nn.Embedding(domaindict.size(), hps.domain_emb_dim) | |||
self.domain_embedding.weight.requires_grad = True | |||
def forward(self, input, domain): | |||
""" | |||
:param input: [batch_size, N, seq_len], N sentence number, seq_len token number | |||
:param domain: [batch_size] | |||
:return: sent_embedding: [batch_size, N, Co * kernel_sizes] | |||
""" | |||
batch_size, N, _ = input.size() | |||
sent_embedding = super().forward(input) | |||
enc_domain_input = self.domain_embedding(domain) # [batch, D] | |||
enc_domain_input = enc_domain_input.unsqueeze(1).expand(batch_size, N, -1) # [batch, N, D] | |||
sent_embedding = torch.cat((sent_embedding, enc_domain_input), dim=2) | |||
return sent_embedding | |||
class MultiDomainEncoder(Encoder): | |||
def __init__(self, hps, vocab, domaindict): | |||
super(MultiDomainEncoder, self).__init__(hps, vocab) | |||
self.domain_size = domaindict.size() | |||
# domain embedding | |||
self.domain_embedding = nn.Embedding(self.domain_size, hps.domain_emb_dim) | |||
self.domain_embedding.weight.requires_grad = True | |||
def forward(self, input, domain): | |||
""" | |||
:param input: [batch_size, N, seq_len], N sentence number, seq_len token number | |||
:param domain: [batch_size, domain_size] | |||
:return: sent_embedding: [batch_size, N, Co * kernel_sizes] | |||
""" | |||
batch_size, N, _ = input.size() | |||
# logger.info(domain[:5, :]) | |||
sent_embedding = super().forward(input) | |||
domain_padding = torch.arange(self.domain_size).unsqueeze(0).expand(batch_size, -1) | |||
domain_padding = domain_padding.cuda().view(-1) if self._hps.cuda else domain_padding.view(-1) # [batch * domain_size] | |||
enc_domain_input = self.domain_embedding(domain_padding) # [batch * domain_size, D] | |||
enc_domain_input = enc_domain_input.view(batch_size, self.domain_size, -1) * domain.unsqueeze(-1).float() # [batch, domain_size, D] | |||
# logger.info(enc_domain_input[:5,:]) # [batch, domain_size, D] | |||
enc_domain_input = enc_domain_input.sum(1) / domain.sum(1).float().unsqueeze(-1) # [batch, D] | |||
enc_domain_input = enc_domain_input.unsqueeze(1).expand(batch_size, N, -1) # [batch, N, D] | |||
sent_embedding = torch.cat((sent_embedding, enc_domain_input), dim=2) | |||
return sent_embedding | |||
class BertEncoder(nn.Module): | |||
def __init__(self, hps): | |||
super(BertEncoder, self).__init__() | |||
from pytorch_pretrained_bert.modeling import BertModel | |||
self._hps = hps | |||
self.sent_max_len = hps.sent_max_len | |||
self._cuda = hps.cuda | |||
embed_size = hps.word_emb_dim | |||
sent_max_len = hps.sent_max_len | |||
input_channels = 1 | |||
out_channels = hps.output_channel | |||
min_kernel_size = hps.min_kernel_size | |||
max_kernel_size = hps.max_kernel_size | |||
width = embed_size | |||
# word embedding | |||
self._bert = BertModel.from_pretrained("/remote-home/dqwang/BERT/pre-train/uncased_L-24_H-1024_A-16") | |||
self._bert.eval() | |||
for p in self._bert.parameters(): | |||
p.requires_grad = False | |||
self.word_embedding_proj = nn.Linear(4096, embed_size) | |||
# position embedding | |||
self.position_embedding = nn.Embedding.from_pretrained(get_sinusoid_encoding_table(sent_max_len + 1, embed_size, padding_idx=0), freeze=True) | |||
# cnn | |||
self.convs = nn.ModuleList([nn.Conv2d(input_channels, out_channels, kernel_size = (height, width)) for height in range(min_kernel_size, max_kernel_size+1)]) | |||
logger.info("[INFO] Initing W for CNN.......") | |||
for conv in self.convs: | |||
init_weight_value = 6.0 | |||
init.xavier_normal_(conv.weight.data, gain=np.sqrt(init_weight_value)) | |||
fan_in, fan_out = Encoder.calculate_fan_in_and_fan_out(conv.weight.data) | |||
std = np.sqrt(init_weight_value) * np.sqrt(2.0 / (fan_in + fan_out)) | |||
def calculate_fan_in_and_fan_out(tensor): | |||
dimensions = tensor.ndimension() | |||
if dimensions < 2: | |||
logger.error("[Error] Fan in and fan out can not be computed for tensor with less than 2 dimensions") | |||
raise ValueError("[Error] Fan in and fan out can not be computed for tensor with less than 2 dimensions") | |||
if dimensions == 2: # Linear | |||
fan_in = tensor.size(1) | |||
fan_out = tensor.size(0) | |||
else: | |||
num_input_fmaps = tensor.size(1) | |||
num_output_fmaps = tensor.size(0) | |||
receptive_field_size = 1 | |||
if tensor.dim() > 2: | |||
receptive_field_size = tensor[0][0].numel() | |||
fan_in = num_input_fmaps * receptive_field_size | |||
fan_out = num_output_fmaps * receptive_field_size | |||
return fan_in, fan_out | |||
def pad_encoder_input(self, input_list): | |||
""" | |||
:param input_list: N [seq_len, hidden_state] | |||
:return: enc_sent_input_pad: list, N [max_len, hidden_state] | |||
""" | |||
max_len = self.sent_max_len | |||
enc_sent_input_pad = [] | |||
_, hidden_size = input_list[0].size() | |||
for i in range(len(input_list)): | |||
article_words = input_list[i] # [seq_len, hidden_size] | |||
seq_len = article_words.size(0) | |||
if seq_len > max_len: | |||
pad_words = article_words[:max_len, :] | |||
else: | |||
pad_tensor = torch.zeros(max_len - seq_len, hidden_size).cuda() if self._cuda else torch.zeros(max_len - seq_len, hidden_size) | |||
pad_words = torch.cat([article_words, pad_tensor], dim=0) | |||
enc_sent_input_pad.append(pad_words) | |||
return enc_sent_input_pad | |||
def forward(self, inputs, input_masks, enc_sent_len): | |||
""" | |||
:param inputs: a batch of Example object [batch_size, doc_len=512] | |||
:param input_masks: 0 or 1, [batch, doc_len=512] | |||
:param enc_sent_len: sentence original length [batch, N] | |||
:return: | |||
""" | |||
# Use Bert to get word embedding | |||
batch_size, N = enc_sent_len.size() | |||
input_pad_list = [] | |||
for i in range(batch_size): | |||
tokens_id = inputs[i] | |||
input_mask = input_masks[i] | |||
sent_len = enc_sent_len[i] | |||
input_ids = tokens_id.unsqueeze(0) | |||
input_mask = input_mask.unsqueeze(0) | |||
out, _ = self._bert(input_ids, token_type_ids=None, attention_mask=input_mask) | |||
out = torch.cat(out[-4:], dim=-1).squeeze(0) # [doc_len=512, hidden_state=4096] | |||
_, hidden_size = out.size() | |||
# restore the sentence | |||
last_end = 1 | |||
enc_sent_input = [] | |||
for length in sent_len: | |||
if length != 0 and last_end < 511: | |||
enc_sent_input.append(out[last_end: min(511, last_end + length), :]) | |||
last_end += length | |||
else: | |||
pad_tensor = torch.zeros(self.sent_max_len, hidden_size).cuda() if self._hps.cuda else torch.zeros(self.sent_max_len, hidden_size) | |||
enc_sent_input.append(pad_tensor) | |||
# pad the sentence | |||
enc_sent_input_pad = self.pad_encoder_input(enc_sent_input) # [N, seq_len, hidden_state=4096] | |||
input_pad_list.append(torch.stack(enc_sent_input_pad)) | |||
input_pad = torch.stack(input_pad_list) | |||
input_pad = input_pad.view(batch_size*N, self.sent_max_len, -1) | |||
enc_sent_len = enc_sent_len.view(-1) # [batch_size*N] | |||
enc_embed_input = self.word_embedding_proj(input_pad) # [batch_size * N, L, D] | |||
sent_pos_list = [] | |||
for sentlen in enc_sent_len: | |||
sent_pos = list(range(1, min(self.sent_max_len, sentlen) + 1)) | |||
for k in range(self.sent_max_len - sentlen): | |||
sent_pos.append(0) | |||
sent_pos_list.append(sent_pos) | |||
input_pos = torch.Tensor(sent_pos_list).long() | |||
if self._hps.cuda: | |||
input_pos = input_pos.cuda() | |||
enc_pos_embed_input = self.position_embedding(input_pos.long()) # [batch_size*N, D] | |||
enc_conv_input = enc_embed_input + enc_pos_embed_input | |||
enc_conv_input = enc_conv_input.unsqueeze(1) # (batch * N,Ci,L,D) | |||
enc_conv_output = [F.relu(conv(enc_conv_input)).squeeze(3) for conv in self.convs] # kernel_sizes * (batch*N, Co, W) | |||
enc_maxpool_output = [F.max_pool1d(x, x.size(2)).squeeze(2) for x in enc_conv_output] # kernel_sizes * (batch*N, Co) | |||
sent_embedding = torch.cat(enc_maxpool_output, 1) # (batch*N, Co * kernel_sizes) | |||
sent_embedding = sent_embedding.view(batch_size, N, -1) | |||
return sent_embedding | |||
class BertTagEncoder(BertEncoder): | |||
def __init__(self, hps, domaindict): | |||
super(BertTagEncoder, self).__init__(hps) | |||
# domain embedding | |||
self.domain_embedding = nn.Embedding(domaindict.size(), hps.domain_emb_dim) | |||
self.domain_embedding.weight.requires_grad = True | |||
def forward(self, inputs, input_masks, enc_sent_len, domain): | |||
sent_embedding = super().forward(inputs, input_masks, enc_sent_len) | |||
batch_size, N = enc_sent_len.size() | |||
enc_domain_input = self.domain_embedding(domain) # [batch, D] | |||
enc_domain_input = enc_domain_input.unsqueeze(1).expand(batch_size, N, -1) # [batch, N, D] | |||
sent_embedding = torch.cat((sent_embedding, enc_domain_input), dim=2) | |||
return sent_embedding | |||
class ELMoEndoer(nn.Module): | |||
def __init__(self, hps): | |||
super(ELMoEndoer, self).__init__() | |||
self._hps = hps | |||
self.sent_max_len = hps.sent_max_len | |||
from allennlp.modules.elmo import Elmo | |||
elmo_dim = 1024 | |||
options_file = "/remote-home/dqwang/ELMo/elmo_2x4096_512_2048cnn_2xhighway_5.5B_options.json" | |||
weight_file = "/remote-home/dqwang/ELMo/elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5" | |||
# elmo_dim = 512 | |||
# options_file = "/remote-home/dqwang/ELMo/elmo_2x2048_256_2048cnn_1xhighway_options.json" | |||
# weight_file = "/remote-home/dqwang/ELMo/elmo_2x2048_256_2048cnn_1xhighway_weights.hdf5" | |||
embed_size = hps.word_emb_dim | |||
sent_max_len = hps.sent_max_len | |||
input_channels = 1 | |||
out_channels = hps.output_channel | |||
min_kernel_size = hps.min_kernel_size | |||
max_kernel_size = hps.max_kernel_size | |||
width = embed_size | |||
# elmo embedding | |||
self.elmo = Elmo(options_file, weight_file, 1, dropout=0) | |||
self.embed_proj = nn.Linear(elmo_dim, embed_size) | |||
# position embedding | |||
self.position_embedding = nn.Embedding.from_pretrained(get_sinusoid_encoding_table(sent_max_len + 1, embed_size, padding_idx=0), freeze=True) | |||
# cnn | |||
self.convs = nn.ModuleList([nn.Conv2d(input_channels, out_channels, kernel_size = (height, width)) for height in range(min_kernel_size, max_kernel_size+1)]) | |||
logger.info("[INFO] Initing W for CNN.......") | |||
for conv in self.convs: | |||
init_weight_value = 6.0 | |||
init.xavier_normal_(conv.weight.data, gain=np.sqrt(init_weight_value)) | |||
fan_in, fan_out = Encoder.calculate_fan_in_and_fan_out(conv.weight.data) | |||
std = np.sqrt(init_weight_value) * np.sqrt(2.0 / (fan_in + fan_out)) | |||
def calculate_fan_in_and_fan_out(tensor): | |||
dimensions = tensor.ndimension() | |||
if dimensions < 2: | |||
logger.error("[Error] Fan in and fan out can not be computed for tensor with less than 2 dimensions") | |||
raise ValueError("[Error] Fan in and fan out can not be computed for tensor with less than 2 dimensions") | |||
if dimensions == 2: # Linear | |||
fan_in = tensor.size(1) | |||
fan_out = tensor.size(0) | |||
else: | |||
num_input_fmaps = tensor.size(1) | |||
num_output_fmaps = tensor.size(0) | |||
receptive_field_size = 1 | |||
if tensor.dim() > 2: | |||
receptive_field_size = tensor[0][0].numel() | |||
fan_in = num_input_fmaps * receptive_field_size | |||
fan_out = num_output_fmaps * receptive_field_size | |||
return fan_in, fan_out | |||
def forward(self, input): | |||
# input: a batch of Example object [batch_size, N, seq_len, character_len] | |||
batch_size, N, seq_len, _ = input.size() | |||
input = input.view(batch_size * N, seq_len, -1) # [batch_size*N, seq_len, character_len] | |||
input_sent_len = ((input.sum(-1)!=0).sum(dim=1)).int() # [batch_size*N, 1] | |||
# logger.debug(input_sent_len.view(batch_size, -1)) | |||
enc_embed_input = self.elmo(input)['elmo_representations'][0] # [batch_size*N, L, D] | |||
enc_embed_input = self.embed_proj(enc_embed_input) | |||
# input_pos = torch.Tensor([np.hstack((np.arange(1, sentlen + 1), np.zeros(self.sent_max_len - sentlen))) for sentlen in input_sent_len]) | |||
sent_pos_list = [] | |||
for sentlen in input_sent_len: | |||
sent_pos = list(range(1, min(self.sent_max_len, sentlen) + 1)) | |||
for k in range(self.sent_max_len - sentlen): | |||
sent_pos.append(0) | |||
sent_pos_list.append(sent_pos) | |||
input_pos = torch.Tensor(sent_pos_list).long() | |||
if self._hps.cuda: | |||
input_pos = input_pos.cuda() | |||
enc_pos_embed_input = self.position_embedding(input_pos.long()) # [batch_size*N, D] | |||
enc_conv_input = enc_embed_input + enc_pos_embed_input | |||
enc_conv_input = enc_conv_input.unsqueeze(1) # (batch * N,Ci,L,D) | |||
enc_conv_output = [F.relu(conv(enc_conv_input)).squeeze(3) for conv in self.convs] # kernel_sizes * (batch*N, Co, W) | |||
enc_maxpool_output = [F.max_pool1d(x, x.size(2)).squeeze(2) for x in enc_conv_output] # kernel_sizes * (batch*N, Co) | |||
sent_embedding = torch.cat(enc_maxpool_output, 1) # (batch*N, Co * kernel_sizes) | |||
sent_embedding = sent_embedding.view(batch_size, N, -1) | |||
return sent_embedding | |||
class ELMoEndoer2(nn.Module): | |||
def __init__(self, hps): | |||
super(ELMoEndoer2, self).__init__() | |||
self._hps = hps | |||
self._cuda = hps.cuda | |||
self.sent_max_len = hps.sent_max_len | |||
from allennlp.modules.elmo import Elmo | |||
elmo_dim = 1024 | |||
options_file = "/remote-home/dqwang/ELMo/elmo_2x4096_512_2048cnn_2xhighway_5.5B_options.json" | |||
weight_file = "/remote-home/dqwang/ELMo/elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5" | |||
# elmo_dim = 512 | |||
# options_file = "/remote-home/dqwang/ELMo/elmo_2x2048_256_2048cnn_1xhighway_options.json" | |||
# weight_file = "/remote-home/dqwang/ELMo/elmo_2x2048_256_2048cnn_1xhighway_weights.hdf5" | |||
embed_size = hps.word_emb_dim | |||
sent_max_len = hps.sent_max_len | |||
input_channels = 1 | |||
out_channels = hps.output_channel | |||
min_kernel_size = hps.min_kernel_size | |||
max_kernel_size = hps.max_kernel_size | |||
width = embed_size | |||
# elmo embedding | |||
self.elmo = Elmo(options_file, weight_file, 1, dropout=0) | |||
self.embed_proj = nn.Linear(elmo_dim, embed_size) | |||
# position embedding | |||
self.position_embedding = nn.Embedding.from_pretrained(get_sinusoid_encoding_table(sent_max_len + 1, embed_size, padding_idx=0), freeze=True) | |||
# cnn | |||
self.convs = nn.ModuleList([nn.Conv2d(input_channels, out_channels, kernel_size = (height, width)) for height in range(min_kernel_size, max_kernel_size+1)]) | |||
logger.info("[INFO] Initing W for CNN.......") | |||
for conv in self.convs: | |||
init_weight_value = 6.0 | |||
init.xavier_normal_(conv.weight.data, gain=np.sqrt(init_weight_value)) | |||
fan_in, fan_out = Encoder.calculate_fan_in_and_fan_out(conv.weight.data) | |||
std = np.sqrt(init_weight_value) * np.sqrt(2.0 / (fan_in + fan_out)) | |||
def calculate_fan_in_and_fan_out(tensor): | |||
dimensions = tensor.ndimension() | |||
if dimensions < 2: | |||
logger.error("[Error] Fan in and fan out can not be computed for tensor with less than 2 dimensions") | |||
raise ValueError("[Error] Fan in and fan out can not be computed for tensor with less than 2 dimensions") | |||
if dimensions == 2: # Linear | |||
fan_in = tensor.size(1) | |||
fan_out = tensor.size(0) | |||
else: | |||
num_input_fmaps = tensor.size(1) | |||
num_output_fmaps = tensor.size(0) | |||
receptive_field_size = 1 | |||
if tensor.dim() > 2: | |||
receptive_field_size = tensor[0][0].numel() | |||
fan_in = num_input_fmaps * receptive_field_size | |||
fan_out = num_output_fmaps * receptive_field_size | |||
return fan_in, fan_out | |||
def pad_encoder_input(self, input_list): | |||
""" | |||
:param input_list: N [seq_len, hidden_state] | |||
:return: enc_sent_input_pad: list, N [max_len, hidden_state] | |||
""" | |||
max_len = self.sent_max_len | |||
enc_sent_input_pad = [] | |||
_, hidden_size = input_list[0].size() | |||
for i in range(len(input_list)): | |||
article_words = input_list[i] # [seq_len, hidden_size] | |||
seq_len = article_words.size(0) | |||
if seq_len > max_len: | |||
pad_words = article_words[:max_len, :] | |||
else: | |||
pad_tensor = torch.zeros(max_len - seq_len, hidden_size).cuda() if self._cuda else torch.zeros(max_len - seq_len, hidden_size) | |||
pad_words = torch.cat([article_words, pad_tensor], dim=0) | |||
enc_sent_input_pad.append(pad_words) | |||
return enc_sent_input_pad | |||
def forward(self, inputs, input_masks, enc_sent_len): | |||
""" | |||
:param inputs: a batch of Example object [batch_size, doc_len=512, character_len=50] | |||
:param input_masks: 0 or 1, [batch, doc_len=512] | |||
:param enc_sent_len: sentence original length [batch, N] | |||
:return: | |||
sent_embedding: [batch, N, D] | |||
""" | |||
# Use Bert to get word embedding | |||
batch_size, N = enc_sent_len.size() | |||
input_pad_list = [] | |||
elmo_output = self.elmo(inputs)['elmo_representations'][0] # [batch_size, 512, D] | |||
elmo_output = elmo_output * input_masks.unsqueeze(-1).float() | |||
# print("END elmo") | |||
for i in range(batch_size): | |||
sent_len = enc_sent_len[i] # [1, N] | |||
out = elmo_output[i] | |||
_, hidden_size = out.size() | |||
# restore the sentence | |||
last_end = 0 | |||
enc_sent_input = [] | |||
for length in sent_len: | |||
if length != 0 and last_end < 512: | |||
enc_sent_input.append(out[last_end : min(512, last_end + length), :]) | |||
last_end += length | |||
else: | |||
pad_tensor = torch.zeros(self.sent_max_len, hidden_size).cuda() if self._hps.cuda else torch.zeros(self.sent_max_len, hidden_size) | |||
enc_sent_input.append(pad_tensor) | |||
# pad the sentence | |||
enc_sent_input_pad = self.pad_encoder_input(enc_sent_input) # [N, seq_len, hidden_state=4096] | |||
input_pad_list.append(torch.stack(enc_sent_input_pad)) # batch * [N, max_len, hidden_state] | |||
input_pad = torch.stack(input_pad_list) | |||
input_pad = input_pad.view(batch_size * N, self.sent_max_len, -1) | |||
enc_sent_len = enc_sent_len.view(-1) # [batch_size*N] | |||
enc_embed_input = self.embed_proj(input_pad) # [batch_size * N, L, D] | |||
# input_pos = torch.Tensor([np.hstack((np.arange(1, sentlen + 1), np.zeros(self.sent_max_len - sentlen))) for sentlen in input_sent_len]) | |||
sent_pos_list = [] | |||
for sentlen in enc_sent_len: | |||
sent_pos = list(range(1, min(self.sent_max_len, sentlen) + 1)) | |||
for k in range(self.sent_max_len - sentlen): | |||
sent_pos.append(0) | |||
sent_pos_list.append(sent_pos) | |||
input_pos = torch.Tensor(sent_pos_list).long() | |||
if self._hps.cuda: | |||
input_pos = input_pos.cuda() | |||
enc_pos_embed_input = self.position_embedding(input_pos.long()) # [batch_size*N, D] | |||
enc_conv_input = enc_embed_input + enc_pos_embed_input | |||
enc_conv_input = enc_conv_input.unsqueeze(1) # (batch * N,Ci,L,D) | |||
enc_conv_output = [F.relu(conv(enc_conv_input)).squeeze(3) for conv in self.convs] # kernel_sizes * (batch*N, Co, W) | |||
enc_maxpool_output = [F.max_pool1d(x, x.size(2)).squeeze(2) for x in enc_conv_output] # kernel_sizes * (batch*N, Co) | |||
sent_embedding = torch.cat(enc_maxpool_output, 1) # (batch*N, Co * kernel_sizes) | |||
sent_embedding = sent_embedding.view(batch_size, N, -1) | |||
return sent_embedding |
@@ -0,0 +1,103 @@ | |||
from __future__ import absolute_import | |||
from __future__ import division | |||
from __future__ import print_function | |||
import torch | |||
import torch.nn as nn | |||
from torch.autograd import * | |||
from torch.distributions import * | |||
from .Encoder import Encoder | |||
from .DeepLSTM import DeepLSTM | |||
from transformer.SubLayers import MultiHeadAttention,PositionwiseFeedForward | |||
class SummarizationModel(nn.Module): | |||
def __init__(self, hps, embed): | |||
""" | |||
:param hps: hyperparameters for the model | |||
:param vocab: vocab object | |||
""" | |||
super(SummarizationModel, self).__init__() | |||
self._hps = hps | |||
# sentence encoder | |||
self.encoder = Encoder(hps, embed) | |||
# Multi-layer highway lstm | |||
self.num_layers = hps.n_layers | |||
self.sent_embedding_size = (hps.max_kernel_size - hps.min_kernel_size + 1) * hps.output_channel | |||
self.lstm_hidden_size = hps.lstm_hidden_size | |||
self.recurrent_dropout = hps.recurrent_dropout_prob | |||
self.deep_lstm = DeepLSTM(self.sent_embedding_size, self.lstm_hidden_size, self.num_layers, self.recurrent_dropout, | |||
hps.use_orthnormal_init, hps.fix_mask, hps.cuda) | |||
# Multi-head attention | |||
self.n_head = hps.n_head | |||
self.d_v = self.d_k = int(self.lstm_hidden_size / hps.n_head) | |||
self.d_inner = hps.ffn_inner_hidden_size | |||
self.slf_attn = MultiHeadAttention(hps.n_head, self.lstm_hidden_size , self.d_k, self.d_v, dropout=hps.atten_dropout_prob) | |||
self.pos_ffn = PositionwiseFeedForward(self.d_v, self.d_inner, dropout = hps.ffn_dropout_prob) | |||
self.wh = nn.Linear(self.d_v, 2) | |||
def forward(self, input, input_len, Train): | |||
""" | |||
:param input: [batch_size, N, seq_len], word idx long tensor | |||
:param input_len: [batch_size, N], 1 for sentence and 0 for padding | |||
:param Train: True for train and False for eval and test | |||
:param return_atten: True or False to return multi-head attention output self.output_slf_attn | |||
:return: | |||
p_sent: [batch_size, N, 2] | |||
output_slf_attn: (option) [n_head, batch_size, N, N] | |||
""" | |||
# -- Sentence Encoder | |||
self.sent_embedding = self.encoder(input) # [batch, N, Co * kernel_sizes] | |||
# -- Multi-layer highway lstm | |||
input_len = input_len.float() # [batch, N] | |||
self.inputs = [None] * (self.num_layers + 1) | |||
self.input_masks = [None] * (self.num_layers + 1) | |||
self.inputs[0] = self.sent_embedding.permute(1, 0, 2) # [N, batch, Co * kernel_sizes] | |||
self.input_masks[0] = input_len.permute(1, 0).unsqueeze(2) | |||
self.lstm_output_state = self.deep_lstm(self.inputs, self.input_masks, Train) # [batch, N, hidden_size] | |||
# -- Prepare masks | |||
batch_size, N = input_len.size() | |||
slf_attn_mask = input_len.eq(0.0) # [batch, N], 1 for padding | |||
slf_attn_mask = slf_attn_mask.unsqueeze(1).expand(-1, N, -1) # [batch, N, N] | |||
# -- Multi-head attention | |||
self.atten_output, self.output_slf_attn = self.slf_attn(self.lstm_output_state, self.lstm_output_state, self.lstm_output_state, mask=slf_attn_mask) | |||
self.atten_output *= input_len.unsqueeze(2) # [batch_size, N, lstm_hidden_size = (n_head * d_v)] | |||
self.multi_atten_output = self.atten_output.view(batch_size, N, self.n_head, self.d_v) # [batch_size, N, n_head, d_v] | |||
self.multi_atten_context = self.multi_atten_output[:, :, 0::2, :].sum(2) - self.multi_atten_output[:, :, 1::2, :].sum(2) # [batch_size, N, d_v] | |||
# -- Position-wise Feed-Forward Networks | |||
self.output_state = self.pos_ffn(self.multi_atten_context) | |||
self.output_state = self.output_state * input_len.unsqueeze(2) # [batch_size, N, d_v] | |||
p_sent = self.wh(self.output_state) # [batch, N, 2] | |||
idx = None | |||
if self._hps.m == 0: | |||
prediction = p_sent.view(-1, 2).max(1)[1] | |||
prediction = prediction.view(batch_size, -1) | |||
else: | |||
mask_output = torch.exp(p_sent[:, :, 1]) # # [batch, N] | |||
mask_output = mask_output.masked_fill(input_len.eq(0), 0) | |||
topk, idx = torch.topk(mask_output, self._hps.m) | |||
prediction = torch.zeros(batch_size, N).scatter_(1, idx.data.cpu(), 1) | |||
prediction = prediction.long().view(batch_size, -1) | |||
if self._hps.cuda: | |||
prediction = prediction.cuda() | |||
return {"p_sent": p_sent, "prediction": prediction, "pred_idx": idx} |
@@ -0,0 +1,55 @@ | |||
#!/usr/bin/python | |||
# -*- coding: utf-8 -*- | |||
# __author__="Danqing Wang" | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
# ============================================================================== | |||
import torch | |||
import torch.nn.functional as F | |||
from fastNLP.core.losses import LossBase | |||
from tools.logger import * | |||
class MyCrossEntropyLoss(LossBase): | |||
def __init__(self, pred=None, target=None, mask=None, padding_idx=-100, reduce='mean'): | |||
super().__init__() | |||
self._init_param_map(pred=pred, target=target, mask=mask) | |||
self.padding_idx = padding_idx | |||
self.reduce = reduce | |||
def get_loss(self, pred, target, mask): | |||
""" | |||
:param pred: [batch, N, 2] | |||
:param target: [batch, N] | |||
:param input_mask: [batch, N] | |||
:return: | |||
""" | |||
# logger.debug(pred[0:5, :, :]) | |||
# logger.debug(target[0:5, :]) | |||
batch, N, _ = pred.size() | |||
pred = pred.view(-1, 2) | |||
target = target.view(-1) | |||
loss = F.cross_entropy(input=pred, target=target, | |||
ignore_index=self.padding_idx, reduction=self.reduce) | |||
loss = loss.view(batch, -1) | |||
loss = loss.masked_fill(mask.eq(0), 0) | |||
loss = loss.sum(1).mean() | |||
logger.debug("loss %f", loss) | |||
return loss | |||
@@ -0,0 +1,171 @@ | |||
#!/usr/bin/python | |||
# -*- coding: utf-8 -*- | |||
# __author__="Danqing Wang" | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
# ============================================================================== | |||
from __future__ import division | |||
import torch | |||
from rouge import Rouge | |||
from fastNLP.core.const import Const | |||
from fastNLP.core.metrics import MetricBase | |||
from tools.logger import * | |||
from tools.utils import pyrouge_score_all, pyrouge_score_all_multi | |||
class LabelFMetric(MetricBase): | |||
def __init__(self, pred=None, target=None): | |||
super().__init__() | |||
self._init_param_map(pred=pred, target=target) | |||
self.match = 0.0 | |||
self.pred = 0.0 | |||
self.true = 0.0 | |||
self.match_true = 0.0 | |||
self.total = 0.0 | |||
def evaluate(self, pred, target): | |||
""" | |||
:param pred: [batch, N] int | |||
:param target: [batch, N] int | |||
:return: | |||
""" | |||
target = target.data | |||
pred = pred.data | |||
# logger.debug(pred.size()) | |||
# logger.debug(pred[:5,:]) | |||
batch, N = pred.size() | |||
self.pred += pred.sum() | |||
self.true += target.sum() | |||
self.match += (pred == target).sum() | |||
self.match_true += ((pred == target) & (pred == 1)).sum() | |||
self.total += batch * N | |||
def get_metric(self, reset=True): | |||
self.match,self.pred, self.true, self.match_true, self.total = self.match.float(),self.pred.float(), self.true.float(), self.match_true.float(), self.total | |||
logger.debug((self.match,self.pred, self.true, self.match_true, self.total)) | |||
try: | |||
accu = self.match / self.total | |||
precision = self.match_true / self.pred | |||
recall = self.match_true / self.true | |||
F = 2 * precision * recall / (precision + recall) | |||
except ZeroDivisionError: | |||
F = 0.0 | |||
logger.error("[Error] float division by zero") | |||
if reset: | |||
self.pred, self.true, self.match_true, self.match, self.total = 0, 0, 0, 0, 0 | |||
ret = {"accu": accu.cpu(), "p":precision.cpu(), "r":recall.cpu(), "f": F.cpu()} | |||
logger.info(ret) | |||
return ret | |||
class RougeMetric(MetricBase): | |||
def __init__(self, hps, pred=None, text=None, refer=None): | |||
super().__init__() | |||
self._hps = hps | |||
self._init_param_map(pred=pred, text=text, summary=refer) | |||
self.hyps = [] | |||
self.refers = [] | |||
def evaluate(self, pred, text, summary): | |||
""" | |||
:param prediction: [batch, N] | |||
:param text: [batch, N] | |||
:param summary: [batch, N] | |||
:return: | |||
""" | |||
batch_size, N = pred.size() | |||
for j in range(batch_size): | |||
original_article_sents = text[j] | |||
sent_max_number = len(original_article_sents) | |||
refer = "\n".join(summary[j]) | |||
hyps = "\n".join(original_article_sents[id] for id in range(len(pred[j])) if | |||
pred[j][id] == 1 and id < sent_max_number) | |||
if sent_max_number < self._hps.m and len(hyps) <= 1: | |||
print("sent_max_number is too short %d, Skip!", sent_max_number) | |||
continue | |||
if len(hyps) >= 1 and hyps != '.': | |||
self.hyps.append(hyps) | |||
self.refers.append(refer) | |||
elif refer == "." or refer == "": | |||
logger.error("Refer is None!") | |||
logger.debug(refer) | |||
elif hyps == "." or hyps == "": | |||
logger.error("hyps is None!") | |||
logger.debug("sent_max_number:%d", sent_max_number) | |||
logger.debug("pred:") | |||
logger.debug(pred[j]) | |||
logger.debug(hyps) | |||
else: | |||
logger.error("Do not select any sentences!") | |||
logger.debug("sent_max_number:%d", sent_max_number) | |||
logger.debug(original_article_sents) | |||
logger.debug(refer) | |||
continue | |||
def get_metric(self, reset=True): | |||
pass | |||
class FastRougeMetric(RougeMetric): | |||
def __init__(self, hps, pred=None, text=None, refer=None): | |||
super().__init__(hps, pred, text, refer) | |||
def get_metric(self, reset=True): | |||
logger.info("[INFO] Hyps and Refer number is %d, %d", len(self.hyps), len(self.refers)) | |||
if len(self.hyps) == 0 or len(self.refers) == 0 : | |||
logger.error("During testing, no hyps or refers is selected!") | |||
return | |||
rouge = Rouge() | |||
scores_all = rouge.get_scores(self.hyps, self.refers, avg=True) | |||
if reset: | |||
self.hyps = [] | |||
self.refers = [] | |||
logger.info(scores_all) | |||
return scores_all | |||
class PyRougeMetric(RougeMetric): | |||
def __init__(self, hps, pred=None, text=None, refer=None): | |||
super().__init__(hps, pred, text, refer) | |||
def get_metric(self, reset=True): | |||
logger.info("[INFO] Hyps and Refer number is %d, %d", len(self.hyps), len(self.refers)) | |||
if len(self.hyps) == 0 or len(self.refers) == 0: | |||
logger.error("During testing, no hyps or refers is selected!") | |||
return | |||
if isinstance(self.refers[0], list): | |||
logger.info("Multi Reference summaries!") | |||
scores_all = pyrouge_score_all_multi(self.hyps, self.refers) | |||
else: | |||
scores_all = pyrouge_score_all(self.hyps, self.refers) | |||
if reset: | |||
self.hyps = [] | |||
self.refers = [] | |||
logger.info(scores_all) | |||
return scores_all | |||
@@ -0,0 +1,143 @@ | |||
#!/usr/bin/python | |||
# -*- coding: utf-8 -*- | |||
# __author__="Danqing Wang" | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
# ============================================================================== | |||
from __future__ import absolute_import | |||
from __future__ import division | |||
from __future__ import print_function | |||
import numpy as np | |||
import torch | |||
import torch.nn as nn | |||
from .Encoder import Encoder | |||
# from tools.Encoder import Encoder | |||
from tools.PositionEmbedding import get_sinusoid_encoding_table | |||
from tools.logger import * | |||
from fastNLP.core.const import Const | |||
from fastNLP.modules.encoder.transformer import TransformerEncoder | |||
from transformer.Layers import EncoderLayer | |||
class TransformerModel(nn.Module): | |||
def __init__(self, hps, embed): | |||
""" | |||
:param hps: | |||
min_kernel_size: min kernel size for cnn encoder | |||
max_kernel_size: max kernel size for cnn encoder | |||
output_channel: output_channel number for cnn encoder | |||
hidden_size: hidden size for transformer | |||
n_layers: transfromer encoder layer | |||
n_head: multi head attention for transformer | |||
ffn_inner_hidden_size: FFN hiddens size | |||
atten_dropout_prob: dropout size | |||
doc_max_timesteps: max sentence number of the document | |||
:param vocab: | |||
""" | |||
super(TransformerModel, self).__init__() | |||
self._hps = hps | |||
self.encoder = Encoder(hps, embed) | |||
self.sent_embedding_size = (hps.max_kernel_size - hps.min_kernel_size + 1) * hps.output_channel | |||
self.hidden_size = hps.hidden_size | |||
self.n_head = hps.n_head | |||
self.d_v = self.d_k = int(self.hidden_size / self.n_head) | |||
self.d_inner = hps.ffn_inner_hidden_size | |||
self.num_layers = hps.n_layers | |||
self.projection = nn.Linear(self.sent_embedding_size, self.hidden_size) | |||
self.sent_pos_embed = nn.Embedding.from_pretrained( | |||
get_sinusoid_encoding_table(hps.doc_max_timesteps + 1, self.hidden_size, padding_idx=0), freeze=True) | |||
self.layer_stack = nn.ModuleList([ | |||
EncoderLayer(self.hidden_size, self.d_inner, self.n_head, self.d_k, self.d_v, | |||
dropout=hps.atten_dropout_prob) | |||
for _ in range(self.num_layers)]) | |||
self.wh = nn.Linear(self.hidden_size, 2) | |||
def forward(self, words, seq_len): | |||
""" | |||
:param input: [batch_size, N, seq_len] | |||
:param input_len: [batch_size, N] | |||
:return: | |||
""" | |||
# Sentence Encoder | |||
input = words | |||
input_len = seq_len | |||
self.sent_embedding = self.encoder(input) # [batch, N, Co * kernel_sizes] | |||
input_len = input_len.float() # [batch, N] | |||
# -- Prepare masks | |||
batch_size, N = input_len.size() | |||
self.slf_attn_mask = input_len.eq(0.0) # [batch, N] | |||
self.slf_attn_mask = self.slf_attn_mask.unsqueeze(1).expand(-1, N, -1) # [batch, N, N] | |||
self.non_pad_mask = input_len.unsqueeze(-1) # [batch, N, 1] | |||
input_doc_len = input_len.sum(dim=1).int() # [batch] | |||
sent_pos = torch.Tensor( | |||
[np.hstack((np.arange(1, doclen + 1), np.zeros(N - doclen))) for doclen in input_doc_len]) | |||
sent_pos = sent_pos.long().cuda() if self._hps.cuda else sent_pos.long() | |||
enc_output_state = self.projection(self.sent_embedding) | |||
enc_input = enc_output_state + self.sent_pos_embed(sent_pos) | |||
# self.enc_slf_attn = self.enc_slf_attn * self.non_pad_mask | |||
enc_input_list = [] | |||
for enc_layer in self.layer_stack: | |||
# enc_output = [batch_size, N, hidden_size = n_head * d_v] | |||
# enc_slf_attn = [n_head * batch_size, N, N] | |||
enc_input, enc_slf_atten = enc_layer(enc_input, non_pad_mask=self.non_pad_mask, | |||
slf_attn_mask=self.slf_attn_mask) | |||
enc_input_list += [enc_input] | |||
self.dec_output_state = torch.cat(enc_input_list[-4:]) # [4, batch_size, N, hidden_state] | |||
self.dec_output_state = self.dec_output_state.view(4, batch_size, N, -1) | |||
self.dec_output_state = self.dec_output_state.sum(0) | |||
p_sent = self.wh(self.dec_output_state) # [batch, N, 2] | |||
idx = None | |||
if self._hps.m == 0: | |||
prediction = p_sent.view(-1, 2).max(1)[1] | |||
prediction = prediction.view(batch_size, -1) | |||
else: | |||
mask_output = torch.exp(p_sent[:, :, 1]) # # [batch, N] | |||
mask_output = mask_output.masked_fill(input_len.eq(0), 0) | |||
topk, idx = torch.topk(mask_output, self._hps.m) | |||
prediction = torch.zeros(batch_size, N).scatter_(1, idx.data.cpu(), 1) | |||
prediction = prediction.long().view(batch_size, -1) | |||
if self._hps.cuda: | |||
prediction = prediction.cuda() | |||
# logger.debug(((p_sent.size(), prediction.size(), idx.size()))) | |||
return {"p_sent": p_sent, "prediction": prediction, "pred_idx": idx} | |||
@@ -0,0 +1,138 @@ | |||
#!/usr/bin/python | |||
# -*- coding: utf-8 -*- | |||
# __author__="Danqing Wang" | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
# ============================================================================== | |||
from __future__ import absolute_import | |||
from __future__ import division | |||
from __future__ import print_function | |||
import numpy as np | |||
import torch | |||
import torch.nn as nn | |||
from .Encoder import Encoder | |||
from tools.PositionEmbedding import get_sinusoid_encoding_table | |||
from fastNLP.core.const import Const | |||
from fastNLP.modules.encoder.transformer import TransformerEncoder | |||
class TransformerModel(nn.Module): | |||
def __init__(self, hps, vocab): | |||
""" | |||
:param hps: | |||
min_kernel_size: min kernel size for cnn encoder | |||
max_kernel_size: max kernel size for cnn encoder | |||
output_channel: output_channel number for cnn encoder | |||
hidden_size: hidden size for transformer | |||
n_layers: transfromer encoder layer | |||
n_head: multi head attention for transformer | |||
ffn_inner_hidden_size: FFN hiddens size | |||
atten_dropout_prob: dropout size | |||
doc_max_timesteps: max sentence number of the document | |||
:param vocab: | |||
""" | |||
super(TransformerModel, self).__init__() | |||
self._hps = hps | |||
self._vocab = vocab | |||
self.encoder = Encoder(hps, vocab) | |||
self.sent_embedding_size = (hps.max_kernel_size - hps.min_kernel_size + 1) * hps.output_channel | |||
self.hidden_size = hps.hidden_size | |||
self.n_head = hps.n_head | |||
self.d_v = self.d_k = int(self.hidden_size / self.n_head) | |||
self.d_inner = hps.ffn_inner_hidden_size | |||
self.num_layers = hps.n_layers | |||
self.projection = nn.Linear(self.sent_embedding_size, self.hidden_size) | |||
self.sent_pos_embed = nn.Embedding.from_pretrained( | |||
get_sinusoid_encoding_table(hps.doc_max_timesteps + 1, self.hidden_size, padding_idx=0), freeze=True) | |||
self.layer_stack = nn.ModuleList([ | |||
TransformerEncoder.SubLayer(model_size=self.hidden_size, inner_size=self.d_inner, key_size=self.d_k, value_size=self.d_v,num_head=self.n_head, dropout=hps.atten_dropout_prob) | |||
for _ in range(self.num_layers)]) | |||
self.wh = nn.Linear(self.hidden_size, 2) | |||
def forward(self, words, seq_len): | |||
""" | |||
:param input: [batch_size, N, seq_len] | |||
:param input_len: [batch_size, N] | |||
:param return_atten: bool | |||
:return: | |||
""" | |||
# Sentence Encoder | |||
input = words | |||
input_len = seq_len | |||
self.sent_embedding = self.encoder(input) # [batch, N, Co * kernel_sizes] | |||
input_len = input_len.float() # [batch, N] | |||
# -- Prepare masks | |||
batch_size, N = input_len.size() | |||
self.slf_attn_mask = input_len.eq(0.0) # [batch, N] | |||
self.slf_attn_mask = self.slf_attn_mask.unsqueeze(1).expand(-1, N, -1) # [batch, N, N] | |||
self.non_pad_mask = input_len.unsqueeze(-1) # [batch, N, 1] | |||
input_doc_len = input_len.sum(dim=1).int() # [batch] | |||
sent_pos = torch.Tensor([np.hstack((np.arange(1, doclen + 1), np.zeros(N - doclen))) for doclen in input_doc_len]) | |||
sent_pos = sent_pos.long().cuda() if self._hps.cuda else sent_pos.long() | |||
enc_output_state = self.projection(self.sent_embedding) | |||
enc_input = enc_output_state + self.sent_pos_embed(sent_pos) | |||
# self.enc_slf_attn = self.enc_slf_attn * self.non_pad_mask | |||
enc_input_list = [] | |||
for enc_layer in self.layer_stack: | |||
# enc_output = [batch_size, N, hidden_size = n_head * d_v] | |||
# enc_slf_attn = [n_head * batch_size, N, N] | |||
enc_input = enc_layer(enc_input, seq_mask=self.non_pad_mask, atte_mask_out=self.slf_attn_mask) | |||
enc_input_list += [enc_input] | |||
self.dec_output_state = torch.cat(enc_input_list[-4:]) # [4, batch_size, N, hidden_state] | |||
self.dec_output_state = self.dec_output_state.view(4, batch_size, N, -1) | |||
self.dec_output_state = self.dec_output_state.sum(0) | |||
p_sent = self.wh(self.dec_output_state) # [batch, N, 2] | |||
idx = None | |||
if self._hps.m == 0: | |||
prediction = p_sent.view(-1, 2).max(1)[1] | |||
prediction = prediction.view(batch_size, -1) | |||
else: | |||
mask_output = torch.exp(p_sent[:, :, 1]) # # [batch, N] | |||
mask_output = mask_output * input_len.float() | |||
topk, idx = torch.topk(mask_output, self._hps.m) | |||
prediction = torch.zeros(batch_size, N).scatter_(1, idx.data.cpu(), 1) | |||
prediction = prediction.long().view(batch_size, -1) | |||
if self._hps.cuda: | |||
prediction = prediction.cuda() | |||
# print((p_sent.size(), prediction.size(), idx.size())) | |||
# [batch, N, 2], [batch, N], [batch, hps.m] | |||
return {"pred": p_sent, "prediction": prediction, "pred_idx": idx} | |||
@@ -0,0 +1,36 @@ | |||
import unittest | |||
import sys | |||
sys.path.append('..') | |||
from data.dataloader import SummarizationLoader | |||
vocab_size = 100000 | |||
vocab_path = "testdata/vocab" | |||
sent_max_len = 100 | |||
doc_max_timesteps = 50 | |||
class TestSummarizationLoader(unittest.TestCase): | |||
def test_case1(self): | |||
sum_loader = SummarizationLoader() | |||
paths = {"train":"testdata/train.jsonl", "valid":"testdata/val.jsonl", "test":"testdata/test.jsonl"} | |||
data = sum_loader.process(paths=paths, vocab_size=vocab_size, vocab_path=vocab_path, sent_max_len=sent_max_len, doc_max_timesteps=doc_max_timesteps) | |||
print(data.datasets) | |||
def test_case2(self): | |||
sum_loader = SummarizationLoader() | |||
paths = {"train": "testdata/train.jsonl", "valid": "testdata/val.jsonl", "test": "testdata/test.jsonl"} | |||
data = sum_loader.process(paths=paths, vocab_size=vocab_size, vocab_path=vocab_path, sent_max_len=sent_max_len, doc_max_timesteps=doc_max_timesteps, domain=True) | |||
print(data.datasets, data.vocabs) | |||
def test_case3(self): | |||
sum_loader = SummarizationLoader() | |||
paths = {"train": "testdata/train.jsonl", "valid": "testdata/val.jsonl", "test": "testdata/test.jsonl"} | |||
data = sum_loader.process(paths=paths, vocab_size=vocab_size, vocab_path=vocab_path, sent_max_len=sent_max_len, doc_max_timesteps=doc_max_timesteps, tag=True) | |||
print(data.datasets, data.vocabs) | |||
@@ -0,0 +1,36 @@ | |||
import unittest | |||
import sys | |||
sys.path.append('..') | |||
from data.dataloader import SummarizationLoader | |||
vocab_size = 100000 | |||
vocab_path = "testdata/vocab" | |||
sent_max_len = 100 | |||
doc_max_timesteps = 50 | |||
class TestSummarizationLoader(unittest.TestCase): | |||
def test_case1(self): | |||
sum_loader = SummarizationLoader() | |||
paths = {"train":"testdata/train.jsonl", "valid":"testdata/val.jsonl", "test":"testdata/test.jsonl"} | |||
data = sum_loader.process(paths=paths, vocab_size=vocab_size, vocab_path=vocab_path, sent_max_len=sent_max_len, doc_max_timesteps=doc_max_timesteps) | |||
print(data.datasets) | |||
def test_case2(self): | |||
sum_loader = SummarizationLoader() | |||
paths = {"train": "testdata/train.jsonl", "valid": "testdata/val.jsonl", "test": "testdata/test.jsonl"} | |||
data = sum_loader.process(paths=paths, vocab_size=vocab_size, vocab_path=vocab_path, sent_max_len=sent_max_len, doc_max_timesteps=doc_max_timesteps, domain=True) | |||
print(data.datasets, data.vocabs) | |||
def test_case3(self): | |||
sum_loader = SummarizationLoader() | |||
paths = {"train": "testdata/train.jsonl", "valid": "testdata/val.jsonl", "test": "testdata/test.jsonl"} | |||
data = sum_loader.process(paths=paths, vocab_size=vocab_size, vocab_path=vocab_path, sent_max_len=sent_max_len, doc_max_timesteps=doc_max_timesteps, tag=True) | |||
print(data.datasets, data.vocabs) | |||
@@ -0,0 +1,56 @@ | |||
#!/usr/bin/python | |||
# -*- coding: utf-8 -*- | |||
# __author__="Danqing Wang" | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
# ============================================================================== | |||
import os | |||
import sys | |||
sys.path.append('/remote-home/dqwang/FastNLP/fastNLP_brxx/') | |||
from fastNLP.core.const import Const | |||
from data.dataloader import SummarizationLoader | |||
from tools.data import ExampleSet, Vocab | |||
vocab_size = 100000 | |||
vocab_path = "test/testdata/vocab" | |||
sent_max_len = 100 | |||
doc_max_timesteps = 50 | |||
# paths = {"train": "test/testdata/train.jsonl", "valid": "test/testdata/val.jsonl"} | |||
paths = {"train": "/remote-home/dqwang/Datasets/CNNDM/train.label.jsonl", "valid": "/remote-home/dqwang/Datasets/CNNDM/val.label.jsonl"} | |||
sum_loader = SummarizationLoader() | |||
dataInfo = sum_loader.process(paths=paths, vocab_size=vocab_size, vocab_path=vocab_path, sent_max_len=sent_max_len, doc_max_timesteps=doc_max_timesteps, load_vocab_file=True) | |||
trainset = dataInfo.datasets["train"] | |||
vocab = Vocab(vocab_path, vocab_size) | |||
dataset = ExampleSet(paths["train"], vocab, doc_max_timesteps, sent_max_len) | |||
# print(trainset[0]["text"]) | |||
# print(dataset.get_example(0).original_article_sents) | |||
# print(trainset[0]["words"]) | |||
# print(dataset[0][0].numpy().tolist()) | |||
b_size = len(trainset) | |||
for i in range(b_size): | |||
if i <= 7327: | |||
continue | |||
print(trainset[i][Const.INPUT]) | |||
print(dataset[i][0].numpy().tolist()) | |||
assert trainset[i][Const.INPUT] == dataset[i][0].numpy().tolist(), i | |||
assert trainset[i][Const.INPUT_LEN] == dataset[i][2].numpy().tolist(), i | |||
assert trainset[i][Const.TARGET] == dataset[i][1].numpy().tolist(), i |
@@ -0,0 +1,135 @@ | |||
#!/usr/bin/python | |||
# -*- coding: utf-8 -*- | |||
# __author__="Danqing Wang" | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
# ============================================================================== | |||
import os | |||
import sys | |||
import time | |||
import numpy as np | |||
import torch | |||
from fastNLP.core.const import Const | |||
from fastNLP.io.model_io import ModelSaver | |||
from fastNLP.core.callback import Callback, EarlyStopError | |||
from tools.logger import * | |||
class TrainCallback(Callback): | |||
def __init__(self, hps, patience=3, quit_all=True): | |||
super().__init__() | |||
self._hps = hps | |||
self.patience = patience | |||
self.wait = 0 | |||
if type(quit_all) != bool: | |||
raise ValueError("In KeyBoardInterrupt, quit_all arguemnt must be a bool.") | |||
self.quit_all = quit_all | |||
def on_epoch_begin(self): | |||
self.epoch_start_time = time.time() | |||
# def on_loss_begin(self, batch_y, predict_y): | |||
# """ | |||
# | |||
# :param batch_y: dict | |||
# input_len: [batch, N] | |||
# :param predict_y: dict | |||
# p_sent: [batch, N, 2] | |||
# :return: | |||
# """ | |||
# input_len = batch_y[Const.INPUT_LEN] | |||
# batch_y[Const.TARGET] = batch_y[Const.TARGET] * ((1 - input_len) * -100) | |||
# # predict_y["p_sent"] = predict_y["p_sent"] * input_len.unsqueeze(-1) | |||
# # logger.debug(predict_y["p_sent"][0:5,:,:]) | |||
def on_backward_begin(self, loss): | |||
""" | |||
:param loss: [] | |||
:return: | |||
""" | |||
if not (np.isfinite(loss.data)).numpy(): | |||
logger.error("train Loss is not finite. Stopping.") | |||
logger.info(loss) | |||
for name, param in self.model.named_parameters(): | |||
if param.requires_grad: | |||
logger.info(name) | |||
logger.info(param.grad.data.sum()) | |||
raise Exception("train Loss is not finite. Stopping.") | |||
def on_backward_end(self): | |||
if self._hps.grad_clip: | |||
torch.nn.utils.clip_grad_norm_(self.model.parameters(), self._hps.max_grad_norm) | |||
def on_epoch_end(self): | |||
logger.info(' | end of epoch {:3d} | time: {:5.2f}s | ' | |||
.format(self.epoch, (time.time() - self.epoch_start_time))) | |||
def on_valid_begin(self): | |||
self.valid_start_time = time.time() | |||
def on_valid_end(self, eval_result, metric_key, optimizer, is_better_eval): | |||
logger.info(' | end of valid {:3d} | time: {:5.2f}s | ' | |||
.format(self.epoch, (time.time() - self.valid_start_time))) | |||
# early stop | |||
if not is_better_eval: | |||
if self.wait == self.patience: | |||
train_dir = os.path.join(self._hps.save_root, "train") | |||
save_file = os.path.join(train_dir, "earlystop.pkl") | |||
saver = ModelSaver(save_file) | |||
saver.save_pytorch(self.model) | |||
logger.info('[INFO] Saving early stop model to %s', save_file) | |||
raise EarlyStopError("Early stopping raised.") | |||
else: | |||
self.wait += 1 | |||
else: | |||
self.wait = 0 | |||
# lr descent | |||
if self._hps.lr_descent: | |||
new_lr = max(5e-6, self._hps.lr / (self.epoch + 1)) | |||
for param_group in list(optimizer.param_groups): | |||
param_group['lr'] = new_lr | |||
logger.info("[INFO] The learning rate now is %f", new_lr) | |||
def on_exception(self, exception): | |||
if isinstance(exception, KeyboardInterrupt): | |||
logger.error("[Error] Caught keyboard interrupt on worker. Stopping supervisor...") | |||
train_dir = os.path.join(self._hps.save_root, "train") | |||
save_file = os.path.join(train_dir, "earlystop.pkl") | |||
saver = ModelSaver(save_file) | |||
saver.save_pytorch(self.model) | |||
logger.info('[INFO] Saving early stop model to %s', save_file) | |||
if self.quit_all is True: | |||
sys.exit(0) # 直接退出程序 | |||
else: | |||
pass | |||
else: | |||
raise exception # 抛出陌生Error | |||
@@ -0,0 +1,562 @@ | |||
from __future__ import absolute_import | |||
from __future__ import division | |||
from __future__ import print_function | |||
import numpy as np | |||
import torch | |||
import torch.nn as nn | |||
import torch.nn.functional as F | |||
from torch.autograd import * | |||
import torch.nn.init as init | |||
import data | |||
from tools.logger import * | |||
from transformer.Models import get_sinusoid_encoding_table | |||
class Encoder(nn.Module): | |||
def __init__(self, hps, vocab): | |||
super(Encoder, self).__init__() | |||
self._hps = hps | |||
self._vocab = vocab | |||
self.sent_max_len = hps.sent_max_len | |||
vocab_size = len(vocab) | |||
logger.info("[INFO] Vocabulary size is %d", vocab_size) | |||
embed_size = hps.word_emb_dim | |||
sent_max_len = hps.sent_max_len | |||
input_channels = 1 | |||
out_channels = hps.output_channel | |||
min_kernel_size = hps.min_kernel_size | |||
max_kernel_size = hps.max_kernel_size | |||
width = embed_size | |||
# word embedding | |||
self.embed = nn.Embedding(vocab_size, embed_size, padding_idx=vocab.word2id('[PAD]')) | |||
if hps.word_embedding: | |||
word2vec = data.Word_Embedding(hps.embedding_path, vocab) | |||
word_vecs = word2vec.load_my_vecs(embed_size) | |||
# pretrained_weight = word2vec.add_unknown_words_by_zero(word_vecs, embed_size) | |||
pretrained_weight = word2vec.add_unknown_words_by_avg(word_vecs, embed_size) | |||
pretrained_weight = np.array(pretrained_weight) | |||
self.embed.weight.data.copy_(torch.from_numpy(pretrained_weight)) | |||
self.embed.weight.requires_grad = hps.embed_train | |||
# position embedding | |||
self.position_embedding = nn.Embedding.from_pretrained(get_sinusoid_encoding_table(sent_max_len + 1, embed_size, padding_idx=0), freeze=True) | |||
# cnn | |||
self.convs = nn.ModuleList([nn.Conv2d(input_channels, out_channels, kernel_size = (height, width)) for height in range(min_kernel_size, max_kernel_size+1)]) | |||
logger.info("[INFO] Initing W for CNN.......") | |||
for conv in self.convs: | |||
init_weight_value = 6.0 | |||
init.xavier_normal_(conv.weight.data, gain=np.sqrt(init_weight_value)) | |||
fan_in, fan_out = Encoder.calculate_fan_in_and_fan_out(conv.weight.data) | |||
std = np.sqrt(init_weight_value) * np.sqrt(2.0 / (fan_in + fan_out)) | |||
def calculate_fan_in_and_fan_out(tensor): | |||
dimensions = tensor.ndimension() | |||
if dimensions < 2: | |||
logger.error("[Error] Fan in and fan out can not be computed for tensor with less than 2 dimensions") | |||
raise ValueError("[Error] Fan in and fan out can not be computed for tensor with less than 2 dimensions") | |||
if dimensions == 2: # Linear | |||
fan_in = tensor.size(1) | |||
fan_out = tensor.size(0) | |||
else: | |||
num_input_fmaps = tensor.size(1) | |||
num_output_fmaps = tensor.size(0) | |||
receptive_field_size = 1 | |||
if tensor.dim() > 2: | |||
receptive_field_size = tensor[0][0].numel() | |||
fan_in = num_input_fmaps * receptive_field_size | |||
fan_out = num_output_fmaps * receptive_field_size | |||
return fan_in, fan_out | |||
def forward(self, input): | |||
# input: a batch of Example object [batch_size, N, seq_len] | |||
vocab = self._vocab | |||
batch_size, N, _ = input.size() | |||
input = input.view(-1, input.size(2)) # [batch_size*N, L] | |||
input_sent_len = ((input!=vocab.word2id('[PAD]')).sum(dim=1)).int() # [batch_size*N, 1] | |||
enc_embed_input = self.embed(input) # [batch_size*N, L, D] | |||
input_pos = torch.Tensor([np.hstack((np.arange(1, sentlen + 1), np.zeros(self.sent_max_len - sentlen))) for sentlen in input_sent_len]) | |||
if self._hps.cuda: | |||
input_pos = input_pos.cuda() | |||
enc_pos_embed_input = self.position_embedding(input_pos.long()) # [batch_size*N, D] | |||
enc_conv_input = enc_embed_input + enc_pos_embed_input | |||
enc_conv_input = enc_conv_input.unsqueeze(1) # (batch * N,Ci,L,D) | |||
enc_conv_output = [F.relu(conv(enc_conv_input)).squeeze(3) for conv in self.convs] # kernel_sizes * (batch*N, Co, W) | |||
enc_maxpool_output = [F.max_pool1d(x, x.size(2)).squeeze(2) for x in enc_conv_output] # kernel_sizes * (batch*N, Co) | |||
sent_embedding = torch.cat(enc_maxpool_output, 1) # (batch*N, Co * kernel_sizes) | |||
sent_embedding = sent_embedding.view(batch_size, N, -1) | |||
return sent_embedding | |||
class DomainEncoder(Encoder): | |||
def __init__(self, hps, vocab, domaindict): | |||
super(DomainEncoder, self).__init__(hps, vocab) | |||
# domain embedding | |||
self.domain_embedding = nn.Embedding(domaindict.size(), hps.domain_emb_dim) | |||
self.domain_embedding.weight.requires_grad = True | |||
def forward(self, input, domain): | |||
""" | |||
:param input: [batch_size, N, seq_len], N sentence number, seq_len token number | |||
:param domain: [batch_size] | |||
:return: sent_embedding: [batch_size, N, Co * kernel_sizes] | |||
""" | |||
batch_size, N, _ = input.size() | |||
sent_embedding = super().forward(input) | |||
enc_domain_input = self.domain_embedding(domain) # [batch, D] | |||
enc_domain_input = enc_domain_input.unsqueeze(1).expand(batch_size, N, -1) # [batch, N, D] | |||
sent_embedding = torch.cat((sent_embedding, enc_domain_input), dim=2) | |||
return sent_embedding | |||
class MultiDomainEncoder(Encoder): | |||
def __init__(self, hps, vocab, domaindict): | |||
super(MultiDomainEncoder, self).__init__(hps, vocab) | |||
self.domain_size = domaindict.size() | |||
# domain embedding | |||
self.domain_embedding = nn.Embedding(self.domain_size, hps.domain_emb_dim) | |||
self.domain_embedding.weight.requires_grad = True | |||
def forward(self, input, domain): | |||
""" | |||
:param input: [batch_size, N, seq_len], N sentence number, seq_len token number | |||
:param domain: [batch_size, domain_size] | |||
:return: sent_embedding: [batch_size, N, Co * kernel_sizes] | |||
""" | |||
batch_size, N, _ = input.size() | |||
# logger.info(domain[:5, :]) | |||
sent_embedding = super().forward(input) | |||
domain_padding = torch.arange(self.domain_size).unsqueeze(0).expand(batch_size, -1) | |||
domain_padding = domain_padding.cuda().view(-1) if self._hps.cuda else domain_padding.view(-1) # [batch * domain_size] | |||
enc_domain_input = self.domain_embedding(domain_padding) # [batch * domain_size, D] | |||
enc_domain_input = enc_domain_input.view(batch_size, self.domain_size, -1) * domain.unsqueeze(-1).float() # [batch, domain_size, D] | |||
# logger.info(enc_domain_input[:5,:]) # [batch, domain_size, D] | |||
enc_domain_input = enc_domain_input.sum(1) / domain.sum(1).float().unsqueeze(-1) # [batch, D] | |||
enc_domain_input = enc_domain_input.unsqueeze(1).expand(batch_size, N, -1) # [batch, N, D] | |||
sent_embedding = torch.cat((sent_embedding, enc_domain_input), dim=2) | |||
return sent_embedding | |||
class BertEncoder(nn.Module): | |||
def __init__(self, hps): | |||
super(BertEncoder, self).__init__() | |||
from pytorch_pretrained_bert.modeling import BertModel | |||
self._hps = hps | |||
self.sent_max_len = hps.sent_max_len | |||
self._cuda = hps.cuda | |||
embed_size = hps.word_emb_dim | |||
sent_max_len = hps.sent_max_len | |||
input_channels = 1 | |||
out_channels = hps.output_channel | |||
min_kernel_size = hps.min_kernel_size | |||
max_kernel_size = hps.max_kernel_size | |||
width = embed_size | |||
# word embedding | |||
self._bert = BertModel.from_pretrained("/remote-home/dqwang/BERT/pre-train/uncased_L-24_H-1024_A-16") | |||
self._bert.eval() | |||
for p in self._bert.parameters(): | |||
p.requires_grad = False | |||
self.word_embedding_proj = nn.Linear(4096, embed_size) | |||
# position embedding | |||
self.position_embedding = nn.Embedding.from_pretrained(get_sinusoid_encoding_table(sent_max_len + 1, embed_size, padding_idx=0), freeze=True) | |||
# cnn | |||
self.convs = nn.ModuleList([nn.Conv2d(input_channels, out_channels, kernel_size = (height, width)) for height in range(min_kernel_size, max_kernel_size+1)]) | |||
logger.info("[INFO] Initing W for CNN.......") | |||
for conv in self.convs: | |||
init_weight_value = 6.0 | |||
init.xavier_normal_(conv.weight.data, gain=np.sqrt(init_weight_value)) | |||
fan_in, fan_out = Encoder.calculate_fan_in_and_fan_out(conv.weight.data) | |||
std = np.sqrt(init_weight_value) * np.sqrt(2.0 / (fan_in + fan_out)) | |||
def calculate_fan_in_and_fan_out(tensor): | |||
dimensions = tensor.ndimension() | |||
if dimensions < 2: | |||
logger.error("[Error] Fan in and fan out can not be computed for tensor with less than 2 dimensions") | |||
raise ValueError("[Error] Fan in and fan out can not be computed for tensor with less than 2 dimensions") | |||
if dimensions == 2: # Linear | |||
fan_in = tensor.size(1) | |||
fan_out = tensor.size(0) | |||
else: | |||
num_input_fmaps = tensor.size(1) | |||
num_output_fmaps = tensor.size(0) | |||
receptive_field_size = 1 | |||
if tensor.dim() > 2: | |||
receptive_field_size = tensor[0][0].numel() | |||
fan_in = num_input_fmaps * receptive_field_size | |||
fan_out = num_output_fmaps * receptive_field_size | |||
return fan_in, fan_out | |||
def pad_encoder_input(self, input_list): | |||
""" | |||
:param input_list: N [seq_len, hidden_state] | |||
:return: enc_sent_input_pad: list, N [max_len, hidden_state] | |||
""" | |||
max_len = self.sent_max_len | |||
enc_sent_input_pad = [] | |||
_, hidden_size = input_list[0].size() | |||
for i in range(len(input_list)): | |||
article_words = input_list[i] # [seq_len, hidden_size] | |||
seq_len = article_words.size(0) | |||
if seq_len > max_len: | |||
pad_words = article_words[:max_len, :] | |||
else: | |||
pad_tensor = torch.zeros(max_len - seq_len, hidden_size).cuda() if self._cuda else torch.zeros(max_len - seq_len, hidden_size) | |||
pad_words = torch.cat([article_words, pad_tensor], dim=0) | |||
enc_sent_input_pad.append(pad_words) | |||
return enc_sent_input_pad | |||
def forward(self, inputs, input_masks, enc_sent_len): | |||
""" | |||
:param inputs: a batch of Example object [batch_size, doc_len=512] | |||
:param input_masks: 0 or 1, [batch, doc_len=512] | |||
:param enc_sent_len: sentence original length [batch, N] | |||
:return: | |||
""" | |||
# Use Bert to get word embedding | |||
batch_size, N = enc_sent_len.size() | |||
input_pad_list = [] | |||
for i in range(batch_size): | |||
tokens_id = inputs[i] | |||
input_mask = input_masks[i] | |||
sent_len = enc_sent_len[i] | |||
input_ids = tokens_id.unsqueeze(0) | |||
input_mask = input_mask.unsqueeze(0) | |||
out, _ = self._bert(input_ids, token_type_ids=None, attention_mask=input_mask) | |||
out = torch.cat(out[-4:], dim=-1).squeeze(0) # [doc_len=512, hidden_state=4096] | |||
_, hidden_size = out.size() | |||
# restore the sentence | |||
last_end = 1 | |||
enc_sent_input = [] | |||
for length in sent_len: | |||
if length != 0 and last_end < 511: | |||
enc_sent_input.append(out[last_end: min(511, last_end + length), :]) | |||
last_end += length | |||
else: | |||
pad_tensor = torch.zeros(self.sent_max_len, hidden_size).cuda() if self._hps.cuda else torch.zeros(self.sent_max_len, hidden_size) | |||
enc_sent_input.append(pad_tensor) | |||
# pad the sentence | |||
enc_sent_input_pad = self.pad_encoder_input(enc_sent_input) # [N, seq_len, hidden_state=4096] | |||
input_pad_list.append(torch.stack(enc_sent_input_pad)) | |||
input_pad = torch.stack(input_pad_list) | |||
input_pad = input_pad.view(batch_size*N, self.sent_max_len, -1) | |||
enc_sent_len = enc_sent_len.view(-1) # [batch_size*N] | |||
enc_embed_input = self.word_embedding_proj(input_pad) # [batch_size * N, L, D] | |||
sent_pos_list = [] | |||
for sentlen in enc_sent_len: | |||
sent_pos = list(range(1, min(self.sent_max_len, sentlen) + 1)) | |||
for k in range(self.sent_max_len - sentlen): | |||
sent_pos.append(0) | |||
sent_pos_list.append(sent_pos) | |||
input_pos = torch.Tensor(sent_pos_list).long() | |||
if self._hps.cuda: | |||
input_pos = input_pos.cuda() | |||
enc_pos_embed_input = self.position_embedding(input_pos.long()) # [batch_size*N, D] | |||
enc_conv_input = enc_embed_input + enc_pos_embed_input | |||
enc_conv_input = enc_conv_input.unsqueeze(1) # (batch * N,Ci,L,D) | |||
enc_conv_output = [F.relu(conv(enc_conv_input)).squeeze(3) for conv in self.convs] # kernel_sizes * (batch*N, Co, W) | |||
enc_maxpool_output = [F.max_pool1d(x, x.size(2)).squeeze(2) for x in enc_conv_output] # kernel_sizes * (batch*N, Co) | |||
sent_embedding = torch.cat(enc_maxpool_output, 1) # (batch*N, Co * kernel_sizes) | |||
sent_embedding = sent_embedding.view(batch_size, N, -1) | |||
return sent_embedding | |||
class BertTagEncoder(BertEncoder): | |||
def __init__(self, hps, domaindict): | |||
super(BertTagEncoder, self).__init__(hps) | |||
# domain embedding | |||
self.domain_embedding = nn.Embedding(domaindict.size(), hps.domain_emb_dim) | |||
self.domain_embedding.weight.requires_grad = True | |||
def forward(self, inputs, input_masks, enc_sent_len, domain): | |||
sent_embedding = super().forward(inputs, input_masks, enc_sent_len) | |||
batch_size, N = enc_sent_len.size() | |||
enc_domain_input = self.domain_embedding(domain) # [batch, D] | |||
enc_domain_input = enc_domain_input.unsqueeze(1).expand(batch_size, N, -1) # [batch, N, D] | |||
sent_embedding = torch.cat((sent_embedding, enc_domain_input), dim=2) | |||
return sent_embedding | |||
class ELMoEndoer(nn.Module): | |||
def __init__(self, hps): | |||
super(ELMoEndoer, self).__init__() | |||
self._hps = hps | |||
self.sent_max_len = hps.sent_max_len | |||
from allennlp.modules.elmo import Elmo | |||
elmo_dim = 1024 | |||
options_file = "/remote-home/dqwang/ELMo/elmo_2x4096_512_2048cnn_2xhighway_5.5B_options.json" | |||
weight_file = "/remote-home/dqwang/ELMo/elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5" | |||
# elmo_dim = 512 | |||
# options_file = "/remote-home/dqwang/ELMo/elmo_2x2048_256_2048cnn_1xhighway_options.json" | |||
# weight_file = "/remote-home/dqwang/ELMo/elmo_2x2048_256_2048cnn_1xhighway_weights.hdf5" | |||
embed_size = hps.word_emb_dim | |||
sent_max_len = hps.sent_max_len | |||
input_channels = 1 | |||
out_channels = hps.output_channel | |||
min_kernel_size = hps.min_kernel_size | |||
max_kernel_size = hps.max_kernel_size | |||
width = embed_size | |||
# elmo embedding | |||
self.elmo = Elmo(options_file, weight_file, 1, dropout=0) | |||
self.embed_proj = nn.Linear(elmo_dim, embed_size) | |||
# position embedding | |||
self.position_embedding = nn.Embedding.from_pretrained(get_sinusoid_encoding_table(sent_max_len + 1, embed_size, padding_idx=0), freeze=True) | |||
# cnn | |||
self.convs = nn.ModuleList([nn.Conv2d(input_channels, out_channels, kernel_size = (height, width)) for height in range(min_kernel_size, max_kernel_size+1)]) | |||
logger.info("[INFO] Initing W for CNN.......") | |||
for conv in self.convs: | |||
init_weight_value = 6.0 | |||
init.xavier_normal_(conv.weight.data, gain=np.sqrt(init_weight_value)) | |||
fan_in, fan_out = Encoder.calculate_fan_in_and_fan_out(conv.weight.data) | |||
std = np.sqrt(init_weight_value) * np.sqrt(2.0 / (fan_in + fan_out)) | |||
def calculate_fan_in_and_fan_out(tensor): | |||
dimensions = tensor.ndimension() | |||
if dimensions < 2: | |||
logger.error("[Error] Fan in and fan out can not be computed for tensor with less than 2 dimensions") | |||
raise ValueError("[Error] Fan in and fan out can not be computed for tensor with less than 2 dimensions") | |||
if dimensions == 2: # Linear | |||
fan_in = tensor.size(1) | |||
fan_out = tensor.size(0) | |||
else: | |||
num_input_fmaps = tensor.size(1) | |||
num_output_fmaps = tensor.size(0) | |||
receptive_field_size = 1 | |||
if tensor.dim() > 2: | |||
receptive_field_size = tensor[0][0].numel() | |||
fan_in = num_input_fmaps * receptive_field_size | |||
fan_out = num_output_fmaps * receptive_field_size | |||
return fan_in, fan_out | |||
def forward(self, input): | |||
# input: a batch of Example object [batch_size, N, seq_len, character_len] | |||
batch_size, N, seq_len, _ = input.size() | |||
input = input.view(batch_size * N, seq_len, -1) # [batch_size*N, seq_len, character_len] | |||
input_sent_len = ((input.sum(-1)!=0).sum(dim=1)).int() # [batch_size*N, 1] | |||
logger.debug(input_sent_len.view(batch_size, -1)) | |||
enc_embed_input = self.elmo(input)['elmo_representations'][0] # [batch_size*N, L, D] | |||
enc_embed_input = self.embed_proj(enc_embed_input) | |||
# input_pos = torch.Tensor([np.hstack((np.arange(1, sentlen + 1), np.zeros(self.sent_max_len - sentlen))) for sentlen in input_sent_len]) | |||
sent_pos_list = [] | |||
for sentlen in input_sent_len: | |||
sent_pos = list(range(1, min(self.sent_max_len, sentlen) + 1)) | |||
for k in range(self.sent_max_len - sentlen): | |||
sent_pos.append(0) | |||
sent_pos_list.append(sent_pos) | |||
input_pos = torch.Tensor(sent_pos_list).long() | |||
if self._hps.cuda: | |||
input_pos = input_pos.cuda() | |||
enc_pos_embed_input = self.position_embedding(input_pos.long()) # [batch_size*N, D] | |||
enc_conv_input = enc_embed_input + enc_pos_embed_input | |||
enc_conv_input = enc_conv_input.unsqueeze(1) # (batch * N,Ci,L,D) | |||
enc_conv_output = [F.relu(conv(enc_conv_input)).squeeze(3) for conv in self.convs] # kernel_sizes * (batch*N, Co, W) | |||
enc_maxpool_output = [F.max_pool1d(x, x.size(2)).squeeze(2) for x in enc_conv_output] # kernel_sizes * (batch*N, Co) | |||
sent_embedding = torch.cat(enc_maxpool_output, 1) # (batch*N, Co * kernel_sizes) | |||
sent_embedding = sent_embedding.view(batch_size, N, -1) | |||
return sent_embedding | |||
class ELMoEndoer2(nn.Module): | |||
def __init__(self, hps): | |||
super(ELMoEndoer2, self).__init__() | |||
self._hps = hps | |||
self._cuda = hps.cuda | |||
self.sent_max_len = hps.sent_max_len | |||
from allennlp.modules.elmo import Elmo | |||
elmo_dim = 1024 | |||
options_file = "/remote-home/dqwang/ELMo/elmo_2x4096_512_2048cnn_2xhighway_5.5B_options.json" | |||
weight_file = "/remote-home/dqwang/ELMo/elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5" | |||
# elmo_dim = 512 | |||
# options_file = "/remote-home/dqwang/ELMo/elmo_2x2048_256_2048cnn_1xhighway_options.json" | |||
# weight_file = "/remote-home/dqwang/ELMo/elmo_2x2048_256_2048cnn_1xhighway_weights.hdf5" | |||
embed_size = hps.word_emb_dim | |||
sent_max_len = hps.sent_max_len | |||
input_channels = 1 | |||
out_channels = hps.output_channel | |||
min_kernel_size = hps.min_kernel_size | |||
max_kernel_size = hps.max_kernel_size | |||
width = embed_size | |||
# elmo embedding | |||
self.elmo = Elmo(options_file, weight_file, 1, dropout=0) | |||
self.embed_proj = nn.Linear(elmo_dim, embed_size) | |||
# position embedding | |||
self.position_embedding = nn.Embedding.from_pretrained(get_sinusoid_encoding_table(sent_max_len + 1, embed_size, padding_idx=0), freeze=True) | |||
# cnn | |||
self.convs = nn.ModuleList([nn.Conv2d(input_channels, out_channels, kernel_size = (height, width)) for height in range(min_kernel_size, max_kernel_size+1)]) | |||
logger.info("[INFO] Initing W for CNN.......") | |||
for conv in self.convs: | |||
init_weight_value = 6.0 | |||
init.xavier_normal_(conv.weight.data, gain=np.sqrt(init_weight_value)) | |||
fan_in, fan_out = Encoder.calculate_fan_in_and_fan_out(conv.weight.data) | |||
std = np.sqrt(init_weight_value) * np.sqrt(2.0 / (fan_in + fan_out)) | |||
def calculate_fan_in_and_fan_out(tensor): | |||
dimensions = tensor.ndimension() | |||
if dimensions < 2: | |||
logger.error("[Error] Fan in and fan out can not be computed for tensor with less than 2 dimensions") | |||
raise ValueError("[Error] Fan in and fan out can not be computed for tensor with less than 2 dimensions") | |||
if dimensions == 2: # Linear | |||
fan_in = tensor.size(1) | |||
fan_out = tensor.size(0) | |||
else: | |||
num_input_fmaps = tensor.size(1) | |||
num_output_fmaps = tensor.size(0) | |||
receptive_field_size = 1 | |||
if tensor.dim() > 2: | |||
receptive_field_size = tensor[0][0].numel() | |||
fan_in = num_input_fmaps * receptive_field_size | |||
fan_out = num_output_fmaps * receptive_field_size | |||
return fan_in, fan_out | |||
def pad_encoder_input(self, input_list): | |||
""" | |||
:param input_list: N [seq_len, hidden_state] | |||
:return: enc_sent_input_pad: list, N [max_len, hidden_state] | |||
""" | |||
max_len = self.sent_max_len | |||
enc_sent_input_pad = [] | |||
_, hidden_size = input_list[0].size() | |||
for i in range(len(input_list)): | |||
article_words = input_list[i] # [seq_len, hidden_size] | |||
seq_len = article_words.size(0) | |||
if seq_len > max_len: | |||
pad_words = article_words[:max_len, :] | |||
else: | |||
pad_tensor = torch.zeros(max_len - seq_len, hidden_size).cuda() if self._cuda else torch.zeros(max_len - seq_len, hidden_size) | |||
pad_words = torch.cat([article_words, pad_tensor], dim=0) | |||
enc_sent_input_pad.append(pad_words) | |||
return enc_sent_input_pad | |||
def forward(self, inputs, input_masks, enc_sent_len): | |||
""" | |||
:param inputs: a batch of Example object [batch_size, doc_len=512, character_len=50] | |||
:param input_masks: 0 or 1, [batch, doc_len=512] | |||
:param enc_sent_len: sentence original length [batch, N] | |||
:return: | |||
sent_embedding: [batch, N, D] | |||
""" | |||
# Use Bert to get word embedding | |||
batch_size, N = enc_sent_len.size() | |||
input_pad_list = [] | |||
elmo_output = self.elmo(inputs)['elmo_representations'][0] # [batch_size, 512, D] | |||
elmo_output = elmo_output * input_masks.unsqueeze(-1).float() | |||
# print("END elmo") | |||
for i in range(batch_size): | |||
sent_len = enc_sent_len[i] # [1, N] | |||
out = elmo_output[i] | |||
_, hidden_size = out.size() | |||
# restore the sentence | |||
last_end = 0 | |||
enc_sent_input = [] | |||
for length in sent_len: | |||
if length != 0 and last_end < 512: | |||
enc_sent_input.append(out[last_end : min(512, last_end + length), :]) | |||
last_end += length | |||
else: | |||
pad_tensor = torch.zeros(self.sent_max_len, hidden_size).cuda() if self._hps.cuda else torch.zeros(self.sent_max_len, hidden_size) | |||
enc_sent_input.append(pad_tensor) | |||
# pad the sentence | |||
enc_sent_input_pad = self.pad_encoder_input(enc_sent_input) # [N, seq_len, hidden_state=4096] | |||
input_pad_list.append(torch.stack(enc_sent_input_pad)) # batch * [N, max_len, hidden_state] | |||
input_pad = torch.stack(input_pad_list) | |||
input_pad = input_pad.view(batch_size * N, self.sent_max_len, -1) | |||
enc_sent_len = enc_sent_len.view(-1) # [batch_size*N] | |||
enc_embed_input = self.embed_proj(input_pad) # [batch_size * N, L, D] | |||
# input_pos = torch.Tensor([np.hstack((np.arange(1, sentlen + 1), np.zeros(self.sent_max_len - sentlen))) for sentlen in input_sent_len]) | |||
sent_pos_list = [] | |||
for sentlen in enc_sent_len: | |||
sent_pos = list(range(1, min(self.sent_max_len, sentlen) + 1)) | |||
for k in range(self.sent_max_len - sentlen): | |||
sent_pos.append(0) | |||
sent_pos_list.append(sent_pos) | |||
input_pos = torch.Tensor(sent_pos_list).long() | |||
if self._hps.cuda: | |||
input_pos = input_pos.cuda() | |||
enc_pos_embed_input = self.position_embedding(input_pos.long()) # [batch_size*N, D] | |||
enc_conv_input = enc_embed_input + enc_pos_embed_input | |||
enc_conv_input = enc_conv_input.unsqueeze(1) # (batch * N,Ci,L,D) | |||
enc_conv_output = [F.relu(conv(enc_conv_input)).squeeze(3) for conv in self.convs] # kernel_sizes * (batch*N, Co, W) | |||
enc_maxpool_output = [F.max_pool1d(x, x.size(2)).squeeze(2) for x in enc_conv_output] # kernel_sizes * (batch*N, Co) | |||
sent_embedding = torch.cat(enc_maxpool_output, 1) # (batch*N, Co * kernel_sizes) | |||
sent_embedding = sent_embedding.view(batch_size, N, -1) | |||
return sent_embedding |
@@ -0,0 +1,41 @@ | |||
#!/usr/bin/python | |||
# -*- coding: utf-8 -*- | |||
# __author__="Danqing Wang" | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
# ============================================================================== | |||
import torch | |||
import numpy as np | |||
def get_sinusoid_encoding_table(n_position, d_hid, padding_idx=None): | |||
''' Sinusoid position encoding table ''' | |||
def cal_angle(position, hid_idx): | |||
return position / np.power(10000, 2 * (hid_idx // 2) / d_hid) | |||
def get_posi_angle_vec(position): | |||
return [cal_angle(position, hid_j) for hid_j in range(d_hid)] | |||
sinusoid_table = np.array([get_posi_angle_vec(pos_i) for pos_i in range(n_position)]) | |||
sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2]) # dim 2i | |||
sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) # dim 2i+1 | |||
if padding_idx is not None: | |||
# zero vector for padding dimension | |||
sinusoid_table[padding_idx] = 0. | |||
return torch.FloatTensor(sinusoid_table) |
@@ -0,0 +1 @@ | |||
@@ -0,0 +1,479 @@ | |||
#!/usr/bin/python | |||
# -*- coding: utf-8 -*- | |||
# __author__="Danqing Wang" | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
# ============================================================================== | |||
"""This file contains code to read the train/eval/test data from file and process it, and read the vocab data from file and process it""" | |||
import os | |||
import re | |||
import glob | |||
import copy | |||
import random | |||
import json | |||
import collections | |||
from itertools import combinations | |||
import numpy as np | |||
from random import shuffle | |||
import torch.utils.data | |||
import time | |||
import pickle | |||
from nltk.tokenize import sent_tokenize | |||
import utils | |||
from logger import * | |||
# <s> and </s> are used in the data files to segment the abstracts into sentences. They don't receive vocab ids. | |||
SENTENCE_START = '<s>' | |||
SENTENCE_END = '</s>' | |||
PAD_TOKEN = '[PAD]' # This has a vocab id, which is used to pad the encoder input, decoder input and target sequence | |||
UNKNOWN_TOKEN = '[UNK]' # This has a vocab id, which is used to represent out-of-vocabulary words | |||
START_DECODING = '[START]' # This has a vocab id, which is used at the start of every decoder input sequence | |||
STOP_DECODING = '[STOP]' # This has a vocab id, which is used at the end of untruncated target sequences | |||
# Note: none of <s>, </s>, [PAD], [UNK], [START], [STOP] should appear in the vocab file. | |||
class Vocab(object): | |||
"""Vocabulary class for mapping between words and ids (integers)""" | |||
def __init__(self, vocab_file, max_size): | |||
""" | |||
Creates a vocab of up to max_size words, reading from the vocab_file. If max_size is 0, reads the entire vocab file. | |||
:param vocab_file: string; path to the vocab file, which is assumed to contain "<word> <frequency>" on each line, sorted with most frequent word first. This code doesn't actually use the frequencies, though. | |||
:param max_size: int; The maximum size of the resulting Vocabulary. | |||
""" | |||
self._word_to_id = {} | |||
self._id_to_word = {} | |||
self._count = 0 # keeps track of total number of words in the Vocab | |||
# [UNK], [PAD], [START] and [STOP] get the ids 0,1,2,3. | |||
for w in [PAD_TOKEN, UNKNOWN_TOKEN, START_DECODING, STOP_DECODING]: | |||
self._word_to_id[w] = self._count | |||
self._id_to_word[self._count] = w | |||
self._count += 1 | |||
# Read the vocab file and add words up to max_size | |||
with open(vocab_file, 'r', encoding='utf8') as vocab_f: #New : add the utf8 encoding to prevent error | |||
cnt = 0 | |||
for line in vocab_f: | |||
cnt += 1 | |||
pieces = line.split("\t") | |||
# pieces = line.split() | |||
w = pieces[0] | |||
# print(w) | |||
if w in [SENTENCE_START, SENTENCE_END, UNKNOWN_TOKEN, PAD_TOKEN, START_DECODING, STOP_DECODING]: | |||
raise Exception('<s>, </s>, [UNK], [PAD], [START] and [STOP] shouldn\'t be in the vocab file, but %s is' % w) | |||
if w in self._word_to_id: | |||
logger.error('Duplicated word in vocabulary file Line %d : %s' % (cnt, w)) | |||
continue | |||
self._word_to_id[w] = self._count | |||
self._id_to_word[self._count] = w | |||
self._count += 1 | |||
if max_size != 0 and self._count >= max_size: | |||
logger.info("[INFO] max_size of vocab was specified as %i; we now have %i words. Stopping reading." % (max_size, self._count)) | |||
break | |||
logger.info("[INFO] Finished constructing vocabulary of %i total words. Last word added: %s", self._count, self._id_to_word[self._count-1]) | |||
def word2id(self, word): | |||
"""Returns the id (integer) of a word (string). Returns [UNK] id if word is OOV.""" | |||
if word not in self._word_to_id: | |||
return self._word_to_id[UNKNOWN_TOKEN] | |||
return self._word_to_id[word] | |||
def id2word(self, word_id): | |||
"""Returns the word (string) corresponding to an id (integer).""" | |||
if word_id not in self._id_to_word: | |||
raise ValueError('Id not found in vocab: %d' % word_id) | |||
return self._id_to_word[word_id] | |||
def size(self): | |||
"""Returns the total size of the vocabulary""" | |||
return self._count | |||
def word_list(self): | |||
"""Return the word list of the vocabulary""" | |||
return self._word_to_id.keys() | |||
class Word_Embedding(object): | |||
def __init__(self, path, vocab): | |||
""" | |||
:param path: string; the path of word embedding | |||
:param vocab: object; | |||
""" | |||
logger.info("[INFO] Loading external word embedding...") | |||
self._path = path | |||
self._vocablist = vocab.word_list() | |||
self._vocab = vocab | |||
def load_my_vecs(self, k=200): | |||
"""Load word embedding""" | |||
word_vecs = {} | |||
with open(self._path, encoding="utf-8") as f: | |||
count = 0 | |||
lines = f.readlines()[1:] | |||
for line in lines: | |||
values = line.split(" ") | |||
word = values[0] | |||
count += 1 | |||
if word in self._vocablist: # whether to judge if in vocab | |||
vector = [] | |||
for count, val in enumerate(values): | |||
if count == 0: | |||
continue | |||
if count <= k: | |||
vector.append(float(val)) | |||
word_vecs[word] = vector | |||
return word_vecs | |||
def add_unknown_words_by_zero(self, word_vecs, k=200): | |||
"""Solve unknown by zeros""" | |||
zero = [0.0] * k | |||
list_word2vec = [] | |||
oov = 0 | |||
iov = 0 | |||
for i in range(self._vocab.size()): | |||
word = self._vocab.id2word(i) | |||
if word not in word_vecs: | |||
oov += 1 | |||
word_vecs[word] = zero | |||
list_word2vec.append(word_vecs[word]) | |||
else: | |||
iov += 1 | |||
list_word2vec.append(word_vecs[word]) | |||
logger.info("[INFO] oov count %d, iov count %d", oov, iov) | |||
return list_word2vec | |||
def add_unknown_words_by_avg(self, word_vecs, k=200): | |||
"""Solve unknown by avg word embedding""" | |||
# solve unknown words inplaced by zero list | |||
word_vecs_numpy = [] | |||
for word in self._vocablist: | |||
if word in word_vecs: | |||
word_vecs_numpy.append(word_vecs[word]) | |||
col = [] | |||
for i in range(k): | |||
sum = 0.0 | |||
for j in range(int(len(word_vecs_numpy))): | |||
sum += word_vecs_numpy[j][i] | |||
sum = round(sum, 6) | |||
col.append(sum) | |||
zero = [] | |||
for m in range(k): | |||
avg = col[m] / int(len(word_vecs_numpy)) | |||
avg = round(avg, 6) | |||
zero.append(float(avg)) | |||
list_word2vec = [] | |||
oov = 0 | |||
iov = 0 | |||
for i in range(self._vocab.size()): | |||
word = self._vocab.id2word(i) | |||
if word not in word_vecs: | |||
oov += 1 | |||
word_vecs[word] = zero | |||
list_word2vec.append(word_vecs[word]) | |||
else: | |||
iov += 1 | |||
list_word2vec.append(word_vecs[word]) | |||
logger.info("[INFO] External Word Embedding iov count: %d, oov count: %d", iov, oov) | |||
return list_word2vec | |||
def add_unknown_words_by_uniform(self, word_vecs, uniform=0.25, k=200): | |||
"""Solve unknown word by uniform(-0.25,0.25)""" | |||
list_word2vec = [] | |||
oov = 0 | |||
iov = 0 | |||
for i in range(self._vocab.size()): | |||
word = self._vocab.id2word(i) | |||
if word not in word_vecs: | |||
oov += 1 | |||
word_vecs[word] = np.random.uniform(-1 * uniform, uniform, k).round(6).tolist() | |||
list_word2vec.append(word_vecs[word]) | |||
else: | |||
iov += 1 | |||
list_word2vec.append(word_vecs[word]) | |||
logger.info("[INFO] oov count %d, iov count %d", oov, iov) | |||
return list_word2vec | |||
# load word embedding | |||
def load_my_vecs_freq1(self, freqs, pro): | |||
word_vecs = {} | |||
with open(self._path, encoding="utf-8") as f: | |||
freq = 0 | |||
lines = f.readlines()[1:] | |||
for line in lines: | |||
values = line.split(" ") | |||
word = values[0] | |||
if word in self._vocablist: # whehter to judge if in vocab | |||
if freqs[word] == 1: | |||
a = np.random.uniform(0, 1, 1).round(2) | |||
if pro < a: | |||
continue | |||
vector = [] | |||
for count, val in enumerate(values): | |||
if count == 0: | |||
continue | |||
vector.append(float(val)) | |||
word_vecs[word] = vector | |||
return word_vecs | |||
class DomainDict(object): | |||
"""Domain embedding for Newsroom""" | |||
def __init__(self, path): | |||
self.domain_list = self.readDomainlist(path) | |||
# self.domain_list = ["foxnews.com", "cnn.com", "mashable.com", "nytimes.com", "washingtonpost.com"] | |||
self.domain_number = len(self.domain_list) | |||
self._domain_to_id = {} | |||
self._id_to_domain = {} | |||
self._cnt = 0 | |||
self._domain_to_id["X"] = self._cnt | |||
self._id_to_domain[self._cnt] = "X" | |||
self._cnt += 1 | |||
for i in range(self.domain_number): | |||
domain = self.domain_list[i] | |||
self._domain_to_id[domain] = self._cnt | |||
self._id_to_domain[self._cnt] = domain | |||
self._cnt += 1 | |||
def readDomainlist(self, path): | |||
domain_list = [] | |||
with open(path) as f: | |||
for line in f: | |||
domain_list.append(line.split("\t")[0].strip()) | |||
logger.info(domain_list) | |||
return domain_list | |||
def domain2id(self, domain): | |||
""" Returns the id (integer) of a domain (string). Returns "X" for unknow domain. | |||
:param domain: string | |||
:return: id; int | |||
""" | |||
if domain in self.domain_list: | |||
return self._domain_to_id[domain] | |||
else: | |||
logger.info(domain) | |||
return self._domain_to_id["X"] | |||
def id2domain(self, domain_id): | |||
""" Returns the domain (string) corresponding to an id (integer). | |||
:param id: int; | |||
:return: domain: string | |||
""" | |||
if domain_id not in self._id_to_domain: | |||
raise ValueError('Id not found in DomainDict: %d' % domain_id) | |||
return self._id_to_domain[id] | |||
def size(self): | |||
return self._cnt | |||
class Example(object): | |||
"""Class representing a train/val/test example for text summarization.""" | |||
def __init__(self, article_sents, abstract_sents, vocab, sent_max_len, label, domainid=None): | |||
""" Initializes the Example, performing tokenization and truncation to produce the encoder, decoder and target sequences, which are stored in self. | |||
:param article_sents: list of strings; one per article sentence. each token is separated by a single space. | |||
:param abstract_sents: list of strings; one per abstract sentence. In each sentence, each token is separated by a single space. | |||
:param domainid: int; publication of the example | |||
:param vocab: Vocabulary object | |||
:param sent_max_len: int; the maximum length of each sentence, padding all sentences to this length | |||
:param label: list of int; the index of selected sentences | |||
""" | |||
self.sent_max_len = sent_max_len | |||
self.enc_sent_len = [] | |||
self.enc_sent_input = [] | |||
self.enc_sent_input_pad = [] | |||
# origin_cnt = len(article_sents) | |||
# article_sents = [re.sub(r"\n+\t+", " ", sent) for sent in article_sents] | |||
# assert origin_cnt == len(article_sents) | |||
# Process the article | |||
for sent in article_sents: | |||
article_words = sent.split() | |||
self.enc_sent_len.append(len(article_words)) # store the length after truncation but before padding | |||
self.enc_sent_input.append([vocab.word2id(w) for w in article_words]) # list of word ids; OOVs are represented by the id for UNK token | |||
self._pad_encoder_input(vocab.word2id('[PAD]')) | |||
# Store the original strings | |||
self.original_article = " ".join(article_sents) | |||
self.original_article_sents = article_sents | |||
if isinstance(abstract_sents[0], list): | |||
logger.debug("[INFO] Multi Reference summaries!") | |||
self.original_abstract_sents = [] | |||
self.original_abstract = [] | |||
for summary in abstract_sents: | |||
self.original_abstract_sents.append([sent.strip() for sent in summary]) | |||
self.original_abstract.append("\n".join([sent.replace("\n", "") for sent in summary])) | |||
else: | |||
self.original_abstract_sents = [sent.replace("\n", "") for sent in abstract_sents] | |||
self.original_abstract = "\n".join(self.original_abstract_sents) | |||
# Store the label | |||
self.label = np.zeros(len(article_sents), dtype=int) | |||
if label != []: | |||
self.label[np.array(label)] = 1 | |||
self.label = list(self.label) | |||
# Store the publication | |||
if domainid != None: | |||
if domainid == 0: | |||
logger.debug("domain id = 0!") | |||
self.domain = domainid | |||
def _pad_encoder_input(self, pad_id): | |||
""" | |||
:param pad_id: int; token pad id | |||
:return: | |||
""" | |||
max_len = self.sent_max_len | |||
for i in range(len(self.enc_sent_input)): | |||
article_words = self.enc_sent_input[i] | |||
if len(article_words) > max_len: | |||
article_words = article_words[:max_len] | |||
while len(article_words) < max_len: | |||
article_words.append(pad_id) | |||
self.enc_sent_input_pad.append(article_words) | |||
class ExampleSet(torch.utils.data.Dataset): | |||
""" Constructor: Dataset of example(object) """ | |||
def __init__(self, data_path, vocab, doc_max_timesteps, sent_max_len, domaindict=None, randomX=False, usetag=False): | |||
""" Initializes the ExampleSet with the path of data | |||
:param data_path: string; the path of data | |||
:param vocab: object; | |||
:param doc_max_timesteps: int; the maximum sentence number of a document, each example should pad sentences to this length | |||
:param sent_max_len: int; the maximum token number of a sentence, each sentence should pad tokens to this length | |||
:param domaindict: object; the domain dict to embed domain | |||
""" | |||
self.domaindict = domaindict | |||
if domaindict: | |||
logger.info("[INFO] Use domain information in the dateset!") | |||
if randomX==True: | |||
logger.info("[INFO] Random some example to unknow domain X!") | |||
self.randomP = 0.1 | |||
logger.info("[INFO] Start reading ExampleSet") | |||
start = time.time() | |||
self.example_list = [] | |||
self.doc_max_timesteps = doc_max_timesteps | |||
cnt = 0 | |||
with open(data_path, 'r') as reader: | |||
for line in reader: | |||
try: | |||
e = json.loads(line) | |||
article_sent = e['text'] | |||
tag = e["tag"][0] if usetag else e['publication'] | |||
# logger.info(tag) | |||
if "duc" in data_path: | |||
abstract_sent = e["summaryList"] if "summaryList" in e.keys() else [e['summary']] | |||
else: | |||
abstract_sent = e['summary'] | |||
if domaindict: | |||
if randomX == True: | |||
p = np.random.rand() | |||
if p <= self.randomP: | |||
domainid = domaindict.domain2id("X") | |||
else: | |||
domainid = domaindict.domain2id(tag) | |||
else: | |||
domainid = domaindict.domain2id(tag) | |||
else: | |||
domainid = None | |||
logger.debug((tag, domainid)) | |||
except (ValueError,EOFError) as e : | |||
logger.debug(e) | |||
break | |||
else: | |||
example = Example(article_sent, abstract_sent, vocab, sent_max_len, e["label"], domainid) # Process into an Example. | |||
self.example_list.append(example) | |||
cnt += 1 | |||
# print(cnt) | |||
logger.info("[INFO] Finish reading ExampleSet. Total time is %f, Total size is %d", time.time() - start, len(self.example_list)) | |||
self.size = len(self.example_list) | |||
# self.example_list.sort(key=lambda ex: ex.domain) | |||
def get_example(self, index): | |||
return self.example_list[index] | |||
def __getitem__(self, index): | |||
""" | |||
:param index: int; the index of the example | |||
:return | |||
input_pad: [N, seq_len] | |||
label: [N] | |||
input_mask: [N] | |||
domain: [1] | |||
""" | |||
item = self.example_list[index] | |||
input = np.array(item.enc_sent_input_pad) | |||
label = np.array(item.label, dtype=int) | |||
# pad input to doc_max_timesteps | |||
if len(input) < self.doc_max_timesteps: | |||
pad_number = self.doc_max_timesteps - len(input) | |||
pad_matrix = np.zeros((pad_number, len(input[0]))) | |||
input_pad = np.vstack((input, pad_matrix)) | |||
label = np.append(label, np.zeros(pad_number, dtype=int)) | |||
input_mask = np.append(np.ones(len(input)), np.zeros(pad_number)) | |||
else: | |||
input_pad = input[:self.doc_max_timesteps] | |||
label = label[:self.doc_max_timesteps] | |||
input_mask = np.ones(self.doc_max_timesteps) | |||
if self.domaindict: | |||
return torch.from_numpy(input_pad).long(), torch.from_numpy(label).long(), torch.from_numpy(input_mask).long(), item.domain | |||
return torch.from_numpy(input_pad).long(), torch.from_numpy(label).long(), torch.from_numpy(input_mask).long() | |||
def __len__(self): | |||
return self.size | |||
class MultiExampleSet(): | |||
def __init__(self, data_dir, vocab, doc_max_timesteps, sent_max_len, domaindict=None, randomX=False, usetag=False): | |||
self.datasets = [None] * (domaindict.size() - 1) | |||
data_path_list = [os.path.join(data_dir, s) for s in os.listdir(data_dir) if s.endswith("label.jsonl")] | |||
for data_path in data_path_list: | |||
fname = data_path.split("/")[-1] # cnn.com.label.json | |||
dataname = ".".join(fname.split(".")[:-2]) | |||
domainid = domaindict.domain2id(dataname) | |||
logger.info("[INFO] domain name: %s, domain id: %d" % (dataname, domainid)) | |||
self.datasets[domainid - 1] = ExampleSet(data_path, vocab, doc_max_timesteps, sent_max_len, domaindict, randomX, usetag) | |||
def get(self, id): | |||
return self.datasets[id] | |||
from torch.utils.data.dataloader import default_collate | |||
def my_collate_fn(batch): | |||
''' | |||
:param batch: (input_pad, label, input_mask, domain) | |||
:return: | |||
''' | |||
start_domain = batch[0][-1] | |||
# for i in range(len(batch)): | |||
# print(batch[i][-1], end=',') | |||
batch = list(filter(lambda x: x[-1] == start_domain, batch)) | |||
print("start_domain %d" % start_domain) | |||
print("batch_len %d" % len(batch)) | |||
if len(batch) == 0: return torch.Tensor() | |||
return default_collate(batch) # 用默认方式拼接过滤后的batch数据 | |||
@@ -0,0 +1,27 @@ | |||
# -*- coding: utf-8 -*- | |||
import logging | |||
import sys | |||
# 获取logger实例,如果参数为空则返回root logger | |||
logger = logging.getLogger("Summarization logger") | |||
# logger = logging.getLogger() | |||
# 指定logger输出格式 | |||
formatter = logging.Formatter('%(asctime)s %(levelname)-8s: %(message)s') | |||
# # 文件日志 | |||
# file_handler = logging.FileHandler("test.log") | |||
# file_handler.setFormatter(formatter) # 可以通过setFormatter指定输出格式 | |||
# 控制台日志 | |||
console_handler = logging.StreamHandler(sys.stdout) | |||
console_handler.formatter = formatter # 也可以直接给formatter赋值 | |||
console_handler.setLevel(logging.INFO) | |||
# 为logger添加的日志处理器 | |||
# logger.addHandler(file_handler) | |||
logger.addHandler(console_handler) | |||
# 指定日志的最低输出级别,默认为WARN级别 | |||
logger.setLevel(logging.DEBUG) |
@@ -0,0 +1,297 @@ | |||
#!/usr/bin/python | |||
# -*- coding: utf-8 -*- | |||
import re | |||
import os | |||
import shutil | |||
import copy | |||
import datetime | |||
import numpy as np | |||
from rouge import Rouge | |||
from .logger import * | |||
# from data import * | |||
import sys | |||
sys.setrecursionlimit(10000) | |||
REMAP = {"-lrb-": "(", "-rrb-": ")", "-lcb-": "{", "-rcb-": "}", | |||
"-lsb-": "[", "-rsb-": "]", "``": '"', "''": '"'} | |||
def clean(x): | |||
return re.sub( | |||
r"-lrb-|-rrb-|-lcb-|-rcb-|-lsb-|-rsb-|``|''", | |||
lambda m: REMAP.get(m.group()), x) | |||
def rouge_eval(hyps, refer): | |||
rouge = Rouge() | |||
# print(hyps) | |||
# print(refer) | |||
# print(rouge.get_scores(hyps, refer)) | |||
try: | |||
score = rouge.get_scores(hyps, refer)[0] | |||
mean_score = np.mean([score["rouge-1"]["f"], score["rouge-2"]["f"], score["rouge-l"]["f"]]) | |||
except: | |||
mean_score = 0.0 | |||
return mean_score | |||
def rouge_all(hyps, refer): | |||
rouge = Rouge() | |||
score = rouge.get_scores(hyps, refer)[0] | |||
# mean_score = np.mean([score["rouge-1"]["f"], score["rouge-2"]["f"], score["rouge-l"]["f"]]) | |||
return score | |||
def eval_label(match_true, pred, true, total, match): | |||
match_true, pred, true, match = match_true.float(), pred.float(), true.float(), match.float() | |||
try: | |||
accu = match / total | |||
precision = match_true / pred | |||
recall = match_true / true | |||
F = 2 * precision * recall / (precision + recall) | |||
except ZeroDivisionError: | |||
F = 0.0 | |||
logger.error("[Error] float division by zero") | |||
return accu, precision, recall, F | |||
def pyrouge_score(hyps, refer, remap = True): | |||
from pyrouge import Rouge155 | |||
nowTime=datetime.datetime.now().strftime('%Y%m%d_%H%M%S') | |||
PYROUGE_ROOT = os.path.join('/remote-home/dqwang/', nowTime) | |||
SYSTEM_PATH = os.path.join(PYROUGE_ROOT,'gold') | |||
MODEL_PATH = os.path.join(PYROUGE_ROOT,'system') | |||
if os.path.exists(SYSTEM_PATH): | |||
shutil.rmtree(SYSTEM_PATH) | |||
os.makedirs(SYSTEM_PATH) | |||
if os.path.exists(MODEL_PATH): | |||
shutil.rmtree(MODEL_PATH) | |||
os.makedirs(MODEL_PATH) | |||
if remap == True: | |||
refer = clean(refer) | |||
hyps = clean(hyps) | |||
system_file = os.path.join(SYSTEM_PATH, 'Reference.0.txt') | |||
model_file = os.path.join(MODEL_PATH, 'Model.A.0.txt') | |||
with open(system_file, 'wb') as f: | |||
f.write(refer.encode('utf-8')) | |||
with open(model_file, 'wb') as f: | |||
f.write(hyps.encode('utf-8')) | |||
r = Rouge155('/home/dqwang/ROUGE/RELEASE-1.5.5') | |||
r.system_dir = SYSTEM_PATH | |||
r.model_dir = MODEL_PATH | |||
r.system_filename_pattern = 'Reference.(\d+).txt' | |||
r.model_filename_pattern = 'Model.[A-Z].#ID#.txt' | |||
output = r.convert_and_evaluate(rouge_args="-e /home/dqwang/ROUGE/RELEASE-1.5.5/data -a -m -n 2 -d") | |||
output_dict = r.output_to_dict(output) | |||
shutil.rmtree(PYROUGE_ROOT) | |||
scores = {} | |||
scores['rouge-1'], scores['rouge-2'], scores['rouge-l'] = {}, {}, {} | |||
scores['rouge-1']['p'], scores['rouge-1']['r'], scores['rouge-1']['f'] = output_dict['rouge_1_precision'], output_dict['rouge_1_recall'], output_dict['rouge_1_f_score'] | |||
scores['rouge-2']['p'], scores['rouge-2']['r'], scores['rouge-2']['f'] = output_dict['rouge_2_precision'], output_dict['rouge_2_recall'], output_dict['rouge_2_f_score'] | |||
scores['rouge-l']['p'], scores['rouge-l']['r'], scores['rouge-l']['f'] = output_dict['rouge_l_precision'], output_dict['rouge_l_recall'], output_dict['rouge_l_f_score'] | |||
return scores | |||
def pyrouge_score_all(hyps_list, refer_list, remap = True): | |||
from pyrouge import Rouge155 | |||
nowTime=datetime.datetime.now().strftime('%Y%m%d_%H%M%S') | |||
PYROUGE_ROOT = os.path.join('/remote-home/dqwang/', nowTime) | |||
SYSTEM_PATH = os.path.join(PYROUGE_ROOT,'gold') | |||
MODEL_PATH = os.path.join(PYROUGE_ROOT,'system') | |||
if os.path.exists(SYSTEM_PATH): | |||
shutil.rmtree(SYSTEM_PATH) | |||
os.makedirs(SYSTEM_PATH) | |||
if os.path.exists(MODEL_PATH): | |||
shutil.rmtree(MODEL_PATH) | |||
os.makedirs(MODEL_PATH) | |||
assert len(hyps_list) == len(refer_list) | |||
for i in range(len(hyps_list)): | |||
system_file = os.path.join(SYSTEM_PATH, 'Reference.%d.txt' % i) | |||
model_file = os.path.join(MODEL_PATH, 'Model.A.%d.txt' % i) | |||
refer = clean(refer_list[i]) if remap else refer_list[i] | |||
hyps = clean(hyps_list[i]) if remap else hyps_list[i] | |||
with open(system_file, 'wb') as f: | |||
f.write(refer.encode('utf-8')) | |||
with open(model_file, 'wb') as f: | |||
f.write(hyps.encode('utf-8')) | |||
r = Rouge155('/remote-home/dqwang/ROUGE/RELEASE-1.5.5') | |||
r.system_dir = SYSTEM_PATH | |||
r.model_dir = MODEL_PATH | |||
r.system_filename_pattern = 'Reference.(\d+).txt' | |||
r.model_filename_pattern = 'Model.[A-Z].#ID#.txt' | |||
output = r.convert_and_evaluate(rouge_args="-e /remote-home/dqwang/ROUGE/RELEASE-1.5.5/data -a -m -n 2 -d") | |||
output_dict = r.output_to_dict(output) | |||
shutil.rmtree(PYROUGE_ROOT) | |||
scores = {} | |||
scores['rouge-1'], scores['rouge-2'], scores['rouge-l'] = {}, {}, {} | |||
scores['rouge-1']['p'], scores['rouge-1']['r'], scores['rouge-1']['f'] = output_dict['rouge_1_precision'], output_dict['rouge_1_recall'], output_dict['rouge_1_f_score'] | |||
scores['rouge-2']['p'], scores['rouge-2']['r'], scores['rouge-2']['f'] = output_dict['rouge_2_precision'], output_dict['rouge_2_recall'], output_dict['rouge_2_f_score'] | |||
scores['rouge-l']['p'], scores['rouge-l']['r'], scores['rouge-l']['f'] = output_dict['rouge_l_precision'], output_dict['rouge_l_recall'], output_dict['rouge_l_f_score'] | |||
return scores | |||
def pyrouge_score_all_multi(hyps_list, refer_list, remap = True): | |||
from pyrouge import Rouge155 | |||
nowTime = datetime.datetime.now().strftime('%Y%m%d_%H%M%S') | |||
PYROUGE_ROOT = os.path.join('/remote-home/dqwang/', nowTime) | |||
SYSTEM_PATH = os.path.join(PYROUGE_ROOT, 'system') | |||
MODEL_PATH = os.path.join(PYROUGE_ROOT, 'gold') | |||
if os.path.exists(SYSTEM_PATH): | |||
shutil.rmtree(SYSTEM_PATH) | |||
os.makedirs(SYSTEM_PATH) | |||
if os.path.exists(MODEL_PATH): | |||
shutil.rmtree(MODEL_PATH) | |||
os.makedirs(MODEL_PATH) | |||
assert len(hyps_list) == len(refer_list) | |||
for i in range(len(hyps_list)): | |||
system_file = os.path.join(SYSTEM_PATH, 'Model.%d.txt' % i) | |||
# model_file = os.path.join(MODEL_PATH, 'Reference.A.%d.txt' % i) | |||
hyps = clean(hyps_list[i]) if remap else hyps_list[i] | |||
with open(system_file, 'wb') as f: | |||
f.write(hyps.encode('utf-8')) | |||
referType = ["A", "B", "C", "D", "E", "F", "G"] | |||
for j in range(len(refer_list[i])): | |||
model_file = os.path.join(MODEL_PATH, "Reference.%s.%d.txt" % (referType[j], i)) | |||
refer = clean(refer_list[i][j]) if remap else refer_list[i][j] | |||
with open(model_file, 'wb') as f: | |||
f.write(refer.encode('utf-8')) | |||
r = Rouge155('/remote-home/dqwang/ROUGE/RELEASE-1.5.5') | |||
r.system_dir = SYSTEM_PATH | |||
r.model_dir = MODEL_PATH | |||
r.system_filename_pattern = 'Model.(\d+).txt' | |||
r.model_filename_pattern = 'Reference.[A-Z].#ID#.txt' | |||
output = r.convert_and_evaluate(rouge_args="-e /remote-home/dqwang/ROUGE/RELEASE-1.5.5/data -a -m -n 2 -d") | |||
output_dict = r.output_to_dict(output) | |||
shutil.rmtree(PYROUGE_ROOT) | |||
scores = {} | |||
scores['rouge-1'], scores['rouge-2'], scores['rouge-l'] = {}, {}, {} | |||
scores['rouge-1']['p'], scores['rouge-1']['r'], scores['rouge-1']['f'] = output_dict['rouge_1_precision'], output_dict['rouge_1_recall'], output_dict['rouge_1_f_score'] | |||
scores['rouge-2']['p'], scores['rouge-2']['r'], scores['rouge-2']['f'] = output_dict['rouge_2_precision'], output_dict['rouge_2_recall'], output_dict['rouge_2_f_score'] | |||
scores['rouge-l']['p'], scores['rouge-l']['r'], scores['rouge-l']['f'] = output_dict['rouge_l_precision'], output_dict['rouge_l_recall'], output_dict['rouge_l_f_score'] | |||
return scores | |||
def cal_label(article, abstract): | |||
hyps_list = article | |||
refer = abstract | |||
scores = [] | |||
for hyps in hyps_list: | |||
mean_score = rouge_eval(hyps, refer) | |||
scores.append(mean_score) | |||
selected = [] | |||
selected.append(int(np.argmax(scores))) | |||
selected_sent_cnt = 1 | |||
best_rouge = np.max(scores) | |||
while selected_sent_cnt < len(hyps_list): | |||
cur_max_rouge = 0.0 | |||
cur_max_idx = -1 | |||
for i in range(len(hyps_list)): | |||
if i not in selected: | |||
temp = copy.deepcopy(selected) | |||
temp.append(i) | |||
hyps = "\n".join([hyps_list[idx] for idx in np.sort(temp)]) | |||
cur_rouge = rouge_eval(hyps, refer) | |||
if cur_rouge > cur_max_rouge: | |||
cur_max_rouge = cur_rouge | |||
cur_max_idx = i | |||
if cur_max_rouge != 0.0 and cur_max_rouge >= best_rouge: | |||
selected.append(cur_max_idx) | |||
selected_sent_cnt += 1 | |||
best_rouge = cur_max_rouge | |||
else: | |||
break | |||
# label = np.zeros(len(hyps_list), dtype=int) | |||
# label[np.array(selected)] = 1 | |||
# return list(label) | |||
return selected | |||
def cal_label_limited3(article, abstract): | |||
hyps_list = article | |||
refer = abstract | |||
scores = [] | |||
for hyps in hyps_list: | |||
try: | |||
mean_score = rouge_eval(hyps, refer) | |||
scores.append(mean_score) | |||
except ValueError: | |||
scores.append(0.0) | |||
selected = [] | |||
selected.append(np.argmax(scores)) | |||
selected_sent_cnt = 1 | |||
best_rouge = np.max(scores) | |||
while selected_sent_cnt < len(hyps_list) and selected_sent_cnt < 3: | |||
cur_max_rouge = 0.0 | |||
cur_max_idx = -1 | |||
for i in range(len(hyps_list)): | |||
if i not in selected: | |||
temp = copy.deepcopy(selected) | |||
temp.append(i) | |||
hyps = "\n".join([hyps_list[idx] for idx in np.sort(temp)]) | |||
cur_rouge = rouge_eval(hyps, refer) | |||
if cur_rouge > cur_max_rouge: | |||
cur_max_rouge = cur_rouge | |||
cur_max_idx = i | |||
selected.append(cur_max_idx) | |||
selected_sent_cnt += 1 | |||
best_rouge = cur_max_rouge | |||
# logger.info(selected) | |||
# label = np.zeros(len(hyps_list), dtype=int) | |||
# label[np.array(selected)] = 1 | |||
# return list(label) | |||
return selected | |||
import torch | |||
def flip(x, dim): | |||
xsize = x.size() | |||
dim = x.dim() + dim if dim < 0 else dim | |||
x = x.contiguous() | |||
x = x.view(-1, *xsize[dim:]).contiguous() | |||
x = x.view(x.size(0), x.size(1), -1)[:, getattr(torch.arange(x.size(1)-1, | |||
-1, -1), ('cpu','cuda')[x.is_cuda])().long(), :] | |||
return x.view(xsize) | |||
def get_attn_key_pad_mask(seq_k, seq_q): | |||
''' For masking out the padding part of key sequence. ''' | |||
# Expand to fit the shape of key query attention matrix. | |||
len_q = seq_q.size(1) | |||
padding_mask = seq_k.eq(0.0) | |||
padding_mask = padding_mask.unsqueeze(1).expand(-1, len_q, -1) # b x lq x lk | |||
return padding_mask | |||
def get_non_pad_mask(seq): | |||
assert seq.dim() == 2 | |||
return seq.ne(0.0).type(torch.float).unsqueeze(-1) |
@@ -0,0 +1,263 @@ | |||
#!/usr/bin/python | |||
# -*- coding: utf-8 -*- | |||
# __author__="Danqing Wang" | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
# ============================================================================== | |||
"""Train Model1: baseline model""" | |||
import os | |||
import sys | |||
import json | |||
import argparse | |||
import datetime | |||
import torch | |||
import torch.nn | |||
os.environ['FASTNLP_BASE_URL'] = 'http://10.141.222.118:8888/file/download/' | |||
os.environ['FASTNLP_CACHE_DIR'] = '/remote-home/hyan01/fastnlp_caches' | |||
sys.path.append('/remote-home/dqwang/FastNLP/fastNLP/') | |||
from fastNLP.core.const import Const | |||
from fastNLP.core.trainer import Trainer, Tester | |||
from fastNLP.io.model_io import ModelLoader, ModelSaver | |||
from fastNLP.io.embed_loader import EmbedLoader | |||
from tools.logger import * | |||
from data.dataloader import SummarizationLoader | |||
# from model.TransformerModel import TransformerModel | |||
from model.TForiginal import TransformerModel | |||
from model.Metric import LabelFMetric, FastRougeMetric, PyRougeMetric | |||
from model.Loss import MyCrossEntropyLoss | |||
from tools.Callback import TrainCallback | |||
def setup_training(model, train_loader, valid_loader, hps): | |||
"""Does setup before starting training (run_training)""" | |||
train_dir = os.path.join(hps.save_root, "train") | |||
if not os.path.exists(train_dir): os.makedirs(train_dir) | |||
if hps.restore_model != 'None': | |||
logger.info("[INFO] Restoring %s for training...", hps.restore_model) | |||
bestmodel_file = os.path.join(train_dir, hps.restore_model) | |||
loader = ModelLoader() | |||
loader.load_pytorch(model, bestmodel_file) | |||
else: | |||
logger.info("[INFO] Create new model for training...") | |||
try: | |||
run_training(model, train_loader, valid_loader, hps) # this is an infinite loop until interrupted | |||
except KeyboardInterrupt: | |||
logger.error("[Error] Caught keyboard interrupt on worker. Stopping supervisor...") | |||
save_file = os.path.join(train_dir, "earlystop.pkl") | |||
saver = ModelSaver(save_file) | |||
saver.save_pytorch(model) | |||
logger.info('[INFO] Saving early stop model to %s', save_file) | |||
def run_training(model, train_loader, valid_loader, hps): | |||
"""Repeatedly runs training iterations, logging loss to screen and writing summaries""" | |||
logger.info("[INFO] Starting run_training") | |||
train_dir = os.path.join(hps.save_root, "train") | |||
if not os.path.exists(train_dir): os.makedirs(train_dir) | |||
eval_dir = os.path.join(hps.save_root, "eval") # make a subdir of the root dir for eval data | |||
if not os.path.exists(eval_dir): os.makedirs(eval_dir) | |||
lr = hps.lr | |||
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=lr) | |||
criterion = MyCrossEntropyLoss(pred = "p_sent", target=Const.TARGET, mask=Const.INPUT_LEN, reduce='none') | |||
# criterion = torch.nn.CrossEntropyLoss(reduce="none") | |||
trainer = Trainer(model=model, train_data=train_loader, optimizer=optimizer, loss=criterion, | |||
n_epochs=hps.n_epochs, print_every=100, dev_data=valid_loader, metrics=[LabelFMetric(pred="prediction"), FastRougeMetric(hps, pred="prediction")], | |||
metric_key="f", validate_every=-1, save_path=eval_dir, | |||
callbacks=[TrainCallback(hps, patience=5)], use_tqdm=False) | |||
train_info = trainer.train(load_best_model=True) | |||
logger.info(' | end of Train | time: {:5.2f}s | '.format(train_info["seconds"])) | |||
logger.info('[INFO] best eval model in epoch %d and iter %d', train_info["best_epoch"], train_info["best_step"]) | |||
logger.info(train_info["best_eval"]) | |||
bestmodel_save_path = os.path.join(eval_dir, 'bestmodel.pkl') # this is where checkpoints of best models are saved | |||
saver = ModelSaver(bestmodel_save_path) | |||
saver.save_pytorch(model) | |||
logger.info('[INFO] Saving eval best model to %s', bestmodel_save_path) | |||
def run_test(model, loader, hps, limited=False): | |||
"""Repeatedly runs eval iterations, logging to screen and writing summaries. Saves the model with the best loss seen so far.""" | |||
test_dir = os.path.join(hps.save_root, "test") # make a subdir of the root dir for eval data | |||
eval_dir = os.path.join(hps.save_root, "eval") | |||
if not os.path.exists(test_dir) : os.makedirs(test_dir) | |||
if not os.path.exists(eval_dir) : | |||
logger.exception("[Error] eval_dir %s doesn't exist. Run in train mode to create it.", eval_dir) | |||
raise Exception("[Error] eval_dir %s doesn't exist. Run in train mode to create it." % (eval_dir)) | |||
if hps.test_model == "evalbestmodel": | |||
bestmodel_load_path = os.path.join(eval_dir, 'bestmodel.pkl') # this is where checkpoints of best models are saved | |||
elif hps.test_model == "earlystop": | |||
train_dir = os.path.join(hps.save_root, "train") | |||
bestmodel_load_path = os.path.join(train_dir, 'earlystop.pkl') | |||
else: | |||
logger.error("None of such model! Must be one of evalbestmodel/trainbestmodel/earlystop") | |||
raise ValueError("None of such model! Must be one of evalbestmodel/trainbestmodel/earlystop") | |||
logger.info("[INFO] Restoring %s for testing...The path is %s", hps.test_model, bestmodel_load_path) | |||
modelloader = ModelLoader() | |||
modelloader.load_pytorch(model, bestmodel_load_path) | |||
if hps.use_pyrouge: | |||
logger.info("[INFO] Use PyRougeMetric for testing") | |||
tester = Tester(data=loader, model=model, | |||
metrics=[LabelFMetric(pred="prediction"), PyRougeMetric(hps, pred="prediction")], | |||
batch_size=hps.batch_size) | |||
else: | |||
logger.info("[INFO] Use FastRougeMetric for testing") | |||
tester = Tester(data=loader, model=model, | |||
metrics=[LabelFMetric(pred="prediction"), FastRougeMetric(hps, pred="prediction")], | |||
batch_size=hps.batch_size) | |||
test_info = tester.test() | |||
logger.info(test_info) | |||
def main(): | |||
parser = argparse.ArgumentParser(description='Summarization Model') | |||
# Where to find data | |||
parser.add_argument('--data_path', type=str, default='/remote-home/dqwang/Datasets/CNNDM/train.label.jsonl', help='Path expression to pickle datafiles.') | |||
parser.add_argument('--valid_path', type=str, default='/remote-home/dqwang/Datasets/CNNDM/val.label.jsonl', help='Path expression to pickle valid datafiles.') | |||
parser.add_argument('--vocab_path', type=str, default='/remote-home/dqwang/Datasets/CNNDM/vocab', help='Path expression to text vocabulary file.') | |||
# Important settings | |||
parser.add_argument('--mode', choices=['train', 'test'], default='train', help='must be one of train/test') | |||
parser.add_argument('--embedding', type=str, default='glove', choices=['word2vec', 'glove', 'elmo', 'bert'], help='must be one of word2vec/glove/elmo/bert') | |||
parser.add_argument('--sentence_encoder', type=str, default='transformer', choices=['bilstm', 'deeplstm', 'transformer'], help='must be one of LSTM/Transformer') | |||
parser.add_argument('--sentence_decoder', type=str, default='SeqLab', choices=['PN', 'SeqLab'], help='must be one of PN/SeqLab') | |||
parser.add_argument('--restore_model', type=str , default='None', help='Restore model for further training. [bestmodel/bestFmodel/earlystop/None]') | |||
# Where to save output | |||
parser.add_argument('--save_root', type=str, default='save/', help='Root directory for all model.') | |||
parser.add_argument('--log_root', type=str, default='log/', help='Root directory for all logging.') | |||
# Hyperparameters | |||
parser.add_argument('--gpu', type=str, default='0', help='GPU ID to use. For cpu, set -1 [default: -1]') | |||
parser.add_argument('--cuda', action='store_true', default=False, help='use cuda') | |||
parser.add_argument('--vocab_size', type=int, default=100000, help='Size of vocabulary. These will be read from the vocabulary file in order. If the vocabulary file contains fewer words than this number, or if this number is set to 0, will take all words in the vocabulary file.') | |||
parser.add_argument('--n_epochs', type=int, default=20, help='Number of epochs [default: 20]') | |||
parser.add_argument('--batch_size', type=int, default=32, help='Mini batch size [default: 128]') | |||
parser.add_argument('--word_embedding', action='store_true', default=True, help='whether to use Word embedding') | |||
parser.add_argument('--embedding_path', type=str, default='/remote-home/dqwang/Glove/glove.42B.300d.txt', help='Path expression to external word embedding.') | |||
parser.add_argument('--word_emb_dim', type=int, default=300, help='Word embedding size [default: 200]') | |||
parser.add_argument('--embed_train', action='store_true', default=False, help='whether to train Word embedding [default: False]') | |||
parser.add_argument('--min_kernel_size', type=int, default=1, help='kernel min length for CNN [default:1]') | |||
parser.add_argument('--max_kernel_size', type=int, default=7, help='kernel max length for CNN [default:7]') | |||
parser.add_argument('--output_channel', type=int, default=50, help='output channel: repeated times for one kernel') | |||
parser.add_argument('--use_orthnormal_init', action='store_true', default=True, help='use orthnormal init for lstm [default: true]') | |||
parser.add_argument('--sent_max_len', type=int, default=100, help='max length of sentences (max source text sentence tokens)') | |||
parser.add_argument('--doc_max_timesteps', type=int, default=50, help='max length of documents (max timesteps of documents)') | |||
parser.add_argument('--save_label', action='store_true', default=False, help='require multihead attention') | |||
# Training | |||
parser.add_argument('--lr', type=float, default=0.0001, help='learning rate') | |||
parser.add_argument('--lr_descent', action='store_true', default=False, help='learning rate descent') | |||
parser.add_argument('--warmup_steps', type=int, default=4000, help='warmup_steps') | |||
parser.add_argument('--grad_clip', action='store_true', default=False, help='for gradient clipping') | |||
parser.add_argument('--max_grad_norm', type=float, default=10, help='for gradient clipping max gradient normalization') | |||
# test | |||
parser.add_argument('-m', type=int, default=3, help='decode summary length') | |||
parser.add_argument('--limited', action='store_true', default=False, help='limited decode summary length') | |||
parser.add_argument('--test_model', type=str, default='evalbestmodel', help='choose different model to test [evalbestmodel/evalbestFmodel/trainbestmodel/trainbestFmodel/earlystop]') | |||
parser.add_argument('--use_pyrouge', action='store_true', default=False, help='use_pyrouge') | |||
args = parser.parse_args() | |||
os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu | |||
torch.set_printoptions(threshold=50000) | |||
# File paths | |||
DATA_FILE = args.data_path | |||
VALID_FILE = args.valid_path | |||
VOCAL_FILE = args.vocab_path | |||
LOG_PATH = args.log_root | |||
# train_log setting | |||
if not os.path.exists(LOG_PATH): | |||
if args.mode == "train": | |||
os.makedirs(LOG_PATH) | |||
else: | |||
logger.exception("[Error] Logdir %s doesn't exist. Run in train mode to create it.", LOG_PATH) | |||
raise Exception("[Error] Logdir %s doesn't exist. Run in train mode to create it." % (LOG_PATH)) | |||
nowTime=datetime.datetime.now().strftime('%Y%m%d_%H%M%S') | |||
log_path = os.path.join(LOG_PATH, args.mode + "_" + nowTime) | |||
file_handler = logging.FileHandler(log_path) | |||
file_handler.setFormatter(formatter) | |||
logger.addHandler(file_handler) | |||
logger.info("Pytorch %s", torch.__version__) | |||
sum_loader = SummarizationLoader() | |||
hps = args | |||
if hps.mode == 'test': | |||
paths = {"test": DATA_FILE} | |||
hps.recurrent_dropout_prob = 0.0 | |||
hps.atten_dropout_prob = 0.0 | |||
hps.ffn_dropout_prob = 0.0 | |||
logger.info(hps) | |||
else: | |||
paths = {"train": DATA_FILE, "valid": VALID_FILE} | |||
dataInfo = sum_loader.process(paths=paths, vocab_size=hps.vocab_size, vocab_path=VOCAL_FILE, sent_max_len=hps.sent_max_len, doc_max_timesteps=hps.doc_max_timesteps, load_vocab=os.path.exists(VOCAL_FILE)) | |||
if args.embedding == "glove": | |||
vocab = dataInfo.vocabs["vocab"] | |||
embed = torch.nn.Embedding(len(vocab), hps.word_emb_dim) | |||
if hps.word_embedding: | |||
embed_loader = EmbedLoader() | |||
pretrained_weight = embed_loader.load_with_vocab(hps.embedding_path, vocab) # unfound with random init | |||
embed.weight.data.copy_(torch.from_numpy(pretrained_weight)) | |||
embed.weight.requires_grad = hps.embed_train | |||
else: | |||
logger.error("[ERROR] embedding To Be Continued!") | |||
sys.exit(1) | |||
if args.sentence_encoder == "transformer" and args.sentence_decoder == "SeqLab": | |||
model_param = json.load(open("config/transformer.config", "rb")) | |||
hps.__dict__.update(model_param) | |||
model = TransformerModel(hps, embed) | |||
else: | |||
logger.error("[ERROR] Model To Be Continued!") | |||
sys.exit(1) | |||
logger.info(hps) | |||
if hps.cuda: | |||
model = model.cuda() | |||
logger.info("[INFO] Use cuda") | |||
if hps.mode == 'train': | |||
dataInfo.datasets["valid"].set_target("text", "summary") | |||
setup_training(model, dataInfo.datasets["train"], dataInfo.datasets["valid"], hps) | |||
elif hps.mode == 'test': | |||
logger.info("[INFO] Decoding...") | |||
dataInfo.datasets["test"].set_target("text", "summary") | |||
run_test(model, dataInfo.datasets["test"], hps, limited=hps.limited) | |||
else: | |||
logger.error("The 'mode' flag must be one of train/eval/test") | |||
raise ValueError("The 'mode' flag must be one of train/eval/test") | |||
if __name__ == '__main__': | |||
main() |
@@ -0,0 +1,706 @@ | |||
#!/usr/bin/python | |||
# -*- coding: utf-8 -*- | |||
# __author__="Danqing Wang" | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
# ============================================================================== | |||
"""Train Model1: baseline model""" | |||
import os | |||
import sys | |||
import time | |||
import copy | |||
import pickle | |||
import datetime | |||
import argparse | |||
import logging | |||
import numpy as np | |||
import torch | |||
import torch.nn as nn | |||
from torch.autograd import Variable | |||
from rouge import Rouge | |||
sys.path.append('/remote-home/dqwang/FastNLP/fastNLP/') | |||
from fastNLP.core.batch import DataSetIter | |||
from fastNLP.core.const import Const | |||
from fastNLP.io.model_io import ModelLoader, ModelSaver | |||
from fastNLP.core.sampler import BucketSampler | |||
from tools import utils | |||
from tools.logger import * | |||
from data.dataloader import SummarizationLoader | |||
from model.TForiginal import TransformerModel | |||
def setup_training(model, train_loader, valid_loader, hps): | |||
"""Does setup before starting training (run_training)""" | |||
train_dir = os.path.join(hps.save_root, "train") | |||
if not os.path.exists(train_dir): os.makedirs(train_dir) | |||
if hps.restore_model != 'None': | |||
logger.info("[INFO] Restoring %s for training...", hps.restore_model) | |||
bestmodel_file = os.path.join(train_dir, hps.restore_model) | |||
loader = ModelLoader() | |||
loader.load_pytorch(model, bestmodel_file) | |||
else: | |||
logger.info("[INFO] Create new model for training...") | |||
try: | |||
run_training(model, train_loader, valid_loader, hps) # this is an infinite loop until interrupted | |||
except KeyboardInterrupt: | |||
logger.error("[Error] Caught keyboard interrupt on worker. Stopping supervisor...") | |||
save_file = os.path.join(train_dir, "earlystop.pkl") | |||
saver = ModelSaver(save_file) | |||
saver.save_pytorch(model) | |||
logger.info('[INFO] Saving early stop model to %s', save_file) | |||
def run_training(model, train_loader, valid_loader, hps): | |||
"""Repeatedly runs training iterations, logging loss to screen and writing summaries""" | |||
logger.info("[INFO] Starting run_training") | |||
train_dir = os.path.join(hps.save_root, "train") | |||
if not os.path.exists(train_dir): os.makedirs(train_dir) | |||
lr = hps.lr | |||
# optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=lr, betas=(0.9, 0.98), | |||
# eps=1e-09) | |||
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=lr) | |||
criterion = torch.nn.CrossEntropyLoss(reduction='none') | |||
best_train_loss = None | |||
best_train_F= None | |||
best_loss = None | |||
best_F = None | |||
step_num = 0 | |||
non_descent_cnt = 0 | |||
for epoch in range(1, hps.n_epochs + 1): | |||
epoch_loss = 0.0 | |||
train_loss = 0.0 | |||
total_example_num = 0 | |||
match, pred, true, match_true = 0.0, 0.0, 0.0, 0.0 | |||
epoch_start_time = time.time() | |||
for i, (batch_x, batch_y) in enumerate(train_loader): | |||
# if i > 10: | |||
# break | |||
model.train() | |||
iter_start_time=time.time() | |||
input, input_len = batch_x[Const.INPUT], batch_x[Const.INPUT_LEN] | |||
label = batch_y[Const.TARGET] | |||
# logger.info(batch_x["text"][0]) | |||
# logger.info(input[0,:,:]) | |||
# logger.info(input_len[0:5,:]) | |||
# logger.info(batch_y["summary"][0:5]) | |||
# logger.info(label[0:5,:]) | |||
# logger.info((len(batch_x["text"][0]), sum(input[0].sum(-1) != 0))) | |||
batch_size, N, seq_len = input.size() | |||
if hps.cuda: | |||
input = input.cuda() # [batch, N, seq_len] | |||
label = label.cuda() | |||
input_len = input_len.cuda() | |||
input = Variable(input) | |||
label = Variable(label) | |||
input_len = Variable(input_len) | |||
model_outputs = model.forward(input, input_len) # [batch, N, 2] | |||
outputs = model_outputs["p_sent"].view(-1, 2) | |||
label = label.view(-1) | |||
loss = criterion(outputs, label) # [batch_size, doc_max_timesteps] | |||
# input_len = input_len.float().view(-1) | |||
loss = loss.view(batch_size, -1) | |||
loss = loss.masked_fill(input_len.eq(0), 0) | |||
loss = loss.sum(1).mean() | |||
logger.debug("loss %f", loss) | |||
if not (np.isfinite(loss.data)).numpy(): | |||
logger.error("train Loss is not finite. Stopping.") | |||
logger.info(loss) | |||
for name, param in model.named_parameters(): | |||
if param.requires_grad: | |||
logger.info(name) | |||
logger.info(param.grad.data.sum()) | |||
raise Exception("train Loss is not finite. Stopping.") | |||
optimizer.zero_grad() | |||
loss.backward() | |||
if hps.grad_clip: | |||
torch.nn.utils.clip_grad_norm_(model.parameters(), hps.max_grad_norm) | |||
optimizer.step() | |||
step_num += 1 | |||
train_loss += float(loss.data) | |||
epoch_loss += float(loss.data) | |||
if i % 100 == 0: | |||
# start debugger | |||
# import pdb; pdb.set_trace() | |||
for name, param in model.named_parameters(): | |||
if param.requires_grad: | |||
logger.debug(name) | |||
logger.debug(param.grad.data.sum()) | |||
logger.info(' | end of iter {:3d} | time: {:5.2f}s | train loss {:5.4f} | ' | |||
.format(i, (time.time() - iter_start_time), | |||
float(train_loss / 100))) | |||
train_loss = 0.0 | |||
# calculate the precision, recall and F | |||
prediction = outputs.max(1)[1] | |||
prediction = prediction.data | |||
label = label.data | |||
pred += prediction.sum() | |||
true += label.sum() | |||
match_true += ((prediction == label) & (prediction == 1)).sum() | |||
match += (prediction == label).sum() | |||
total_example_num += int(batch_size * N) | |||
if hps.lr_descent: | |||
# new_lr = pow(hps.hidden_size, -0.5) * min(pow(step_num, -0.5), | |||
# step_num * pow(hps.warmup_steps, -1.5)) | |||
new_lr = max(5e-6, lr / (epoch + 1)) | |||
for param_group in list(optimizer.param_groups): | |||
param_group['lr'] = new_lr | |||
logger.info("[INFO] The learning rate now is %f", new_lr) | |||
epoch_avg_loss = epoch_loss / len(train_loader) | |||
logger.info(' | end of epoch {:3d} | time: {:5.2f}s | epoch train loss {:5.4f} | ' | |||
.format(epoch, (time.time() - epoch_start_time), | |||
float(epoch_avg_loss))) | |||
logger.info("[INFO] Trainset match_true %d, pred %d, true %d, total %d, match %d", match_true, pred, true, total_example_num, match) | |||
accu, precision, recall, F = utils.eval_label(match_true, pred, true, total_example_num, match) | |||
logger.info("[INFO] The size of totalset is %d, accu is %f, precision is %f, recall is %f, F is %f", total_example_num / hps.doc_max_timesteps, accu, precision, recall, F) | |||
if not best_train_loss or epoch_avg_loss < best_train_loss: | |||
save_file = os.path.join(train_dir, "bestmodel.pkl") | |||
logger.info('[INFO] Found new best model with %.3f running_train_loss. Saving to %s', float(epoch_avg_loss), save_file) | |||
saver = ModelSaver(save_file) | |||
saver.save_pytorch(model) | |||
best_train_loss = epoch_avg_loss | |||
elif epoch_avg_loss > best_train_loss: | |||
logger.error("[Error] training loss does not descent. Stopping supervisor...") | |||
save_file = os.path.join(train_dir, "earlystop.pkl") | |||
saver = ModelSaver(save_file) | |||
saver.save_pytorch(model) | |||
logger.info('[INFO] Saving early stop model to %s', save_file) | |||
return | |||
if not best_train_F or F > best_train_F: | |||
save_file = os.path.join(train_dir, "bestFmodel.pkl") | |||
logger.info('[INFO] Found new best model with %.3f F score. Saving to %s', float(F), save_file) | |||
saver = ModelSaver(save_file) | |||
saver.save_pytorch(model) | |||
best_train_F = F | |||
best_loss, best_F, non_descent_cnt = run_eval(model, valid_loader, hps, best_loss, best_F, non_descent_cnt) | |||
if non_descent_cnt >= 3: | |||
logger.error("[Error] val loss does not descent for three times. Stopping supervisor...") | |||
save_file = os.path.join(train_dir, "earlystop") | |||
saver = ModelSaver(save_file) | |||
saver.save_pytorch(model) | |||
logger.info('[INFO] Saving early stop model to %s', save_file) | |||
return | |||
def run_eval(model, loader, hps, best_loss, best_F, non_descent_cnt): | |||
"""Repeatedly runs eval iterations, logging to screen and writing summaries. Saves the model with the best loss seen so far.""" | |||
logger.info("[INFO] Starting eval for this model ...") | |||
eval_dir = os.path.join(hps.save_root, "eval") # make a subdir of the root dir for eval data | |||
if not os.path.exists(eval_dir): os.makedirs(eval_dir) | |||
model.eval() | |||
running_loss = 0.0 | |||
match, pred, true, match_true = 0.0, 0.0, 0.0, 0.0 | |||
pairs = {} | |||
pairs["hyps"] = [] | |||
pairs["refer"] = [] | |||
total_example_num = 0 | |||
criterion = torch.nn.CrossEntropyLoss(reduction='none') | |||
iter_start_time = time.time() | |||
with torch.no_grad(): | |||
for i, (batch_x, batch_y) in enumerate(loader): | |||
# if i > 10: | |||
# break | |||
input, input_len = batch_x[Const.INPUT], batch_x[Const.INPUT_LEN] | |||
label = batch_y[Const.TARGET] | |||
if hps.cuda: | |||
input = input.cuda() # [batch, N, seq_len] | |||
label = label.cuda() | |||
input_len = input_len.cuda() | |||
batch_size, N, _ = input.size() | |||
input = Variable(input, requires_grad=False) | |||
label = Variable(label) | |||
input_len = Variable(input_len, requires_grad=False) | |||
model_outputs = model.forward(input,input_len) # [batch, N, 2] | |||
outputs = model_outputs["p_sent"] | |||
prediction = model_outputs["prediction"] | |||
outputs = outputs.view(-1, 2) # [batch * N, 2] | |||
label = label.view(-1) # [batch * N] | |||
loss = criterion(outputs, label) | |||
loss = loss.view(batch_size, -1) | |||
loss = loss.masked_fill(input_len.eq(0), 0) | |||
loss = loss.sum(1).mean() | |||
logger.debug("loss %f", loss) | |||
running_loss += float(loss.data) | |||
label = label.data.view(batch_size, -1) | |||
pred += prediction.sum() | |||
true += label.sum() | |||
match_true += ((prediction == label) & (prediction == 1)).sum() | |||
match += (prediction == label).sum() | |||
total_example_num += batch_size * N | |||
# rouge | |||
prediction = prediction.view(batch_size, -1) | |||
for j in range(batch_size): | |||
original_article_sents = batch_x["text"][j] | |||
sent_max_number = len(original_article_sents) | |||
refer = "\n".join(batch_x["summary"][j]) | |||
hyps = "\n".join(original_article_sents[id] for id in range(len(prediction[j])) if prediction[j][id]==1 and id < sent_max_number) | |||
if sent_max_number < hps.m and len(hyps) <= 1: | |||
logger.error("sent_max_number is too short %d, Skip!" , sent_max_number) | |||
continue | |||
if len(hyps) >= 1 and hyps != '.': | |||
# logger.debug(prediction[j]) | |||
pairs["hyps"].append(hyps) | |||
pairs["refer"].append(refer) | |||
elif refer == "." or refer == "": | |||
logger.error("Refer is None!") | |||
logger.debug("label:") | |||
logger.debug(label[j]) | |||
logger.debug(refer) | |||
elif hyps == "." or hyps == "": | |||
logger.error("hyps is None!") | |||
logger.debug("sent_max_number:%d", sent_max_number) | |||
logger.debug("prediction:") | |||
logger.debug(prediction[j]) | |||
logger.debug(hyps) | |||
else: | |||
logger.error("Do not select any sentences!") | |||
logger.debug("sent_max_number:%d", sent_max_number) | |||
logger.debug(original_article_sents) | |||
logger.debug("label:") | |||
logger.debug(label[j]) | |||
continue | |||
running_avg_loss = running_loss / len(loader) | |||
if hps.use_pyrouge: | |||
logger.info("The number of pairs is %d", len(pairs["hyps"])) | |||
logging.getLogger('global').setLevel(logging.WARNING) | |||
if not len(pairs["hyps"]): | |||
logger.error("During testing, no hyps is selected!") | |||
return | |||
if isinstance(pairs["refer"][0], list): | |||
logger.info("Multi Reference summaries!") | |||
scores_all = utils.pyrouge_score_all_multi(pairs["hyps"], pairs["refer"]) | |||
else: | |||
scores_all = utils.pyrouge_score_all(pairs["hyps"], pairs["refer"]) | |||
else: | |||
if len(pairs["hyps"]) == 0 or len(pairs["refer"]) == 0 : | |||
logger.error("During testing, no hyps is selected!") | |||
return | |||
rouge = Rouge() | |||
scores_all = rouge.get_scores(pairs["hyps"], pairs["refer"], avg=True) | |||
# try: | |||
# scores_all = rouge.get_scores(pairs["hyps"], pairs["refer"], avg=True) | |||
# except ValueError as e: | |||
# logger.error(repr(e)) | |||
# scores_all = [] | |||
# for idx in range(len(pairs["hyps"])): | |||
# try: | |||
# scores = rouge.get_scores(pairs["hyps"][idx], pairs["refer"][idx])[0] | |||
# scores_all.append(scores) | |||
# except ValueError as e: | |||
# logger.error(repr(e)) | |||
# logger.debug("HYPS:\t%s", pairs["hyps"][idx]) | |||
# logger.debug("REFER:\t%s", pairs["refer"][idx]) | |||
# finally: | |||
# logger.error("During testing, some errors happen!") | |||
# logger.error(len(scores_all)) | |||
# exit(1) | |||
logger.info('[INFO] End of valid | time: {:5.2f}s | valid loss {:5.4f} | ' | |||
.format((time.time() - iter_start_time), | |||
float(running_avg_loss))) | |||
logger.info("[INFO] Validset match_true %d, pred %d, true %d, total %d, match %d", match_true, pred, true, total_example_num, match) | |||
accu, precision, recall, F = utils.eval_label(match_true, pred, true, total_example_num, match) | |||
logger.info("[INFO] The size of totalset is %d, accu is %f, precision is %f, recall is %f, F is %f", | |||
total_example_num / hps.doc_max_timesteps, accu, precision, recall, F) | |||
res = "Rouge1:\n\tp:%.6f, r:%.6f, f:%.6f\n" % (scores_all['rouge-1']['p'], scores_all['rouge-1']['r'], scores_all['rouge-1']['f']) \ | |||
+ "Rouge2:\n\tp:%.6f, r:%.6f, f:%.6f\n" % (scores_all['rouge-2']['p'], scores_all['rouge-2']['r'], scores_all['rouge-2']['f']) \ | |||
+ "Rougel:\n\tp:%.6f, r:%.6f, f:%.6f\n" % (scores_all['rouge-l']['p'], scores_all['rouge-l']['r'], scores_all['rouge-l']['f']) | |||
logger.info(res) | |||
# If running_avg_loss is best so far, save this checkpoint (early stopping). | |||
# These checkpoints will appear as bestmodel-<iteration_number> in the eval dir | |||
if best_loss is None or running_avg_loss < best_loss: | |||
bestmodel_save_path = os.path.join(eval_dir, 'bestmodel.pkl') # this is where checkpoints of best models are saved | |||
if best_loss is not None: | |||
logger.info('[INFO] Found new best model with %.6f running_avg_loss. The original loss is %.6f, Saving to %s', float(running_avg_loss), float(best_loss), bestmodel_save_path) | |||
else: | |||
logger.info('[INFO] Found new best model with %.6f running_avg_loss. The original loss is None, Saving to %s', float(running_avg_loss), bestmodel_save_path) | |||
saver = ModelSaver(bestmodel_save_path) | |||
saver.save_pytorch(model) | |||
best_loss = running_avg_loss | |||
non_descent_cnt = 0 | |||
else: | |||
non_descent_cnt += 1 | |||
if best_F is None or best_F < F: | |||
bestmodel_save_path = os.path.join(eval_dir, 'bestFmodel.pkl') # this is where checkpoints of best models are saved | |||
if best_F is not None: | |||
logger.info('[INFO] Found new best model with %.6f F. The original F is %.6f, Saving to %s', float(F), float(best_F), bestmodel_save_path) | |||
else: | |||
logger.info('[INFO] Found new best model with %.6f F. The original loss is None, Saving to %s', float(F), bestmodel_save_path) | |||
saver = ModelSaver(bestmodel_save_path) | |||
saver.save_pytorch(model) | |||
best_F = F | |||
return best_loss, best_F, non_descent_cnt | |||
def run_test(model, loader, hps, limited=False): | |||
"""Repeatedly runs eval iterations, logging to screen and writing summaries. Saves the model with the best loss seen so far.""" | |||
test_dir = os.path.join(hps.save_root, "test") # make a subdir of the root dir for eval data | |||
eval_dir = os.path.join(hps.save_root, "eval") | |||
if not os.path.exists(test_dir) : os.makedirs(test_dir) | |||
if not os.path.exists(eval_dir) : | |||
logger.exception("[Error] eval_dir %s doesn't exist. Run in train mode to create it.", eval_dir) | |||
raise Exception("[Error] eval_dir %s doesn't exist. Run in train mode to create it." % (eval_dir)) | |||
if hps.test_model == "evalbestmodel": | |||
bestmodel_load_path = os.path.join(eval_dir, 'bestmodel.pkl') # this is where checkpoints of best models are saved | |||
elif hps.test_model == "evalbestFmodel": | |||
bestmodel_load_path = os.path.join(eval_dir, 'bestFmodel.pkl') | |||
elif hps.test_model == "trainbestmodel": | |||
train_dir = os.path.join(hps.save_root, "train") | |||
bestmodel_load_path = os.path.join(train_dir, 'bestmodel.pkl') | |||
elif hps.test_model == "trainbestFmodel": | |||
train_dir = os.path.join(hps.save_root, "train") | |||
bestmodel_load_path = os.path.join(train_dir, 'bestFmodel.pkl') | |||
elif hps.test_model == "earlystop": | |||
train_dir = os.path.join(hps.save_root, "train") | |||
bestmodel_load_path = os.path.join(train_dir, 'earlystop,pkl') | |||
else: | |||
logger.error("None of such model! Must be one of evalbestmodel/trainbestmodel/earlystop") | |||
raise ValueError("None of such model! Must be one of evalbestmodel/trainbestmodel/earlystop") | |||
logger.info("[INFO] Restoring %s for testing...The path is %s", hps.test_model, bestmodel_load_path) | |||
modelloader = ModelLoader() | |||
modelloader.load_pytorch(model, bestmodel_load_path) | |||
import datetime | |||
nowTime=datetime.datetime.now().strftime('%Y%m%d_%H%M%S')#现在 | |||
if hps.save_label: | |||
log_dir = os.path.join(test_dir, hps.data_path.split("/")[-1]) | |||
resfile = open(log_dir, "w") | |||
else: | |||
log_dir = os.path.join(test_dir, nowTime) | |||
resfile = open(log_dir, "wb") | |||
logger.info("[INFO] Write the Evaluation into %s", log_dir) | |||
model.eval() | |||
match, pred, true, match_true = 0.0, 0.0, 0.0, 0.0 | |||
total_example_num = 0.0 | |||
pairs = {} | |||
pairs["hyps"] = [] | |||
pairs["refer"] = [] | |||
pred_list = [] | |||
iter_start_time=time.time() | |||
with torch.no_grad(): | |||
for i, (batch_x, batch_y) in enumerate(loader): | |||
input, input_len = batch_x[Const.INPUT], batch_x[Const.INPUT_LEN] | |||
label = batch_y[Const.TARGET] | |||
if hps.cuda: | |||
input = input.cuda() # [batch, N, seq_len] | |||
label = label.cuda() | |||
input_len = input_len.cuda() | |||
batch_size, N, _ = input.size() | |||
input = Variable(input) | |||
input_len = Variable(input_len, requires_grad=False) | |||
model_outputs = model.forward(input, input_len) # [batch, N, 2] | |||
prediction = model_outputs["prediction"] | |||
if hps.save_label: | |||
pred_list.extend(model_outputs["pred_idx"].data.cpu().view(-1).tolist()) | |||
continue | |||
pred += prediction.sum() | |||
true += label.sum() | |||
match_true += ((prediction == label) & (prediction == 1)).sum() | |||
match += (prediction == label).sum() | |||
total_example_num += batch_size * N | |||
for j in range(batch_size): | |||
original_article_sents = batch_x["text"][j] | |||
sent_max_number = len(original_article_sents) | |||
refer = "\n".join(batch_x["summary"][j]) | |||
hyps = "\n".join(original_article_sents[id].replace("\n", "") for id in range(len(prediction[j])) if prediction[j][id]==1 and id < sent_max_number) | |||
if limited: | |||
k = len(refer.split()) | |||
hyps = " ".join(hyps.split()[:k]) | |||
logger.info((len(refer.split()),len(hyps.split()))) | |||
resfile.write(b"Original_article:") | |||
resfile.write("\n".join(batch_x["text"][j]).encode('utf-8')) | |||
resfile.write(b"\n") | |||
resfile.write(b"Reference:") | |||
if isinstance(refer, list): | |||
for ref in refer: | |||
resfile.write(ref.encode('utf-8')) | |||
resfile.write(b"\n") | |||
resfile.write(b'*' * 40) | |||
resfile.write(b"\n") | |||
else: | |||
resfile.write(refer.encode('utf-8')) | |||
resfile.write(b"\n") | |||
resfile.write(b"hypothesis:") | |||
resfile.write(hyps.encode('utf-8')) | |||
resfile.write(b"\n") | |||
if hps.use_pyrouge: | |||
pairs["hyps"].append(hyps) | |||
pairs["refer"].append(refer) | |||
else: | |||
try: | |||
scores = utils.rouge_all(hyps, refer) | |||
pairs["hyps"].append(hyps) | |||
pairs["refer"].append(refer) | |||
except ValueError: | |||
logger.error("Do not select any sentences!") | |||
logger.debug("sent_max_number:%d", sent_max_number) | |||
logger.debug(original_article_sents) | |||
logger.debug("label:") | |||
logger.debug(label[j]) | |||
continue | |||
# single example res writer | |||
res = "Rouge1:\n\tp:%.6f, r:%.6f, f:%.6f\n" % (scores['rouge-1']['p'], scores['rouge-1']['r'], scores['rouge-1']['f']) \ | |||
+ "Rouge2:\n\tp:%.6f, r:%.6f, f:%.6f\n" % (scores['rouge-2']['p'], scores['rouge-2']['r'], scores['rouge-2']['f']) \ | |||
+ "Rougel:\n\tp:%.6f, r:%.6f, f:%.6f\n" % (scores['rouge-l']['p'], scores['rouge-l']['r'], scores['rouge-l']['f']) | |||
resfile.write(res.encode('utf-8')) | |||
resfile.write(b'-' * 89) | |||
resfile.write(b"\n") | |||
if hps.save_label: | |||
import json | |||
json.dump(pred_list, resfile) | |||
logger.info(' | end of test | time: {:5.2f}s | '.format((time.time() - iter_start_time))) | |||
return | |||
resfile.write(b"\n") | |||
resfile.write(b'=' * 89) | |||
resfile.write(b"\n") | |||
if hps.use_pyrouge: | |||
logger.info("The number of pairs is %d", len(pairs["hyps"])) | |||
if not len(pairs["hyps"]): | |||
logger.error("During testing, no hyps is selected!") | |||
return | |||
if isinstance(pairs["refer"][0], list): | |||
logger.info("Multi Reference summaries!") | |||
scores_all = utils.pyrouge_score_all_multi(pairs["hyps"], pairs["refer"]) | |||
else: | |||
scores_all = utils.pyrouge_score_all(pairs["hyps"], pairs["refer"]) | |||
else: | |||
logger.info("The number of pairs is %d", len(pairs["hyps"])) | |||
if not len(pairs["hyps"]): | |||
logger.error("During testing, no hyps is selected!") | |||
return | |||
rouge = Rouge() | |||
scores_all = rouge.get_scores(pairs["hyps"], pairs["refer"], avg=True) | |||
# the whole model res writer | |||
resfile.write(b"The total testset is:") | |||
res = "Rouge1:\n\tp:%.6f, r:%.6f, f:%.6f\n" % (scores_all['rouge-1']['p'], scores_all['rouge-1']['r'], scores_all['rouge-1']['f']) \ | |||
+ "Rouge2:\n\tp:%.6f, r:%.6f, f:%.6f\n" % (scores_all['rouge-2']['p'], scores_all['rouge-2']['r'], scores_all['rouge-2']['f']) \ | |||
+ "Rougel:\n\tp:%.6f, r:%.6f, f:%.6f\n" % (scores_all['rouge-l']['p'], scores_all['rouge-l']['r'], scores_all['rouge-l']['f']) | |||
resfile.write(res.encode("utf-8")) | |||
logger.info(res) | |||
logger.info(' | end of test | time: {:5.2f}s | ' | |||
.format((time.time() - iter_start_time))) | |||
# label prediction | |||
logger.info("match_true %d, pred %d, true %d, total %d, match %d", match, pred, true, total_example_num, match) | |||
accu, precision, recall, F = utils.eval_label(match_true, pred, true, total_example_num, match) | |||
res = "The size of totalset is %d, accu is %f, precision is %f, recall is %f, F is %f" % (total_example_num / hps.doc_max_timesteps, accu, precision, recall, F) | |||
resfile.write(res.encode('utf-8')) | |||
logger.info("The size of totalset is %d, accu is %f, precision is %f, recall is %f, F is %f", len(loader), accu, precision, recall, F) | |||
def main(): | |||
parser = argparse.ArgumentParser(description='Transformer Model') | |||
# Where to find data | |||
parser.add_argument('--data_path', type=str, default='/remote-home/dqwang/Datasets/CNNDM/train.label.jsonl', help='Path expression to pickle datafiles.') | |||
parser.add_argument('--valid_path', type=str, default='/remote-home/dqwang/Datasets/CNNDM/val.label.jsonl', help='Path expression to pickle valid datafiles.') | |||
parser.add_argument('--vocab_path', type=str, default='/remote-home/dqwang/Datasets/CNNDM/vocab', help='Path expression to text vocabulary file.') | |||
parser.add_argument('--embedding_path', type=str, default='/remote-home/dqwang/Glove/glove.42B.300d.txt', help='Path expression to external word embedding.') | |||
# Important settings | |||
parser.add_argument('--mode', type=str, default='train', help='must be one of train/test') | |||
parser.add_argument('--restore_model', type=str , default='None', help='Restore model for further training. [bestmodel/bestFmodel/earlystop/None]') | |||
parser.add_argument('--test_model', type=str, default='evalbestmodel', help='choose different model to test [evalbestmodel/evalbestFmodel/trainbestmodel/trainbestFmodel/earlystop]') | |||
parser.add_argument('--use_pyrouge', action='store_true', default=False, help='use_pyrouge') | |||
# Where to save output | |||
parser.add_argument('--save_root', type=str, default='save/', help='Root directory for all model.') | |||
parser.add_argument('--log_root', type=str, default='log/', help='Root directory for all logging.') | |||
# Hyperparameters | |||
parser.add_argument('--gpu', type=str, default='0', help='GPU ID to use. For cpu, set -1 [default: -1]') | |||
parser.add_argument('--cuda', action='store_true', default=False, help='use cuda') | |||
parser.add_argument('--vocab_size', type=int, default=100000, help='Size of vocabulary. These will be read from the vocabulary file in order. If the vocabulary file contains fewer words than this number, or if this number is set to 0, will take all words in the vocabulary file.') | |||
parser.add_argument('--n_epochs', type=int, default=20, help='Number of epochs [default: 20]') | |||
parser.add_argument('--batch_size', type=int, default=32, help='Mini batch size [default: 128]') | |||
parser.add_argument('--word_embedding', action='store_true', default=True, help='whether to use Word embedding') | |||
parser.add_argument('--word_emb_dim', type=int, default=300, help='Word embedding size [default: 200]') | |||
parser.add_argument('--embed_train', action='store_true', default=False, help='whether to train Word embedding [default: False]') | |||
parser.add_argument('--min_kernel_size', type=int, default=1, help='kernel min length for CNN [default:1]') | |||
parser.add_argument('--max_kernel_size', type=int, default=7, help='kernel max length for CNN [default:7]') | |||
parser.add_argument('--output_channel', type=int, default=50, help='output channel: repeated times for one kernel') | |||
parser.add_argument('--n_layers', type=int, default=12, help='Number of deeplstm layers') | |||
parser.add_argument('--hidden_size', type=int, default=512, help='hidden size [default: 512]') | |||
parser.add_argument('--ffn_inner_hidden_size', type=int, default=2048, help='PositionwiseFeedForward inner hidden size [default: 2048]') | |||
parser.add_argument('--n_head', type=int, default=8, help='multihead attention number [default: 8]') | |||
parser.add_argument('--recurrent_dropout_prob', type=float, default=0.1, help='recurrent dropout prob [default: 0.1]') | |||
parser.add_argument('--atten_dropout_prob', type=float, default=0.1,help='attention dropout prob [default: 0.1]') | |||
parser.add_argument('--ffn_dropout_prob', type=float, default=0.1, help='PositionwiseFeedForward dropout prob [default: 0.1]') | |||
parser.add_argument('--use_orthnormal_init', action='store_true', default=True, help='use orthnormal init for lstm [default: true]') | |||
parser.add_argument('--sent_max_len', type=int, default=100, help='max length of sentences (max source text sentence tokens)') | |||
parser.add_argument('--doc_max_timesteps', type=int, default=50, help='max length of documents (max timesteps of documents)') | |||
parser.add_argument('--save_label', action='store_true', default=False, help='require multihead attention') | |||
# Training | |||
parser.add_argument('--lr', type=float, default=0.0001, help='learning rate') | |||
parser.add_argument('--lr_descent', action='store_true', default=False, help='learning rate descent') | |||
parser.add_argument('--warmup_steps', type=int, default=4000, help='warmup_steps') | |||
parser.add_argument('--grad_clip', action='store_true', default=False, help='for gradient clipping') | |||
parser.add_argument('--max_grad_norm', type=float, default=1.0, help='for gradient clipping max gradient normalization') | |||
parser.add_argument('-m', type=int, default=3, help='decode summary length') | |||
parser.add_argument('--limited', action='store_true', default=False, help='limited decode summary length') | |||
args = parser.parse_args() | |||
os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu | |||
torch.set_printoptions(threshold=50000) | |||
hps = args | |||
# File paths | |||
DATA_FILE = args.data_path | |||
VALID_FILE = args.valid_path | |||
VOCAL_FILE = args.vocab_path | |||
LOG_PATH = args.log_root | |||
# train_log setting | |||
if not os.path.exists(LOG_PATH): | |||
if hps.mode == "train": | |||
os.makedirs(LOG_PATH) | |||
else: | |||
logger.exception("[Error] Logdir %s doesn't exist. Run in train mode to create it.", LOG_PATH) | |||
raise Exception("[Error] Logdir %s doesn't exist. Run in train mode to create it." % (LOG_PATH)) | |||
nowTime=datetime.datetime.now().strftime('%Y%m%d_%H%M%S') | |||
log_path = os.path.join(LOG_PATH, hps.mode + "_" + nowTime) | |||
file_handler = logging.FileHandler(log_path) | |||
file_handler.setFormatter(formatter) | |||
logger.addHandler(file_handler) | |||
logger.info("Pytorch %s", torch.__version__) | |||
logger.info(args) | |||
logger.info(args) | |||
sum_loader = SummarizationLoader() | |||
if hps.mode == 'test': | |||
paths = {"test": DATA_FILE} | |||
hps.recurrent_dropout_prob = 0.0 | |||
hps.atten_dropout_prob = 0.0 | |||
hps.ffn_dropout_prob = 0.0 | |||
logger.info(hps) | |||
else: | |||
paths = {"train": DATA_FILE, "valid": VALID_FILE} | |||
dataInfo = sum_loader.process(paths=paths, vocab_size=hps.vocab_size, vocab_path=VOCAL_FILE, sent_max_len=hps.sent_max_len, doc_max_timesteps=hps.doc_max_timesteps, load_vocab=os.path.exists(VOCAL_FILE)) | |||
vocab = dataInfo.vocabs["vocab"] | |||
model = TransformerModel(hps, vocab) | |||
if len(hps.gpu) > 1: | |||
gpuid = hps.gpu.split(',') | |||
gpuid = [int(s) for s in gpuid] | |||
model = nn.DataParallel(model,device_ids=gpuid) | |||
logger.info("[INFO] Use Multi-gpu: %s", hps.gpu) | |||
if hps.cuda: | |||
model = model.cuda() | |||
logger.info("[INFO] Use cuda") | |||
if hps.mode == 'train': | |||
trainset = dataInfo.datasets["train"] | |||
train_sampler = BucketSampler(batch_size=hps.batch_size, seq_len_field_name=Const.INPUT) | |||
train_batch = DataSetIter(batch_size=hps.batch_size, dataset=trainset, sampler=train_sampler) | |||
validset = dataInfo.datasets["valid"] | |||
validset.set_input("text", "summary") | |||
valid_batch = DataSetIter(batch_size=hps.batch_size, dataset=validset) | |||
setup_training(model, train_batch, valid_batch, hps) | |||
elif hps.mode == 'test': | |||
logger.info("[INFO] Decoding...") | |||
testset = dataInfo.datasets["test"] | |||
testset.set_input("text", "summary") | |||
test_batch = DataSetIter(batch_size=hps.batch_size, dataset=testset) | |||
run_test(model, test_batch, hps, limited=hps.limited) | |||
else: | |||
logger.error("The 'mode' flag must be one of train/eval/test") | |||
raise ValueError("The 'mode' flag must be one of train/eval/test") | |||
if __name__ == '__main__': | |||
main() |
@@ -0,0 +1,705 @@ | |||
#!/usr/bin/python | |||
# -*- coding: utf-8 -*- | |||
# __author__="Danqing Wang" | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
# ============================================================================== | |||
"""Train Model1: baseline model""" | |||
import os | |||
import sys | |||
import time | |||
import copy | |||
import pickle | |||
import datetime | |||
import argparse | |||
import logging | |||
import numpy as np | |||
import torch | |||
import torch.nn as nn | |||
from torch.autograd import Variable | |||
from rouge import Rouge | |||
sys.path.append('/remote-home/dqwang/FastNLP/fastNLP/') | |||
from fastNLP.core.batch import Batch | |||
from fastNLP.core.const import Const | |||
from fastNLP.io.model_io import ModelLoader, ModelSaver | |||
from fastNLP.core.sampler import BucketSampler | |||
from tools import utils | |||
from tools.logger import * | |||
from data.dataloader import SummarizationLoader | |||
from model.TransformerModel import TransformerModel | |||
def setup_training(model, train_loader, valid_loader, hps): | |||
"""Does setup before starting training (run_training)""" | |||
train_dir = os.path.join(hps.save_root, "train") | |||
if not os.path.exists(train_dir): os.makedirs(train_dir) | |||
if hps.restore_model != 'None': | |||
logger.info("[INFO] Restoring %s for training...", hps.restore_model) | |||
bestmodel_file = os.path.join(train_dir, hps.restore_model) | |||
loader = ModelLoader() | |||
loader.load_pytorch(model, bestmodel_file) | |||
else: | |||
logger.info("[INFO] Create new model for training...") | |||
try: | |||
run_training(model, train_loader, valid_loader, hps) # this is an infinite loop until interrupted | |||
except KeyboardInterrupt: | |||
logger.error("[Error] Caught keyboard interrupt on worker. Stopping supervisor...") | |||
save_file = os.path.join(train_dir, "earlystop.pkl") | |||
saver = ModelSaver(save_file) | |||
saver.save_pytorch(model) | |||
logger.info('[INFO] Saving early stop model to %s', save_file) | |||
def run_training(model, train_loader, valid_loader, hps): | |||
"""Repeatedly runs training iterations, logging loss to screen and writing summaries""" | |||
logger.info("[INFO] Starting run_training") | |||
train_dir = os.path.join(hps.save_root, "train") | |||
if not os.path.exists(train_dir): os.makedirs(train_dir) | |||
lr = hps.lr | |||
# optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=lr, betas=(0.9, 0.98), | |||
# eps=1e-09) | |||
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=lr) | |||
criterion = torch.nn.CrossEntropyLoss(reduction='none') | |||
best_train_loss = None | |||
best_train_F= None | |||
best_loss = None | |||
best_F = None | |||
step_num = 0 | |||
non_descent_cnt = 0 | |||
for epoch in range(1, hps.n_epochs + 1): | |||
epoch_loss = 0.0 | |||
train_loss = 0.0 | |||
total_example_num = 0 | |||
match, pred, true, match_true = 0.0, 0.0, 0.0, 0.0 | |||
epoch_start_time = time.time() | |||
for i, (batch_x, batch_y) in enumerate(train_loader): | |||
# if i > 10: | |||
# break | |||
model.train() | |||
iter_start_time=time.time() | |||
input, input_len = batch_x[Const.INPUT], batch_x[Const.INPUT_LEN] | |||
label = batch_y[Const.TARGET] | |||
# logger.info(batch_x["text"][0]) | |||
# logger.info(input[0,:,:]) | |||
# logger.info(input_len[0:5,:]) | |||
# logger.info(batch_y["summary"][0:5]) | |||
# logger.info(label[0:5,:]) | |||
# logger.info((len(batch_x["text"][0]), sum(input[0].sum(-1) != 0))) | |||
batch_size, N, seq_len = input.size() | |||
if hps.cuda: | |||
input = input.cuda() # [batch, N, seq_len] | |||
label = label.cuda() | |||
input_len = input_len.cuda() | |||
input = Variable(input) | |||
label = Variable(label) | |||
input_len = Variable(input_len) | |||
model_outputs = model.forward(input, input_len) # [batch, N, 2] | |||
outputs = model_outputs[Const.OUTPUT].view(-1, 2) | |||
label = label.view(-1) | |||
loss = criterion(outputs, label) # [batch_size, doc_max_timesteps] | |||
input_len = input_len.float().view(-1) | |||
loss = loss * input_len | |||
loss = loss.view(batch_size, -1) | |||
loss = loss.sum(1).mean() | |||
if not (np.isfinite(loss.data)).numpy(): | |||
logger.error("train Loss is not finite. Stopping.") | |||
logger.info(loss) | |||
for name, param in model.named_parameters(): | |||
if param.requires_grad: | |||
logger.info(name) | |||
logger.info(param.grad.data.sum()) | |||
raise Exception("train Loss is not finite. Stopping.") | |||
optimizer.zero_grad() | |||
loss.backward() | |||
if hps.grad_clip: | |||
torch.nn.utils.clip_grad_norm_(model.parameters(), hps.max_grad_norm) | |||
optimizer.step() | |||
step_num += 1 | |||
train_loss += float(loss.data) | |||
epoch_loss += float(loss.data) | |||
if i % 100 == 0: | |||
# start debugger | |||
# import pdb; pdb.set_trace() | |||
for name, param in model.named_parameters(): | |||
if param.requires_grad: | |||
logger.debug(name) | |||
logger.debug(param.grad.data.sum()) | |||
logger.info(' | end of iter {:3d} | time: {:5.2f}s | train loss {:5.4f} | ' | |||
.format(i, (time.time() - iter_start_time), | |||
float(train_loss / 100))) | |||
train_loss = 0.0 | |||
# calculate the precision, recall and F | |||
prediction = outputs.max(1)[1] | |||
prediction = prediction.data | |||
label = label.data | |||
pred += prediction.sum() | |||
true += label.sum() | |||
match_true += ((prediction == label) & (prediction == 1)).sum() | |||
match += (prediction == label).sum() | |||
total_example_num += int(batch_size * N) | |||
if hps.lr_descent: | |||
# new_lr = pow(hps.hidden_size, -0.5) * min(pow(step_num, -0.5), | |||
# step_num * pow(hps.warmup_steps, -1.5)) | |||
new_lr = max(5e-6, lr / (epoch + 1)) | |||
for param_group in list(optimizer.param_groups): | |||
param_group['lr'] = new_lr | |||
logger.info("[INFO] The learning rate now is %f", new_lr) | |||
epoch_avg_loss = epoch_loss / len(train_loader) | |||
logger.info(' | end of epoch {:3d} | time: {:5.2f}s | epoch train loss {:5.4f} | ' | |||
.format(epoch, (time.time() - epoch_start_time), | |||
float(epoch_avg_loss))) | |||
logger.info("[INFO] Trainset match_true %d, pred %d, true %d, total %d, match %d", match_true, pred, true, total_example_num, match) | |||
accu, precision, recall, F = utils.eval_label(match_true, pred, true, total_example_num, match) | |||
logger.info("[INFO] The size of totalset is %d, accu is %f, precision is %f, recall is %f, F is %f", total_example_num / hps.doc_max_timesteps, accu, precision, recall, F) | |||
if not best_train_loss or epoch_avg_loss < best_train_loss: | |||
save_file = os.path.join(train_dir, "bestmodel.pkl") | |||
logger.info('[INFO] Found new best model with %.3f running_train_loss. Saving to %s', float(epoch_avg_loss), save_file) | |||
saver = ModelSaver(save_file) | |||
saver.save_pytorch(model) | |||
best_train_loss = epoch_avg_loss | |||
elif epoch_avg_loss > best_train_loss: | |||
logger.error("[Error] training loss does not descent. Stopping supervisor...") | |||
save_file = os.path.join(train_dir, "earlystop.pkl") | |||
saver = ModelSaver(save_file) | |||
saver.save_pytorch(model) | |||
logger.info('[INFO] Saving early stop model to %s', save_file) | |||
return | |||
if not best_train_F or F > best_train_F: | |||
save_file = os.path.join(train_dir, "bestFmodel.pkl") | |||
logger.info('[INFO] Found new best model with %.3f F score. Saving to %s', float(F), save_file) | |||
saver = ModelSaver(save_file) | |||
saver.save_pytorch(model) | |||
best_train_F = F | |||
best_loss, best_F, non_descent_cnt = run_eval(model, valid_loader, hps, best_loss, best_F, non_descent_cnt) | |||
if non_descent_cnt >= 3: | |||
logger.error("[Error] val loss does not descent for three times. Stopping supervisor...") | |||
save_file = os.path.join(train_dir, "earlystop") | |||
saver = ModelSaver(save_file) | |||
saver.save_pytorch(model) | |||
logger.info('[INFO] Saving early stop model to %s', save_file) | |||
return | |||
def run_eval(model, loader, hps, best_loss, best_F, non_descent_cnt): | |||
"""Repeatedly runs eval iterations, logging to screen and writing summaries. Saves the model with the best loss seen so far.""" | |||
logger.info("[INFO] Starting eval for this model ...") | |||
eval_dir = os.path.join(hps.save_root, "eval") # make a subdir of the root dir for eval data | |||
if not os.path.exists(eval_dir): os.makedirs(eval_dir) | |||
model.eval() | |||
running_loss = 0.0 | |||
match, pred, true, match_true = 0.0, 0.0, 0.0, 0.0 | |||
pairs = {} | |||
pairs["hyps"] = [] | |||
pairs["refer"] = [] | |||
total_example_num = 0 | |||
criterion = torch.nn.CrossEntropyLoss(reduction='none') | |||
iter_start_time = time.time() | |||
with torch.no_grad(): | |||
for i, (batch_x, batch_y) in enumerate(loader): | |||
# if i > 10: | |||
# break | |||
input, input_len = batch_x[Const.INPUT], batch_x[Const.INPUT_LEN] | |||
label = batch_y[Const.TARGET] | |||
if hps.cuda: | |||
input = input.cuda() # [batch, N, seq_len] | |||
label = label.cuda() | |||
input_len = input_len.cuda() | |||
batch_size, N, _ = input.size() | |||
input = Variable(input, requires_grad=False) | |||
label = Variable(label) | |||
input_len = Variable(input_len, requires_grad=False) | |||
model_outputs = model.forward(input,input_len) # [batch, N, 2] | |||
outputs = model_outputs[Const.OUTPUTS] | |||
prediction = model_outputs["prediction"] | |||
outputs = outputs.view(-1, 2) # [batch * N, 2] | |||
label = label.view(-1) # [batch * N] | |||
loss = criterion(outputs, label) | |||
input_len = input_len.float().view(-1) | |||
loss = loss * input_len | |||
loss = loss.view(batch_size, -1) | |||
loss = loss.sum(1).mean() | |||
running_loss += float(loss.data) | |||
label = label.data | |||
pred += prediction.sum() | |||
true += label.sum() | |||
match_true += ((prediction == label) & (prediction == 1)).sum() | |||
match += (prediction == label).sum() | |||
total_example_num += batch_size * N | |||
# rouge | |||
prediction = prediction.view(batch_size, -1) | |||
for j in range(batch_size): | |||
original_article_sents = batch_x["text"][j] | |||
sent_max_number = len(original_article_sents) | |||
refer = "\n".join(batch_x["summary"][j]) | |||
hyps = "\n".join(original_article_sents[id] for id in range(len(prediction[j])) if prediction[j][id]==1 and id < sent_max_number) | |||
if sent_max_number < hps.m and len(hyps) <= 1: | |||
logger.error("sent_max_number is too short %d, Skip!" , sent_max_number) | |||
continue | |||
if len(hyps) >= 1 and hyps != '.': | |||
# logger.debug(prediction[j]) | |||
pairs["hyps"].append(hyps) | |||
pairs["refer"].append(refer) | |||
elif refer == "." or refer == "": | |||
logger.error("Refer is None!") | |||
logger.debug("label:") | |||
logger.debug(label[j]) | |||
logger.debug(refer) | |||
elif hyps == "." or hyps == "": | |||
logger.error("hyps is None!") | |||
logger.debug("sent_max_number:%d", sent_max_number) | |||
logger.debug("prediction:") | |||
logger.debug(prediction[j]) | |||
logger.debug(hyps) | |||
else: | |||
logger.error("Do not select any sentences!") | |||
logger.debug("sent_max_number:%d", sent_max_number) | |||
logger.debug(original_article_sents) | |||
logger.debug("label:") | |||
logger.debug(label[j]) | |||
continue | |||
running_avg_loss = running_loss / len(loader) | |||
if hps.use_pyrouge: | |||
logger.info("The number of pairs is %d", len(pairs["hyps"])) | |||
logging.getLogger('global').setLevel(logging.WARNING) | |||
if not len(pairs["hyps"]): | |||
logger.error("During testing, no hyps is selected!") | |||
return | |||
if isinstance(pairs["refer"][0], list): | |||
logger.info("Multi Reference summaries!") | |||
scores_all = utils.pyrouge_score_all_multi(pairs["hyps"], pairs["refer"]) | |||
else: | |||
scores_all = utils.pyrouge_score_all(pairs["hyps"], pairs["refer"]) | |||
else: | |||
if len(pairs["hyps"]) == 0 or len(pairs["refer"]) == 0 : | |||
logger.error("During testing, no hyps is selected!") | |||
return | |||
rouge = Rouge() | |||
scores_all = rouge.get_scores(pairs["hyps"], pairs["refer"], avg=True) | |||
# try: | |||
# scores_all = rouge.get_scores(pairs["hyps"], pairs["refer"], avg=True) | |||
# except ValueError as e: | |||
# logger.error(repr(e)) | |||
# scores_all = [] | |||
# for idx in range(len(pairs["hyps"])): | |||
# try: | |||
# scores = rouge.get_scores(pairs["hyps"][idx], pairs["refer"][idx])[0] | |||
# scores_all.append(scores) | |||
# except ValueError as e: | |||
# logger.error(repr(e)) | |||
# logger.debug("HYPS:\t%s", pairs["hyps"][idx]) | |||
# logger.debug("REFER:\t%s", pairs["refer"][idx]) | |||
# finally: | |||
# logger.error("During testing, some errors happen!") | |||
# logger.error(len(scores_all)) | |||
# exit(1) | |||
logger.info('[INFO] End of valid | time: {:5.2f}s | valid loss {:5.4f} | ' | |||
.format((time.time() - iter_start_time), | |||
float(running_avg_loss))) | |||
logger.info("[INFO] Validset match_true %d, pred %d, true %d, total %d, match %d", match_true, pred, true, total_example_num, match) | |||
accu, precision, recall, F = utils.eval_label(match_true, pred, true, total_example_num, match) | |||
logger.info("[INFO] The size of totalset is %d, accu is %f, precision is %f, recall is %f, F is %f", | |||
total_example_num / hps.doc_max_timesteps, accu, precision, recall, F) | |||
res = "Rouge1:\n\tp:%.6f, r:%.6f, f:%.6f\n" % (scores_all['rouge-1']['p'], scores_all['rouge-1']['r'], scores_all['rouge-1']['f']) \ | |||
+ "Rouge2:\n\tp:%.6f, r:%.6f, f:%.6f\n" % (scores_all['rouge-2']['p'], scores_all['rouge-2']['r'], scores_all['rouge-2']['f']) \ | |||
+ "Rougel:\n\tp:%.6f, r:%.6f, f:%.6f\n" % (scores_all['rouge-l']['p'], scores_all['rouge-l']['r'], scores_all['rouge-l']['f']) | |||
logger.info(res) | |||
# If running_avg_loss is best so far, save this checkpoint (early stopping). | |||
# These checkpoints will appear as bestmodel-<iteration_number> in the eval dir | |||
if best_loss is None or running_avg_loss < best_loss: | |||
bestmodel_save_path = os.path.join(eval_dir, 'bestmodel.pkl') # this is where checkpoints of best models are saved | |||
if best_loss is not None: | |||
logger.info('[INFO] Found new best model with %.6f running_avg_loss. The original loss is %.6f, Saving to %s', float(running_avg_loss), float(best_loss), bestmodel_save_path) | |||
else: | |||
logger.info('[INFO] Found new best model with %.6f running_avg_loss. The original loss is None, Saving to %s', float(running_avg_loss), bestmodel_save_path) | |||
saver = ModelSaver(bestmodel_save_path) | |||
saver.save_pytorch(model) | |||
best_loss = running_avg_loss | |||
non_descent_cnt = 0 | |||
else: | |||
non_descent_cnt += 1 | |||
if best_F is None or best_F < F: | |||
bestmodel_save_path = os.path.join(eval_dir, 'bestFmodel.pkl') # this is where checkpoints of best models are saved | |||
if best_F is not None: | |||
logger.info('[INFO] Found new best model with %.6f F. The original F is %.6f, Saving to %s', float(F), float(best_F), bestmodel_save_path) | |||
else: | |||
logger.info('[INFO] Found new best model with %.6f F. The original loss is None, Saving to %s', float(F), bestmodel_save_path) | |||
saver = ModelSaver(bestmodel_save_path) | |||
saver.save_pytorch(model) | |||
best_F = F | |||
return best_loss, best_F, non_descent_cnt | |||
def run_test(model, loader, hps, limited=False): | |||
"""Repeatedly runs eval iterations, logging to screen and writing summaries. Saves the model with the best loss seen so far.""" | |||
test_dir = os.path.join(hps.save_root, "test") # make a subdir of the root dir for eval data | |||
eval_dir = os.path.join(hps.save_root, "eval") | |||
if not os.path.exists(test_dir) : os.makedirs(test_dir) | |||
if not os.path.exists(eval_dir) : | |||
logger.exception("[Error] eval_dir %s doesn't exist. Run in train mode to create it.", eval_dir) | |||
raise Exception("[Error] eval_dir %s doesn't exist. Run in train mode to create it." % (eval_dir)) | |||
if hps.test_model == "evalbestmodel": | |||
bestmodel_load_path = os.path.join(eval_dir, 'bestmodel.pkl') # this is where checkpoints of best models are saved | |||
elif hps.test_model == "evalbestFmodel": | |||
bestmodel_load_path = os.path.join(eval_dir, 'bestFmodel.pkl') | |||
elif hps.test_model == "trainbestmodel": | |||
train_dir = os.path.join(hps.save_root, "train") | |||
bestmodel_load_path = os.path.join(train_dir, 'bestmodel.pkl') | |||
elif hps.test_model == "trainbestFmodel": | |||
train_dir = os.path.join(hps.save_root, "train") | |||
bestmodel_load_path = os.path.join(train_dir, 'bestFmodel.pkl') | |||
elif hps.test_model == "earlystop": | |||
train_dir = os.path.join(hps.save_root, "train") | |||
bestmodel_load_path = os.path.join(train_dir, 'earlystop,pkl') | |||
else: | |||
logger.error("None of such model! Must be one of evalbestmodel/trainbestmodel/earlystop") | |||
raise ValueError("None of such model! Must be one of evalbestmodel/trainbestmodel/earlystop") | |||
logger.info("[INFO] Restoring %s for testing...The path is %s", hps.test_model, bestmodel_load_path) | |||
modelloader = ModelLoader() | |||
modelloader.load_pytorch(model, bestmodel_load_path) | |||
import datetime | |||
nowTime=datetime.datetime.now().strftime('%Y%m%d_%H%M%S')#现在 | |||
if hps.save_label: | |||
log_dir = os.path.join(test_dir, hps.data_path.split("/")[-1]) | |||
resfile = open(log_dir, "w") | |||
else: | |||
log_dir = os.path.join(test_dir, nowTime) | |||
resfile = open(log_dir, "wb") | |||
logger.info("[INFO] Write the Evaluation into %s", log_dir) | |||
model.eval() | |||
match, pred, true, match_true = 0.0, 0.0, 0.0, 0.0 | |||
total_example_num = 0.0 | |||
pairs = {} | |||
pairs["hyps"] = [] | |||
pairs["refer"] = [] | |||
pred_list = [] | |||
iter_start_time=time.time() | |||
with torch.no_grad(): | |||
for i, (batch_x, batch_y) in enumerate(loader): | |||
input, input_len = batch_x[Const.INPUT], batch_x[Const.INPUT_LEN] | |||
label = batch_y[Const.TARGET] | |||
if hps.cuda: | |||
input = input.cuda() # [batch, N, seq_len] | |||
label = label.cuda() | |||
input_len = input_len.cuda() | |||
batch_size, N, _ = input.size() | |||
input = Variable(input) | |||
input_len = Variable(input_len, requires_grad=False) | |||
model_outputs = model.forward(input, input_len) # [batch, N, 2] | |||
prediction = model_outputs["pred"] | |||
if hps.save_label: | |||
pred_list.extend(model_outputs["pred_idx"].data.cpu().view(-1).tolist()) | |||
continue | |||
pred += prediction.sum() | |||
true += label.sum() | |||
match_true += ((prediction == label) & (prediction == 1)).sum() | |||
match += (prediction == label).sum() | |||
total_example_num += batch_size * N | |||
for j in range(batch_size): | |||
original_article_sents = batch_x["text"][j] | |||
sent_max_number = len(original_article_sents) | |||
refer = "\n".join(batch_x["summary"][j]) | |||
hyps = "\n".join(original_article_sents[id].replace("\n", "") for id in range(len(prediction[j])) if prediction[j][id]==1 and id < sent_max_number) | |||
if limited: | |||
k = len(refer.split()) | |||
hyps = " ".join(hyps.split()[:k]) | |||
logger.info((len(refer.split()),len(hyps.split()))) | |||
resfile.write(b"Original_article:") | |||
resfile.write("\n".join(batch_x["text"][j]).encode('utf-8')) | |||
resfile.write(b"\n") | |||
resfile.write(b"Reference:") | |||
if isinstance(refer, list): | |||
for ref in refer: | |||
resfile.write(ref.encode('utf-8')) | |||
resfile.write(b"\n") | |||
resfile.write(b'*' * 40) | |||
resfile.write(b"\n") | |||
else: | |||
resfile.write(refer.encode('utf-8')) | |||
resfile.write(b"\n") | |||
resfile.write(b"hypothesis:") | |||
resfile.write(hyps.encode('utf-8')) | |||
resfile.write(b"\n") | |||
if hps.use_pyrouge: | |||
pairs["hyps"].append(hyps) | |||
pairs["refer"].append(refer) | |||
else: | |||
try: | |||
scores = utils.rouge_all(hyps, refer) | |||
pairs["hyps"].append(hyps) | |||
pairs["refer"].append(refer) | |||
except ValueError: | |||
logger.error("Do not select any sentences!") | |||
logger.debug("sent_max_number:%d", sent_max_number) | |||
logger.debug(original_article_sents) | |||
logger.debug("label:") | |||
logger.debug(label[j]) | |||
continue | |||
# single example res writer | |||
res = "Rouge1:\n\tp:%.6f, r:%.6f, f:%.6f\n" % (scores['rouge-1']['p'], scores['rouge-1']['r'], scores['rouge-1']['f']) \ | |||
+ "Rouge2:\n\tp:%.6f, r:%.6f, f:%.6f\n" % (scores['rouge-2']['p'], scores['rouge-2']['r'], scores['rouge-2']['f']) \ | |||
+ "Rougel:\n\tp:%.6f, r:%.6f, f:%.6f\n" % (scores['rouge-l']['p'], scores['rouge-l']['r'], scores['rouge-l']['f']) | |||
resfile.write(res.encode('utf-8')) | |||
resfile.write(b'-' * 89) | |||
resfile.write(b"\n") | |||
if hps.save_label: | |||
import json | |||
json.dump(pred_list, resfile) | |||
logger.info(' | end of test | time: {:5.2f}s | '.format((time.time() - iter_start_time))) | |||
return | |||
resfile.write(b"\n") | |||
resfile.write(b'=' * 89) | |||
resfile.write(b"\n") | |||
if hps.use_pyrouge: | |||
logger.info("The number of pairs is %d", len(pairs["hyps"])) | |||
if not len(pairs["hyps"]): | |||
logger.error("During testing, no hyps is selected!") | |||
return | |||
if isinstance(pairs["refer"][0], list): | |||
logger.info("Multi Reference summaries!") | |||
scores_all = utils.pyrouge_score_all_multi(pairs["hyps"], pairs["refer"]) | |||
else: | |||
scores_all = utils.pyrouge_score_all(pairs["hyps"], pairs["refer"]) | |||
else: | |||
logger.info("The number of pairs is %d", len(pairs["hyps"])) | |||
if not len(pairs["hyps"]): | |||
logger.error("During testing, no hyps is selected!") | |||
return | |||
rouge = Rouge() | |||
scores_all = rouge.get_scores(pairs["hyps"], pairs["refer"], avg=True) | |||
# the whole model res writer | |||
resfile.write(b"The total testset is:") | |||
res = "Rouge1:\n\tp:%.6f, r:%.6f, f:%.6f\n" % (scores_all['rouge-1']['p'], scores_all['rouge-1']['r'], scores_all['rouge-1']['f']) \ | |||
+ "Rouge2:\n\tp:%.6f, r:%.6f, f:%.6f\n" % (scores_all['rouge-2']['p'], scores_all['rouge-2']['r'], scores_all['rouge-2']['f']) \ | |||
+ "Rougel:\n\tp:%.6f, r:%.6f, f:%.6f\n" % (scores_all['rouge-l']['p'], scores_all['rouge-l']['r'], scores_all['rouge-l']['f']) | |||
resfile.write(res.encode("utf-8")) | |||
logger.info(res) | |||
logger.info(' | end of test | time: {:5.2f}s | ' | |||
.format((time.time() - iter_start_time))) | |||
# label prediction | |||
logger.info("match_true %d, pred %d, true %d, total %d, match %d", match, pred, true, total_example_num, match) | |||
accu, precision, recall, F = utils.eval_label(match_true, pred, true, total_example_num, match) | |||
res = "The size of totalset is %d, accu is %f, precision is %f, recall is %f, F is %f" % (total_example_num / hps.doc_max_timesteps, accu, precision, recall, F) | |||
resfile.write(res.encode('utf-8')) | |||
logger.info("The size of totalset is %d, accu is %f, precision is %f, recall is %f, F is %f", len(loader), accu, precision, recall, F) | |||
def main(): | |||
parser = argparse.ArgumentParser(description='Transformer Model') | |||
# Where to find data | |||
parser.add_argument('--data_path', type=str, default='/remote-home/dqwang/Datasets/CNNDM/train.label.jsonl', help='Path expression to pickle datafiles.') | |||
parser.add_argument('--valid_path', type=str, default='/remote-home/dqwang/Datasets/CNNDM/val.label.jsonl', help='Path expression to pickle valid datafiles.') | |||
parser.add_argument('--vocab_path', type=str, default='/remote-home/dqwang/Datasets/CNNDM/vocab', help='Path expression to text vocabulary file.') | |||
parser.add_argument('--embedding_path', type=str, default='/remote-home/dqwang/Glove/glove.42B.300d.txt', help='Path expression to external word embedding.') | |||
# Important settings | |||
parser.add_argument('--mode', type=str, default='train', help='must be one of train/test') | |||
parser.add_argument('--restore_model', type=str , default='None', help='Restore model for further training. [bestmodel/bestFmodel/earlystop/None]') | |||
parser.add_argument('--test_model', type=str, default='evalbestmodel', help='choose different model to test [evalbestmodel/evalbestFmodel/trainbestmodel/trainbestFmodel/earlystop]') | |||
parser.add_argument('--use_pyrouge', action='store_true', default=False, help='use_pyrouge') | |||
# Where to save output | |||
parser.add_argument('--save_root', type=str, default='save/', help='Root directory for all model.') | |||
parser.add_argument('--log_root', type=str, default='log/', help='Root directory for all logging.') | |||
# Hyperparameters | |||
parser.add_argument('--gpu', type=str, default='0', help='GPU ID to use. For cpu, set -1 [default: -1]') | |||
parser.add_argument('--cuda', action='store_true', default=False, help='use cuda') | |||
parser.add_argument('--vocab_size', type=int, default=100000, help='Size of vocabulary. These will be read from the vocabulary file in order. If the vocabulary file contains fewer words than this number, or if this number is set to 0, will take all words in the vocabulary file.') | |||
parser.add_argument('--n_epochs', type=int, default=20, help='Number of epochs [default: 20]') | |||
parser.add_argument('--batch_size', type=int, default=32, help='Mini batch size [default: 128]') | |||
parser.add_argument('--word_embedding', action='store_true', default=True, help='whether to use Word embedding') | |||
parser.add_argument('--word_emb_dim', type=int, default=300, help='Word embedding size [default: 200]') | |||
parser.add_argument('--embed_train', action='store_true', default=False, help='whether to train Word embedding [default: False]') | |||
parser.add_argument('--min_kernel_size', type=int, default=1, help='kernel min length for CNN [default:1]') | |||
parser.add_argument('--max_kernel_size', type=int, default=7, help='kernel max length for CNN [default:7]') | |||
parser.add_argument('--output_channel', type=int, default=50, help='output channel: repeated times for one kernel') | |||
parser.add_argument('--n_layers', type=int, default=12, help='Number of deeplstm layers') | |||
parser.add_argument('--hidden_size', type=int, default=512, help='hidden size [default: 512]') | |||
parser.add_argument('--ffn_inner_hidden_size', type=int, default=2048, help='PositionwiseFeedForward inner hidden size [default: 2048]') | |||
parser.add_argument('--n_head', type=int, default=8, help='multihead attention number [default: 8]') | |||
parser.add_argument('--recurrent_dropout_prob', type=float, default=0.1, help='recurrent dropout prob [default: 0.1]') | |||
parser.add_argument('--atten_dropout_prob', type=float, default=0.1,help='attention dropout prob [default: 0.1]') | |||
parser.add_argument('--ffn_dropout_prob', type=float, default=0.1, help='PositionwiseFeedForward dropout prob [default: 0.1]') | |||
parser.add_argument('--use_orthnormal_init', action='store_true', default=True, help='use orthnormal init for lstm [default: true]') | |||
parser.add_argument('--sent_max_len', type=int, default=100, help='max length of sentences (max source text sentence tokens)') | |||
parser.add_argument('--doc_max_timesteps', type=int, default=50, help='max length of documents (max timesteps of documents)') | |||
parser.add_argument('--save_label', action='store_true', default=False, help='require multihead attention') | |||
# Training | |||
parser.add_argument('--lr', type=float, default=0.0001, help='learning rate') | |||
parser.add_argument('--lr_descent', action='store_true', default=False, help='learning rate descent') | |||
parser.add_argument('--warmup_steps', type=int, default=4000, help='warmup_steps') | |||
parser.add_argument('--grad_clip', action='store_true', default=False, help='for gradient clipping') | |||
parser.add_argument('--max_grad_norm', type=float, default=1.0, help='for gradient clipping max gradient normalization') | |||
parser.add_argument('-m', type=int, default=3, help='decode summary length') | |||
parser.add_argument('--limited', action='store_true', default=False, help='limited decode summary length') | |||
args = parser.parse_args() | |||
os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu | |||
torch.set_printoptions(threshold=50000) | |||
hps = args | |||
# File paths | |||
DATA_FILE = args.data_path | |||
VALID_FILE = args.valid_path | |||
VOCAL_FILE = args.vocab_path | |||
LOG_PATH = args.log_root | |||
# train_log setting | |||
if not os.path.exists(LOG_PATH): | |||
if hps.mode == "train": | |||
os.makedirs(LOG_PATH) | |||
else: | |||
logger.exception("[Error] Logdir %s doesn't exist. Run in train mode to create it.", LOG_PATH) | |||
raise Exception("[Error] Logdir %s doesn't exist. Run in train mode to create it." % (LOG_PATH)) | |||
nowTime=datetime.datetime.now().strftime('%Y%m%d_%H%M%S') | |||
log_path = os.path.join(LOG_PATH, hps.mode + "_" + nowTime) | |||
file_handler = logging.FileHandler(log_path) | |||
file_handler.setFormatter(formatter) | |||
logger.addHandler(file_handler) | |||
logger.info("Pytorch %s", torch.__version__) | |||
logger.info(args) | |||
logger.info(args) | |||
sum_loader = SummarizationLoader() | |||
if hps.mode == 'test': | |||
paths = {"test": DATA_FILE} | |||
hps.recurrent_dropout_prob = 0.0 | |||
hps.atten_dropout_prob = 0.0 | |||
hps.ffn_dropout_prob = 0.0 | |||
logger.info(hps) | |||
else: | |||
paths = {"train": DATA_FILE, "valid": VALID_FILE} | |||
dataInfo = sum_loader.process(paths=paths, vocab_size=hps.vocab_size, vocab_path=VOCAL_FILE, sent_max_len=hps.sent_max_len, doc_max_timesteps=hps.doc_max_timesteps, load_vocab=os.path.exists(VOCAL_FILE)) | |||
vocab = dataInfo.vocabs["vocab"] | |||
model = TransformerModel(hps, vocab) | |||
if len(hps.gpu) > 1: | |||
gpuid = hps.gpu.split(',') | |||
gpuid = [int(s) for s in gpuid] | |||
model = nn.DataParallel(model,device_ids=gpuid) | |||
logger.info("[INFO] Use Multi-gpu: %s", hps.gpu) | |||
if hps.cuda: | |||
model = model.cuda() | |||
logger.info("[INFO] Use cuda") | |||
if hps.mode == 'train': | |||
trainset = dataInfo.datasets["train"] | |||
train_sampler = BucketSampler(batch_size=hps.batch_size, seq_len_field_name=Const.INPUT) | |||
train_batch = Batch(batch_size=hps.batch_size, dataset=trainset, sampler=train_sampler) | |||
validset = dataInfo.datasets["valid"] | |||
validset.set_input("text", "summary") | |||
valid_batch = Batch(batch_size=hps.batch_size, dataset=validset) | |||
setup_training(model, train_batch, valid_batch, hps) | |||
elif hps.mode == 'test': | |||
logger.info("[INFO] Decoding...") | |||
testset = dataInfo.datasets["test"] | |||
testset.set_input("text", "summary") | |||
test_batch = Batch(batch_size=hps.batch_size, dataset=testset) | |||
run_test(model, test_batch, hps, limited=hps.limited) | |||
else: | |||
logger.error("The 'mode' flag must be one of train/eval/test") | |||
raise ValueError("The 'mode' flag must be one of train/eval/test") | |||
if __name__ == '__main__': | |||
main() |
@@ -0,0 +1,103 @@ | |||
""" Manage beam search info structure. | |||
Heavily borrowed from OpenNMT-py. | |||
For code in OpenNMT-py, please check the following link: | |||
https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/Beam.py | |||
""" | |||
import torch | |||
import numpy as np | |||
import transformer.Constants as Constants | |||
class Beam(): | |||
''' Beam search ''' | |||
def __init__(self, size, device=False): | |||
self.size = size | |||
self._done = False | |||
# The score for each translation on the beam. | |||
self.scores = torch.zeros((size,), dtype=torch.float, device=device) | |||
self.all_scores = [] | |||
# The backpointers at each time-step. | |||
self.prev_ks = [] | |||
# The outputs at each time-step. | |||
self.next_ys = [torch.full((size,), Constants.PAD, dtype=torch.long, device=device)] | |||
self.next_ys[0][0] = Constants.BOS | |||
def get_current_state(self): | |||
"Get the outputs for the current timestep." | |||
return self.get_tentative_hypothesis() | |||
def get_current_origin(self): | |||
"Get the backpointers for the current timestep." | |||
return self.prev_ks[-1] | |||
@property | |||
def done(self): | |||
return self._done | |||
def advance(self, word_prob): | |||
"Update beam status and check if finished or not." | |||
num_words = word_prob.size(1) | |||
# Sum the previous scores. | |||
if len(self.prev_ks) > 0: | |||
beam_lk = word_prob + self.scores.unsqueeze(1).expand_as(word_prob) | |||
else: | |||
beam_lk = word_prob[0] | |||
flat_beam_lk = beam_lk.view(-1) | |||
best_scores, best_scores_id = flat_beam_lk.topk(self.size, 0, True, True) # 1st sort | |||
best_scores, best_scores_id = flat_beam_lk.topk(self.size, 0, True, True) # 2nd sort | |||
self.all_scores.append(self.scores) | |||
self.scores = best_scores | |||
# bestScoresId is flattened as a (beam x word) array, | |||
# so we need to calculate which word and beam each score came from | |||
prev_k = best_scores_id / num_words | |||
self.prev_ks.append(prev_k) | |||
self.next_ys.append(best_scores_id - prev_k * num_words) | |||
# End condition is when top-of-beam is EOS. | |||
if self.next_ys[-1][0].item() == Constants.EOS: | |||
self._done = True | |||
self.all_scores.append(self.scores) | |||
return self._done | |||
def sort_scores(self): | |||
"Sort the scores." | |||
return torch.sort(self.scores, 0, True) | |||
def get_the_best_score_and_idx(self): | |||
"Get the score of the best in the beam." | |||
scores, ids = self.sort_scores() | |||
return scores[1], ids[1] | |||
def get_tentative_hypothesis(self): | |||
"Get the decoded sequence for the current timestep." | |||
if len(self.next_ys) == 1: | |||
dec_seq = self.next_ys[0].unsqueeze(1) | |||
else: | |||
_, keys = self.sort_scores() | |||
hyps = [self.get_hypothesis(k) for k in keys] | |||
hyps = [[Constants.BOS] + h for h in hyps] | |||
dec_seq = torch.LongTensor(hyps) | |||
return dec_seq | |||
def get_hypothesis(self, k): | |||
""" Walk back to construct the full hypothesis. """ | |||
hyp = [] | |||
for j in range(len(self.prev_ks) - 1, -1, -1): | |||
hyp.append(self.next_ys[j+1][k]) | |||
k = self.prev_ks[j][k] | |||
return list(map(lambda x: x.item(), hyp[::-1])) |
@@ -0,0 +1,10 @@ | |||
PAD = 0 | |||
UNK = 1 | |||
BOS = 2 | |||
EOS = 3 | |||
PAD_WORD = '<blank>' | |||
UNK_WORD = '<unk>' | |||
BOS_WORD = '<s>' | |||
EOS_WORD = '</s>' |
@@ -0,0 +1,49 @@ | |||
''' Define the Layers ''' | |||
import torch.nn as nn | |||
from transformer.SubLayers import MultiHeadAttention, PositionwiseFeedForward | |||
__author__ = "Yu-Hsiang Huang" | |||
class EncoderLayer(nn.Module): | |||
''' Compose with two layers ''' | |||
def __init__(self, d_model, d_inner, n_head, d_k, d_v, dropout=0.1): | |||
super(EncoderLayer, self).__init__() | |||
self.slf_attn = MultiHeadAttention( | |||
n_head, d_model, d_k, d_v, dropout=dropout) | |||
self.pos_ffn = PositionwiseFeedForward(d_model, d_inner, dropout=dropout) | |||
def forward(self, enc_input, non_pad_mask=None, slf_attn_mask=None): | |||
enc_output, enc_slf_attn = self.slf_attn( | |||
enc_input, enc_input, enc_input, mask=slf_attn_mask) | |||
enc_output *= non_pad_mask | |||
enc_output = self.pos_ffn(enc_output) | |||
enc_output *= non_pad_mask | |||
return enc_output, enc_slf_attn | |||
class DecoderLayer(nn.Module): | |||
''' Compose with three layers ''' | |||
def __init__(self, d_model, d_inner, n_head, d_k, d_v, dropout=0.1): | |||
super(DecoderLayer, self).__init__() | |||
self.slf_attn = MultiHeadAttention(n_head, d_model, d_k, d_v, dropout=dropout) | |||
self.enc_attn = MultiHeadAttention(n_head, d_model, d_k, d_v, dropout=dropout) | |||
self.pos_ffn = PositionwiseFeedForward(d_model, d_inner, dropout=dropout) | |||
def forward(self, dec_input, enc_output, non_pad_mask=None, slf_attn_mask=None, dec_enc_attn_mask=None): | |||
dec_output, dec_slf_attn = self.slf_attn( | |||
dec_input, dec_input, dec_input, mask=slf_attn_mask) | |||
dec_output *= non_pad_mask | |||
dec_output, dec_enc_attn = self.enc_attn( | |||
dec_output, enc_output, enc_output, mask=dec_enc_attn_mask) | |||
dec_output *= non_pad_mask | |||
dec_output = self.pos_ffn(dec_output) | |||
dec_output *= non_pad_mask | |||
return dec_output, dec_slf_attn, dec_enc_attn |
@@ -0,0 +1,208 @@ | |||
''' Define the Transformer model ''' | |||
import torch | |||
import torch.nn as nn | |||
import numpy as np | |||
import transformer.Constants as Constants | |||
from transformer.Layers import EncoderLayer, DecoderLayer | |||
__author__ = "Yu-Hsiang Huang" | |||
def get_non_pad_mask(seq): | |||
assert seq.dim() == 2 | |||
return seq.ne(Constants.PAD).type(torch.float).unsqueeze(-1) | |||
def get_sinusoid_encoding_table(n_position, d_hid, padding_idx=None): | |||
''' Sinusoid position encoding table ''' | |||
def cal_angle(position, hid_idx): | |||
return position / np.power(10000, 2 * (hid_idx // 2) / d_hid) | |||
def get_posi_angle_vec(position): | |||
return [cal_angle(position, hid_j) for hid_j in range(d_hid)] | |||
sinusoid_table = np.array([get_posi_angle_vec(pos_i) for pos_i in range(n_position)]) | |||
sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2]) # dim 2i | |||
sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) # dim 2i+1 | |||
if padding_idx is not None: | |||
# zero vector for padding dimension | |||
sinusoid_table[padding_idx] = 0. | |||
return torch.FloatTensor(sinusoid_table) | |||
def get_attn_key_pad_mask(seq_k, seq_q): | |||
''' For masking out the padding part of key sequence. ''' | |||
# Expand to fit the shape of key query attention matrix. | |||
len_q = seq_q.size(1) | |||
padding_mask = seq_k.eq(Constants.PAD) | |||
padding_mask = padding_mask.unsqueeze(1).expand(-1, len_q, -1) # b x lq x lk | |||
return padding_mask | |||
def get_subsequent_mask(seq): | |||
''' For masking out the subsequent info. ''' | |||
sz_b, len_s = seq.size() | |||
subsequent_mask = torch.triu( | |||
torch.ones((len_s, len_s), device=seq.device, dtype=torch.uint8), diagonal=1) | |||
subsequent_mask = subsequent_mask.unsqueeze(0).expand(sz_b, -1, -1) # b x ls x ls | |||
return subsequent_mask | |||
class Encoder(nn.Module): | |||
''' A encoder model with self attention mechanism. ''' | |||
def __init__( | |||
self, | |||
n_src_vocab, len_max_seq, d_word_vec, | |||
n_layers, n_head, d_k, d_v, | |||
d_model, d_inner, dropout=0.1): | |||
super().__init__() | |||
n_position = len_max_seq + 1 | |||
self.src_word_emb = nn.Embedding( | |||
n_src_vocab, d_word_vec, padding_idx=Constants.PAD) | |||
self.position_enc = nn.Embedding.from_pretrained( | |||
get_sinusoid_encoding_table(n_position, d_word_vec, padding_idx=0), | |||
freeze=True) | |||
self.layer_stack = nn.ModuleList([ | |||
EncoderLayer(d_model, d_inner, n_head, d_k, d_v, dropout=dropout) | |||
for _ in range(n_layers)]) | |||
def forward(self, src_seq, src_pos, return_attns=False): | |||
enc_slf_attn_list = [] | |||
# -- Prepare masks | |||
slf_attn_mask = get_attn_key_pad_mask(seq_k=src_seq, seq_q=src_seq) | |||
non_pad_mask = get_non_pad_mask(src_seq) | |||
# -- Forward | |||
enc_output = self.src_word_emb(src_seq) + self.position_enc(src_pos) | |||
for enc_layer in self.layer_stack: | |||
enc_output, enc_slf_attn = enc_layer( | |||
enc_output, | |||
non_pad_mask=non_pad_mask, | |||
slf_attn_mask=slf_attn_mask) | |||
if return_attns: | |||
enc_slf_attn_list += [enc_slf_attn] | |||
if return_attns: | |||
return enc_output, enc_slf_attn_list | |||
return enc_output, | |||
class Decoder(nn.Module): | |||
''' A decoder model with self attention mechanism. ''' | |||
def __init__( | |||
self, | |||
n_tgt_vocab, len_max_seq, d_word_vec, | |||
n_layers, n_head, d_k, d_v, | |||
d_model, d_inner, dropout=0.1): | |||
super().__init__() | |||
n_position = len_max_seq + 1 | |||
self.tgt_word_emb = nn.Embedding( | |||
n_tgt_vocab, d_word_vec, padding_idx=Constants.PAD) | |||
self.position_enc = nn.Embedding.from_pretrained( | |||
get_sinusoid_encoding_table(n_position, d_word_vec, padding_idx=0), | |||
freeze=True) | |||
self.layer_stack = nn.ModuleList([ | |||
DecoderLayer(d_model, d_inner, n_head, d_k, d_v, dropout=dropout) | |||
for _ in range(n_layers)]) | |||
def forward(self, tgt_seq, tgt_pos, src_seq, enc_output, return_attns=False): | |||
dec_slf_attn_list, dec_enc_attn_list = [], [] | |||
# -- Prepare masks | |||
non_pad_mask = get_non_pad_mask(tgt_seq) | |||
slf_attn_mask_subseq = get_subsequent_mask(tgt_seq) | |||
slf_attn_mask_keypad = get_attn_key_pad_mask(seq_k=tgt_seq, seq_q=tgt_seq) | |||
slf_attn_mask = (slf_attn_mask_keypad + slf_attn_mask_subseq).gt(0) | |||
dec_enc_attn_mask = get_attn_key_pad_mask(seq_k=src_seq, seq_q=tgt_seq) | |||
# -- Forward | |||
dec_output = self.tgt_word_emb(tgt_seq) + self.position_enc(tgt_pos) | |||
for dec_layer in self.layer_stack: | |||
dec_output, dec_slf_attn, dec_enc_attn = dec_layer( | |||
dec_output, enc_output, | |||
non_pad_mask=non_pad_mask, | |||
slf_attn_mask=slf_attn_mask, | |||
dec_enc_attn_mask=dec_enc_attn_mask) | |||
if return_attns: | |||
dec_slf_attn_list += [dec_slf_attn] | |||
dec_enc_attn_list += [dec_enc_attn] | |||
if return_attns: | |||
return dec_output, dec_slf_attn_list, dec_enc_attn_list | |||
return dec_output, | |||
class Transformer(nn.Module): | |||
''' A sequence to sequence model with attention mechanism. ''' | |||
def __init__( | |||
self, | |||
n_src_vocab, n_tgt_vocab, len_max_seq, | |||
d_word_vec=512, d_model=512, d_inner=2048, | |||
n_layers=6, n_head=8, d_k=64, d_v=64, dropout=0.1, | |||
tgt_emb_prj_weight_sharing=True, | |||
emb_src_tgt_weight_sharing=True): | |||
super().__init__() | |||
self.encoder = Encoder( | |||
n_src_vocab=n_src_vocab, len_max_seq=len_max_seq, | |||
d_word_vec=d_word_vec, d_model=d_model, d_inner=d_inner, | |||
n_layers=n_layers, n_head=n_head, d_k=d_k, d_v=d_v, | |||
dropout=dropout) | |||
self.decoder = Decoder( | |||
n_tgt_vocab=n_tgt_vocab, len_max_seq=len_max_seq, | |||
d_word_vec=d_word_vec, d_model=d_model, d_inner=d_inner, | |||
n_layers=n_layers, n_head=n_head, d_k=d_k, d_v=d_v, | |||
dropout=dropout) | |||
self.tgt_word_prj = nn.Linear(d_model, n_tgt_vocab, bias=False) | |||
nn.init.xavier_normal_(self.tgt_word_prj.weight) | |||
assert d_model == d_word_vec, \ | |||
'To facilitate the residual connections, \ | |||
the dimensions of all module outputs shall be the same.' | |||
if tgt_emb_prj_weight_sharing: | |||
# Share the weight matrix between target word embedding & the final logit dense layer | |||
self.tgt_word_prj.weight = self.decoder.tgt_word_emb.weight | |||
self.x_logit_scale = (d_model ** -0.5) | |||
else: | |||
self.x_logit_scale = 1. | |||
if emb_src_tgt_weight_sharing: | |||
# Share the weight matrix between source & target word embeddings | |||
assert n_src_vocab == n_tgt_vocab, \ | |||
"To share word embedding table, the vocabulary size of src/tgt shall be the same." | |||
self.encoder.src_word_emb.weight = self.decoder.tgt_word_emb.weight | |||
def forward(self, src_seq, src_pos, tgt_seq, tgt_pos): | |||
tgt_seq, tgt_pos = tgt_seq[:, :-1], tgt_pos[:, :-1] | |||
enc_output, *_ = self.encoder(src_seq, src_pos) | |||
dec_output, *_ = self.decoder(tgt_seq, tgt_pos, src_seq, enc_output) | |||
seq_logit = self.tgt_word_prj(dec_output) * self.x_logit_scale | |||
return seq_logit.view(-1, seq_logit.size(2)) |
@@ -0,0 +1,28 @@ | |||
import torch | |||
import torch.nn as nn | |||
import numpy as np | |||
__author__ = "Yu-Hsiang Huang" | |||
class ScaledDotProductAttention(nn.Module): | |||
''' Scaled Dot-Product Attention ''' | |||
def __init__(self, temperature, attn_dropout=0.1): | |||
super().__init__() | |||
self.temperature = temperature | |||
self.dropout = nn.Dropout(attn_dropout) | |||
self.softmax = nn.Softmax(dim=2) | |||
def forward(self, q, k, v, mask=None): | |||
attn = torch.bmm(q, k.transpose(1, 2)) | |||
attn = attn / self.temperature | |||
if mask is not None: | |||
attn = attn.masked_fill(mask, -np.inf) | |||
attn = self.softmax(attn) | |||
attn = self.dropout(attn) | |||
output = torch.bmm(attn, v) | |||
return output, attn |
@@ -0,0 +1,35 @@ | |||
'''A wrapper class for optimizer ''' | |||
import numpy as np | |||
class ScheduledOptim(): | |||
'''A simple wrapper class for learning rate scheduling''' | |||
def __init__(self, optimizer, d_model, n_warmup_steps): | |||
self._optimizer = optimizer | |||
self.n_warmup_steps = n_warmup_steps | |||
self.n_current_steps = 0 | |||
self.init_lr = np.power(d_model, -0.5) | |||
def step_and_update_lr(self): | |||
"Step with the inner optimizer" | |||
self._update_learning_rate() | |||
self._optimizer.step() | |||
def zero_grad(self): | |||
"Zero out the gradients by the inner optimizer" | |||
self._optimizer.zero_grad() | |||
def _get_lr_scale(self): | |||
return np.min([ | |||
np.power(self.n_current_steps, -0.5), | |||
np.power(self.n_warmup_steps, -1.5) * self.n_current_steps]) | |||
def _update_learning_rate(self): | |||
''' Learning rate scheduling per step ''' | |||
self.n_current_steps += 1 | |||
lr = self.init_lr * self._get_lr_scale() | |||
for param_group in self._optimizer.param_groups: | |||
param_group['lr'] = lr | |||
@@ -0,0 +1,82 @@ | |||
''' Define the sublayers in encoder/decoder layer ''' | |||
import numpy as np | |||
import torch.nn as nn | |||
import torch.nn.functional as F | |||
from transformer.Modules import ScaledDotProductAttention | |||
__author__ = "Yu-Hsiang Huang" | |||
class MultiHeadAttention(nn.Module): | |||
''' Multi-Head Attention module ''' | |||
def __init__(self, n_head, d_model, d_k, d_v, dropout=0.1): | |||
super().__init__() | |||
self.n_head = n_head | |||
self.d_k = d_k | |||
self.d_v = d_v | |||
self.w_qs = nn.Linear(d_model, n_head * d_k) | |||
self.w_ks = nn.Linear(d_model, n_head * d_k) | |||
self.w_vs = nn.Linear(d_model, n_head * d_v) | |||
nn.init.xavier_normal_(self.w_qs.weight) | |||
nn.init.xavier_normal_(self.w_ks.weight) | |||
nn.init.xavier_normal_(self.w_vs.weight) | |||
self.attention = ScaledDotProductAttention(temperature=np.power(d_k, 0.5)) | |||
self.layer_norm = nn.LayerNorm(d_model) | |||
self.fc = nn.Linear(n_head * d_v, d_model) | |||
nn.init.xavier_normal_(self.fc.weight) | |||
self.dropout = nn.Dropout(dropout) | |||
def forward(self, q, k, v, mask=None): | |||
d_k, d_v, n_head = self.d_k, self.d_v, self.n_head | |||
sz_b, len_q, _ = q.size() | |||
sz_b, len_k, _ = k.size() | |||
sz_b, len_v, _ = v.size() | |||
residual = q | |||
q = self.w_qs(q).view(sz_b, len_q, n_head, d_k) | |||
k = self.w_ks(k).view(sz_b, len_k, n_head, d_k) | |||
v = self.w_vs(v).view(sz_b, len_v, n_head, d_v) | |||
q = q.permute(2, 0, 1, 3).contiguous().view(-1, len_q, d_k) # (n*b) x lq x dk | |||
k = k.permute(2, 0, 1, 3).contiguous().view(-1, len_k, d_k) # (n*b) x lk x dk | |||
v = v.permute(2, 0, 1, 3).contiguous().view(-1, len_v, d_v) # (n*b) x lv x dv | |||
if mask is not None: | |||
mask = mask.repeat(n_head, 1, 1) # (n*b) x .. x .. | |||
output, attn = self.attention(q, k, v, mask=mask) | |||
output = output.view(n_head, sz_b, len_q, d_v) | |||
output = output.permute(1, 2, 0, 3).contiguous().view(sz_b, len_q, -1) # b x lq x (n*dv) | |||
output = self.dropout(self.fc(output)) | |||
output = self.layer_norm(output + residual) | |||
return output, attn | |||
class PositionwiseFeedForward(nn.Module): | |||
''' A two-feed-forward-layer module ''' | |||
def __init__(self, d_in, d_hid, dropout=0.1): | |||
super().__init__() | |||
self.w_1 = nn.Conv1d(d_in, d_hid, 1) # position-wise | |||
self.w_2 = nn.Conv1d(d_hid, d_in, 1) # position-wise | |||
self.layer_norm = nn.LayerNorm(d_in) | |||
self.dropout = nn.Dropout(dropout) | |||
def forward(self, x): | |||
residual = x | |||
output = x.transpose(1, 2) | |||
output = self.w_2(F.relu(self.w_1(output))) | |||
output = output.transpose(1, 2) | |||
output = self.dropout(output) | |||
output = self.layer_norm(output + residual) | |||
return output |
@@ -0,0 +1,166 @@ | |||
''' This module will handle the text generation with beam search. ''' | |||
import torch | |||
import torch.nn as nn | |||
import torch.nn.functional as F | |||
from transformer.Models import Transformer | |||
from transformer.Beam import Beam | |||
class Translator(object): | |||
''' Load with trained model and handle the beam search ''' | |||
def __init__(self, opt): | |||
self.opt = opt | |||
self.device = torch.device('cuda' if opt.cuda else 'cpu') | |||
checkpoint = torch.load(opt.model) | |||
model_opt = checkpoint['settings'] | |||
self.model_opt = model_opt | |||
model = Transformer( | |||
model_opt.src_vocab_size, | |||
model_opt.tgt_vocab_size, | |||
model_opt.max_token_seq_len, | |||
tgt_emb_prj_weight_sharing=model_opt.proj_share_weight, | |||
emb_src_tgt_weight_sharing=model_opt.embs_share_weight, | |||
d_k=model_opt.d_k, | |||
d_v=model_opt.d_v, | |||
d_model=model_opt.d_model, | |||
d_word_vec=model_opt.d_word_vec, | |||
d_inner=model_opt.d_inner_hid, | |||
n_layers=model_opt.n_layers, | |||
n_head=model_opt.n_head, | |||
dropout=model_opt.dropout) | |||
model.load_state_dict(checkpoint['model']) | |||
print('[Info] Trained model state loaded.') | |||
model.word_prob_prj = nn.LogSoftmax(dim=1) | |||
model = model.to(self.device) | |||
self.model = model | |||
self.model.eval() | |||
def translate_batch(self, src_seq, src_pos): | |||
''' Translation work in one batch ''' | |||
def get_inst_idx_to_tensor_position_map(inst_idx_list): | |||
''' Indicate the position of an instance in a tensor. ''' | |||
return {inst_idx: tensor_position for tensor_position, inst_idx in enumerate(inst_idx_list)} | |||
def collect_active_part(beamed_tensor, curr_active_inst_idx, n_prev_active_inst, n_bm): | |||
''' Collect tensor parts associated to active instances. ''' | |||
_, *d_hs = beamed_tensor.size() | |||
n_curr_active_inst = len(curr_active_inst_idx) | |||
new_shape = (n_curr_active_inst * n_bm, *d_hs) | |||
beamed_tensor = beamed_tensor.view(n_prev_active_inst, -1) | |||
beamed_tensor = beamed_tensor.index_select(0, curr_active_inst_idx) | |||
beamed_tensor = beamed_tensor.view(*new_shape) | |||
return beamed_tensor | |||
def collate_active_info( | |||
src_seq, src_enc, inst_idx_to_position_map, active_inst_idx_list): | |||
# Sentences which are still active are collected, | |||
# so the decoder will not run on completed sentences. | |||
n_prev_active_inst = len(inst_idx_to_position_map) | |||
active_inst_idx = [inst_idx_to_position_map[k] for k in active_inst_idx_list] | |||
active_inst_idx = torch.LongTensor(active_inst_idx).to(self.device) | |||
active_src_seq = collect_active_part(src_seq, active_inst_idx, n_prev_active_inst, n_bm) | |||
active_src_enc = collect_active_part(src_enc, active_inst_idx, n_prev_active_inst, n_bm) | |||
active_inst_idx_to_position_map = get_inst_idx_to_tensor_position_map(active_inst_idx_list) | |||
return active_src_seq, active_src_enc, active_inst_idx_to_position_map | |||
def beam_decode_step( | |||
inst_dec_beams, len_dec_seq, src_seq, enc_output, inst_idx_to_position_map, n_bm): | |||
''' Decode and update beam status, and then return active beam idx ''' | |||
def prepare_beam_dec_seq(inst_dec_beams, len_dec_seq): | |||
dec_partial_seq = [b.get_current_state() for b in inst_dec_beams if not b.done] | |||
dec_partial_seq = torch.stack(dec_partial_seq).to(self.device) | |||
dec_partial_seq = dec_partial_seq.view(-1, len_dec_seq) | |||
return dec_partial_seq | |||
def prepare_beam_dec_pos(len_dec_seq, n_active_inst, n_bm): | |||
dec_partial_pos = torch.arange(1, len_dec_seq + 1, dtype=torch.long, device=self.device) | |||
dec_partial_pos = dec_partial_pos.unsqueeze(0).repeat(n_active_inst * n_bm, 1) | |||
return dec_partial_pos | |||
def predict_word(dec_seq, dec_pos, src_seq, enc_output, n_active_inst, n_bm): | |||
dec_output, *_ = self.model.decoder(dec_seq, dec_pos, src_seq, enc_output) | |||
dec_output = dec_output[:, -1, :] # Pick the last step: (bh * bm) * d_h | |||
word_prob = F.log_softmax(self.model.tgt_word_prj(dec_output), dim=1) | |||
word_prob = word_prob.view(n_active_inst, n_bm, -1) | |||
return word_prob | |||
def collect_active_inst_idx_list(inst_beams, word_prob, inst_idx_to_position_map): | |||
active_inst_idx_list = [] | |||
for inst_idx, inst_position in inst_idx_to_position_map.items(): | |||
is_inst_complete = inst_beams[inst_idx].advance(word_prob[inst_position]) | |||
if not is_inst_complete: | |||
active_inst_idx_list += [inst_idx] | |||
return active_inst_idx_list | |||
n_active_inst = len(inst_idx_to_position_map) | |||
dec_seq = prepare_beam_dec_seq(inst_dec_beams, len_dec_seq) | |||
dec_pos = prepare_beam_dec_pos(len_dec_seq, n_active_inst, n_bm) | |||
word_prob = predict_word(dec_seq, dec_pos, src_seq, enc_output, n_active_inst, n_bm) | |||
# Update the beam with predicted word prob information and collect incomplete instances | |||
active_inst_idx_list = collect_active_inst_idx_list( | |||
inst_dec_beams, word_prob, inst_idx_to_position_map) | |||
return active_inst_idx_list | |||
def collect_hypothesis_and_scores(inst_dec_beams, n_best): | |||
all_hyp, all_scores = [], [] | |||
for inst_idx in range(len(inst_dec_beams)): | |||
scores, tail_idxs = inst_dec_beams[inst_idx].sort_scores() | |||
all_scores += [scores[:n_best]] | |||
hyps = [inst_dec_beams[inst_idx].get_hypothesis(i) for i in tail_idxs[:n_best]] | |||
all_hyp += [hyps] | |||
return all_hyp, all_scores | |||
with torch.no_grad(): | |||
#-- Encode | |||
src_seq, src_pos = src_seq.to(self.device), src_pos.to(self.device) | |||
src_enc, *_ = self.model.encoder(src_seq, src_pos) | |||
#-- Repeat data for beam search | |||
n_bm = self.opt.beam_size | |||
n_inst, len_s, d_h = src_enc.size() | |||
src_seq = src_seq.repeat(1, n_bm).view(n_inst * n_bm, len_s) | |||
src_enc = src_enc.repeat(1, n_bm, 1).view(n_inst * n_bm, len_s, d_h) | |||
#-- Prepare beams | |||
inst_dec_beams = [Beam(n_bm, device=self.device) for _ in range(n_inst)] | |||
#-- Bookkeeping for active or not | |||
active_inst_idx_list = list(range(n_inst)) | |||
inst_idx_to_position_map = get_inst_idx_to_tensor_position_map(active_inst_idx_list) | |||
#-- Decode | |||
for len_dec_seq in range(1, self.model_opt.max_token_seq_len + 1): | |||
active_inst_idx_list = beam_decode_step( | |||
inst_dec_beams, len_dec_seq, src_seq, src_enc, inst_idx_to_position_map, n_bm) | |||
if not active_inst_idx_list: | |||
break # all instances have finished their path to <EOS> | |||
src_seq, src_enc, inst_idx_to_position_map = collate_active_info( | |||
src_seq, src_enc, inst_idx_to_position_map, active_inst_idx_list) | |||
batch_hyp, batch_scores = collect_hypothesis_and_scores(inst_dec_beams, self.opt.n_best) | |||
return batch_hyp, batch_scores |
@@ -0,0 +1,13 @@ | |||
import transformer.Constants | |||
import transformer.Modules | |||
import transformer.Layers | |||
import transformer.SubLayers | |||
import transformer.Models | |||
import transformer.Translator | |||
import transformer.Beam | |||
import transformer.Optim | |||
__all__ = [ | |||
transformer.Constants, transformer.Modules, transformer.Layers, | |||
transformer.SubLayers, transformer.Models, transformer.Optim, | |||
transformer.Translator, transformer.Beam] |
@@ -3,7 +3,7 @@ from datetime import timedelta | |||
from fastNLP.io.dataset_loader import JsonLoader | |||
from fastNLP.modules.encoder._bert import BertTokenizer | |||
from fastNLP.io.base_loader import DataInfo | |||
from fastNLP.io.base_loader import DataBundle | |||
from fastNLP.core.const import Const | |||
class BertData(JsonLoader): | |||
@@ -110,7 +110,7 @@ class BertData(JsonLoader): | |||
# set paddding value | |||
datasets[name].set_pad_val('article', 0) | |||
return DataInfo(datasets=datasets) | |||
return DataBundle(datasets=datasets) | |||
class BertSumLoader(JsonLoader): | |||
@@ -154,4 +154,4 @@ class BertSumLoader(JsonLoader): | |||
print('Finished in {}'.format(timedelta(seconds=time()-start))) | |||
return DataInfo(datasets=datasets) | |||
return DataBundle(datasets=datasets) |
@@ -0,0 +1,78 @@ | |||
# Summarization | |||
## Extractive Summarization | |||
### Models | |||
FastNLP中实现的模型包括: | |||
1. Get To The Point: Summarization with Pointer-Generator Networks (See et al. 2017) | |||
2. Searching for Effective Neural Extractive Summarization What Works and What's Next (Zhong et al. 2019) | |||
3. Fine-tune BERT for Extractive Summarization (Liu et al. 2019) | |||
### Dataset | |||
这里提供的摘要任务数据集包括: | |||
- CNN/DailyMail | |||
- Newsroom | |||
- The New York Times Annotated Corpus | |||
- NYT | |||
- NYT50 | |||
- DUC | |||
- 2002 Task4 | |||
- 2003/2004 Task1 | |||
- arXiv | |||
- PubMed | |||
其中公开数据集(CNN/DailyMail, Newsroom, arXiv, PubMed)预处理之后的下载地址: | |||
- [百度云盘](https://pan.baidu.com/s/11qWnDjK9lb33mFZ9vuYlzA) (提取码:h1px) | |||
- [Google Drive](https://drive.google.com/file/d/1uzeSdcLk5ilHaUTeJRNrf-_j59CQGe6r/view?usp=drivesdk) | |||
未公开数据集(NYT, NYT50, DUC)数据处理部分脚本放置于data文件夹 | |||
### Dataset_loader | |||
- SummarizationLoader: 用于读取处理好的jsonl格式数据集,返回以下field | |||
- text: 文章正文 | |||
- summary: 摘要 | |||
- domain: 可选,文章发布网站 | |||
- tag: 可选,文章内容标签 | |||
- labels: 抽取式句子标签 | |||
- BertSumLoader:用于读取作为 BertSum(Liu 2019) 输入的数据集,返回以下 field: | |||
- article:每篇文章被截断为 512 后的词表 ID | |||
- segmet_id:每句话属于 0/1 的 segment | |||
- cls_id:输入中 ‘[CLS]’ 的位置 | |||
- label:抽取式句子标签 | |||
### Performance and Hyperparameters | |||
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | Paper | | |||
| :-----------------------------: | :-----: | :-----: | :-----: | :-----------------------------------------: | | |||
| LEAD 3 | 40.11 | 17.64 | 36.32 | our data pre-process | | |||
| ORACLE | 55.24 | 31.14 | 50.96 | our data pre-process | | |||
| LSTM + Sequence Labeling | 40.72 | 18.27 | 36.98 | | | |||
| Transformer + Sequence Labeling | 40.86 | 18.38 | 37.18 | | | |||
| LSTM + Pointer Network | - | - | - | | | |||
| Transformer + Pointer Network | - | - | - | | | |||
| BERTSUM | 42.71 | 19.76 | 39.03 | Fine-tune BERT for Extractive Summarization | | |||
| LSTM+PN+BERT+RL | - | - | - | | | |||
## Abstractive Summarization | |||
Still in Progress... |
@@ -11,7 +11,7 @@ Coreference resolution是查找文本中指向同一现实实体的所有表达 | |||
由于版权问题,本文无法提供数据集的下载,请自行下载。 | |||
原始数据集的格式为conll格式,详细介绍参考数据集给出的官方介绍页面。 | |||
代码实现采用了论文作者Lee的预处理方法,具体细节参加[链接](https://github.com/kentonl/e2e-coref/blob/e2e/setup_training.sh)。 | |||
代码实现采用了论文作者Lee的预处理方法,具体细节参见[链接](https://github.com/kentonl/e2e-coref/blob/e2e/setup_training.sh)。 | |||
处理之后的数据集为json格式,例子: | |||
``` | |||
{ | |||
@@ -25,12 +25,12 @@ Coreference resolution是查找文本中指向同一现实实体的所有表达 | |||
### embedding 数据集下载 | |||
[turian emdedding](https://lil.cs.washington.edu/coref/turian.50d.txt) | |||
[glove embedding]( https://nlp.stanford.edu/data/glove.840B.300d.zip) | |||
[glove embedding](https://nlp.stanford.edu/data/glove.840B.300d.zip) | |||
## 运行 | |||
```python | |||
```shell | |||
# 训练代码 | |||
CUDA_VISIBLE_DEVICES=0 python train.py | |||
# 测试代码 | |||
@@ -39,9 +39,9 @@ CUDA_VISIBLE_DEVICES=0 python valid.py | |||
## 结果 | |||
原论文作者在测试集上取得了67.2%的结果,AllenNLP复现的结果为 [63.0%](https://allennlp.org/models)。 | |||
其中allenNLP训练时没有加入speaker信息,没有variational dropout以及只使用了100的antecedents而不是250。 | |||
其中AllenNLP训练时没有加入speaker信息,没有variational dropout以及只使用了100的antecedents而不是250。 | |||
在与allenNLP使用同样的超参和配置时,本代码复现取得了63.6%的F1值。 | |||
在与AllenNLP使用同样的超参和配置时,本代码复现取得了63.6%的F1值。 | |||
## 问题 |
@@ -1,7 +1,7 @@ | |||
from fastNLP.io.dataset_loader import JsonLoader,DataSet,Instance | |||
from fastNLP.io.file_reader import _read_json | |||
from fastNLP.core.vocabulary import Vocabulary | |||
from fastNLP.io.base_loader import DataInfo | |||
from fastNLP.io.base_loader import DataBundle | |||
from reproduction.coreference_resolution.model.config import Config | |||
import reproduction.coreference_resolution.model.preprocess as preprocess | |||
@@ -26,7 +26,7 @@ class CRLoader(JsonLoader): | |||
return dataset | |||
def process(self, paths, **kwargs): | |||
data_info = DataInfo() | |||
data_info = DataBundle() | |||
for name in ['train', 'test', 'dev']: | |||
data_info.datasets[name] = self.load(paths[name]) | |||
@@ -1,7 +1,7 @@ | |||
from fastNLP.io.base_loader import DataSetLoader, DataInfo | |||
from fastNLP.io.dataset_loader import ConllLoader | |||
from fastNLP.io.base_loader import DataSetLoader, DataBundle | |||
from fastNLP.io.data_loader import ConllLoader | |||
import numpy as np | |||
from itertools import chain | |||
@@ -76,7 +76,7 @@ class CTBxJointLoader(DataSetLoader): | |||
gold_label_word_pairs: | |||
""" | |||
paths = check_dataloader_paths(paths) | |||
data = DataInfo() | |||
data = DataBundle() | |||
for name, path in paths.items(): | |||
dataset = self.load(path) | |||
@@ -2,13 +2,13 @@ | |||
这里使用fastNLP复现了几个著名的Matching任务的模型,旨在达到与论文中相符的性能。这几个任务的评价指标均为准确率(%). | |||
复现的模型有(按论文发表时间顺序排序): | |||
- CNTN:模型代码(still in progress)[](); 训练代码(still in progress)[](). | |||
- CNTN:[模型代码](model/cntn.py); [训练代码](matching_cntn.py). | |||
论文链接:[Convolutional Neural Tensor Network Architecture for Community-based Question Answering](https://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/view/11401/10844). | |||
- ESIM:[模型代码](model/esim.py); [训练代码](matching_esim.py). | |||
论文链接:[Enhanced LSTM for Natural Language Inference](https://arxiv.org/pdf/1609.06038.pdf). | |||
- DIIN:模型代码(still in progress)[](); 训练代码(still in progress)[](). | |||
论文链接:[Natural Language Inference over Interaction Space](https://arxiv.org/pdf/1709.04348.pdf). | |||
- MwAN:模型代码(still in progress)[](); 训练代码(still in progress)[](). | |||
- MwAN:[模型代码](model/mwan.py); [训练代码](matching_mwan.py). | |||
论文链接:[Multiway Attention Networks for Modeling Sentence Pairs](https://www.ijcai.org/proceedings/2018/0613.pdf). | |||
- BERT:[模型代码](model/bert.py); [训练代码](matching_bert.py). | |||
论文链接:[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf). | |||
@@ -21,10 +21,10 @@ | |||
model name | SNLI | MNLI | RTE | QNLI | Quora | |||
:---: | :---: | :---: | :---: | :---: | :---: | |||
CNTN [](); [论文](https://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/view/11401/10844) | 74.53 vs - | 60.84/-(dev) vs - | 57.4(dev) vs - | 62.53(dev) vs - | - | | |||
CNTN [代码](model/cntn.py); [论文](https://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/view/11401/10844) | 77.79 vs - | 63.29/63.16(dev) vs - | 57.04(dev) vs - | 62.38(dev) vs - | - | | |||
ESIM[代码](model/bert.py); [论文](https://arxiv.org/pdf/1609.06038.pdf) | 88.13(glove) vs 88.0(glove)/88.7(elmo) | 77.78/76.49 vs 72.4/72.1* | 59.21(dev) vs - | 76.97(dev) vs - | - | | |||
DIIN [](); [论文](https://arxiv.org/pdf/1709.04348.pdf) | - vs 88.0 | - vs 78.8/77.8 | - | - | - vs 89.06 | | |||
MwAN [](); [论文](https://www.ijcai.org/proceedings/2018/0613.pdf) | 87.9 vs 88.3 | 77.3/76.7(dev) vs 78.5/77.7 | - | 74.6(dev) vs - | 85.6 vs 89.12 | | |||
MwAN [代码](model/mwan.py); [论文](https://www.ijcai.org/proceedings/2018/0613.pdf) | 87.9 vs 88.3 | 77.3/76.7(dev) vs 78.5/77.7 | - | 74.6(dev) vs - | 85.6 vs 89.12 | | |||
BERT (BASE version)[代码](model/bert.py); [论文](https://arxiv.org/pdf/1810.04805.pdf) | 90.6 vs - | - vs 84.6/83.4| 67.87(dev) vs 66.4 | 90.97(dev) vs 90.5 | - | | |||
*ESIM模型由MNLI官方复现的结果为72.4/72.1,ESIM原论文当中没有汇报MNLI数据集的结果。 | |||
@@ -44,7 +44,7 @@ Performance on Test set: | |||
model name | CNTN | ESIM | DIIN | MwAN | BERT-Base | BERT-Large | |||
:---: | :---: | :---: | :---: | :---: | :---: | :---: | |||
__performance__ | - | 88.13 | - | 87.9 | 90.6 | 91.16 | |||
__performance__ | 77.79 | 88.13 | - | 87.9 | 90.6 | 91.16 | |||
## MNLI | |||
[Link to MNLI main page](https://www.nyu.edu/projects/bowman/multinli/) | |||
@@ -60,7 +60,7 @@ Performance on Test set(matched/mismatched): | |||
model name | CNTN | ESIM | DIIN | MwAN | BERT-Base | |||
:---: | :---: | :---: | :---: | :---: | :---: | | |||
__performance__ | - | 77.78/76.49 | - | 77.3/76.7(dev) | - | | |||
__performance__ | 63.29/63.16(dev) | 77.78/76.49 | - | 77.3/76.7(dev) | - | | |||
## RTE | |||
@@ -92,7 +92,7 @@ Performance on __Dev__ set: | |||
model name | CNTN | ESIM | DIIN | MwAN | BERT | |||
:---: | :---: | :---: | :---: | :---: | :---: | |||
__performance__ | - | 76.97 | - | 74.6 | - | |||
__performance__ | 62.38 | 76.97 | - | 74.6 | - | |||
## Quora | |||
@@ -5,7 +5,7 @@ from typing import Union, Dict | |||
from fastNLP.core.const import Const | |||
from fastNLP.core.vocabulary import Vocabulary | |||
from fastNLP.io.base_loader import DataInfo, DataSetLoader | |||
from fastNLP.io.base_loader import DataBundle, DataSetLoader | |||
from fastNLP.io.dataset_loader import JsonLoader, CSVLoader | |||
from fastNLP.io.file_utils import _get_base_url, cached_path, PRETRAINED_BERT_MODEL_DIR | |||
from fastNLP.modules.encoder._bert import BertTokenizer | |||
@@ -35,7 +35,7 @@ class MatchingLoader(DataSetLoader): | |||
to_lower=False, seq_len_type: str=None, bert_tokenizer: str=None, | |||
cut_text: int = None, get_index=True, auto_pad_length: int=None, | |||
auto_pad_token: str='<pad>', set_input: Union[list, str, bool]=True, | |||
set_target: Union[list, str, bool] = True, concat: Union[str, list, bool]=None, ) -> DataInfo: | |||
set_target: Union[list, str, bool] = True, concat: Union[str, list, bool]=None, ) -> DataBundle: | |||
""" | |||
:param paths: str或者Dict[str, str]。如果是str,则为数据集所在的文件夹或者是全路径文件名:如果是文件夹, | |||
则会从self.paths里面找对应的数据集名称与文件名。如果是Dict,则为数据集名称(如train、dev、test)和 | |||
@@ -80,7 +80,7 @@ class MatchingLoader(DataSetLoader): | |||
else: | |||
path = paths | |||
data_info = DataInfo() | |||
data_info = DataBundle() | |||
for data_name in path.keys(): | |||
data_info.datasets[data_name] = self._load(path[data_name]) | |||
@@ -0,0 +1,145 @@ | |||
import sys | |||
import os | |||
import random | |||
import numpy as np | |||
import torch | |||
from torch.optim import Adadelta, SGD | |||
from torch.optim.lr_scheduler import StepLR | |||
from tqdm import tqdm | |||
from fastNLP import CrossEntropyLoss | |||
from fastNLP import cache_results | |||
from fastNLP.core import Trainer, Tester, Adam, AccuracyMetric, Const | |||
from fastNLP.core.predictor import Predictor | |||
from fastNLP.core.callback import GradientClipCallback, LRScheduler, FitlogCallback | |||
from fastNLP.modules.encoder.embedding import ElmoEmbedding, StaticEmbedding | |||
from fastNLP.io.data_loader import MNLILoader, QNLILoader, QuoraLoader, SNLILoader, RTELoader | |||
from reproduction.matching.model.mwan import MwanModel | |||
import fitlog | |||
fitlog.debug() | |||
import argparse | |||
argument = argparse.ArgumentParser() | |||
argument.add_argument('--task' , choices = ['snli', 'rte', 'qnli', 'mnli'],default = 'snli') | |||
argument.add_argument('--batch-size' , type = int , default = 128) | |||
argument.add_argument('--n-epochs' , type = int , default = 50) | |||
argument.add_argument('--lr' , type = float , default = 1) | |||
argument.add_argument('--testset-name' , type = str , default = 'test') | |||
argument.add_argument('--devset-name' , type = str , default = 'dev') | |||
argument.add_argument('--seed' , type = int , default = 42) | |||
argument.add_argument('--hidden-size' , type = int , default = 150) | |||
argument.add_argument('--dropout' , type = float , default = 0.3) | |||
arg = argument.parse_args() | |||
random.seed(arg.seed) | |||
np.random.seed(arg.seed) | |||
torch.manual_seed(arg.seed) | |||
n_gpu = torch.cuda.device_count() | |||
if n_gpu > 0: | |||
torch.cuda.manual_seed_all(arg.seed) | |||
print (n_gpu) | |||
for k in arg.__dict__: | |||
print(k, arg.__dict__[k], type(arg.__dict__[k])) | |||
# load data set | |||
if arg.task == 'snli': | |||
@cache_results(f'snli_mwan.pkl') | |||
def read_snli(): | |||
data_info = SNLILoader().process( | |||
paths='path/to/snli/data', to_lower=True, seq_len_type=None, bert_tokenizer=None, | |||
get_index=True, concat=False, extra_split=['/','%','-'], | |||
) | |||
return data_info | |||
data_info = read_snli() | |||
elif arg.task == 'rte': | |||
@cache_results(f'rte_mwan.pkl') | |||
def read_rte(): | |||
data_info = RTELoader().process( | |||
paths='path/to/rte/data', to_lower=True, seq_len_type=None, bert_tokenizer=None, | |||
get_index=True, concat=False, extra_split=['/','%','-'], | |||
) | |||
return data_info | |||
data_info = read_rte() | |||
elif arg.task == 'qnli': | |||
data_info = QNLILoader().process( | |||
paths='path/to/qnli/data', to_lower=True, seq_len_type=None, bert_tokenizer=None, | |||
get_index=True, concat=False , cut_text=512, extra_split=['/','%','-'], | |||
) | |||
elif arg.task == 'mnli': | |||
@cache_results(f'mnli_v0.9_mwan.pkl') | |||
def read_mnli(): | |||
data_info = MNLILoader().process( | |||
paths='path/to/mnli/data', to_lower=True, seq_len_type=None, bert_tokenizer=None, | |||
get_index=True, concat=False, extra_split=['/','%','-'], | |||
) | |||
return data_info | |||
data_info = read_mnli() | |||
else: | |||
raise RuntimeError(f'NOT support {arg.task} task yet!') | |||
print(data_info) | |||
print(len(data_info.vocabs['words'])) | |||
model = MwanModel( | |||
num_class = len(data_info.vocabs[Const.TARGET]), | |||
EmbLayer = StaticEmbedding(data_info.vocabs[Const.INPUT], requires_grad=False, normalize=False), | |||
ElmoLayer = None, | |||
args_of_imm = { | |||
"input_size" : 300 , | |||
"hidden_size" : arg.hidden_size , | |||
"dropout" : arg.dropout , | |||
"use_allennlp" : False , | |||
} , | |||
) | |||
optimizer = Adadelta(lr=arg.lr, params=model.parameters()) | |||
scheduler = StepLR(optimizer, step_size=10, gamma=0.5) | |||
callbacks = [ | |||
LRScheduler(scheduler), | |||
] | |||
if arg.task in ['snli']: | |||
callbacks.append(FitlogCallback(data_info.datasets[arg.testset_name], verbose=1)) | |||
elif arg.task == 'mnli': | |||
callbacks.append(FitlogCallback({'dev_matched': data_info.datasets['dev_matched'], | |||
'dev_mismatched': data_info.datasets['dev_mismatched']}, | |||
verbose=1)) | |||
trainer = Trainer( | |||
train_data = data_info.datasets['train'], | |||
model = model, | |||
optimizer = optimizer, | |||
num_workers = 0, | |||
batch_size = arg.batch_size, | |||
n_epochs = arg.n_epochs, | |||
print_every = -1, | |||
dev_data = data_info.datasets[arg.devset_name], | |||
metrics = AccuracyMetric(pred = "pred" , target = "target"), | |||
metric_key = 'acc', | |||
device = [i for i in range(torch.cuda.device_count())], | |||
check_code_level = -1, | |||
callbacks = callbacks, | |||
loss = CrossEntropyLoss(pred = "pred" , target = "target") | |||
) | |||
trainer.train(load_best_model=True) | |||
tester = Tester( | |||
data=data_info.datasets[arg.testset_name], | |||
model=model, | |||
metrics=AccuracyMetric(), | |||
batch_size=arg.batch_size, | |||
device=[i for i in range(torch.cuda.device_count())], | |||
) | |||
tester.test() |
@@ -0,0 +1,455 @@ | |||
import torch as tc | |||
import torch.nn as nn | |||
import torch.nn.functional as F | |||
import sys | |||
import os | |||
import math | |||
from fastNLP.core.const import Const | |||
class RNNModel(nn.Module): | |||
def __init__(self, input_size, hidden_size, num_layers, bidrect, dropout): | |||
super(RNNModel, self).__init__() | |||
if num_layers <= 1: | |||
dropout = 0.0 | |||
self.rnn = nn.GRU(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, | |||
batch_first=True, dropout=dropout, bidirectional=bidrect) | |||
self.number = (2 if bidrect else 1) * num_layers | |||
def forward(self, x, mask): | |||
''' | |||
mask: (batch_size, seq_len) | |||
x: (batch_size, seq_len, input_size) | |||
''' | |||
lens = (mask).long().sum(dim=1) | |||
lens, idx_sort = tc.sort(lens, descending=True) | |||
_, idx_unsort = tc.sort(idx_sort) | |||
x = x[idx_sort] | |||
x = nn.utils.rnn.pack_padded_sequence(x, lens, batch_first=True) | |||
self.rnn.flatten_parameters() | |||
y, h = self.rnn(x) | |||
y, lens = nn.utils.rnn.pad_packed_sequence(y, batch_first=True) | |||
h = h.transpose(0,1).contiguous() #make batch size first | |||
y = y[idx_unsort] #(batch_size, seq_len, bid * hid_size) | |||
h = h[idx_unsort] #(batch_size, number, hid_size) | |||
return y, h | |||
class Contexualizer(nn.Module): | |||
def __init__(self, input_size, hidden_size, num_layers=1, dropout=0.3): | |||
super(Contexualizer, self).__init__() | |||
self.rnn = RNNModel(input_size, hidden_size, num_layers, True, dropout) | |||
self.output_size = hidden_size * 2 | |||
self.reset_parameters() | |||
def reset_parameters(self): | |||
weights = self.rnn.rnn.all_weights | |||
for w1 in weights: | |||
for w2 in w1: | |||
if len(list(w2.size())) <= 1: | |||
w2.data.fill_(0) | |||
else: nn.init.xavier_normal_(w2.data, gain=1.414) | |||
def forward(self, s, mask): | |||
y = self.rnn(s, mask)[0] # (batch_size, seq_len, 2 * hidden_size) | |||
return y | |||
class ConcatAttention_Param(nn.Module): | |||
def __init__(self, input_size, hidden_size, dropout=0.2): | |||
super(ConcatAttention_Param, self).__init__() | |||
self.ln = nn.Linear(input_size + hidden_size, hidden_size) | |||
self.v = nn.Linear(hidden_size, 1, bias=False) | |||
self.vq = nn.Parameter(tc.rand(hidden_size)) | |||
self.drop = nn.Dropout(dropout) | |||
self.output_size = input_size | |||
self.reset_parameters() | |||
def reset_parameters(self): | |||
nn.init.xavier_uniform_(self.v.weight.data) | |||
nn.init.xavier_uniform_(self.ln.weight.data) | |||
self.ln.bias.data.fill_(0) | |||
def forward(self, h, mask): | |||
''' | |||
h: (batch_size, len, input_size) | |||
mask: (batch_size, len) | |||
''' | |||
vq = self.vq.view(1,1,-1).expand(h.size(0), h.size(1), self.vq.size(0)) | |||
s = self.v(tc.tanh(self.ln(tc.cat([h,vq],-1)))).squeeze(-1) # (batch_size, len) | |||
s = s - ((mask == 0).float() * 10000) | |||
a = tc.softmax(s, dim=1) | |||
r = a.unsqueeze(-1) * h # (batch_size, len, input_size) | |||
r = tc.sum(r, dim=1) # (batch_size, input_size) | |||
return self.drop(r) | |||
def get_2dmask(mask_hq, mask_hp, siz=None): | |||
if siz is None: | |||
siz = (mask_hq.size(0), mask_hq.size(1), mask_hp.size(1)) | |||
mask_mat = 1 | |||
if mask_hq is not None: | |||
mask_mat = mask_mat * mask_hq.unsqueeze(2).expand(siz) | |||
if mask_hp is not None: | |||
mask_mat = mask_mat * mask_hp.unsqueeze(1).expand(siz) | |||
return mask_mat | |||
def Attention(hq, hp, mask_hq, mask_hp, my_method): | |||
standard_size = (hq.size(0), hq.size(1), hp.size(1), hq.size(-1)) | |||
mask_mat = get_2dmask(mask_hq, mask_hp, standard_size[:-1]) | |||
hq_mat = hq.unsqueeze(2).expand(standard_size) | |||
hp_mat = hp.unsqueeze(1).expand(standard_size) | |||
s = my_method(hq_mat, hp_mat) # (batch_size, len_q, len_p) | |||
s = s - ((mask_mat == 0).float() * 10000) | |||
a = tc.softmax(s, dim=1) | |||
q = a.unsqueeze(-1) * hq_mat #(batch_size, len_q, len_p, input_size) | |||
q = tc.sum(q, dim=1) #(batch_size, len_p, input_size) | |||
return q | |||
class ConcatAttention(nn.Module): | |||
def __init__(self, input_size, hidden_size, dropout=0.2, input_size_2=-1): | |||
super(ConcatAttention, self).__init__() | |||
if input_size_2 < 0: | |||
input_size_2 = input_size | |||
self.ln = nn.Linear(input_size + input_size_2, hidden_size) | |||
self.v = nn.Linear(hidden_size, 1, bias=False) | |||
self.drop = nn.Dropout(dropout) | |||
self.output_size = input_size | |||
self.reset_parameters() | |||
def reset_parameters(self): | |||
nn.init.xavier_uniform_(self.v.weight.data) | |||
nn.init.xavier_uniform_(self.ln.weight.data) | |||
self.ln.bias.data.fill_(0) | |||
def my_method(self, hq_mat, hp_mat): | |||
s = tc.cat([hq_mat, hp_mat], dim=-1) | |||
s = self.v(tc.tanh(self.ln(s))).squeeze(-1) #(batch_size, len_q, len_p) | |||
return s | |||
def forward(self, hq, hp, mask_hq=None, mask_hp=None): | |||
''' | |||
hq: (batch_size, len_q, input_size) | |||
mask_hq: (batch_size, len_q) | |||
''' | |||
return self.drop(Attention(hq, hp, mask_hq, mask_hp, self.my_method)) | |||
class MinusAttention(nn.Module): | |||
def __init__(self, input_size, hidden_size, dropout=0.2): | |||
super(MinusAttention, self).__init__() | |||
self.ln = nn.Linear(input_size, hidden_size) | |||
self.v = nn.Linear(hidden_size, 1, bias=False) | |||
self.drop = nn.Dropout(dropout) | |||
self.output_size = input_size | |||
self.reset_parameters() | |||
def reset_parameters(self): | |||
nn.init.xavier_uniform_(self.v.weight.data) | |||
nn.init.xavier_uniform_(self.ln.weight.data) | |||
self.ln.bias.data.fill_(0) | |||
def my_method(self, hq_mat, hp_mat): | |||
s = hq_mat - hp_mat | |||
s = self.v(tc.tanh(self.ln(s))).squeeze(-1) #(batch_size, len_q, len_p) s[j,t] | |||
return s | |||
def forward(self, hq, hp, mask_hq=None, mask_hp=None): | |||
return self.drop(Attention(hq, hp, mask_hq, mask_hp, self.my_method)) | |||
class DotProductAttention(nn.Module): | |||
def __init__(self, input_size, hidden_size, dropout=0.2): | |||
super(DotProductAttention, self).__init__() | |||
self.ln = nn.Linear(input_size, hidden_size) | |||
self.v = nn.Linear(hidden_size, 1, bias=False) | |||
self.drop = nn.Dropout(dropout) | |||
self.output_size = input_size | |||
self.reset_parameters() | |||
def reset_parameters(self): | |||
nn.init.xavier_uniform_(self.v.weight.data) | |||
nn.init.xavier_uniform_(self.ln.weight.data) | |||
self.ln.bias.data.fill_(0) | |||
def my_method(self, hq_mat, hp_mat): | |||
s = hq_mat * hp_mat | |||
s = self.v(tc.tanh(self.ln(s))).squeeze(-1) #(batch_size, len_q, len_p) s[j,t] | |||
return s | |||
def forward(self, hq, hp, mask_hq=None, mask_hp=None): | |||
return self.drop(Attention(hq, hp, mask_hq, mask_hp, self.my_method)) | |||
class BiLinearAttention(nn.Module): | |||
def __init__(self, input_size, hidden_size, dropout=0.2, input_size_2=-1): | |||
super(BiLinearAttention, self).__init__() | |||
input_size_2 = input_size if input_size_2 < 0 else input_size_2 | |||
self.ln = nn.Linear(input_size_2, input_size) | |||
self.drop = nn.Dropout(dropout) | |||
self.output_size = input_size | |||
self.reset_parameters() | |||
def reset_parameters(self): | |||
nn.init.xavier_uniform_(self.ln.weight.data) | |||
self.ln.bias.data.fill_(0) | |||
def my_method(self, hq, hp, mask_p): | |||
# (bs, len, input_size) | |||
hp = self.ln(hp) | |||
hp = hp * mask_p.unsqueeze(-1) | |||
s = tc.matmul(hq, hp.transpose(-1,-2)) | |||
return s | |||
def forward(self, hq, hp, mask_hq=None, mask_hp=None): | |||
standard_size = (hq.size(0), hq.size(1), hp.size(1), hq.size(-1)) | |||
mask_mat = get_2dmask(mask_hq, mask_hp, standard_size[:-1]) | |||
s = self.my_method(hq, hp, mask_hp) # (batch_size, len_q, len_p) | |||
s = s - ((mask_mat == 0).float() * 10000) | |||
a = tc.softmax(s, dim=1) | |||
hq_mat = hq.unsqueeze(2).expand(standard_size) | |||
q = a.unsqueeze(-1) * hq_mat #(batch_size, len_q, len_p, input_size) | |||
q = tc.sum(q, dim=1) #(batch_size, len_p, input_size) | |||
return self.drop(q) | |||
class AggAttention(nn.Module): | |||
def __init__(self, input_size, hidden_size, dropout=0.2): | |||
super(AggAttention, self).__init__() | |||
self.ln = nn.Linear(input_size + hidden_size, hidden_size) | |||
self.v = nn.Linear(hidden_size, 1, bias=False) | |||
self.vq = nn.Parameter(tc.rand(hidden_size, 1)) | |||
self.drop = nn.Dropout(dropout) | |||
self.output_size = input_size | |||
self.reset_parameters() | |||
def reset_parameters(self): | |||
nn.init.xavier_uniform_(self.vq.data) | |||
nn.init.xavier_uniform_(self.v.weight.data) | |||
nn.init.xavier_uniform_(self.ln.weight.data) | |||
self.ln.bias.data.fill_(0) | |||
self.vq.data = self.vq.data[:,0] | |||
def forward(self, hs, mask): | |||
''' | |||
hs: [(batch_size, len_q, input_size), ...] | |||
mask: (batch_size, len_q) | |||
''' | |||
hs = tc.cat([h.unsqueeze(0) for h in hs], dim=0)# (4, batch_size, len_q, input_size) | |||
vq = self.vq.view(1,1,1,-1).expand(hs.size(0), hs.size(1), hs.size(2), self.vq.size(0)) | |||
s = self.v(tc.tanh(self.ln(tc.cat([hs,vq],-1)))).squeeze(-1)# (4, batch_size, len_q) | |||
s = s - ((mask.unsqueeze(0) == 0).float() * 10000) | |||
a = tc.softmax(s, dim=0) | |||
x = a.unsqueeze(-1) * hs | |||
x = tc.sum(x, dim=0)#(batch_size, len_q, input_size) | |||
return self.drop(x) | |||
class Aggragator(nn.Module): | |||
def __init__(self, input_size, hidden_size, dropout=0.3): | |||
super(Aggragator, self).__init__() | |||
now_size = input_size | |||
self.ln = nn.Linear(2 * input_size, 2 * input_size) | |||
now_size = 2 * input_size | |||
self.rnn = Contexualizer(now_size, hidden_size, 2, dropout) | |||
now_size = self.rnn.output_size | |||
self.agg_att = AggAttention(now_size, now_size, dropout) | |||
now_size = self.agg_att.output_size | |||
self.agg_rnn = Contexualizer(now_size, hidden_size, 2, dropout) | |||
self.drop = nn.Dropout(dropout) | |||
self.output_size = self.agg_rnn.output_size | |||
def forward(self, qs, hp, mask): | |||
''' | |||
qs: [ (batch_size, len_p, input_size), ...] | |||
hp: (batch_size, len_p, input_size) | |||
mask if the same of hp's mask | |||
''' | |||
hs = [0 for _ in range(len(qs))] | |||
for i in range(len(qs)): | |||
q = qs[i] | |||
x = tc.cat([q, hp], dim=-1) | |||
g = tc.sigmoid(self.ln(x)) | |||
x_star = x * g | |||
h = self.rnn(x_star, mask) | |||
hs[i] = h | |||
x = self.agg_att(hs, mask) #(batch_size, len_p, output_size) | |||
h = self.agg_rnn(x, mask) #(batch_size, len_p, output_size) | |||
return self.drop(h) | |||
class Mwan_Imm(nn.Module): | |||
def __init__(self, input_size, hidden_size, num_class=3, dropout=0.2, use_allennlp=False): | |||
super(Mwan_Imm, self).__init__() | |||
now_size = input_size | |||
self.enc_s1 = Contexualizer(now_size, hidden_size, 2, dropout) | |||
self.enc_s2 = Contexualizer(now_size, hidden_size, 2, dropout) | |||
now_size = self.enc_s1.output_size | |||
self.att_c = ConcatAttention(now_size, hidden_size, dropout) | |||
self.att_b = BiLinearAttention(now_size, hidden_size, dropout) | |||
self.att_d = DotProductAttention(now_size, hidden_size, dropout) | |||
self.att_m = MinusAttention(now_size, hidden_size, dropout) | |||
now_size = self.att_c.output_size | |||
self.agg = Aggragator(now_size, hidden_size, dropout) | |||
now_size = self.enc_s1.output_size | |||
self.pred_1 = ConcatAttention_Param(now_size, hidden_size, dropout) | |||
now_size = self.agg.output_size | |||
self.pred_2 = ConcatAttention(now_size, hidden_size, dropout, | |||
input_size_2=self.pred_1.output_size) | |||
now_size = self.pred_2.output_size | |||
self.ln1 = nn.Linear(now_size, hidden_size) | |||
self.ln2 = nn.Linear(hidden_size, num_class) | |||
self.reset_parameters() | |||
def reset_parameters(self): | |||
nn.init.xavier_uniform_(self.ln1.weight.data) | |||
nn.init.xavier_uniform_(self.ln2.weight.data) | |||
self.ln1.bias.data.fill_(0) | |||
self.ln2.bias.data.fill_(0) | |||
def forward(self, s1, s2, mas_s1, mas_s2): | |||
hq = self.enc_s1(s1, mas_s1) #(batch_size, len_q, output_size) | |||
hp = self.enc_s1(s2, mas_s2) | |||
mas_s1 = mas_s1[:,:hq.size(1)] | |||
mas_s2 = mas_s2[:,:hp.size(1)] | |||
mas_q, mas_p = mas_s1, mas_s2 | |||
qc = self.att_c(hq, hp, mas_s1, mas_s2) #(batch_size, len_p, output_size) | |||
qb = self.att_b(hq, hp, mas_s1, mas_s2) | |||
qd = self.att_d(hq, hp, mas_s1, mas_s2) | |||
qm = self.att_m(hq, hp, mas_s1, mas_s2) | |||
ho = self.agg([qc,qb,qd,qm], hp, mas_s2) #(batch_size, len_p, output_size) | |||
rq = self.pred_1(hq, mas_q) #(batch_size, output_size) | |||
rp = self.pred_2(ho, rq.unsqueeze(1), mas_p)#(batch_size, 1, output_size) | |||
rp = rp.squeeze(1) #(batch_size, output_size) | |||
rp = F.relu(self.ln1(rp)) | |||
rp = self.ln2(rp) | |||
return rp | |||
class MwanModel(nn.Module): | |||
def __init__(self, num_class, EmbLayer, args_of_imm={}, ElmoLayer=None): | |||
super(MwanModel, self).__init__() | |||
self.emb = EmbLayer | |||
if ElmoLayer is not None: | |||
self.elmo = ElmoLayer | |||
self.elmo_preln = nn.Linear(3 * self.elmo.emb_size, self.elmo.emb_size) | |||
self.elmo_ln = nn.Linear(args_of_imm["input_size"] + | |||
self.elmo.emb_size, args_of_imm["input_size"]) | |||
else: | |||
self.elmo = None | |||
self.imm = Mwan_Imm(num_class=num_class, **args_of_imm) | |||
self.drop = nn.Dropout(args_of_imm["dropout"]) | |||
def forward(self, words1, words2, str_s1=None, str_s2=None, *pargs, **kwargs): | |||
''' | |||
str_s is for elmo use , however we don't use elmo | |||
str_s: (batch_size, seq_len, word_len) | |||
''' | |||
s1, s2 = words1, words2 | |||
mas_s1 = (s1 != 0).float() # mas: (batch_size, seq_len) | |||
mas_s2 = (s2 != 0).float() # mas: (batch_size, seq_len) | |||
mas_s1.requires_grad = False | |||
mas_s2.requires_grad = False | |||
s1_emb = self.emb(s1) | |||
s2_emb = self.emb(s2) | |||
if self.elmo is not None: | |||
s1_elmo = self.elmo(str_s1) | |||
s2_elmo = self.elmo(str_s2) | |||
s1_elmo = tc.tanh(self.elmo_preln(tc.cat(s1_elmo, dim=-1))) | |||
s2_elmo = tc.tanh(self.elmo_preln(tc.cat(s2_elmo, dim=-1))) | |||
s1_emb = tc.cat([s1_emb, s1_elmo], dim=-1) | |||
s2_emb = tc.cat([s2_emb, s2_elmo], dim=-1) | |||
s1_emb = tc.tanh(self.elmo_ln(s1_emb)) | |||
s2_emb = tc.tanh(self.elmo_ln(s2_emb)) | |||
s1_emb = self.drop(s1_emb) | |||
s2_emb = self.drop(s2_emb) | |||
y = self.imm(s1_emb, s2_emb, mas_s1, mas_s2) | |||
return { | |||
Const.OUTPUT: y, | |||
} |
@@ -1,7 +1,7 @@ | |||
from fastNLP.io.embed_loader import EmbeddingOption, EmbedLoader | |||
from fastNLP.core.vocabulary import VocabularyOption | |||
from fastNLP.io.base_loader import DataSetLoader, DataInfo | |||
from fastNLP.io.base_loader import DataSetLoader, DataBundle | |||
from typing import Union, Dict, List, Iterator | |||
from fastNLP import DataSet | |||
from fastNLP import Instance | |||
@@ -161,7 +161,7 @@ class SigHanLoader(DataSetLoader): | |||
# 推荐大家使用这个check_data_loader_paths进行paths的验证 | |||
paths = check_dataloader_paths(paths) | |||
datasets = {} | |||
data = DataInfo() | |||
data = DataBundle() | |||
bigram = bigram_vocab_opt is not None | |||
for name, path in paths.items(): | |||
dataset = self.load(path, bigram=bigram) | |||
@@ -0,0 +1,93 @@ | |||
from fastNLP.core.vocabulary import VocabularyOption | |||
from fastNLP.io.base_loader import DataSetLoader, DataBundle | |||
from typing import Union, Dict | |||
from fastNLP import Vocabulary | |||
from fastNLP import Const | |||
from reproduction.utils import check_dataloader_paths | |||
from fastNLP.io import ConllLoader | |||
from reproduction.seqence_labelling.ner.data.utils import iob2bioes, iob2 | |||
class Conll2003DataLoader(DataSetLoader): | |||
def __init__(self, task:str='ner', encoding_type:str='bioes'): | |||
""" | |||
加载Conll2003格式的英语语料,该数据集的信息可以在https://www.clips.uantwerpen.be/conll2003/ner/找到。当task为pos | |||
时,返回的DataSet中target取值于第2列; 当task为chunk时,返回的DataSet中target取值于第3列;当task为ner时,返回 | |||
的DataSet中target取值于第4列。所有"-DOCSTART- -X- O O"将被忽略,这会导致数据的数量少于很多文献报道的值,但 | |||
鉴于"-DOCSTART- -X- O O"只是用于文档分割的符号,并不应该作为预测对象,所以我们忽略了数据中的-DOCTSTART-开头的行 | |||
ner与chunk任务读取后的数据的target将为encoding_type类型。pos任务读取后就是pos列的数据。 | |||
:param task: 指定需要标注任务。可选ner, pos, chunk | |||
""" | |||
assert task in ('ner', 'pos', 'chunk') | |||
index = {'ner':3, 'pos':1, 'chunk':2}[task] | |||
self._loader = ConllLoader(headers=['raw_words', 'target'], indexes=[0, index]) | |||
self._tag_converters = [] | |||
if task in ('ner', 'chunk'): | |||
self._tag_converters = [iob2] | |||
if encoding_type == 'bioes': | |||
self._tag_converters.append(iob2bioes) | |||
def load(self, path: str): | |||
dataset = self._loader.load(path) | |||
def convert_tag_schema(tags): | |||
for converter in self._tag_converters: | |||
tags = converter(tags) | |||
return tags | |||
if self._tag_converters: | |||
dataset.apply_field(convert_tag_schema, field_name=Const.TARGET, new_field_name=Const.TARGET) | |||
return dataset | |||
def process(self, paths: Union[str, Dict[str, str]], word_vocab_opt:VocabularyOption=None, lower:bool=False): | |||
""" | |||
读取并处理数据。数据中的'-DOCSTART-'开头的行会被忽略 | |||
:param paths: | |||
:param word_vocab_opt: vocabulary的初始化值 | |||
:param lower: 是否将所有字母转为小写。 | |||
:return: | |||
""" | |||
# 读取数据 | |||
paths = check_dataloader_paths(paths) | |||
data = DataBundle() | |||
input_fields = [Const.TARGET, Const.INPUT, Const.INPUT_LEN] | |||
target_fields = [Const.TARGET, Const.INPUT_LEN] | |||
for name, path in paths.items(): | |||
dataset = self.load(path) | |||
dataset.apply_field(lambda words: words, field_name='raw_words', new_field_name=Const.INPUT) | |||
if lower: | |||
dataset.words.lower() | |||
data.datasets[name] = dataset | |||
# 对construct vocab | |||
word_vocab = Vocabulary(min_freq=2) if word_vocab_opt is None else Vocabulary(**word_vocab_opt) | |||
word_vocab.from_dataset(data.datasets['train'], field_name=Const.INPUT, | |||
no_create_entry_dataset=[dataset for name, dataset in data.datasets.items() if name!='train']) | |||
word_vocab.index_dataset(*data.datasets.values(), field_name=Const.INPUT, new_field_name=Const.INPUT) | |||
data.vocabs[Const.INPUT] = word_vocab | |||
# cap words | |||
cap_word_vocab = Vocabulary() | |||
cap_word_vocab.from_dataset(data.datasets['train'], field_name='raw_words', | |||
no_create_entry_dataset=[dataset for name, dataset in data.datasets.items() if name!='train']) | |||
cap_word_vocab.index_dataset(*data.datasets.values(), field_name='raw_words', new_field_name='cap_words') | |||
input_fields.append('cap_words') | |||
data.vocabs['cap_words'] = cap_word_vocab | |||
# 对target建vocab | |||
target_vocab = Vocabulary(unknown=None, padding=None) | |||
target_vocab.from_dataset(*data.datasets.values(), field_name=Const.TARGET) | |||
target_vocab.index_dataset(*data.datasets.values(), field_name=Const.TARGET) | |||
data.vocabs[Const.TARGET] = target_vocab | |||
for name, dataset in data.datasets.items(): | |||
dataset.add_seq_len(Const.INPUT, new_field_name=Const.INPUT_LEN) | |||
dataset.set_input(*input_fields) | |||
dataset.set_target(*target_fields) | |||
return data | |||
if __name__ == '__main__': | |||
pass |
@@ -0,0 +1,152 @@ | |||
from fastNLP.core.vocabulary import VocabularyOption | |||
from fastNLP.io.base_loader import DataSetLoader, DataBundle | |||
from typing import Union, Dict | |||
from fastNLP import DataSet | |||
from fastNLP import Vocabulary | |||
from fastNLP import Const | |||
from reproduction.utils import check_dataloader_paths | |||
from fastNLP.io.dataset_loader import ConllLoader | |||
from reproduction.seqence_labelling.ner.data.utils import iob2bioes, iob2 | |||
class OntoNoteNERDataLoader(DataSetLoader): | |||
""" | |||
用于读取处理为Conll格式后的OntoNote数据。将OntoNote数据处理为conll格式的过程可以参考https://github.com/yhcc/OntoNotes-5.0-NER。 | |||
""" | |||
def __init__(self, encoding_type:str='bioes'): | |||
assert encoding_type in ('bioes', 'bio') | |||
self.encoding_type = encoding_type | |||
if encoding_type=='bioes': | |||
self.encoding_method = iob2bioes | |||
else: | |||
self.encoding_method = iob2 | |||
def load(self, path:str)->DataSet: | |||
""" | |||
给定一个文件路径,读取数据。返回的DataSet包含以下的field | |||
raw_words: List[str] | |||
target: List[str] | |||
:param path: | |||
:return: | |||
""" | |||
dataset = ConllLoader(headers=['raw_words', 'target'], indexes=[3, 10]).load(path) | |||
def convert_to_bio(tags): | |||
bio_tags = [] | |||
flag = None | |||
for tag in tags: | |||
label = tag.strip("()*") | |||
if '(' in tag: | |||
bio_label = 'B-' + label | |||
flag = label | |||
elif flag: | |||
bio_label = 'I-' + flag | |||
else: | |||
bio_label = 'O' | |||
if ')' in tag: | |||
flag = None | |||
bio_tags.append(bio_label) | |||
return self.encoding_method(bio_tags) | |||
def convert_word(words): | |||
converted_words = [] | |||
for word in words: | |||
word = word.replace('/.', '.') # 有些结尾的.是/.形式的 | |||
if not word.startswith('-'): | |||
converted_words.append(word) | |||
continue | |||
# 以下是由于这些符号被转义了,再转回来 | |||
tfrs = {'-LRB-':'(', | |||
'-RRB-': ')', | |||
'-LSB-': '[', | |||
'-RSB-': ']', | |||
'-LCB-': '{', | |||
'-RCB-': '}' | |||
} | |||
if word in tfrs: | |||
converted_words.append(tfrs[word]) | |||
else: | |||
converted_words.append(word) | |||
return converted_words | |||
dataset.apply_field(convert_word, field_name='raw_words', new_field_name='raw_words') | |||
dataset.apply_field(convert_to_bio, field_name='target', new_field_name='target') | |||
return dataset | |||
def process(self, paths: Union[str, Dict[str, str]], word_vocab_opt:VocabularyOption=None, | |||
lower:bool=True)->DataBundle: | |||
""" | |||
读取并处理数据。返回的DataInfo包含以下的内容 | |||
vocabs: | |||
word: Vocabulary | |||
target: Vocabulary | |||
datasets: | |||
train: DataSet | |||
words: List[int], 被设置为input | |||
target: int. label,被同时设置为input和target | |||
seq_len: int. 句子的长度,被同时设置为input和target | |||
raw_words: List[str] | |||
xxx(根据传入的paths可能有所变化) | |||
:param paths: | |||
:param word_vocab_opt: vocabulary的初始化值 | |||
:param lower: 是否使用小写 | |||
:return: | |||
""" | |||
paths = check_dataloader_paths(paths) | |||
data = DataBundle() | |||
input_fields = [Const.TARGET, Const.INPUT, Const.INPUT_LEN] | |||
target_fields = [Const.TARGET, Const.INPUT_LEN] | |||
for name, path in paths.items(): | |||
dataset = self.load(path) | |||
dataset.apply_field(lambda words: words, field_name='raw_words', new_field_name=Const.INPUT) | |||
if lower: | |||
dataset.words.lower() | |||
data.datasets[name] = dataset | |||
# 对construct vocab | |||
word_vocab = Vocabulary(min_freq=2) if word_vocab_opt is None else Vocabulary(**word_vocab_opt) | |||
word_vocab.from_dataset(data.datasets['train'], field_name=Const.INPUT, | |||
no_create_entry_dataset=[dataset for name, dataset in data.datasets.items() if name!='train']) | |||
word_vocab.index_dataset(*data.datasets.values(), field_name=Const.INPUT, new_field_name=Const.INPUT) | |||
data.vocabs[Const.INPUT] = word_vocab | |||
# cap words | |||
cap_word_vocab = Vocabulary() | |||
cap_word_vocab.from_dataset(*data.datasets.values(), field_name='raw_words') | |||
cap_word_vocab.index_dataset(*data.datasets.values(), field_name='raw_words', new_field_name='cap_words') | |||
input_fields.append('cap_words') | |||
data.vocabs['cap_words'] = cap_word_vocab | |||
# 对target建vocab | |||
target_vocab = Vocabulary(unknown=None, padding=None) | |||
target_vocab.from_dataset(*data.datasets.values(), field_name=Const.TARGET) | |||
target_vocab.index_dataset(*data.datasets.values(), field_name=Const.TARGET) | |||
data.vocabs[Const.TARGET] = target_vocab | |||
for name, dataset in data.datasets.items(): | |||
dataset.add_seq_len(Const.INPUT, new_field_name=Const.INPUT_LEN) | |||
dataset.set_input(*input_fields) | |||
dataset.set_target(*target_fields) | |||
return data | |||
if __name__ == '__main__': | |||
loader = OntoNoteNERDataLoader() | |||
dataset = loader.load('/hdd/fudanNLP/fastNLP/others/data/v4/english/test.txt') | |||
print(dataset.target.value_count()) | |||
print(dataset[:4]) | |||
""" | |||
train 115812 2200752 | |||
development 15680 304684 | |||
test 12217 230111 | |||
train 92403 1901772 | |||
valid 13606 279180 | |||
test 10258 204135 | |||
""" |
@@ -0,0 +1,49 @@ | |||
from typing import List | |||
def iob2(tags:List[str])->List[str]: | |||
""" | |||
检查数据是否是合法的IOB数据,如果是IOB1会被自动转换为IOB2。 | |||
:param tags: 需要转换的tags | |||
""" | |||
for i, tag in enumerate(tags): | |||
if tag == "O": | |||
continue | |||
split = tag.split("-") | |||
if len(split) != 2 or split[0] not in ["I", "B"]: | |||
raise TypeError("The encoding schema is not a valid IOB type.") | |||
if split[0] == "B": | |||
continue | |||
elif i == 0 or tags[i - 1] == "O": # conversion IOB1 to IOB2 | |||
tags[i] = "B" + tag[1:] | |||
elif tags[i - 1][1:] == tag[1:]: | |||
continue | |||
else: # conversion IOB1 to IOB2 | |||
tags[i] = "B" + tag[1:] | |||
return tags | |||
def iob2bioes(tags:List[str])->List[str]: | |||
""" | |||
将iob的tag转换为bmeso编码 | |||
:param tags: | |||
:return: | |||
""" | |||
new_tags = [] | |||
for i, tag in enumerate(tags): | |||
if tag == 'O': | |||
new_tags.append(tag) | |||
else: | |||
split = tag.split('-')[0] | |||
if split == 'B': | |||
if i+1!=len(tags) and tags[i+1].split('-')[0] == 'I': | |||
new_tags.append(tag) | |||
else: | |||
new_tags.append(tag.replace('B-', 'S-')) | |||
elif split == 'I': | |||
if i + 1<len(tags) and tags[i+1].split('-')[0] == 'I': | |||
new_tags.append(tag) | |||
else: | |||
new_tags.append(tag.replace('I-', 'E-')) | |||
else: | |||
raise TypeError("Invalid IOB format.") | |||
return new_tags |