diff --git a/README.md b/README.md
index 9d949482..b294e54b 100644
--- a/README.md
+++ b/README.md
@@ -6,48 +6,59 @@
![Hex.pm](https://img.shields.io/hexpm/l/plug.svg)
[![Documentation Status](https://readthedocs.org/projects/fastnlp/badge/?version=latest)](http://fastnlp.readthedocs.io/?badge=latest)
-fastNLP 是一款轻量级的 NLP 处理套件。你既可以使用它快速地完成一个命名实体识别(NER)、中文分词或文本分类任务; 也可以使用他构建许多复杂的网络模型,进行科研。它具有如下的特性:
+fastNLP 是一款轻量级的 NLP 处理套件。你既可以使用它快速地完成一个序列标注([NER](reproduction/seqence_labelling/ner)、POS-Tagging等)、中文分词、[文本分类](reproduction/text_classification)、[Matching](reproduction/matching)、[指代消解](reproduction/coreference_resolution)、[摘要](reproduction/Summarization)等任务; 也可以使用它构建许多复杂的网络模型,进行科研。它具有如下的特性:
-- 统一的Tabular式数据容器,让数据预处理过程简洁明了。内置多种数据集的DataSet Loader,省去预处理代码。
-- 各种方便的NLP工具,例如预处理embedding加载; 中间数据cache等;
-- 详尽的中文文档以供查阅;
+- 统一的Tabular式数据容器,让数据预处理过程简洁明了。内置多种数据集的DataSet Loader,省去预处理代码;
+- 多种训练、测试组件,例如训练器Trainer;测试器Tester;以及各种评测metrics等等;
+- 各种方便的NLP工具,例如预处理embedding加载(包括ELMo和BERT); 中间数据cache等;
+- 详尽的中文[文档](https://fastnlp.readthedocs.io/)、[教程](https://fastnlp.readthedocs.io/zh/latest/user/tutorials.html)以供查阅;
- 提供诸多高级模块,例如Variational LSTM, Transformer, CRF等;
-- 封装CNNText,Biaffine等模型可供直接使用;
+- 在序列标注、中文分词、文本分类、Matching、指代消解、摘要等任务上封装了各种模型可供直接使用,详细内容见 [reproduction](reproduction) 部分;
- 便捷且具有扩展性的训练器; 提供多种内置callback函数,方便实验记录、异常捕获等。
## 安装指南
-fastNLP 依赖如下包:
+fastNLP 依赖以下包:
-+ numpy
-+ torch>=0.4.0
-+ tqdm
-+ nltk
++ numpy>=1.14.2
++ torch>=1.0.0
++ tqdm>=4.28.1
++ nltk>=3.4.1
++ requests
++ spacy
-其中torch的安装可能与操作系统及 CUDA 的版本相关,请参见 PyTorch 官网 。
-在依赖包安装完成的情况,您可以在命令行执行如下指令完成安装
+其中torch的安装可能与操作系统及 CUDA 的版本相关,请参见 [PyTorch 官网](https://pytorch.org/) 。
+在依赖包安装完成后,您可以在命令行执行如下指令完成安装
```shell
pip install fastNLP
+python -m spacy download en
```
-## 参考资源
+## fastNLP教程
-- [文档](https://fastnlp.readthedocs.io/zh/latest/)
-- [源码](https://github.com/fastnlp/fastNLP)
+- [1. 使用DataSet预处理文本](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_1_data_preprocess.html)
+- [2. 使用DataSetLoader加载数据集](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_2_load_dataset.html)
+- [3. 使用Embedding模块将文本转成向量](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_3_embedding.html)
+- [4. 动手实现一个文本分类器I-使用Trainer和Tester快速训练和测试](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_4_loss_optimizer.html)
+- [5. 动手实现一个文本分类器II-使用DataSetIter实现自定义训练过程](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_5_datasetiter.html)
+- [6. 快速实现序列标注模型](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_6_seq_labeling.html)
+- [7. 使用Modules和Models快速搭建自定义模型](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_7_modules_models.html)
+- [8. 使用Metric快速评测你的模型](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_8_metrics.html)
+- [9. 使用Callback自定义你的训练过程](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_9_callback.html)
## 内置组件
-大部分用于的 NLP 任务神经网络都可以看做由编码(encoder)、聚合(aggregator)、解码(decoder)三种模块组成。
+大部分用于的 NLP 任务神经网络都可以看做由编码器(encoder)、解码器(decoder)两种模块组成。
![](./docs/source/figures/text_classification.png)
-fastNLP 在 modules 模块中内置了三种模块的诸多组件,可以帮助用户快速搭建自己所需的网络。 三种模块的功能和常见组件如下:
+fastNLP 在 modules 模块中内置了两种模块的诸多组件,可以帮助用户快速搭建自己所需的网络。 两种模块的功能和常见组件如下:
@@ -57,29 +68,17 @@ fastNLP 在 modules 模块中内置了三种模块的诸多组件,可以帮助
encoder |
- 将输入编码为具有具 有表示能力的向量 |
+ 将输入编码为具有具有表示能力的向量 |
embedding, RNN, CNN, transformer
|
-
- aggregator |
- 从多个向量中聚合信息 |
- self-attention, max-pooling |
-
decoder |
- 将具有某种表示意义的 向量解码为需要的输出 形式 |
+ 将具有某种表示意义的向量解码为需要的输出形式 |
MLP, CRF |
-## 完整模型
-fastNLP 为不同的 NLP 任务实现了许多完整的模型,它们都经过了训练和测试。
-
-你可以在以下两个地方查看相关信息
-- [介绍](reproduction/)
-- [源码](fastNLP/models/)
-
## 项目结构
![](./docs/source/figures/workflow.png)
@@ -93,7 +92,7 @@ fastNLP的大致工作流程如上图所示,而项目结构如下:
fastNLP.core |
- 实现了核心功能,包括数据处理组件、训练器、测速器等 |
+ 实现了核心功能,包括数据处理组件、训练器、测试器等 |
fastNLP.models |
diff --git a/fastNLP/__init__.py b/fastNLP/__init__.py
index e666f65f..12d421a2 100644
--- a/fastNLP/__init__.py
+++ b/fastNLP/__init__.py
@@ -37,7 +37,7 @@ __all__ = [
"AccuracyMetric",
"SpanFPreRecMetric",
- "SQuADMetric",
+ "ExtractiveQAMetric",
"Optimizer",
"SGD",
@@ -61,3 +61,4 @@ __version__ = '0.4.0'
from .core import *
from . import models
from . import modules
+from .io import data_loader
diff --git a/fastNLP/core/__init__.py b/fastNLP/core/__init__.py
index 792bff66..efc83017 100644
--- a/fastNLP/core/__init__.py
+++ b/fastNLP/core/__init__.py
@@ -21,7 +21,7 @@ from .dataset import DataSet
from .field import FieldArray, Padder, AutoPadder, EngChar2DPadder
from .instance import Instance
from .losses import LossFunc, CrossEntropyLoss, L1Loss, BCELoss, NLLLoss, LossInForward
-from .metrics import AccuracyMetric, SpanFPreRecMetric, SQuADMetric
+from .metrics import AccuracyMetric, SpanFPreRecMetric, ExtractiveQAMetric
from .optimizer import Optimizer, SGD, Adam
from .sampler import SequentialSampler, BucketSampler, RandomSampler, Sampler
from .tester import Tester
diff --git a/fastNLP/core/batch.py b/fastNLP/core/batch.py
index ca48a8e1..2d8c1a80 100644
--- a/fastNLP/core/batch.py
+++ b/fastNLP/core/batch.py
@@ -3,7 +3,6 @@ batch 模块实现了 fastNLP 所需的 Batch 类。
"""
__all__ = [
- "BatchIter",
"DataSetIter",
"TorchLoaderIter",
]
@@ -50,6 +49,7 @@ class DataSetGetter:
return len(self.dataset)
def collate_fn(self, batch: list):
+ # TODO 支持在DataSet中定义collate_fn,因为有时候可能需要不同的field之间融合,比如BERT的场景
batch_x = {n:[] for n in self.inputs.keys()}
batch_y = {n:[] for n in self.targets.keys()}
indices = []
@@ -136,6 +136,31 @@ class BatchIter:
class DataSetIter(BatchIter):
+ """
+ 别名::class:`fastNLP.DataSetIter` :class:`fastNLP.core.batch.DataSetIter`
+
+ DataSetIter 用于从 `DataSet` 中按一定的顺序, 依次按 ``batch_size`` 的大小将数据取出,
+ 组成 `x` 和 `y`::
+
+ batch = DataSetIter(data_set, batch_size=16, sampler=SequentialSampler())
+ num_batch = len(batch)
+ for batch_x, batch_y in batch:
+ # do stuff ...
+
+ :param dataset: :class:`~fastNLP.DataSet` 对象, 数据集
+ :param int batch_size: 取出的batch大小
+ :param sampler: 规定使用的 :class:`~fastNLP.Sampler` 方式. 若为 ``None`` , 使用 :class:`~fastNLP.SequentialSampler`.
+
+ Default: ``None``
+ :param bool as_numpy: 若为 ``True`` , 输出batch为 numpy.array. 否则为 :class:`torch.Tensor`.
+
+ Default: ``False``
+ :param int num_workers: 使用多少个进程来预处理数据
+ :param bool pin_memory: 是否将产生的tensor使用pin memory, 可能会加快速度。
+ :param bool drop_last: 如果最后一个batch没有batch_size这么多sample,就扔掉最后一个
+ :param timeout:
+ :param worker_init_fn: 在每个worker启动时调用该函数,会传入一个值,该值是worker的index。
+ """
def __init__(self, dataset, batch_size=1, sampler=None, as_numpy=False,
num_workers=0, pin_memory=False, drop_last=False,
timeout=0, worker_init_fn=None):
diff --git a/fastNLP/core/callback.py b/fastNLP/core/callback.py
index 9c6b01d6..bbe2f325 100644
--- a/fastNLP/core/callback.py
+++ b/fastNLP/core/callback.py
@@ -66,6 +66,8 @@ import os
import torch
from copy import deepcopy
+import sys
+from .utils import _save_model
try:
from tensorboardX import SummaryWriter
@@ -399,10 +401,11 @@ class GradientClipCallback(Callback):
self.clip_value = clip_value
def on_backward_end(self):
- if self.parameters is None:
- self.clip_fun(self.model.parameters(), self.clip_value)
- else:
- self.clip_fun(self.parameters, self.clip_value)
+ if self.step%self.update_every==0:
+ if self.parameters is None:
+ self.clip_fun(self.model.parameters(), self.clip_value)
+ else:
+ self.clip_fun(self.parameters, self.clip_value)
class EarlyStopCallback(Callback):
@@ -736,6 +739,132 @@ class TensorboardCallback(Callback):
del self._summary_writer
+class WarmupCallback(Callback):
+ """
+ 按一定的周期调节Learning rate的大小。
+
+ :param int,float warmup: 如果warmup为int,则在该step之前,learning rate根据schedule的策略变化; 如果warmup为float,
+ 如0.1, 则前10%的step是按照schedule策略调整learning rate。
+ :param str schedule: 以哪种方式调整。linear: 前warmup的step上升到指定的learning rate(从Trainer中的optimizer处获取的), 后
+ warmup的step下降到0; constant前warmup的step上升到指定learning rate,后面的step保持learning rate.
+ """
+ def __init__(self, warmup=0.1, schedule='constant'):
+ super().__init__()
+ self.warmup = max(warmup, 0.)
+
+ self.initial_lrs = [] # 存放param_group的learning rate
+ if schedule == 'constant':
+ self.get_lr = self._get_constant_lr
+ elif schedule == 'linear':
+ self.get_lr = self._get_linear_lr
+ else:
+ raise RuntimeError("Only support 'linear', 'constant'.")
+
+ def _get_constant_lr(self, progress):
+ if progress1:
+ self.warmup = self.warmup/self.t_steps
+ self.t_steps = max(2, self.t_steps) # 不能小于2
+ # 获取param_group的初始learning rate
+ for group in self.optimizer.param_groups:
+ self.initial_lrs.append(group['lr'])
+
+ def on_backward_end(self):
+ if self.step%self.update_every==0:
+ progress = (self.step/self.update_every)/self.t_steps
+ for lr, group in zip(self.initial_lrs, self.optimizer.param_groups):
+ group['lr'] = lr * self.get_lr(progress)
+
+
+class SaveModelCallback(Callback):
+ """
+ 由于Trainer在训练过程中只会保存最佳的模型, 该callback可实现多种方式的结果存储。
+ 会根据训练开始的时间戳在save_dir下建立文件夹,再在文件夹下存放多个模型
+ -save_dir
+ -2019-07-03-15-06-36
+ -epoch:0_step:20_{metric_key}:{evaluate_performance}.pt # metric是给定的metric_key, evaluate_performance是性能
+ -epoch:1_step:40_{metric_key}:{evaluate_performance}.pt
+ -2019-07-03-15-10-00
+ -epoch:0_step:20_{metric_key}:{evaluate_performance}.pt # metric是给定的metric_key, evaluate_perfomance是性能
+ :param str save_dir: 将模型存放在哪个目录下,会在该目录下创建以时间戳命名的目录,并存放模型
+ :param int top: 保存dev表现top多少模型。-1为保存所有模型。
+ :param bool only_param: 是否只保存模型d饿权重。
+ :param save_on_exception: 发生exception时,是否保存一份发生exception的模型。模型名称为epoch:x_step:x_Exception:{exception_name}.
+ """
+ def __init__(self, save_dir, top=3, only_param=False, save_on_exception=False):
+ super().__init__()
+
+ if not os.path.isdir(save_dir):
+ raise IsADirectoryError("{} is not a directory.".format(save_dir))
+ self.save_dir = save_dir
+ if top < 0:
+ self.top = sys.maxsize
+ else:
+ self.top = top
+ self._ordered_save_models = [] # List[Tuple], Tuple[0]是metric, Tuple[1]是path。metric是依次变好的,所以从头删
+
+ self.only_param = only_param
+ self.save_on_exception = save_on_exception
+
+ def on_train_begin(self):
+ self.save_dir = os.path.join(self.save_dir, self.trainer.start_time)
+
+ def on_valid_end(self, eval_result, metric_key, optimizer, is_better_eval):
+ metric_value = list(eval_result.values())[0][metric_key]
+ self._save_this_model(metric_value)
+
+ def _insert_into_ordered_save_models(self, pair):
+ # pair:(metric_value, model_name)
+ # 返回save的模型pair与删除的模型pair. pair中第一个元素是metric的值,第二个元素是模型的名称
+ index = -1
+ for _pair in self._ordered_save_models:
+ if _pair[0]>=pair[0] and self.trainer.increase_better:
+ break
+ if not self.trainer.increase_better and _pair[0]<=pair[0]:
+ break
+ index += 1
+ save_pair = None
+ if len(self._ordered_save_models)=self.top and index!=-1):
+ save_pair = pair
+ self._ordered_save_models.insert(index+1, pair)
+ delete_pair = None
+ if len(self._ordered_save_models)>self.top:
+ delete_pair = self._ordered_save_models.pop(0)
+ return save_pair, delete_pair
+
+ def _save_this_model(self, metric_value):
+ name = "epoch:{}_step:{}_{}:{:.6f}.pt".format(self.epoch, self.step, self.trainer.metric_key, metric_value)
+ save_pair, delete_pair = self._insert_into_ordered_save_models((metric_value, name))
+ if save_pair:
+ try:
+ _save_model(self.model, model_name=name, save_dir=self.save_dir, only_param=self.only_param)
+ except Exception as e:
+ print(f"The following exception:{e} happens when save model to {self.save_dir}.")
+ if delete_pair:
+ try:
+ delete_model_path = os.path.join(self.save_dir, delete_pair[1])
+ if os.path.exists(delete_model_path):
+ os.remove(delete_model_path)
+ except Exception as e:
+ print(f"Fail to delete model {name} at {self.save_dir} caused by exception:{e}.")
+
+ def on_exception(self, exception):
+ if self.save_on_exception:
+ name = "epoch:{}_step:{}_Exception:{}.pt".format(self.epoch, self.step, exception.__class__.__name__)
+ _save_model(self.model, model_name=name, save_dir=self.save_dir, only_param=self.only_param)
+
+
class CallbackException(BaseException):
"""
当需要通过callback跳出训练的时候可以通过抛出CallbackException并在on_exception中捕获这个值。
diff --git a/fastNLP/core/losses.py b/fastNLP/core/losses.py
index 46a72802..14aacef0 100644
--- a/fastNLP/core/losses.py
+++ b/fastNLP/core/losses.py
@@ -20,6 +20,7 @@ from collections import defaultdict
import torch
import torch.nn.functional as F
+from ..core.const import Const
from .utils import _CheckError
from .utils import _CheckRes
from .utils import _build_args
@@ -28,6 +29,7 @@ from .utils import _check_function_or_method
from .utils import _get_func_signature
from .utils import seq_len_to_mask
+
class LossBase(object):
"""
所有loss的基类。如果想了解其中的原理,请查看源码。
@@ -95,22 +97,7 @@ class LossBase(object):
# if func_spect.varargs:
# raise NameError(f"Delete `*{func_spect.varargs}` in {get_func_signature(self.get_loss)}(Do not use "
# f"positional argument.).")
-
- def _fast_param_map(self, pred_dict, target_dict):
- """Only used as inner function. When the pred_dict, target is unequivocal. Don't need users to pass key_map.
- such as pred_dict has one element, target_dict has one element
- :param pred_dict:
- :param target_dict:
- :return: dict, if dict is not {}, pass it to self.evaluate. Otherwise do mapping.
- """
- fast_param = {}
- if len(self._param_map) == 2 and len(pred_dict) == 1 and len(target_dict) == 1:
- fast_param['pred'] = list(pred_dict.values())[0]
- fast_param['target'] = list(target_dict.values())[0]
- return fast_param
- return fast_param
-
def __call__(self, pred_dict, target_dict, check=False):
"""
:param dict pred_dict: 模型的forward函数返回的dict
@@ -118,11 +105,7 @@ class LossBase(object):
:param Boolean check: 每一次执行映射函数的时候是否检查映射表,默认为不检查
:return:
"""
- fast_param = self._fast_param_map(pred_dict, target_dict)
- if fast_param:
- loss = self.get_loss(**fast_param)
- return loss
-
+
if not self._checked:
# 1. check consistence between signature and _param_map
func_spect = inspect.getfullargspec(self.get_loss)
@@ -212,7 +195,6 @@ class LossFunc(LossBase):
if not isinstance(key_map, dict):
raise RuntimeError(f"Loss error: key_map except a {type({})} but got a {type(key_map)}")
self._init_param_map(key_map, **kwargs)
-
class CrossEntropyLoss(LossBase):
@@ -226,7 +208,7 @@ class CrossEntropyLoss(LossBase):
:param seq_len: 句子的长度, 长度之外的token不会计算loss。。
:param padding_idx: padding的index,在计算loss时将忽略target中标号为padding_idx的内容, 可以通过该值代替
传入seq_len.
- :param str reduction: 支持'elementwise_mean'和'sum'.
+ :param str reduction: 支持'mean','sum'和'none'.
Example::
@@ -234,16 +216,16 @@ class CrossEntropyLoss(LossBase):
"""
- def __init__(self, pred=None, target=None, seq_len=None, padding_idx=-100, reduction='elementwise_mean'):
+ def __init__(self, pred=None, target=None, seq_len=None, padding_idx=-100, reduction='mean'):
super(CrossEntropyLoss, self).__init__()
self._init_param_map(pred=pred, target=target, seq_len=seq_len)
self.padding_idx = padding_idx
- assert reduction in ('elementwise_mean', 'sum')
+ assert reduction in ('mean', 'sum', 'none')
self.reduction = reduction
def get_loss(self, pred, target, seq_len=None):
- if pred.dim()>2:
- if pred.size(1)!=target.size(1):
+ if pred.dim() > 2:
+ if pred.size(1) != target.size(1):
pred = pred.transpose(1, 2)
pred = pred.reshape(-1, pred.size(-1))
target = target.reshape(-1)
@@ -263,15 +245,18 @@ class L1Loss(LossBase):
:param pred: 参数映射表中 `pred` 的映射关系,None表示映射关系为 `pred` -> `pred`
:param target: 参数映射表中 `target` 的映射关系,None表示映射关系为 `target` >`target`
+ :param str reduction: 支持'mean','sum'和'none'.
"""
- def __init__(self, pred=None, target=None):
+ def __init__(self, pred=None, target=None, reduction='mean'):
super(L1Loss, self).__init__()
self._init_param_map(pred=pred, target=target)
+ assert reduction in ('mean', 'sum', 'none')
+ self.reduction = reduction
def get_loss(self, pred, target):
- return F.l1_loss(input=pred, target=target)
+ return F.l1_loss(input=pred, target=target, reduction=self.reduction)
class BCELoss(LossBase):
@@ -282,14 +267,17 @@ class BCELoss(LossBase):
:param pred: 参数映射表中`pred`的映射关系,None表示映射关系为`pred`->`pred`
:param target: 参数映射表中`target`的映射关系,None表示映射关系为`target`->`target`
+ :param str reduction: 支持'mean','sum'和'none'.
"""
- def __init__(self, pred=None, target=None):
+ def __init__(self, pred=None, target=None, reduction='mean'):
super(BCELoss, self).__init__()
self._init_param_map(pred=pred, target=target)
+ assert reduction in ('mean', 'sum', 'none')
+ self.reduction = reduction
def get_loss(self, pred, target):
- return F.binary_cross_entropy(input=pred, target=target)
+ return F.binary_cross_entropy(input=pred, target=target, reduction=self.reduction)
class NLLLoss(LossBase):
@@ -300,14 +288,20 @@ class NLLLoss(LossBase):
:param pred: 参数映射表中`pred`的映射关系,None表示映射关系为`pred`->`pred`
:param target: 参数映射表中`target`的映射关系,None表示映射关系为`target`->`target`
+ :param ignore_idx: ignore的index,在计算loss时将忽略target中标号为ignore_idx的内容, 可以通过该值代替
+ 传入seq_len.
+ :param str reduction: 支持'mean','sum'和'none'.
"""
- def __init__(self, pred=None, target=None):
+ def __init__(self, pred=None, target=None, ignore_idx=-100, reduction='mean'):
super(NLLLoss, self).__init__()
self._init_param_map(pred=pred, target=target)
+ assert reduction in ('mean', 'sum', 'none')
+ self.reduction = reduction
+ self.ignore_idx = ignore_idx
def get_loss(self, pred, target):
- return F.nll_loss(input=pred, target=target)
+ return F.nll_loss(input=pred, target=target, ignore_index=self.ignore_idx, reduction=self.reduction)
class LossInForward(LossBase):
@@ -319,7 +313,7 @@ class LossInForward(LossBase):
:param str loss_key: 在forward函数中loss的键名,默认为loss
"""
- def __init__(self, loss_key='loss'):
+ def __init__(self, loss_key=Const.LOSS):
super().__init__()
if not isinstance(loss_key, str):
raise TypeError(f"Only str allowed for loss_key, got {type(loss_key)}.")
diff --git a/fastNLP/core/metrics.py b/fastNLP/core/metrics.py
index d54bf8ec..f75b6c90 100644
--- a/fastNLP/core/metrics.py
+++ b/fastNLP/core/metrics.py
@@ -6,7 +6,7 @@ __all__ = [
"MetricBase",
"AccuracyMetric",
"SpanFPreRecMetric",
- "SQuADMetric"
+ "ExtractiveQAMetric"
]
import inspect
@@ -24,6 +24,7 @@ from .utils import seq_len_to_mask
from .vocabulary import Vocabulary
from abc import abstractmethod
+
class MetricBase(object):
"""
所有metrics的基类,,所有的传入到Trainer, Tester的Metric需要继承自该对象,需要覆盖写入evaluate(), get_metric()方法。
@@ -735,11 +736,11 @@ def _pred_topk(y_prob, k=1):
return y_pred_topk, y_prob_topk
-class SQuADMetric(MetricBase):
+class ExtractiveQAMetric(MetricBase):
r"""
- 别名::class:`fastNLP.SQuADMetric` :class:`fastNLP.core.metrics.SQuADMetric`
+ 别名::class:`fastNLP.ExtractiveQAMetric` :class:`fastNLP.core.metrics.ExtractiveQAMetric`
- SQuAD数据集metric
+ 抽取式QA(如SQuAD)的metric.
:param pred1: 参数映射表中 `pred1` 的映射关系,None表示映射关系为 `pred1` -> `pred1`
:param pred2: 参数映射表中 `pred2` 的映射关系,None表示映射关系为 `pred2` -> `pred2`
@@ -755,7 +756,7 @@ class SQuADMetric(MetricBase):
def __init__(self, pred1=None, pred2=None, target1=None, target2=None,
beta=1, right_open=True, print_predict_stat=False):
- super(SQuADMetric, self).__init__()
+ super(ExtractiveQAMetric, self).__init__()
self._init_param_map(pred1=pred1, pred2=pred2, target1=target1, target2=target2)
diff --git a/fastNLP/core/utils.py b/fastNLP/core/utils.py
index 490f9f8f..9b23240c 100644
--- a/fastNLP/core/utils.py
+++ b/fastNLP/core/utils.py
@@ -16,6 +16,7 @@ from collections import Counter, namedtuple
import numpy as np
import torch
import torch.nn as nn
+from typing import List
_CheckRes = namedtuple('_CheckRes', ['missing', 'unused', 'duplicated', 'required', 'all_needed',
'varargs'])
@@ -162,6 +163,30 @@ def cache_results(_cache_fp, _refresh=False, _verbose=1):
return wrapper_
+def _save_model(model, model_name, save_dir, only_param=False):
+ """ 存储不含有显卡信息的state_dict或model
+ :param model:
+ :param model_name:
+ :param save_dir: 保存的directory
+ :param only_param:
+ :return:
+ """
+ model_path = os.path.join(save_dir, model_name)
+ if not os.path.isdir(save_dir):
+ os.makedirs(save_dir, exist_ok=True)
+ if isinstance(model, nn.DataParallel):
+ model = model.module
+ if only_param:
+ state_dict = model.state_dict()
+ for key in state_dict:
+ state_dict[key] = state_dict[key].cpu()
+ torch.save(state_dict, model_path)
+ else:
+ _model_device = _get_model_device(model)
+ model.cpu()
+ torch.save(model, model_path)
+ model.to(_model_device)
+
# def save_pickle(obj, pickle_path, file_name):
# """Save an object into a pickle file.
@@ -277,7 +302,6 @@ def _move_model_to_device(model, device):
return model
-
def _get_model_device(model):
"""
传入一个nn.Module的模型,获取它所在的device
@@ -285,7 +309,7 @@ def _get_model_device(model):
:param model: nn.Module
:return: torch.device,None 如果返回值为None,说明这个模型没有任何参数。
"""
- # TODO 这个函数存在一定的风险,因为同一个模型可能存在某些parameter不在显卡中,比如BertEmbedding
+ # TODO 这个函数存在一定的风险,因为同一个模型可能存在某些parameter不在显卡中,比如BertEmbedding. 或者跨显卡
assert isinstance(model, nn.Module)
parameters = list(model.parameters())
@@ -712,3 +736,52 @@ class _pseudo_tqdm:
def __exit__(self, exc_type, exc_val, exc_tb):
del self
+
+def iob2(tags:List[str])->List[str]:
+ """
+ 检查数据是否是合法的IOB数据,如果是IOB1会被自动转换为IOB2。两者的差异见
+ https://datascience.stackexchange.com/questions/37824/difference-between-iob-and-iob2-format
+
+ :param tags: 需要转换的tags, 需要为大写的BIO标签。
+ """
+ for i, tag in enumerate(tags):
+ if tag == "O":
+ continue
+ split = tag.split("-")
+ if len(split) != 2 or split[0] not in ["I", "B"]:
+ raise TypeError("The encoding schema is not a valid IOB type.")
+ if split[0] == "B":
+ continue
+ elif i == 0 or tags[i - 1] == "O": # conversion IOB1 to IOB2
+ tags[i] = "B" + tag[1:]
+ elif tags[i - 1][1:] == tag[1:]:
+ continue
+ else: # conversion IOB1 to IOB2
+ tags[i] = "B" + tag[1:]
+ return tags
+
+def iob2bioes(tags:List[str])->List[str]:
+ """
+ 将iob的tag转换为bioes编码
+ :param tags: List[str]. 编码需要是大写的。
+ :return:
+ """
+ new_tags = []
+ for i, tag in enumerate(tags):
+ if tag == 'O':
+ new_tags.append(tag)
+ else:
+ split = tag.split('-')[0]
+ if split == 'B':
+ if i+1!=len(tags) and tags[i+1].split('-')[0] == 'I':
+ new_tags.append(tag)
+ else:
+ new_tags.append(tag.replace('B-', 'S-'))
+ elif split == 'I':
+ if i + 1 DataInfo:
+ def process(self, paths: Union[str, Dict[str, str]], **options) -> DataBundle:
"""
对于特定的任务和数据集,读取并处理数据,返回处理DataInfo类对象或字典。
从指定一个或多个路径中的文件中读取数据,DataInfo对象中可以包含一个或多个数据集 。
如果处理多个路径,传入的 dict 的 key 与返回DataInfo中的 dict 中的 key 保存一致。
- 返回的 :class:`DataInfo` 对象有如下属性:
+ 返回的 :class:`DataBundle` 对象有如下属性:
- vocabs: 由从数据集中获取的词表组成的字典,每个词表
- - embeddings: (可选) 数据集对应的词嵌入
- datasets: 一个dict,包含一系列 :class:`~fastNLP.DataSet` 类型的对象。其中 field 的命名参考 :mod:`~fastNLP.core.const`
:param paths: 原始数据读取的路径
:param options: 根据不同的任务和数据集,设计自己的参数
- :return: 返回一个 DataInfo
+ :return: 返回一个 DataBundle
"""
raise NotImplementedError
diff --git a/fastNLP/io/data_loader/__init__.py b/fastNLP/io/data_loader/__init__.py
new file mode 100644
index 00000000..d4777ff8
--- /dev/null
+++ b/fastNLP/io/data_loader/__init__.py
@@ -0,0 +1,35 @@
+"""
+用于读数据集的模块, 具体包括:
+
+这些模块的使用方法如下:
+"""
+__all__ = [
+ 'ConllLoader',
+ 'Conll2003Loader',
+ 'IMDBLoader',
+ 'MatchingLoader',
+ 'MNLILoader',
+ 'MTL16Loader',
+ 'PeopleDailyCorpusLoader',
+ 'QNLILoader',
+ 'QuoraLoader',
+ 'RTELoader',
+ 'SSTLoader',
+ 'SST2Loader',
+ 'SNLILoader',
+ 'YelpLoader',
+]
+
+
+from .conll import ConllLoader, Conll2003Loader
+from .imdb import IMDBLoader
+from .matching import MatchingLoader
+from .mnli import MNLILoader
+from .mtl import MTL16Loader
+from .people_daily import PeopleDailyCorpusLoader
+from .qnli import QNLILoader
+from .quora import QuoraLoader
+from .rte import RTELoader
+from .snli import SNLILoader
+from .sst import SSTLoader, SST2Loader
+from .yelp import YelpLoader
diff --git a/fastNLP/io/data_loader/conll.py b/fastNLP/io/data_loader/conll.py
new file mode 100644
index 00000000..61f4f61b
--- /dev/null
+++ b/fastNLP/io/data_loader/conll.py
@@ -0,0 +1,73 @@
+
+from ...core.dataset import DataSet
+from ...core.instance import Instance
+from ..base_loader import DataSetLoader
+from ..file_reader import _read_conll
+
+
+class ConllLoader(DataSetLoader):
+ """
+ 别名::class:`fastNLP.io.ConllLoader` :class:`fastNLP.io.data_loader.ConllLoader`
+
+ 读取Conll格式的数据. 数据格式详见 http://conll.cemantix.org/2012/data.html. 数据中以"-DOCSTART-"开头的行将被忽略,因为
+ 该符号在conll 2003中被用为文档分割符。
+
+ 列号从0开始, 每列对应内容为::
+
+ Column Type
+ 0 Document ID
+ 1 Part number
+ 2 Word number
+ 3 Word itself
+ 4 Part-of-Speech
+ 5 Parse bit
+ 6 Predicate lemma
+ 7 Predicate Frameset ID
+ 8 Word sense
+ 9 Speaker/Author
+ 10 Named Entities
+ 11:N Predicate Arguments
+ N Coreference
+
+ :param headers: 每一列数据的名称,需为List or Tuple of str。``header`` 与 ``indexes`` 一一对应
+ :param indexes: 需要保留的数据列下标,从0开始。若为 ``None`` ,则所有列都保留。Default: ``None``
+ :param dropna: 是否忽略非法数据,若 ``False`` ,遇到非法数据时抛出 ``ValueError`` 。Default: ``False``
+ """
+
+ def __init__(self, headers, indexes=None, dropna=False):
+ super(ConllLoader, self).__init__()
+ if not isinstance(headers, (list, tuple)):
+ raise TypeError(
+ 'invalid headers: {}, should be list of strings'.format(headers))
+ self.headers = headers
+ self.dropna = dropna
+ if indexes is None:
+ self.indexes = list(range(len(self.headers)))
+ else:
+ if len(indexes) != len(headers):
+ raise ValueError
+ self.indexes = indexes
+
+ def _load(self, path):
+ ds = DataSet()
+ for idx, data in _read_conll(path, indexes=self.indexes, dropna=self.dropna):
+ ins = {h: data[i] for i, h in enumerate(self.headers)}
+ ds.append(Instance(**ins))
+ return ds
+
+
+class Conll2003Loader(ConllLoader):
+ """
+ 别名::class:`fastNLP.io.Conll2003Loader` :class:`fastNLP.io.dataset_loader.Conll2003Loader`
+
+ 读取Conll2003数据
+
+ 关于数据集的更多信息,参考:
+ https://sites.google.com/site/ermasoftware/getting-started/ne-tagging-conll2003-data
+ """
+
+ def __init__(self):
+ headers = [
+ 'tokens', 'pos', 'chunks', 'ner',
+ ]
+ super(Conll2003Loader, self).__init__(headers=headers)
diff --git a/fastNLP/io/data_loader/imdb.py b/fastNLP/io/data_loader/imdb.py
new file mode 100644
index 00000000..bf53c5be
--- /dev/null
+++ b/fastNLP/io/data_loader/imdb.py
@@ -0,0 +1,96 @@
+
+from typing import Union, Dict
+
+from ..embed_loader import EmbeddingOption, EmbedLoader
+from ..base_loader import DataSetLoader, DataBundle
+from ...core.vocabulary import VocabularyOption, Vocabulary
+from ...core.dataset import DataSet
+from ...core.instance import Instance
+from ...core.const import Const
+
+from ..utils import get_tokenizer
+
+
+class IMDBLoader(DataSetLoader):
+ """
+ 读取IMDB数据集,DataSet包含以下fields:
+
+ words: list(str), 需要分类的文本
+ target: str, 文本的标签
+
+ """
+
+ def __init__(self):
+ super(IMDBLoader, self).__init__()
+ self.tokenizer = get_tokenizer()
+
+ def _load(self, path):
+ dataset = DataSet()
+ with open(path, 'r', encoding="utf-8") as f:
+ for line in f:
+ line = line.strip()
+ if not line:
+ continue
+ parts = line.split('\t')
+ target = parts[0]
+ words = self.tokenizer(parts[1].lower())
+ dataset.append(Instance(words=words, target=target))
+
+ if len(dataset) == 0:
+ raise RuntimeError(f"{path} has no valid data.")
+
+ return dataset
+
+ def process(self,
+ paths: Union[str, Dict[str, str]],
+ src_vocab_opt: VocabularyOption = None,
+ tgt_vocab_opt: VocabularyOption = None,
+ char_level_op=False):
+
+ datasets = {}
+ info = DataBundle()
+ for name, path in paths.items():
+ dataset = self.load(path)
+ datasets[name] = dataset
+
+ def wordtochar(words):
+ chars = []
+ for word in words:
+ word = word.lower()
+ for char in word:
+ chars.append(char)
+ chars.append('')
+ chars.pop()
+ return chars
+
+ if char_level_op:
+ for dataset in datasets.values():
+ dataset.apply_field(wordtochar, field_name="words", new_field_name='chars')
+
+ datasets["train"], datasets["dev"] = datasets["train"].split(0.1, shuffle=False)
+
+ src_vocab = Vocabulary() if src_vocab_opt is None else Vocabulary(**src_vocab_opt)
+ src_vocab.from_dataset(datasets['train'], field_name='words')
+
+ src_vocab.index_dataset(*datasets.values(), field_name='words')
+
+ tgt_vocab = Vocabulary(unknown=None, padding=None) \
+ if tgt_vocab_opt is None else Vocabulary(**tgt_vocab_opt)
+ tgt_vocab.from_dataset(datasets['train'], field_name='target')
+ tgt_vocab.index_dataset(*datasets.values(), field_name='target')
+
+ info.vocabs = {
+ Const.INPUT: src_vocab,
+ Const.TARGET: tgt_vocab
+ }
+
+ info.datasets = datasets
+
+ for name, dataset in info.datasets.items():
+ dataset.set_input(Const.INPUT)
+ dataset.set_target(Const.TARGET)
+
+ return info
+
+
+
diff --git a/fastNLP/io/data_loader/matching.py b/fastNLP/io/data_loader/matching.py
new file mode 100644
index 00000000..cecaee96
--- /dev/null
+++ b/fastNLP/io/data_loader/matching.py
@@ -0,0 +1,248 @@
+import os
+
+from typing import Union, Dict, List
+
+from ...core.const import Const
+from ...core.vocabulary import Vocabulary
+from ..base_loader import DataBundle, DataSetLoader
+from ..file_utils import _get_base_url, cached_path, PRETRAINED_BERT_MODEL_DIR
+from ...modules.encoder._bert import BertTokenizer
+
+
+class MatchingLoader(DataSetLoader):
+ """
+ 别名::class:`fastNLP.io.MatchingLoader` :class:`fastNLP.io.data_loader.MatchingLoader`
+
+ 读取Matching任务的数据集
+
+ :param dict paths: key是数据集名称(如train、dev、test),value是对应的文件名
+ """
+
+ def __init__(self, paths: dict=None):
+ self.paths = paths
+
+ def _load(self, path):
+ """
+ :param str path: 待读取数据集的路径名
+ :return: fastNLP.DataSet ds: 返回一个DataSet对象,里面必须包含3个field:其中两个分别为两个句子
+ 的原始字符串文本,第三个为标签
+ """
+ raise NotImplementedError
+
+ def process(self, paths: Union[str, Dict[str, str]], dataset_name: str=None,
+ to_lower=False, seq_len_type: str=None, bert_tokenizer: str=None,
+ cut_text: int = None, get_index=True, auto_pad_length: int=None,
+ auto_pad_token: str='', set_input: Union[list, str, bool]=True,
+ set_target: Union[list, str, bool]=True, concat: Union[str, list, bool]=None,
+ extra_split: List[str]=None, ) -> DataBundle:
+ """
+ :param paths: str或者Dict[str, str]。如果是str,则为数据集所在的文件夹或者是全路径文件名:如果是文件夹,
+ 则会从self.paths里面找对应的数据集名称与文件名。如果是Dict,则为数据集名称(如train、dev、test)和
+ 对应的全路径文件名。
+ :param str dataset_name: 如果在paths里传入的是一个数据集的全路径文件名,那么可以用dataset_name来定义
+ 这个数据集的名字,如果不定义则默认为train。
+ :param bool to_lower: 是否将文本自动转为小写。默认值为False。
+ :param str seq_len_type: 提供的seq_len类型,支持 ``seq_len`` :提供一个数字作为句子长度; ``mask`` :
+ 提供一个0/1的mask矩阵作为句子长度; ``bert`` :提供segment_type_id(第一个句子为0,第二个句子为1)和
+ attention mask矩阵(0/1的mask矩阵)。默认值为None,即不提供seq_len
+ :param str bert_tokenizer: bert tokenizer所使用的词表所在的文件夹路径
+ :param int cut_text: 将长于cut_text的内容截掉。默认为None,即不截。
+ :param bool get_index: 是否需要根据词表将文本转为index
+ :param int auto_pad_length: 是否需要将文本自动pad到一定长度(超过这个长度的文本将会被截掉),默认为不会自动pad
+ :param str auto_pad_token: 自动pad的内容
+ :param set_input: 如果为True,则会自动将相关的field(名字里含有Const.INPUT的)设置为input,如果为False
+ 则不会将任何field设置为input。如果传入str或者List[str],则会根据传入的内容将相对应的field设置为input,
+ 于此同时其他field不会被设置为input。默认值为True。
+ :param set_target: set_target将控制哪些field可以被设置为target,用法与set_input一致。默认值为True。
+ :param concat: 是否需要将两个句子拼接起来。如果为False则不会拼接。如果为True则会在两个句子之间插入一个。
+ 如果传入一个长度为4的list,则分别表示插在第一句开始前、第一句结束后、第二句开始前、第二句结束后的标识符。如果
+ 传入字符串 ``bert`` ,则会采用bert的拼接方式,等价于['[CLS]', '[SEP]', '', '[SEP]'].
+ :param extra_split: 额外的分隔符,即除了空格之外的用于分词的字符。
+ :return:
+ """
+ if isinstance(set_input, str):
+ set_input = [set_input]
+ if isinstance(set_target, str):
+ set_target = [set_target]
+ if isinstance(set_input, bool):
+ auto_set_input = set_input
+ else:
+ auto_set_input = False
+ if isinstance(set_target, bool):
+ auto_set_target = set_target
+ else:
+ auto_set_target = False
+ if isinstance(paths, str):
+ if os.path.isdir(paths):
+ path = {n: os.path.join(paths, self.paths[n]) for n in self.paths.keys()}
+ else:
+ path = {dataset_name if dataset_name is not None else 'train': paths}
+ else:
+ path = paths
+
+ data_info = DataBundle()
+ for data_name in path.keys():
+ data_info.datasets[data_name] = self._load(path[data_name])
+
+ for data_name, data_set in data_info.datasets.items():
+ if auto_set_input:
+ data_set.set_input(Const.INPUTS(0), Const.INPUTS(1))
+ if auto_set_target:
+ if Const.TARGET in data_set.get_field_names():
+ data_set.set_target(Const.TARGET)
+
+ if extra_split is not None:
+ for data_name, data_set in data_info.datasets.items():
+ data_set.apply(lambda x: ' '.join(x[Const.INPUTS(0)]), new_field_name=Const.INPUTS(0))
+ data_set.apply(lambda x: ' '.join(x[Const.INPUTS(1)]), new_field_name=Const.INPUTS(1))
+
+ for s in extra_split:
+ data_set.apply(lambda x: x[Const.INPUTS(0)].replace(s, ' ' + s + ' '),
+ new_field_name=Const.INPUTS(0))
+ data_set.apply(lambda x: x[Const.INPUTS(0)].replace(s, ' ' + s + ' '),
+ new_field_name=Const.INPUTS(0))
+
+ _filt = lambda x: x
+ data_set.apply(lambda x: list(filter(_filt, x[Const.INPUTS(0)].split(' '))),
+ new_field_name=Const.INPUTS(0), is_input=auto_set_input)
+ data_set.apply(lambda x: list(filter(_filt, x[Const.INPUTS(1)].split(' '))),
+ new_field_name=Const.INPUTS(1), is_input=auto_set_input)
+ _filt = None
+
+ if to_lower:
+ for data_name, data_set in data_info.datasets.items():
+ data_set.apply(lambda x: [w.lower() for w in x[Const.INPUTS(0)]], new_field_name=Const.INPUTS(0),
+ is_input=auto_set_input)
+ data_set.apply(lambda x: [w.lower() for w in x[Const.INPUTS(1)]], new_field_name=Const.INPUTS(1),
+ is_input=auto_set_input)
+
+ if bert_tokenizer is not None:
+ if bert_tokenizer.lower() in PRETRAINED_BERT_MODEL_DIR:
+ PRETRAIN_URL = _get_base_url('bert')
+ model_name = PRETRAINED_BERT_MODEL_DIR[bert_tokenizer]
+ model_url = PRETRAIN_URL + model_name
+ model_dir = cached_path(model_url)
+ # 检查是否存在
+ elif os.path.isdir(bert_tokenizer):
+ model_dir = bert_tokenizer
+ else:
+ raise ValueError(f"Cannot recognize BERT tokenizer from {bert_tokenizer}.")
+
+ words_vocab = Vocabulary(padding='[PAD]', unknown='[UNK]')
+ with open(os.path.join(model_dir, 'vocab.txt'), 'r') as f:
+ lines = f.readlines()
+ lines = [line.strip() for line in lines]
+ words_vocab.add_word_lst(lines)
+ words_vocab.build_vocab()
+
+ tokenizer = BertTokenizer.from_pretrained(model_dir)
+
+ for data_name, data_set in data_info.datasets.items():
+ for fields in data_set.get_field_names():
+ if Const.INPUT in fields:
+ data_set.apply(lambda x: tokenizer.tokenize(' '.join(x[fields])), new_field_name=fields,
+ is_input=auto_set_input)
+
+ if isinstance(concat, bool):
+ concat = 'default' if concat else None
+ if concat is not None:
+ if isinstance(concat, str):
+ CONCAT_MAP = {'bert': ['[CLS]', '[SEP]', '', '[SEP]'],
+ 'default': ['', '', '', '']}
+ if concat.lower() in CONCAT_MAP:
+ concat = CONCAT_MAP[concat]
+ else:
+ concat = 4 * [concat]
+ assert len(concat) == 4, \
+ f'Please choose a list with 4 symbols which at the beginning of first sentence ' \
+ f'the end of first sentence, the begin of second sentence, and the end of second' \
+ f'sentence. Your input is {concat}'
+
+ for data_name, data_set in data_info.datasets.items():
+ data_set.apply(lambda x: [concat[0]] + x[Const.INPUTS(0)] + [concat[1]] + [concat[2]] +
+ x[Const.INPUTS(1)] + [concat[3]], new_field_name=Const.INPUT)
+ data_set.apply(lambda x: [w for w in x[Const.INPUT] if len(w) > 0], new_field_name=Const.INPUT,
+ is_input=auto_set_input)
+
+ if seq_len_type is not None:
+ if seq_len_type == 'seq_len': #
+ for data_name, data_set in data_info.datasets.items():
+ for fields in data_set.get_field_names():
+ if Const.INPUT in fields:
+ data_set.apply(lambda x: len(x[fields]),
+ new_field_name=fields.replace(Const.INPUT, Const.INPUT_LEN),
+ is_input=auto_set_input)
+ elif seq_len_type == 'mask':
+ for data_name, data_set in data_info.datasets.items():
+ for fields in data_set.get_field_names():
+ if Const.INPUT in fields:
+ data_set.apply(lambda x: [1] * len(x[fields]),
+ new_field_name=fields.replace(Const.INPUT, Const.INPUT_LEN),
+ is_input=auto_set_input)
+ elif seq_len_type == 'bert':
+ for data_name, data_set in data_info.datasets.items():
+ if Const.INPUT not in data_set.get_field_names():
+ raise KeyError(f'Field ``{Const.INPUT}`` not in {data_name} data set: '
+ f'got {data_set.get_field_names()}')
+ data_set.apply(lambda x: [0] * (len(x[Const.INPUTS(0)]) + 2) + [1] * (len(x[Const.INPUTS(1)]) + 1),
+ new_field_name=Const.INPUT_LENS(0), is_input=auto_set_input)
+ data_set.apply(lambda x: [1] * len(x[Const.INPUT_LENS(0)]),
+ new_field_name=Const.INPUT_LENS(1), is_input=auto_set_input)
+
+ if auto_pad_length is not None:
+ cut_text = min(auto_pad_length, cut_text if cut_text is not None else auto_pad_length)
+
+ if cut_text is not None:
+ for data_name, data_set in data_info.datasets.items():
+ for fields in data_set.get_field_names():
+ if (Const.INPUT in fields) or ((Const.INPUT_LEN in fields) and (seq_len_type != 'seq_len')):
+ data_set.apply(lambda x: x[fields][: cut_text], new_field_name=fields,
+ is_input=auto_set_input)
+
+ data_set_list = [d for n, d in data_info.datasets.items()]
+ assert len(data_set_list) > 0, f'There are NO data sets in data info!'
+
+ if bert_tokenizer is None:
+ words_vocab = Vocabulary(padding=auto_pad_token)
+ words_vocab = words_vocab.from_dataset(*[d for n, d in data_info.datasets.items() if 'train' in n],
+ field_name=[n for n in data_set_list[0].get_field_names()
+ if (Const.INPUT in n)],
+ no_create_entry_dataset=[d for n, d in data_info.datasets.items()
+ if 'train' not in n])
+ target_vocab = Vocabulary(padding=None, unknown=None)
+ target_vocab = target_vocab.from_dataset(*[d for n, d in data_info.datasets.items() if 'train' in n],
+ field_name=Const.TARGET)
+ data_info.vocabs = {Const.INPUT: words_vocab, Const.TARGET: target_vocab}
+
+ if get_index:
+ for data_name, data_set in data_info.datasets.items():
+ for fields in data_set.get_field_names():
+ if Const.INPUT in fields:
+ data_set.apply(lambda x: [words_vocab.to_index(w) for w in x[fields]], new_field_name=fields,
+ is_input=auto_set_input)
+
+ if Const.TARGET in data_set.get_field_names():
+ data_set.apply(lambda x: target_vocab.to_index(x[Const.TARGET]), new_field_name=Const.TARGET,
+ is_input=auto_set_input, is_target=auto_set_target)
+
+ if auto_pad_length is not None:
+ if seq_len_type == 'seq_len':
+ raise RuntimeError(f'the sequence will be padded with the length {auto_pad_length}, '
+ f'so the seq_len_type cannot be `{seq_len_type}`!')
+ for data_name, data_set in data_info.datasets.items():
+ for fields in data_set.get_field_names():
+ if Const.INPUT in fields:
+ data_set.apply(lambda x: x[fields] + [words_vocab.to_index(words_vocab.padding)] *
+ (auto_pad_length - len(x[fields])), new_field_name=fields,
+ is_input=auto_set_input)
+ elif (Const.INPUT_LEN in fields) and (seq_len_type != 'seq_len'):
+ data_set.apply(lambda x: x[fields] + [0] * (auto_pad_length - len(x[fields])),
+ new_field_name=fields, is_input=auto_set_input)
+
+ for data_name, data_set in data_info.datasets.items():
+ if isinstance(set_input, list):
+ data_set.set_input(*[inputs for inputs in set_input if inputs in data_set.get_field_names()])
+ if isinstance(set_target, list):
+ data_set.set_target(*[target for target in set_target if target in data_set.get_field_names()])
+
+ return data_info
diff --git a/fastNLP/io/data_loader/mnli.py b/fastNLP/io/data_loader/mnli.py
new file mode 100644
index 00000000..5d857533
--- /dev/null
+++ b/fastNLP/io/data_loader/mnli.py
@@ -0,0 +1,60 @@
+
+from ...core.const import Const
+
+from .matching import MatchingLoader
+from ..dataset_loader import CSVLoader
+
+
+class MNLILoader(MatchingLoader, CSVLoader):
+ """
+ 别名::class:`fastNLP.io.MNLILoader` :class:`fastNLP.io.data_loader.MNLILoader`
+
+ 读取MNLI数据集,读取的DataSet包含fields::
+
+ words1: list(str),第一句文本, premise
+ words2: list(str), 第二句文本, hypothesis
+ target: str, 真实标签
+
+ 数据来源:
+ """
+
+ def __init__(self, paths: dict=None):
+ paths = paths if paths is not None else {
+ 'train': 'train.tsv',
+ 'dev_matched': 'dev_matched.tsv',
+ 'dev_mismatched': 'dev_mismatched.tsv',
+ 'test_matched': 'test_matched.tsv',
+ 'test_mismatched': 'test_mismatched.tsv',
+ # 'test_0.9_matched': 'multinli_0.9_test_matched_unlabeled.txt',
+ # 'test_0.9_mismatched': 'multinli_0.9_test_mismatched_unlabeled.txt',
+
+ # test_0.9_mathed与mismatched是MNLI0.9版本的(数据来源:kaggle)
+ }
+ MatchingLoader.__init__(self, paths=paths)
+ CSVLoader.__init__(self, sep='\t')
+ self.fields = {
+ 'sentence1_binary_parse': Const.INPUTS(0),
+ 'sentence2_binary_parse': Const.INPUTS(1),
+ 'gold_label': Const.TARGET,
+ }
+
+ def _load(self, path):
+ ds = CSVLoader._load(self, path)
+
+ for k, v in self.fields.items():
+ if k in ds.get_field_names():
+ ds.rename_field(k, v)
+
+ if Const.TARGET in ds.get_field_names():
+ if ds[0][Const.TARGET] == 'hidden':
+ ds.delete_field(Const.TARGET)
+
+ parentheses_table = str.maketrans({'(': None, ')': None})
+
+ ds.apply(lambda ins: ins[Const.INPUTS(0)].translate(parentheses_table).strip().split(),
+ new_field_name=Const.INPUTS(0))
+ ds.apply(lambda ins: ins[Const.INPUTS(1)].translate(parentheses_table).strip().split(),
+ new_field_name=Const.INPUTS(1))
+ if Const.TARGET in ds.get_field_names():
+ ds.drop(lambda x: x[Const.TARGET] == '-')
+ return ds
diff --git a/fastNLP/io/data_loader/mtl.py b/fastNLP/io/data_loader/mtl.py
new file mode 100644
index 00000000..940ece51
--- /dev/null
+++ b/fastNLP/io/data_loader/mtl.py
@@ -0,0 +1,65 @@
+
+from typing import Union, Dict
+
+from ..base_loader import DataBundle
+from ..dataset_loader import CSVLoader
+from ...core.vocabulary import Vocabulary, VocabularyOption
+from ...core.const import Const
+from ..utils import check_dataloader_paths
+
+
+class MTL16Loader(CSVLoader):
+ """
+ 读取MTL16数据集,DataSet包含以下fields:
+
+ words: list(str), 需要分类的文本
+ target: str, 文本的标签
+
+ 数据来源:https://pan.baidu.com/s/1c2L6vdA
+
+ """
+
+ def __init__(self):
+ super(MTL16Loader, self).__init__(headers=(Const.TARGET, Const.INPUT), sep='\t')
+
+ def _load(self, path):
+ dataset = super(MTL16Loader, self)._load(path)
+ dataset.apply(lambda x: x[Const.INPUT].lower().split(), new_field_name=Const.INPUT)
+ if len(dataset) == 0:
+ raise RuntimeError(f"{path} has no valid data.")
+
+ return dataset
+
+ def process(self,
+ paths: Union[str, Dict[str, str]],
+ src_vocab_opt: VocabularyOption = None,
+ tgt_vocab_opt: VocabularyOption = None,):
+
+ paths = check_dataloader_paths(paths)
+ datasets = {}
+ info = DataBundle()
+ for name, path in paths.items():
+ dataset = self.load(path)
+ datasets[name] = dataset
+
+ src_vocab = Vocabulary() if src_vocab_opt is None else Vocabulary(**src_vocab_opt)
+ src_vocab.from_dataset(datasets['train'], field_name=Const.INPUT)
+ src_vocab.index_dataset(*datasets.values(), field_name=Const.INPUT)
+
+ tgt_vocab = Vocabulary(unknown=None, padding=None) \
+ if tgt_vocab_opt is None else Vocabulary(**tgt_vocab_opt)
+ tgt_vocab.from_dataset(datasets['train'], field_name=Const.TARGET)
+ tgt_vocab.index_dataset(*datasets.values(), field_name=Const.TARGET)
+
+ info.vocabs = {
+ Const.INPUT: src_vocab,
+ Const.TARGET: tgt_vocab
+ }
+
+ info.datasets = datasets
+
+ for name, dataset in info.datasets.items():
+ dataset.set_input(Const.INPUT)
+ dataset.set_target(Const.TARGET)
+
+ return info
diff --git a/fastNLP/io/data_loader/people_daily.py b/fastNLP/io/data_loader/people_daily.py
new file mode 100644
index 00000000..d8c55aef
--- /dev/null
+++ b/fastNLP/io/data_loader/people_daily.py
@@ -0,0 +1,85 @@
+
+from ..base_loader import DataSetLoader
+from ...core.dataset import DataSet
+from ...core.instance import Instance
+from ...core.const import Const
+
+
+class PeopleDailyCorpusLoader(DataSetLoader):
+ """
+ 别名::class:`fastNLP.io.PeopleDailyCorpusLoader` :class:`fastNLP.io.dataset_loader.PeopleDailyCorpusLoader`
+
+ 读取人民日报数据集
+ """
+
+ def __init__(self, pos=True, ner=True):
+ super(PeopleDailyCorpusLoader, self).__init__()
+ self.pos = pos
+ self.ner = ner
+
+ def _load(self, data_path):
+ with open(data_path, "r", encoding="utf-8") as f:
+ sents = f.readlines()
+ examples = []
+ for sent in sents:
+ if len(sent) <= 2:
+ continue
+ inside_ne = False
+ sent_pos_tag = []
+ sent_words = []
+ sent_ner = []
+ words = sent.strip().split()[1:]
+ for word in words:
+ if "[" in word and "]" in word:
+ ner_tag = "U"
+ print(word)
+ elif "[" in word:
+ inside_ne = True
+ ner_tag = "B"
+ word = word[1:]
+ elif "]" in word:
+ ner_tag = "L"
+ word = word[:word.index("]")]
+ if inside_ne is True:
+ inside_ne = False
+ else:
+ raise RuntimeError("only ] appears!")
+ else:
+ if inside_ne is True:
+ ner_tag = "I"
+ else:
+ ner_tag = "O"
+ tmp = word.split("/")
+ token, pos = tmp[0], tmp[1]
+ sent_ner.append(ner_tag)
+ sent_pos_tag.append(pos)
+ sent_words.append(token)
+ example = [sent_words]
+ if self.pos is True:
+ example.append(sent_pos_tag)
+ if self.ner is True:
+ example.append(sent_ner)
+ examples.append(example)
+ return self.convert(examples)
+
+ def convert(self, data):
+ """
+
+ :param data: python 内置对象
+ :return: 一个 :class:`~fastNLP.DataSet` 类型的对象
+ """
+ data_set = DataSet()
+ for item in data:
+ sent_words = item[0]
+ if self.pos is True and self.ner is True:
+ instance = Instance(
+ words=sent_words, pos_tags=item[1], ner=item[2])
+ elif self.pos is True:
+ instance = Instance(words=sent_words, pos_tags=item[1])
+ elif self.ner is True:
+ instance = Instance(words=sent_words, ner=item[1])
+ else:
+ instance = Instance(words=sent_words)
+ data_set.append(instance)
+ data_set.apply(lambda ins: len(ins[Const.INPUT]), new_field_name=Const.INPUT_LEN)
+ return data_set
diff --git a/fastNLP/io/data_loader/qnli.py b/fastNLP/io/data_loader/qnli.py
new file mode 100644
index 00000000..ff6302b2
--- /dev/null
+++ b/fastNLP/io/data_loader/qnli.py
@@ -0,0 +1,45 @@
+
+from ...core.const import Const
+
+from .matching import MatchingLoader
+from ..dataset_loader import CSVLoader
+
+
+class QNLILoader(MatchingLoader, CSVLoader):
+ """
+ 别名::class:`fastNLP.io.QNLILoader` :class:`fastNLP.io.data_loader.QNLILoader`
+
+ 读取QNLI数据集,读取的DataSet包含fields::
+
+ words1: list(str),第一句文本, premise
+ words2: list(str), 第二句文本, hypothesis
+ target: str, 真实标签
+
+ 数据来源:
+ """
+
+ def __init__(self, paths: dict=None):
+ paths = paths if paths is not None else {
+ 'train': 'train.tsv',
+ 'dev': 'dev.tsv',
+ 'test': 'test.tsv' # test set has not label
+ }
+ MatchingLoader.__init__(self, paths=paths)
+ self.fields = {
+ 'question': Const.INPUTS(0),
+ 'sentence': Const.INPUTS(1),
+ 'label': Const.TARGET,
+ }
+ CSVLoader.__init__(self, sep='\t')
+
+ def _load(self, path):
+ ds = CSVLoader._load(self, path)
+
+ for k, v in self.fields.items():
+ if k in ds.get_field_names():
+ ds.rename_field(k, v)
+ for fields in ds.get_all_fields():
+ if Const.INPUT in fields:
+ ds.apply(lambda x: x[fields].strip().split(), new_field_name=fields)
+
+ return ds
diff --git a/fastNLP/io/data_loader/quora.py b/fastNLP/io/data_loader/quora.py
new file mode 100644
index 00000000..12cc42ce
--- /dev/null
+++ b/fastNLP/io/data_loader/quora.py
@@ -0,0 +1,32 @@
+
+from ...core.const import Const
+
+from .matching import MatchingLoader
+from ..dataset_loader import CSVLoader
+
+
+class QuoraLoader(MatchingLoader, CSVLoader):
+ """
+ 别名::class:`fastNLP.io.QuoraLoader` :class:`fastNLP.io.data_loader.QuoraLoader`
+
+ 读取MNLI数据集,读取的DataSet包含fields::
+
+ words1: list(str),第一句文本, premise
+ words2: list(str), 第二句文本, hypothesis
+ target: str, 真实标签
+
+ 数据来源:
+ """
+
+ def __init__(self, paths: dict=None):
+ paths = paths if paths is not None else {
+ 'train': 'train.tsv',
+ 'dev': 'dev.tsv',
+ 'test': 'test.tsv',
+ }
+ MatchingLoader.__init__(self, paths=paths)
+ CSVLoader.__init__(self, sep='\t', headers=(Const.TARGET, Const.INPUTS(0), Const.INPUTS(1), 'pairID'))
+
+ def _load(self, path):
+ ds = CSVLoader._load(self, path)
+ return ds
diff --git a/fastNLP/io/data_loader/rte.py b/fastNLP/io/data_loader/rte.py
new file mode 100644
index 00000000..c6c64ef8
--- /dev/null
+++ b/fastNLP/io/data_loader/rte.py
@@ -0,0 +1,45 @@
+
+from ...core.const import Const
+
+from .matching import MatchingLoader
+from ..dataset_loader import CSVLoader
+
+
+class RTELoader(MatchingLoader, CSVLoader):
+ """
+ 别名::class:`fastNLP.io.RTELoader` :class:`fastNLP.io.data_loader.RTELoader`
+
+ 读取RTE数据集,读取的DataSet包含fields::
+
+ words1: list(str),第一句文本, premise
+ words2: list(str), 第二句文本, hypothesis
+ target: str, 真实标签
+
+ 数据来源:
+ """
+
+ def __init__(self, paths: dict=None):
+ paths = paths if paths is not None else {
+ 'train': 'train.tsv',
+ 'dev': 'dev.tsv',
+ 'test': 'test.tsv' # test set has not label
+ }
+ MatchingLoader.__init__(self, paths=paths)
+ self.fields = {
+ 'sentence1': Const.INPUTS(0),
+ 'sentence2': Const.INPUTS(1),
+ 'label': Const.TARGET,
+ }
+ CSVLoader.__init__(self, sep='\t')
+
+ def _load(self, path):
+ ds = CSVLoader._load(self, path)
+
+ for k, v in self.fields.items():
+ if k in ds.get_field_names():
+ ds.rename_field(k, v)
+ for fields in ds.get_all_fields():
+ if Const.INPUT in fields:
+ ds.apply(lambda x: x[fields].strip().split(), new_field_name=fields)
+
+ return ds
diff --git a/fastNLP/io/data_loader/snli.py b/fastNLP/io/data_loader/snli.py
new file mode 100644
index 00000000..8334fcfd
--- /dev/null
+++ b/fastNLP/io/data_loader/snli.py
@@ -0,0 +1,44 @@
+
+from ...core.const import Const
+
+from .matching import MatchingLoader
+from ..dataset_loader import JsonLoader
+
+
+class SNLILoader(MatchingLoader, JsonLoader):
+ """
+ 别名::class:`fastNLP.io.SNLILoader` :class:`fastNLP.io.data_loader.SNLILoader`
+
+ 读取SNLI数据集,读取的DataSet包含fields::
+
+ words1: list(str),第一句文本, premise
+ words2: list(str), 第二句文本, hypothesis
+ target: str, 真实标签
+
+ 数据来源: https://nlp.stanford.edu/projects/snli/snli_1.0.zip
+ """
+
+ def __init__(self, paths: dict=None):
+ fields = {
+ 'sentence1_binary_parse': Const.INPUTS(0),
+ 'sentence2_binary_parse': Const.INPUTS(1),
+ 'gold_label': Const.TARGET,
+ }
+ paths = paths if paths is not None else {
+ 'train': 'snli_1.0_train.jsonl',
+ 'dev': 'snli_1.0_dev.jsonl',
+ 'test': 'snli_1.0_test.jsonl'}
+ MatchingLoader.__init__(self, paths=paths)
+ JsonLoader.__init__(self, fields=fields)
+
+ def _load(self, path):
+ ds = JsonLoader._load(self, path)
+
+ parentheses_table = str.maketrans({'(': None, ')': None})
+
+ ds.apply(lambda ins: ins[Const.INPUTS(0)].translate(parentheses_table).strip().split(),
+ new_field_name=Const.INPUTS(0))
+ ds.apply(lambda ins: ins[Const.INPUTS(1)].translate(parentheses_table).strip().split(),
+ new_field_name=Const.INPUTS(1))
+ ds.drop(lambda x: x[Const.TARGET] == '-')
+ return ds
diff --git a/fastNLP/io/data_loader/sst.py b/fastNLP/io/data_loader/sst.py
index 1e1b8bef..df46b47f 100644
--- a/fastNLP/io/data_loader/sst.py
+++ b/fastNLP/io/data_loader/sst.py
@@ -1,18 +1,19 @@
-from typing import Iterable
+
+from typing import Union, Dict
from nltk import Tree
-from ..base_loader import DataInfo, DataSetLoader
+
+from ..base_loader import DataBundle, DataSetLoader
+from ..dataset_loader import CSVLoader
from ...core.vocabulary import VocabularyOption, Vocabulary
from ...core.dataset import DataSet
+from ...core.const import Const
from ...core.instance import Instance
-from ..embed_loader import EmbeddingOption, EmbedLoader
+from ..utils import check_dataloader_paths, get_tokenizer
class SSTLoader(DataSetLoader):
- URL = 'https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip'
- DATA_DIR = 'sst/'
-
"""
- 别名::class:`fastNLP.io.SSTLoader` :class:`fastNLP.io.dataset_loader.SSTLoader`
+ 别名::class:`fastNLP.io.SSTLoader` :class:`fastNLP.io.data_loader.SSTLoader`
读取SST数据集, DataSet包含fields::
@@ -25,6 +26,9 @@ class SSTLoader(DataSetLoader):
:param fine_grained: 是否使用SST-5标准,若 ``False`` , 使用SST-2。Default: ``False``
"""
+ URL = 'https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip'
+ DATA_DIR = 'sst/'
+
def __init__(self, subtree=False, fine_grained=False):
self.subtree = subtree
@@ -34,6 +38,7 @@ class SSTLoader(DataSetLoader):
tag_v['0'] = tag_v['1']
tag_v['4'] = tag_v['3']
self.tag_v = tag_v
+ self.tokenizer = get_tokenizer()
def _load(self, path):
"""
@@ -52,29 +57,37 @@ class SSTLoader(DataSetLoader):
ds.append(Instance(words=words, target=tag))
return ds
- @staticmethod
- def _get_one(data, subtree):
+ def _get_one(self, data, subtree):
tree = Tree.fromstring(data)
if subtree:
- return [(t.leaves(), t.label()) for t in tree.subtrees()]
- return [(tree.leaves(), tree.label())]
+ return [(self.tokenizer(' '.join(t.leaves())), t.label()) for t in tree.subtrees() ]
+ return [(self.tokenizer(' '.join(tree.leaves())), tree.label())]
def process(self,
- paths,
- train_ds: Iterable[str] = None,
+ paths, train_subtree=True,
src_vocab_op: VocabularyOption = None,
- tgt_vocab_op: VocabularyOption = None,
- src_embed_op: EmbeddingOption = None):
+ tgt_vocab_op: VocabularyOption = None,):
+ paths = check_dataloader_paths(paths)
input_name, target_name = 'words', 'target'
src_vocab = Vocabulary() if src_vocab_op is None else Vocabulary(**src_vocab_op)
tgt_vocab = Vocabulary(unknown=None, padding=None) \
if tgt_vocab_op is None else Vocabulary(**tgt_vocab_op)
- info = DataInfo(datasets=self.load(paths))
- _train_ds = [info.datasets[name]
- for name in train_ds] if train_ds else info.datasets.values()
- src_vocab.from_dataset(*_train_ds, field_name=input_name)
- tgt_vocab.from_dataset(*_train_ds, field_name=target_name)
+ info = DataBundle()
+ origin_subtree = self.subtree
+ self.subtree = train_subtree
+ info.datasets['train'] = self._load(paths['train'])
+ self.subtree = origin_subtree
+ for n, p in paths.items():
+ if n != 'train':
+ info.datasets[n] = self._load(p)
+
+ src_vocab.from_dataset(
+ info.datasets['train'],
+ field_name=input_name,
+ no_create_entry_dataset=[ds for n, ds in info.datasets.items() if n != 'train'])
+ tgt_vocab.from_dataset(info.datasets['train'], field_name=target_name)
+
src_vocab.index_dataset(
*info.datasets.values(),
field_name=input_name, new_field_name=input_name)
@@ -86,10 +99,77 @@ class SSTLoader(DataSetLoader):
target_name: tgt_vocab
}
- if src_embed_op is not None:
- src_embed_op.vocab = src_vocab
- init_emb = EmbedLoader.load_with_vocab(**src_embed_op)
- info.embeddings[input_name] = init_emb
+ return info
+
+
+class SST2Loader(CSVLoader):
+ """
+ 数据来源"SST":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8',
+ """
+
+ def __init__(self):
+ super(SST2Loader, self).__init__(sep='\t')
+ self.tokenizer = get_tokenizer()
+ self.field = {'sentence': Const.INPUT, 'label': Const.TARGET}
+
+ def _load(self, path: str) -> DataSet:
+ ds = super(SST2Loader, self)._load(path)
+ for k, v in self.field.items():
+ if k in ds.get_field_names():
+ ds.rename_field(k, v)
+ ds.apply(lambda x: self.tokenizer(x[Const.INPUT]), new_field_name=Const.INPUT)
+ print("all count:", len(ds))
+ return ds
+
+ def process(self,
+ paths: Union[str, Dict[str, str]],
+ src_vocab_opt: VocabularyOption = None,
+ tgt_vocab_opt: VocabularyOption = None,
+ char_level_op=False):
+
+ paths = check_dataloader_paths(paths)
+ datasets = {}
+ info = DataBundle()
+ for name, path in paths.items():
+ dataset = self.load(path)
+ datasets[name] = dataset
+
+ def wordtochar(words):
+ chars = []
+ for word in words:
+ word = word.lower()
+ for char in word:
+ chars.append(char)
+ chars.append('')
+ chars.pop()
+ return chars
+
+ input_name, target_name = Const.INPUT, Const.TARGET
+ info.vocabs={}
+
+ # 就分隔为char形式
+ if char_level_op:
+ for dataset in datasets.values():
+ dataset.apply_field(wordtochar, field_name=Const.INPUT, new_field_name=Const.CHAR_INPUT)
+ src_vocab = Vocabulary() if src_vocab_opt is None else Vocabulary(**src_vocab_opt)
+ src_vocab.from_dataset(datasets['train'], field_name=Const.INPUT)
+ src_vocab.index_dataset(*datasets.values(), field_name=Const.INPUT)
+
+ tgt_vocab = Vocabulary(unknown=None, padding=None) \
+ if tgt_vocab_opt is None else Vocabulary(**tgt_vocab_opt)
+ tgt_vocab.from_dataset(datasets['train'], field_name=Const.TARGET)
+ tgt_vocab.index_dataset(*datasets.values(), field_name=Const.TARGET)
+
+ info.vocabs = {
+ Const.INPUT: src_vocab,
+ Const.TARGET: tgt_vocab
+ }
+
+ info.datasets = datasets
+
+ for name, dataset in info.datasets.items():
+ dataset.set_input(Const.INPUT)
+ dataset.set_target(Const.TARGET)
return info
diff --git a/fastNLP/io/data_loader/yelp.py b/fastNLP/io/data_loader/yelp.py
new file mode 100644
index 00000000..c287a90c
--- /dev/null
+++ b/fastNLP/io/data_loader/yelp.py
@@ -0,0 +1,127 @@
+
+import csv
+from typing import Iterable
+
+from ...core.const import Const
+from ...core.dataset import DataSet
+from ...core.instance import Instance
+from ...core.vocabulary import VocabularyOption, Vocabulary
+from ..base_loader import DataBundle, DataSetLoader
+from typing import Union, Dict
+from ..utils import check_dataloader_paths, get_tokenizer
+
+
+class YelpLoader(DataSetLoader):
+ """
+ 读取Yelp_full/Yelp_polarity数据集, DataSet包含fields:
+ words: list(str), 需要分类的文本
+ target: str, 文本的标签
+ chars:list(str),未index的字符列表
+
+ 数据集:yelp_full/yelp_polarity
+ :param fine_grained: 是否使用SST-5标准,若 ``False`` , 使用SST-2。Default: ``False``
+ :param lower: 是否需要自动转小写,默认为False。
+ """
+
+ def __init__(self, fine_grained=False, lower=False):
+ super(YelpLoader, self).__init__()
+ tag_v = {'1.0': 'very negative', '2.0': 'negative', '3.0': 'neutral',
+ '4.0': 'positive', '5.0': 'very positive'}
+ if not fine_grained:
+ tag_v['1.0'] = tag_v['2.0']
+ tag_v['5.0'] = tag_v['4.0']
+ self.fine_grained = fine_grained
+ self.tag_v = tag_v
+ self.lower = lower
+ self.tokenizer = get_tokenizer()
+
+ def _load(self, path):
+ ds = DataSet()
+ csv_reader = csv.reader(open(path, encoding='utf-8'))
+ all_count = 0
+ real_count = 0
+ for row in csv_reader:
+ all_count += 1
+ if len(row) == 2:
+ target = self.tag_v[row[0] + ".0"]
+ words = clean_str(row[1], self.tokenizer, self.lower)
+ if len(words) != 0:
+ ds.append(Instance(words=words, target=target))
+ real_count += 1
+ print("all count:", all_count)
+ print("real count:", real_count)
+ return ds
+
+ def process(self, paths: Union[str, Dict[str, str]],
+ train_ds: Iterable[str] = None,
+ src_vocab_op: VocabularyOption = None,
+ tgt_vocab_op: VocabularyOption = None,
+ char_level_op=False):
+ paths = check_dataloader_paths(paths)
+ info = DataBundle(datasets=self.load(paths))
+ src_vocab = Vocabulary() if src_vocab_op is None else Vocabulary(**src_vocab_op)
+ tgt_vocab = Vocabulary(unknown=None, padding=None) \
+ if tgt_vocab_op is None else Vocabulary(**tgt_vocab_op)
+ _train_ds = [info.datasets[name]
+ for name in train_ds] if train_ds else info.datasets.values()
+
+ def wordtochar(words):
+ chars = []
+ for word in words:
+ word = word.lower()
+ for char in word:
+ chars.append(char)
+ chars.append('')
+ chars.pop()
+ return chars
+
+ input_name, target_name = Const.INPUT, Const.TARGET
+ info.vocabs = {}
+ # 就分隔为char形式
+ if char_level_op:
+ for dataset in info.datasets.values():
+ dataset.apply_field(wordtochar, field_name=Const.INPUT, new_field_name=Const.CHAR_INPUT)
+ else:
+ src_vocab.from_dataset(*_train_ds, field_name=input_name)
+ src_vocab.index_dataset(*info.datasets.values(), field_name=input_name, new_field_name=input_name)
+ info.vocabs[input_name] = src_vocab
+
+ tgt_vocab.from_dataset(*_train_ds, field_name=target_name)
+ tgt_vocab.index_dataset(
+ *info.datasets.values(),
+ field_name=target_name, new_field_name=target_name)
+
+ info.vocabs[target_name] = tgt_vocab
+
+ info.datasets['train'], info.datasets['dev'] = info.datasets['train'].split(0.1, shuffle=False)
+
+ for name, dataset in info.datasets.items():
+ dataset.set_input(Const.INPUT)
+ dataset.set_target(Const.TARGET)
+
+ return info
+
+
+def clean_str(sentence, tokenizer, char_lower=False):
+ """
+ heavily borrowed from github
+ https://github.com/LukeZhuang/Hierarchical-Attention-Network/blob/master/yelp-preprocess.ipynb
+ :param sentence: is a str
+ :return:
+ """
+ if char_lower:
+ sentence = sentence.lower()
+ import re
+ nonalpnum = re.compile('[^0-9a-zA-Z?!\']+')
+ words = tokenizer(sentence)
+ words_collection = []
+ for word in words:
+ if word in ['-lrb-', '-rrb-', '', '-r', '-l', 'b-']:
+ continue
+ tt = nonalpnum.split(word)
+ t = ''.join(tt)
+ if t != '':
+ words_collection.append(t)
+
+ return words_collection
+
diff --git a/fastNLP/io/dataset_loader.py b/fastNLP/io/dataset_loader.py
index 558fe20e..ad6bbdc1 100644
--- a/fastNLP/io/dataset_loader.py
+++ b/fastNLP/io/dataset_loader.py
@@ -15,202 +15,13 @@ dataset_loader模块实现了许多 DataSetLoader, 用于读取不同格式的
__all__ = [
'CSVLoader',
'JsonLoader',
- 'ConllLoader',
- 'SNLILoader',
- 'SSTLoader',
- 'PeopleDailyCorpusLoader',
- 'Conll2003Loader',
]
-import os
-from nltk import Tree
-from typing import Union, Dict
-from ..core.vocabulary import Vocabulary
+
from ..core.dataset import DataSet
from ..core.instance import Instance
-from .file_reader import _read_csv, _read_json, _read_conll
-from .base_loader import DataSetLoader, DataInfo
-from .data_loader.sst import SSTLoader
-from ..core.const import Const
-from ..modules.encoder._bert import BertTokenizer
-
-
-class PeopleDailyCorpusLoader(DataSetLoader):
- """
- 别名::class:`fastNLP.io.PeopleDailyCorpusLoader` :class:`fastNLP.io.dataset_loader.PeopleDailyCorpusLoader`
-
- 读取人民日报数据集
- """
-
- def __init__(self, pos=True, ner=True):
- super(PeopleDailyCorpusLoader, self).__init__()
- self.pos = pos
- self.ner = ner
-
- def _load(self, data_path):
- with open(data_path, "r", encoding="utf-8") as f:
- sents = f.readlines()
- examples = []
- for sent in sents:
- if len(sent) <= 2:
- continue
- inside_ne = False
- sent_pos_tag = []
- sent_words = []
- sent_ner = []
- words = sent.strip().split()[1:]
- for word in words:
- if "[" in word and "]" in word:
- ner_tag = "U"
- print(word)
- elif "[" in word:
- inside_ne = True
- ner_tag = "B"
- word = word[1:]
- elif "]" in word:
- ner_tag = "L"
- word = word[:word.index("]")]
- if inside_ne is True:
- inside_ne = False
- else:
- raise RuntimeError("only ] appears!")
- else:
- if inside_ne is True:
- ner_tag = "I"
- else:
- ner_tag = "O"
- tmp = word.split("/")
- token, pos = tmp[0], tmp[1]
- sent_ner.append(ner_tag)
- sent_pos_tag.append(pos)
- sent_words.append(token)
- example = [sent_words]
- if self.pos is True:
- example.append(sent_pos_tag)
- if self.ner is True:
- example.append(sent_ner)
- examples.append(example)
- return self.convert(examples)
-
- def convert(self, data):
- """
-
- :param data: python 内置对象
- :return: 一个 :class:`~fastNLP.DataSet` 类型的对象
- """
- data_set = DataSet()
- for item in data:
- sent_words = item[0]
- if self.pos is True and self.ner is True:
- instance = Instance(
- words=sent_words, pos_tags=item[1], ner=item[2])
- elif self.pos is True:
- instance = Instance(words=sent_words, pos_tags=item[1])
- elif self.ner is True:
- instance = Instance(words=sent_words, ner=item[1])
- else:
- instance = Instance(words=sent_words)
- data_set.append(instance)
- data_set.apply(lambda ins: len(ins["words"]), new_field_name="seq_len")
- return data_set
-
-
-class ConllLoader(DataSetLoader):
- """
- 别名::class:`fastNLP.io.ConllLoader` :class:`fastNLP.io.dataset_loader.ConllLoader`
-
- 读取Conll格式的数据. 数据格式详见 http://conll.cemantix.org/2012/data.html. 数据中以"-DOCSTART-"开头的行将被忽略,因为
- 该符号在conll 2003中被用为文档分割符。
-
- 列号从0开始, 每列对应内容为::
-
- Column Type
- 0 Document ID
- 1 Part number
- 2 Word number
- 3 Word itself
- 4 Part-of-Speech
- 5 Parse bit
- 6 Predicate lemma
- 7 Predicate Frameset ID
- 8 Word sense
- 9 Speaker/Author
- 10 Named Entities
- 11:N Predicate Arguments
- N Coreference
-
- :param headers: 每一列数据的名称,需为List or Tuple of str。``header`` 与 ``indexes`` 一一对应
- :param indexes: 需要保留的数据列下标,从0开始。若为 ``None`` ,则所有列都保留。Default: ``None``
- :param dropna: 是否忽略非法数据,若 ``False`` ,遇到非法数据时抛出 ``ValueError`` 。Default: ``False``
- """
-
- def __init__(self, headers, indexes=None, dropna=False):
- super(ConllLoader, self).__init__()
- if not isinstance(headers, (list, tuple)):
- raise TypeError(
- 'invalid headers: {}, should be list of strings'.format(headers))
- self.headers = headers
- self.dropna = dropna
- if indexes is None:
- self.indexes = list(range(len(self.headers)))
- else:
- if len(indexes) != len(headers):
- raise ValueError
- self.indexes = indexes
-
- def _load(self, path):
- ds = DataSet()
- for idx, data in _read_conll(path, indexes=self.indexes, dropna=self.dropna):
- ins = {h: data[i] for i, h in enumerate(self.headers)}
- ds.append(Instance(**ins))
- return ds
-
-
-class Conll2003Loader(ConllLoader):
- """
- 别名::class:`fastNLP.io.Conll2003Loader` :class:`fastNLP.io.dataset_loader.Conll2003Loader`
-
- 读取Conll2003数据
-
- 关于数据集的更多信息,参考:
- https://sites.google.com/site/ermasoftware/getting-started/ne-tagging-conll2003-data
- """
-
- def __init__(self):
- headers = [
- 'tokens', 'pos', 'chunks', 'ner',
- ]
- super(Conll2003Loader, self).__init__(headers=headers)
-
-
-def _cut_long_sentence(sent, max_sample_length=200):
- """
- 将长于max_sample_length的sentence截成多段,只会在有空格的地方发生截断。
- 所以截取的句子可能长于或者短于max_sample_length
-
- :param sent: str.
- :param max_sample_length: int.
- :return: list of str.
- """
- sent_no_space = sent.replace(' ', '')
- cutted_sentence = []
- if len(sent_no_space) > max_sample_length:
- parts = sent.strip().split()
- new_line = ''
- length = 0
- for part in parts:
- length += len(part)
- new_line += part + ' '
- if length > max_sample_length:
- new_line = new_line[:-1]
- cutted_sentence.append(new_line)
- length = 0
- new_line = ''
- if new_line != '':
- cutted_sentence.append(new_line[:-1])
- else:
- cutted_sentence.append(sent)
- return cutted_sentence
+from .file_reader import _read_csv, _read_json
+from .base_loader import DataSetLoader
class JsonLoader(DataSetLoader):
@@ -249,42 +60,6 @@ class JsonLoader(DataSetLoader):
return ds
-class SNLILoader(JsonLoader):
- """
- 别名::class:`fastNLP.io.SNLILoader` :class:`fastNLP.io.dataset_loader.SNLILoader`
-
- 读取SNLI数据集,读取的DataSet包含fields::
-
- words1: list(str),第一句文本, premise
- words2: list(str), 第二句文本, hypothesis
- target: str, 真实标签
-
- 数据来源: https://nlp.stanford.edu/projects/snli/snli_1.0.zip
- """
-
- def __init__(self):
- fields = {
- 'sentence1_parse': Const.INPUTS(0),
- 'sentence2_parse': Const.INPUTS(1),
- 'gold_label': Const.TARGET,
- }
- super(SNLILoader, self).__init__(fields=fields)
-
- def _load(self, path):
- ds = super(SNLILoader, self)._load(path)
-
- def parse_tree(x):
- t = Tree.fromstring(x)
- return t.leaves()
-
- ds.apply(lambda ins: parse_tree(
- ins[Const.INPUTS(0)]), new_field_name=Const.INPUTS(0))
- ds.apply(lambda ins: parse_tree(
- ins[Const.INPUTS(1)]), new_field_name=Const.INPUTS(1))
- ds.drop(lambda x: x[Const.TARGET] == '-')
- return ds
-
-
class CSVLoader(DataSetLoader):
"""
别名::class:`fastNLP.io.CSVLoader` :class:`fastNLP.io.dataset_loader.CSVLoader`
@@ -311,6 +86,36 @@ class CSVLoader(DataSetLoader):
return ds
+def _cut_long_sentence(sent, max_sample_length=200):
+ """
+ 将长于max_sample_length的sentence截成多段,只会在有空格的地方发生截断。
+ 所以截取的句子可能长于或者短于max_sample_length
+
+ :param sent: str.
+ :param max_sample_length: int.
+ :return: list of str.
+ """
+ sent_no_space = sent.replace(' ', '')
+ cutted_sentence = []
+ if len(sent_no_space) > max_sample_length:
+ parts = sent.strip().split()
+ new_line = ''
+ length = 0
+ for part in parts:
+ length += len(part)
+ new_line += part + ' '
+ if length > max_sample_length:
+ new_line = new_line[:-1]
+ cutted_sentence.append(new_line)
+ length = 0
+ new_line = ''
+ if new_line != '':
+ cutted_sentence.append(new_line[:-1])
+ else:
+ cutted_sentence.append(sent)
+ return cutted_sentence
+
+
def _add_seg_tag(data):
"""
diff --git a/fastNLP/io/file_utils.py b/fastNLP/io/file_utils.py
index 04970cb3..cb762eb7 100644
--- a/fastNLP/io/file_utils.py
+++ b/fastNLP/io/file_utils.py
@@ -17,6 +17,10 @@ PRETRAINED_BERT_MODEL_DIR = {
'en-large-uncased': 'bert-large-uncased-20939f45.zip',
'en-large-cased': 'bert-large-cased-e0cf90fc.zip',
+ 'en-large-cased-wwm': 'bert-large-cased-wwm-a457f118.zip',
+ 'en-large-uncased-wwm': 'bert-large-uncased-wwm-92a50aeb.zip',
+ 'en-base-cased-mrpc': 'bert-base-cased-finetuned-mrpc-c7099855.zip',
+
'cn': 'bert-base-chinese-29d0a84a.zip',
'cn-base': 'bert-base-chinese-29d0a84a.zip',
@@ -68,6 +72,7 @@ def cached_path(url_or_filename: str, cache_dir: Path=None) -> Path:
"unable to parse {} as a URL or as a local path".format(url_or_filename)
)
+
def get_filepath(filepath):
"""
如果filepath中只有一个文件,则直接返回对应的全路径
@@ -82,6 +87,7 @@ def get_filepath(filepath):
return filepath
return filepath
+
def get_defalt_path():
"""
获取默认的fastNLP存放路径, 如果将FASTNLP_CACHE_PATH设置在了环境变量中,将使用环境变量的值,使得不用每个用户都去下载。
@@ -98,6 +104,7 @@ def get_defalt_path():
fastnlp_cache_dir = os.path.expanduser(os.path.join("~", ".fastNLP"))
return fastnlp_cache_dir
+
def _get_base_url(name):
# 返回的URL结尾必须是/
if 'FASTNLP_BASE_URL' in os.environ:
@@ -105,6 +112,7 @@ def _get_base_url(name):
return fastnlp_base_url
raise RuntimeError("There function is not available right now.")
+
def split_filename_suffix(filepath):
"""
给定filepath返回对应的name和suffix
@@ -116,6 +124,7 @@ def split_filename_suffix(filepath):
return filename[:-7], '.tar.gz'
return os.path.splitext(filename)
+
def get_from_cache(url: str, cache_dir: Path = None) -> Path:
"""
尝试在cache_dir中寻找url定义的资源; 如果没有找到。则从url下载并将结果放在cache_dir下,缓存的名称由url的结果推断而来。
@@ -226,6 +235,7 @@ def get_from_cache(url: str, cache_dir: Path = None) -> Path:
return get_filepath(cache_path)
+
def unzip_file(file: Path, to: Path):
# unpack and write out in CoNLL column-like format
from zipfile import ZipFile
@@ -234,13 +244,15 @@ def unzip_file(file: Path, to: Path):
# Extract all the contents of zip file in current directory
zipObj.extractall(to)
+
def untar_gz_file(file:Path, to:Path):
import tarfile
with tarfile.open(file, 'r:gz') as tar:
tar.extractall(to)
-def match_file(dir_name:str, cache_dir:str)->str:
+
+def match_file(dir_name: str, cache_dir: str) -> str:
"""
匹配的原则是,在cache_dir下的文件: (1) 与dir_name完全一致; (2) 除了后缀以外和dir_name完全一致。
如果找到了两个匹配的结果将报错. 如果找到了则返回匹配的文件的名称; 没有找到返回空字符串
@@ -261,6 +273,7 @@ def match_file(dir_name:str, cache_dir:str)->str:
else:
raise RuntimeError(f"Duplicate matched files:{matched_filenames}, this should be caused by a bug.")
+
if __name__ == '__main__':
cache_dir = Path('caches')
cache_dir = None
diff --git a/fastNLP/io/utils.py b/fastNLP/io/utils.py
new file mode 100644
index 00000000..a7d2de85
--- /dev/null
+++ b/fastNLP/io/utils.py
@@ -0,0 +1,69 @@
+import os
+
+from typing import Union, Dict
+
+
+def check_dataloader_paths(paths:Union[str, Dict[str, str]])->Dict[str, str]:
+ """
+ 检查传入dataloader的文件的合法性。如果为合法路径,将返回至少包含'train'这个key的dict。类似于下面的结果
+ {
+ 'train': '/some/path/to/', # 一定包含,建词表应该在这上面建立,剩下的其它文件应该只需要处理并index。
+ 'test': 'xxx' # 可能有,也可能没有
+ ...
+ }
+ 如果paths为不合法的,将直接进行raise相应的错误
+
+ :param paths: 路径. 可以为一个文件路径(则认为该文件就是train的文件); 可以为一个文件目录,将在该目录下寻找train(文件名
+ 中包含train这个字段), test.txt, dev.txt; 可以为一个dict, 则key是用户自定义的某个文件的名称,value是这个文件的路径。
+ :return:
+ """
+ if isinstance(paths, str):
+ if os.path.isfile(paths):
+ return {'train': paths}
+ elif os.path.isdir(paths):
+ filenames = os.listdir(paths)
+ files = {}
+ for filename in filenames:
+ path_pair = None
+ if 'train' in filename:
+ path_pair = ('train', filename)
+ if 'dev' in filename:
+ if path_pair:
+ raise Exception("File:{} in {} contains bot `{}` and `dev`.".format(filename, paths, path_pair[0]))
+ path_pair = ('dev', filename)
+ if 'test' in filename:
+ if path_pair:
+ raise Exception("File:{} in {} contains bot `{}` and `test`.".format(filename, paths, path_pair[0]))
+ path_pair = ('test', filename)
+ if path_pair:
+ files[path_pair[0]] = os.path.join(paths, path_pair[1])
+ return files
+ else:
+ raise FileNotFoundError(f"{paths} is not a valid file path.")
+
+ elif isinstance(paths, dict):
+ if paths:
+ if 'train' not in paths:
+ raise KeyError("You have to include `train` in your dict.")
+ for key, value in paths.items():
+ if isinstance(key, str) and isinstance(value, str):
+ if not os.path.isfile(value):
+ raise TypeError(f"{value} is not a valid file.")
+ else:
+ raise TypeError("All keys and values in paths should be str.")
+ return paths
+ else:
+ raise ValueError("Empty paths is not allowed.")
+ else:
+ raise TypeError(f"paths only supports str and dict. not {type(paths)}.")
+
+def get_tokenizer():
+ try:
+ import spacy
+ spacy.prefer_gpu()
+ en = spacy.load('en')
+ print('use spacy tokenizer')
+ return lambda x: [w.text for w in en.tokenizer(x)]
+ except Exception as e:
+ print('use raw tokenizer')
+ return lambda x: x.split()
diff --git a/fastNLP/models/bert.py b/fastNLP/models/bert.py
index 4846c7fa..fb186ce4 100644
--- a/fastNLP/models/bert.py
+++ b/fastNLP/models/bert.py
@@ -8,35 +8,7 @@ from torch import nn
from .base_model import BaseModel
from ..core.const import Const
from ..modules.encoder import BertModel
-
-
-class BertConfig:
-
- def __init__(
- self,
- vocab_size=30522,
- hidden_size=768,
- num_hidden_layers=12,
- num_attention_heads=12,
- intermediate_size=3072,
- hidden_act="gelu",
- hidden_dropout_prob=0.1,
- attention_probs_dropout_prob=0.1,
- max_position_embeddings=512,
- type_vocab_size=2,
- initializer_range=0.02
- ):
- self.vocab_size = vocab_size
- self.hidden_size = hidden_size
- self.num_hidden_layers = num_hidden_layers
- self.num_attention_heads = num_attention_heads
- self.intermediate_size = intermediate_size
- self.hidden_act = hidden_act
- self.hidden_dropout_prob = hidden_dropout_prob
- self.attention_probs_dropout_prob = attention_probs_dropout_prob
- self.max_position_embeddings = max_position_embeddings
- self.type_vocab_size = type_vocab_size
- self.initializer_range = initializer_range
+from ..modules.encoder._bert import BertConfig
class BertForSequenceClassification(BaseModel):
@@ -84,11 +56,17 @@ class BertForSequenceClassification(BaseModel):
self.bert = BertModel.from_pretrained(bert_dir)
else:
if config is None:
- config = BertConfig()
- self.bert = BertModel(**config.__dict__)
+ config = BertConfig(30522)
+ self.bert = BertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, num_labels)
+ @classmethod
+ def from_pretrained(cls, num_labels, pretrained_model_dir):
+ config = BertConfig(pretrained_model_dir)
+ model = cls(num_labels=num_labels, config=config, bert_dir=pretrained_model_dir)
+ return model
+
def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
_, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
pooled_output = self.dropout(pooled_output)
@@ -151,11 +129,17 @@ class BertForMultipleChoice(BaseModel):
self.bert = BertModel.from_pretrained(bert_dir)
else:
if config is None:
- config = BertConfig()
- self.bert = BertModel(**config.__dict__)
+ config = BertConfig(30522)
+ self.bert = BertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, 1)
+ @classmethod
+ def from_pretrained(cls, num_choices, pretrained_model_dir):
+ config = BertConfig(pretrained_model_dir)
+ model = cls(num_choices=num_choices, config=config, bert_dir=pretrained_model_dir)
+ return model
+
def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
flat_input_ids = input_ids.view(-1, input_ids.size(-1))
flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1))
@@ -224,11 +208,17 @@ class BertForTokenClassification(BaseModel):
self.bert = BertModel.from_pretrained(bert_dir)
else:
if config is None:
- config = BertConfig()
- self.bert = BertModel(**config.__dict__)
+ config = BertConfig(30522)
+ self.bert = BertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, num_labels)
+ @classmethod
+ def from_pretrained(cls, num_labels, pretrained_model_dir):
+ config = BertConfig(pretrained_model_dir)
+ model = cls(num_labels=num_labels, config=config, bert_dir=pretrained_model_dir)
+ return model
+
def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
sequence_output = self.dropout(sequence_output)
@@ -302,12 +292,18 @@ class BertForQuestionAnswering(BaseModel):
self.bert = BertModel.from_pretrained(bert_dir)
else:
if config is None:
- config = BertConfig()
- self.bert = BertModel(**config.__dict__)
+ config = BertConfig(30522)
+ self.bert = BertModel(config)
# TODO check with Google if it's normal there is no dropout on the token classifier of SQuAD in the TF version
# self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.qa_outputs = nn.Linear(config.hidden_size, 2)
+ @classmethod
+ def from_pretrained(cls, pretrained_model_dir):
+ config = BertConfig(pretrained_model_dir)
+ model = cls(config=config, bert_dir=pretrained_model_dir)
+ return model
+
def forward(self, input_ids, token_type_ids=None, attention_mask=None, start_positions=None, end_positions=None):
sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
logits = self.qa_outputs(sequence_output)
diff --git a/fastNLP/models/snli.py b/fastNLP/models/snli.py
index 395a9bbf..d12524cc 100644
--- a/fastNLP/models/snli.py
+++ b/fastNLP/models/snli.py
@@ -4,149 +4,209 @@ __all__ = [
import torch
import torch.nn as nn
+import torch.nn.functional as F
-from .base_model import BaseModel
-from ..core.const import Const
-from ..modules import decoder as Decoder
-from ..modules import encoder as Encoder
-from ..modules import aggregator as Aggregator
-from ..core.utils import seq_len_to_mask
+from torch.nn import CrossEntropyLoss
-my_inf = 10e12
+from fastNLP.models import BaseModel
+from fastNLP.modules.encoder.embedding import TokenEmbedding
+from fastNLP.modules.encoder.lstm import LSTM
+from fastNLP.core.const import Const
+from fastNLP.core.utils import seq_len_to_mask
class ESIM(BaseModel):
- """
- 别名::class:`fastNLP.models.ESIM` :class:`fastNLP.models.snli.ESIM`
-
- ESIM模型的一个PyTorch实现。
- ESIM模型的论文: Enhanced LSTM for Natural Language Inference (arXiv: 1609.06038)
+ """ESIM model的一个PyTorch实现
+ 论文参见: https://arxiv.org/pdf/1609.06038.pdf
- :param int vocab_size: 词表大小
- :param int embed_dim: 词嵌入维度
- :param int hidden_size: LSTM隐层大小
- :param float dropout: dropout大小,默认为0
- :param int num_classes: 标签数目,默认为3
- :param numpy.array init_embedding: 初始词嵌入矩阵,形状为(vocab_size, embed_dim),默认为None,即随机初始化词嵌入矩阵
+ :param fastNLP.TokenEmbedding init_embedding: 初始化的TokenEmbedding
+ :param int hidden_size: 隐藏层大小,默认值为Embedding的维度
+ :param int num_labels: 目标标签种类数量,默认值为3
+ :param float dropout_rate: dropout的比率,默认值为0.3
+ :param float dropout_embed: 对Embedding的dropout比率,默认值为0.1
"""
-
- def __init__(self, vocab_size, embed_dim, hidden_size, dropout=0.0, num_classes=3, init_embedding=None):
-
+
+ def __init__(self, init_embedding: TokenEmbedding, hidden_size=None, num_labels=3, dropout_rate=0.3,
+ dropout_embed=0.1):
super(ESIM, self).__init__()
- self.vocab_size = vocab_size
- self.embed_dim = embed_dim
- self.hidden_size = hidden_size
- self.dropout = dropout
- self.n_labels = num_classes
-
- self.drop = nn.Dropout(self.dropout)
-
- self.embedding = Encoder.Embedding(
- (self.vocab_size, self.embed_dim), dropout=self.dropout,
- )
-
- self.embedding_layer = nn.Linear(self.embed_dim, self.hidden_size)
-
- self.encoder = Encoder.LSTM(
- input_size=self.embed_dim, hidden_size=self.hidden_size, num_layers=1, bias=True,
- batch_first=True, bidirectional=True
- )
-
- self.bi_attention = Aggregator.BiAttention()
- self.mean_pooling = Aggregator.AvgPoolWithMask()
- self.max_pooling = Aggregator.MaxPoolWithMask()
-
- self.inference_layer = nn.Linear(self.hidden_size * 4, self.hidden_size)
-
- self.decoder = Encoder.LSTM(
- input_size=self.hidden_size, hidden_size=self.hidden_size, num_layers=1, bias=True,
- batch_first=True, bidirectional=True
- )
-
- self.output = Decoder.MLP([4 * self.hidden_size, self.hidden_size, self.n_labels], 'tanh', dropout=self.dropout)
-
- def forward(self, words1, words2, seq_len1=None, seq_len2=None, target=None):
- """ Forward function
-
- :param torch.Tensor words1: [batch size(B), premise seq len(PL)] premise的token表示
- :param torch.Tensor words2: [B, hypothesis seq len(HL)] hypothesis的token表示
- :param torch.LongTensor seq_len1: [B] premise的长度
- :param torch.LongTensor seq_len2: [B] hypothesis的长度
- :param torch.LongTensor target: [B] 真实目标值
- :return: dict prediction: [B, n_labels(N)] 预测结果
+
+ self.embedding = init_embedding
+ self.dropout_embed = EmbedDropout(p=dropout_embed)
+ if hidden_size is None:
+ hidden_size = self.embedding.embed_size
+ self.rnn = BiRNN(self.embedding.embed_size, hidden_size, dropout_rate=dropout_rate)
+ # self.rnn = LSTM(self.embedding.embed_size, hidden_size, dropout=dropout_rate, bidirectional=True)
+
+ self.interfere = nn.Sequential(nn.Dropout(p=dropout_rate),
+ nn.Linear(8 * hidden_size, hidden_size),
+ nn.ReLU())
+ nn.init.xavier_uniform_(self.interfere[1].weight.data)
+ self.bi_attention = SoftmaxAttention()
+
+ self.rnn_high = BiRNN(self.embedding.embed_size, hidden_size, dropout_rate=dropout_rate)
+ # self.rnn_high = LSTM(hidden_size, hidden_size, dropout=dropout_rate, bidirectional=True,)
+
+ self.classifier = nn.Sequential(nn.Dropout(p=dropout_rate),
+ nn.Linear(8 * hidden_size, hidden_size),
+ nn.Tanh(),
+ nn.Dropout(p=dropout_rate),
+ nn.Linear(hidden_size, num_labels))
+
+ self.dropout_rnn = nn.Dropout(p=dropout_rate)
+
+ nn.init.xavier_uniform_(self.classifier[1].weight.data)
+ nn.init.xavier_uniform_(self.classifier[4].weight.data)
+
+ def forward(self, words1, words2, seq_len1, seq_len2, target=None):
"""
-
- premise0 = self.embedding_layer(self.embedding(words1))
- hypothesis0 = self.embedding_layer(self.embedding(words2))
-
- if seq_len1 is not None:
- seq_len1 = seq_len_to_mask(seq_len1)
- else:
- seq_len1 = torch.ones(premise0.size(0), premise0.size(1))
- seq_len1 = (seq_len1.long()).to(device=premise0.device)
- if seq_len2 is not None:
- seq_len2 = seq_len_to_mask(seq_len2)
- else:
- seq_len2 = torch.ones(hypothesis0.size(0), hypothesis0.size(1))
- seq_len2 = (seq_len2.long()).to(device=hypothesis0.device)
-
- _BP, _PSL, _HP = premise0.size()
- _BH, _HSL, _HH = hypothesis0.size()
- _BPL, _PLL = seq_len1.size()
- _HPL, _HLL = seq_len2.size()
-
- assert _BP == _BH and _BPL == _HPL and _BP == _BPL
- assert _HP == _HH
- assert _PSL == _PLL and _HSL == _HLL
-
- B, PL, H = premise0.size()
- B, HL, H = hypothesis0.size()
-
- a0 = self.encoder(self.drop(premise0)) # a0: [B, PL, H * 2]
- b0 = self.encoder(self.drop(hypothesis0)) # b0: [B, HL, H * 2]
-
- a = torch.mean(a0.view(B, PL, -1, H), dim=2) # a: [B, PL, H]
- b = torch.mean(b0.view(B, HL, -1, H), dim=2) # b: [B, HL, H]
-
- ai, bi = self.bi_attention(a, b, seq_len1, seq_len2)
-
- ma = torch.cat((a, ai, a - ai, a * ai), dim=2) # ma: [B, PL, 4 * H]
- mb = torch.cat((b, bi, b - bi, b * bi), dim=2) # mb: [B, HL, 4 * H]
-
- f_ma = self.inference_layer(ma)
- f_mb = self.inference_layer(mb)
-
- vat = self.decoder(self.drop(f_ma))
- vbt = self.decoder(self.drop(f_mb))
-
- va = torch.mean(vat.view(B, PL, -1, H), dim=2) # va: [B, PL, H]
- vb = torch.mean(vbt.view(B, HL, -1, H), dim=2) # vb: [B, HL, H]
-
- va_ave = self.mean_pooling(va, seq_len1, dim=1) # va_ave: [B, H]
- va_max, va_arg_max = self.max_pooling(va, seq_len1, dim=1) # va_max: [B, H]
- vb_ave = self.mean_pooling(vb, seq_len2, dim=1) # vb_ave: [B, H]
- vb_max, vb_arg_max = self.max_pooling(vb, seq_len2, dim=1) # vb_max: [B, H]
-
- v = torch.cat((va_ave, va_max, vb_ave, vb_max), dim=1) # v: [B, 4 * H]
-
- prediction = torch.tanh(self.output(v)) # prediction: [B, N]
-
- if target is not None:
- func = nn.CrossEntropyLoss()
- loss = func(prediction, target)
- return {Const.OUTPUT: prediction, Const.LOSS: loss}
-
- return {Const.OUTPUT: prediction}
-
- def predict(self, words1, words2, seq_len1=None, seq_len2=None, target=None):
- """ Predict function
-
- :param torch.Tensor words1: [batch size(B), premise seq len(PL)] premise的token表示
- :param torch.Tensor words2: [B, hypothesis seq len(HL)] hypothesis的token表示
- :param torch.LongTensor seq_len1: [B] premise的长度
- :param torch.LongTensor seq_len2: [B] hypothesis的长度
- :param torch.LongTensor target: [B] 真实目标值
- :return: dict prediction: [B, n_labels(N)] 预测结果
+ :param words1: [batch, seq_len]
+ :param words2: [batch, seq_len]
+ :param seq_len1: [batch]
+ :param seq_len2: [batch]
+ :param target:
+ :return:
"""
- prediction = self.forward(words1, words2, seq_len1, seq_len2)[Const.OUTPUT]
- return {Const.OUTPUT: torch.argmax(prediction, dim=-1)}
+ mask1 = seq_len_to_mask(seq_len1, words1.size(1))
+ mask2 = seq_len_to_mask(seq_len2, words2.size(1))
+ a0 = self.embedding(words1) # B * len * emb_dim
+ b0 = self.embedding(words2)
+ a0, b0 = self.dropout_embed(a0), self.dropout_embed(b0)
+ a = self.rnn(a0, mask1.byte()) # a: [B, PL, 2 * H]
+ b = self.rnn(b0, mask2.byte())
+ # a = self.dropout_rnn(self.rnn(a0, seq_len1)[0]) # a: [B, PL, 2 * H]
+ # b = self.dropout_rnn(self.rnn(b0, seq_len2)[0])
+
+ ai, bi = self.bi_attention(a, mask1, b, mask2)
+
+ a_ = torch.cat((a, ai, a - ai, a * ai), dim=2) # ma: [B, PL, 8 * H]
+ b_ = torch.cat((b, bi, b - bi, b * bi), dim=2)
+ a_f = self.interfere(a_)
+ b_f = self.interfere(b_)
+
+ a_h = self.rnn_high(a_f, mask1.byte()) # ma: [B, PL, 2 * H]
+ b_h = self.rnn_high(b_f, mask2.byte())
+ # a_h = self.dropout_rnn(self.rnn_high(a_f, seq_len1)[0]) # ma: [B, PL, 2 * H]
+ # b_h = self.dropout_rnn(self.rnn_high(b_f, seq_len2)[0])
+
+ a_avg = self.mean_pooling(a_h, mask1, dim=1)
+ a_max, _ = self.max_pooling(a_h, mask1, dim=1)
+ b_avg = self.mean_pooling(b_h, mask2, dim=1)
+ b_max, _ = self.max_pooling(b_h, mask2, dim=1)
+
+ out = torch.cat((a_avg, a_max, b_avg, b_max), dim=1) # v: [B, 8 * H]
+ logits = torch.tanh(self.classifier(out))
+
+ if target is not None:
+ loss_fct = CrossEntropyLoss()
+ loss = loss_fct(logits, target)
+
+ return {Const.LOSS: loss, Const.OUTPUT: logits}
+ else:
+ return {Const.OUTPUT: logits}
+
+ def predict(self, **kwargs):
+ pred = self.forward(**kwargs)[Const.OUTPUT].argmax(-1)
+ return {Const.OUTPUT: pred}
+
+ # input [batch_size, len , hidden]
+ # mask [batch_size, len] (111...00)
+ @staticmethod
+ def mean_pooling(input, mask, dim=1):
+ masks = mask.view(mask.size(0), mask.size(1), -1).float()
+ return torch.sum(input * masks, dim=dim) / torch.sum(masks, dim=1)
+
+ @staticmethod
+ def max_pooling(input, mask, dim=1):
+ my_inf = 10e12
+ masks = mask.view(mask.size(0), mask.size(1), -1)
+ masks = masks.expand(-1, -1, input.size(2)).float()
+ return torch.max(input + masks.le(0.5).float() * -my_inf, dim=dim)
+
+
+class EmbedDropout(nn.Dropout):
+
+ def forward(self, sequences_batch):
+ ones = sequences_batch.data.new_ones(sequences_batch.shape[0], sequences_batch.shape[-1])
+ dropout_mask = nn.functional.dropout(ones, self.p, self.training, inplace=False)
+ return dropout_mask.unsqueeze(1) * sequences_batch
+
+
+class BiRNN(nn.Module):
+ def __init__(self, input_size, hidden_size, dropout_rate=0.3):
+ super(BiRNN, self).__init__()
+ self.dropout_rate = dropout_rate
+ self.rnn = nn.LSTM(input_size, hidden_size,
+ num_layers=1,
+ bidirectional=True,
+ batch_first=True)
+
+ def forward(self, x, x_mask):
+ # Sort x
+ lengths = x_mask.data.eq(1).long().sum(1)
+ _, idx_sort = torch.sort(lengths, dim=0, descending=True)
+ _, idx_unsort = torch.sort(idx_sort, dim=0)
+ lengths = list(lengths[idx_sort])
+
+ x = x.index_select(0, idx_sort)
+ # Pack it up
+ rnn_input = nn.utils.rnn.pack_padded_sequence(x, lengths, batch_first=True)
+ # Apply dropout to input
+ if self.dropout_rate > 0:
+ dropout_input = F.dropout(rnn_input.data, p=self.dropout_rate, training=self.training)
+ rnn_input = nn.utils.rnn.PackedSequence(dropout_input, rnn_input.batch_sizes)
+ output = self.rnn(rnn_input)[0]
+ # Unpack everything
+ output = nn.utils.rnn.pad_packed_sequence(output, batch_first=True)[0]
+ output = output.index_select(0, idx_unsort)
+ if output.size(1) != x_mask.size(1):
+ padding = torch.zeros(output.size(0),
+ x_mask.size(1) - output.size(1),
+ output.size(2)).type(output.data.type())
+ output = torch.cat([output, padding], 1)
+ return output
+
+
+def masked_softmax(tensor, mask):
+ tensor_shape = tensor.size()
+ reshaped_tensor = tensor.view(-1, tensor_shape[-1])
+
+ # Reshape the mask so it matches the size of the input tensor.
+ while mask.dim() < tensor.dim():
+ mask = mask.unsqueeze(1)
+ mask = mask.expand_as(tensor).contiguous().float()
+ reshaped_mask = mask.view(-1, mask.size()[-1])
+ result = F.softmax(reshaped_tensor * reshaped_mask, dim=-1)
+ result = result * reshaped_mask
+ # 1e-13 is added to avoid divisions by zero.
+ result = result / (result.sum(dim=-1, keepdim=True) + 1e-13)
+ return result.view(*tensor_shape)
+
+
+def weighted_sum(tensor, weights, mask):
+ w_sum = weights.bmm(tensor)
+ while mask.dim() < w_sum.dim():
+ mask = mask.unsqueeze(1)
+ mask = mask.transpose(-1, -2)
+ mask = mask.expand_as(w_sum).contiguous().float()
+ return w_sum * mask
+
+
+class SoftmaxAttention(nn.Module):
+
+ def forward(self, premise_batch, premise_mask, hypothesis_batch, hypothesis_mask):
+ similarity_matrix = premise_batch.bmm(hypothesis_batch.transpose(2, 1)
+ .contiguous())
+
+ prem_hyp_attn = masked_softmax(similarity_matrix, hypothesis_mask)
+ hyp_prem_attn = masked_softmax(similarity_matrix.transpose(1, 2)
+ .contiguous(),
+ premise_mask)
+
+ attended_premises = weighted_sum(hypothesis_batch,
+ prem_hyp_attn,
+ premise_mask)
+ attended_hypotheses = weighted_sum(premise_batch,
+ hyp_prem_attn,
+ hypothesis_mask)
+
+ return attended_premises, attended_hypotheses
diff --git a/fastNLP/models/star_transformer.py b/fastNLP/models/star_transformer.py
index 4c944a54..bb91a5b6 100644
--- a/fastNLP/models/star_transformer.py
+++ b/fastNLP/models/star_transformer.py
@@ -47,7 +47,7 @@ class StarTransEnc(nn.Module):
self.embedding = get_embeddings(init_embed)
emb_dim = self.embedding.embedding_dim
self.emb_fc = nn.Linear(emb_dim, hidden_size)
- self.emb_drop = nn.Dropout(emb_dropout)
+ # self.emb_drop = nn.Dropout(emb_dropout)
self.encoder = StarTransformer(hidden_size=hidden_size,
num_layers=num_layers,
num_head=num_head,
@@ -65,7 +65,7 @@ class StarTransEnc(nn.Module):
[batch, hidden] 全局 relay 节点, 详见论文
"""
x = self.embedding(x)
- x = self.emb_fc(self.emb_drop(x))
+ x = self.emb_fc(x)
nodes, relay = self.encoder(x, mask)
return nodes, relay
@@ -205,7 +205,7 @@ class STSeqCls(nn.Module):
max_len=max_len,
emb_dropout=emb_dropout,
dropout=dropout)
- self.cls = _Cls(hidden_size, num_cls, cls_hidden_size)
+ self.cls = _Cls(hidden_size, num_cls, cls_hidden_size, dropout=dropout)
def forward(self, words, seq_len):
"""
diff --git a/fastNLP/modules/__init__.py b/fastNLP/modules/__init__.py
index 194fda4e..2cd2216c 100644
--- a/fastNLP/modules/__init__.py
+++ b/fastNLP/modules/__init__.py
@@ -1,11 +1,11 @@
"""
大部分用于的 NLP 任务神经网络都可以看做由编码 :mod:`~fastNLP.modules.encoder` 、
-聚合 :mod:`~fastNLP.modules.aggregator` 、解码 :mod:`~fastNLP.modules.decoder` 三种模块组成。
+解码 :mod:`~fastNLP.modules.decoder` 两种模块组成。
.. image:: figures/text_classification.png
:mod:`~fastNLP.modules` 中实现了 fastNLP 提供的诸多模块组件,可以帮助用户快速搭建自己所需的网络。
-三种模块的功能和常见组件如下:
+两种模块的功能和常见组件如下:
+-----------------------+-----------------------+-----------------------+
| module type | functionality | example |
@@ -13,9 +13,6 @@
| encoder | 将输入编码为具有具 | embedding, RNN, CNN, |
| | 有表示能力的向量 | transformer |
+-----------------------+-----------------------+-----------------------+
-| aggregator | 从多个向量中聚合信息 | self-attention, |
-| | | max-pooling |
-+-----------------------+-----------------------+-----------------------+
| decoder | 将具有某种表示意义的 | MLP, CRF |
| | 向量解码为需要的输出 | |
| | 形式 | |
@@ -46,10 +43,8 @@ __all__ = [
"allowed_transitions",
]
-from . import aggregator
from . import decoder
from . import encoder
-from .aggregator import *
from .decoder import *
from .dropout import TimestepDropout
from .encoder import *
diff --git a/fastNLP/modules/aggregator/__init__.py b/fastNLP/modules/aggregator/__init__.py
deleted file mode 100644
index a82138e7..00000000
--- a/fastNLP/modules/aggregator/__init__.py
+++ /dev/null
@@ -1,14 +0,0 @@
-__all__ = [
- "MaxPool",
- "MaxPoolWithMask",
- "AvgPool",
-
- "MultiHeadAttention",
-]
-
-from .pooling import MaxPool
-from .pooling import MaxPoolWithMask
-from .pooling import AvgPool
-from .pooling import AvgPoolWithMask
-
-from .attention import MultiHeadAttention
diff --git a/fastNLP/modules/decoder/mlp.py b/fastNLP/modules/decoder/mlp.py
index c1579224..418b3a77 100644
--- a/fastNLP/modules/decoder/mlp.py
+++ b/fastNLP/modules/decoder/mlp.py
@@ -15,7 +15,8 @@ class MLP(nn.Module):
多层感知器
:param List[int] size_layer: 一个int的列表,用来定义MLP的层数,列表中的数字为每一层是hidden数目。MLP的层数为 len(size_layer) - 1
- :param Union[str,func,List[str]] activation: 一个字符串或者函数的列表,用来定义每一个隐层的激活函数,字符串包括relu,tanh和sigmoid,默认值为relu
+ :param Union[str,func,List[str]] activation: 一个字符串或者函数的列表,用来定义每一个隐层的激活函数,字符串包括relu,tanh和
+ sigmoid,默认值为relu
:param Union[str,func] output_activation: 字符串或者函数,用来定义输出层的激活函数,默认值为None,表示输出层没有激活函数
:param str initial_method: 参数初始化方式
:param float dropout: dropout概率,默认值为0
diff --git a/fastNLP/modules/encoder/__init__.py b/fastNLP/modules/encoder/__init__.py
index 349bce69..7b5bc070 100644
--- a/fastNLP/modules/encoder/__init__.py
+++ b/fastNLP/modules/encoder/__init__.py
@@ -22,7 +22,14 @@ __all__ = [
"VarRNN",
"VarLSTM",
- "VarGRU"
+ "VarGRU",
+
+ "MaxPool",
+ "MaxPoolWithMask",
+ "AvgPool",
+ "AvgPoolWithMask",
+
+ "MultiHeadAttention",
]
from ._bert import BertModel
from .bert import BertWordPieceEncoder
@@ -34,3 +41,6 @@ from .lstm import LSTM
from .star_transformer import StarTransformer
from .transformer import TransformerEncoder
from .variational_rnn import VarRNN, VarLSTM, VarGRU
+
+from .pooling import MaxPool, MaxPoolWithMask, AvgPool, AvgPoolWithMask
+from .attention import MultiHeadAttention
diff --git a/fastNLP/modules/encoder/_bert.py b/fastNLP/modules/encoder/_bert.py
index 4669b511..61a5d7d1 100644
--- a/fastNLP/modules/encoder/_bert.py
+++ b/fastNLP/modules/encoder/_bert.py
@@ -26,6 +26,7 @@ import sys
CONFIG_FILE = 'bert_config.json'
+
class BertConfig(object):
"""Configuration class to store the configuration of a `BertModel`.
"""
@@ -339,13 +340,19 @@ class BertModel(nn.Module):
如果你想使用预训练好的权重矩阵,请在以下网址下载.
sources::
- 'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz",
- 'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased.tar.gz",
- 'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz",
- 'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased.tar.gz",
- 'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased.tar.gz",
- 'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased.tar.gz",
- 'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz",
+ 'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin",
+ 'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-pytorch_model.bin",
+ 'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin",
+ 'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-pytorch_model.bin",
+ 'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-pytorch_model.bin",
+ 'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-pytorch_model.bin",
+ 'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-pytorch_model.bin",
+ 'bert-base-german-cased': "https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-pytorch_model.bin",
+ 'bert-large-uncased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-pytorch_model.bin",
+ 'bert-large-cased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-pytorch_model.bin",
+ 'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-pytorch_model.bin",
+ 'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-pytorch_model.bin",
+ 'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-pytorch_model.bin"
用预训练权重矩阵来建立BERT模型::
@@ -562,6 +569,7 @@ class WordpieceTokenizer(object):
output_tokens.extend(sub_tokens)
return output_tokens
+
def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
@@ -692,6 +700,7 @@ class BasicTokenizer(object):
output.append(char)
return "".join(output)
+
def _is_whitespace(char):
"""Checks whether `chars` is a whitespace character."""
# \t, \n, and \r are technically contorl characters but we treat them
diff --git a/fastNLP/modules/encoder/_elmo.py b/fastNLP/modules/encoder/_elmo.py
index 4ebee819..b887c6b1 100644
--- a/fastNLP/modules/encoder/_elmo.py
+++ b/fastNLP/modules/encoder/_elmo.py
@@ -1,22 +1,21 @@
-
"""
-这个页面的代码大量参考了https://github.com/HIT-SCIR/ELMoForManyLangs/tree/master/elmoformanylangs
+这个页面的代码大量参考了 allenNLP
"""
-
from typing import Optional, Tuple, List, Callable
import os
+
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.utils.rnn import PackedSequence, pad_packed_sequence
from ...core.vocabulary import Vocabulary
import json
+import pickle
from ..utils import get_dropout_mask
import codecs
-from torch import autograd
class LstmCellWithProjection(torch.nn.Module):
"""
@@ -58,6 +57,7 @@ class LstmCellWithProjection(torch.nn.Module):
respectively. The first dimension is 1 in order to match the Pytorch
API for returning stacked LSTM states.
"""
+
def __init__(self,
input_size: int,
hidden_size: int,
@@ -129,13 +129,13 @@ class LstmCellWithProjection(torch.nn.Module):
# We have to use this '.data.new().fill_' pattern to create tensors with the correct
# type - forward has no knowledge of whether these are torch.Tensors or torch.cuda.Tensors.
output_accumulator = inputs.data.new(batch_size,
- total_timesteps,
- self.hidden_size).fill_(0)
+ total_timesteps,
+ self.hidden_size).fill_(0)
if initial_state is None:
full_batch_previous_memory = inputs.data.new(batch_size,
- self.cell_size).fill_(0)
+ self.cell_size).fill_(0)
full_batch_previous_state = inputs.data.new(batch_size,
- self.hidden_size).fill_(0)
+ self.hidden_size).fill_(0)
else:
full_batch_previous_state = initial_state[0].squeeze(0)
full_batch_previous_memory = initial_state[1].squeeze(0)
@@ -169,7 +169,7 @@ class LstmCellWithProjection(torch.nn.Module):
# Second conditional: Does the next shortest sequence beyond the current batch
# index require computation use this timestep?
while current_length_index < (len(batch_lengths) - 1) and \
- batch_lengths[current_length_index + 1] > index:
+ batch_lengths[current_length_index + 1] > index:
current_length_index += 1
# Actually get the slices of the batch which we
@@ -243,23 +243,23 @@ class LstmbiLm(nn.Module):
def __init__(self, config):
super(LstmbiLm, self).__init__()
self.config = config
- self.encoder = nn.LSTM(self.config['encoder']['projection_dim'],
- self.config['encoder']['dim'],
- num_layers=self.config['encoder']['n_layers'],
+ self.encoder = nn.LSTM(self.config['lstm']['projection_dim'],
+ self.config['lstm']['dim'],
+ num_layers=self.config['lstm']['n_layers'],
bidirectional=True,
batch_first=True,
dropout=self.config['dropout'])
- self.projection = nn.Linear(self.config['encoder']['dim'], self.config['encoder']['projection_dim'], bias=True)
+ self.projection = nn.Linear(self.config['lstm']['dim'], self.config['lstm']['projection_dim'], bias=True)
def forward(self, inputs, seq_len):
sort_lens, sort_idx = torch.sort(seq_len, dim=0, descending=True)
inputs = inputs[sort_idx]
inputs = nn.utils.rnn.pack_padded_sequence(inputs, sort_lens, batch_first=self.batch_first)
output, hx = self.encoder(inputs, None) # -> [N,L,C]
- output, _ = nn.util.rnn.pad_packed_sequence(output, batch_first=self.batch_first)
+ output, _ = nn.utils.rnn.pad_packed_sequence(output, batch_first=self.batch_first)
_, unsort_idx = torch.sort(sort_idx, dim=0, descending=False)
output = output[unsort_idx]
- forward, backward = output.split(self.config['encoder']['dim'], 2)
+ forward, backward = output.split(self.config['lstm']['dim'], 2)
return torch.cat([self.projection(forward), self.projection(backward)], dim=2)
@@ -267,13 +267,13 @@ class ElmobiLm(torch.nn.Module):
def __init__(self, config):
super(ElmobiLm, self).__init__()
self.config = config
- input_size = config['encoder']['projection_dim']
- hidden_size = config['encoder']['projection_dim']
- cell_size = config['encoder']['dim']
- num_layers = config['encoder']['n_layers']
- memory_cell_clip_value = config['encoder']['cell_clip']
- state_projection_clip_value = config['encoder']['proj_clip']
- recurrent_dropout_probability = config['dropout']
+ input_size = config['lstm']['projection_dim']
+ hidden_size = config['lstm']['projection_dim']
+ cell_size = config['lstm']['dim']
+ num_layers = config['lstm']['n_layers']
+ memory_cell_clip_value = config['lstm']['cell_clip']
+ state_projection_clip_value = config['lstm']['proj_clip']
+ recurrent_dropout_probability = 0.0
self.input_size = input_size
self.hidden_size = hidden_size
@@ -316,13 +316,13 @@ class ElmobiLm(torch.nn.Module):
:param seq_len: batch_size
:return: torch.FloatTensor. num_layers x batch_size x max_len x hidden_size
"""
+ max_len = inputs.size(1)
sort_lens, sort_idx = torch.sort(seq_len, dim=0, descending=True)
inputs = inputs[sort_idx]
inputs = nn.utils.rnn.pack_padded_sequence(inputs, sort_lens, batch_first=True)
output, _ = self._lstm_forward(inputs, None)
_, unsort_idx = torch.sort(sort_idx, dim=0, descending=False)
output = output[:, unsort_idx]
-
return output
def _lstm_forward(self,
@@ -399,7 +399,7 @@ class ElmobiLm(torch.nn.Module):
torch.cat([forward_state[1], backward_state[1]], -1)))
stacked_sequence_outputs: torch.FloatTensor = torch.stack(sequence_outputs)
- # Stack the hidden state and memory for each layer into 2 tensors of shape
+ # Stack the hidden state and memory for each layer in。to 2 tensors of shape
# (num_layers, batch_size, hidden_size) and (num_layers, batch_size, cell_size)
# respectively.
final_hidden_states, final_memory_states = zip(*final_states)
@@ -409,63 +409,30 @@ class ElmobiLm(torch.nn.Module):
return stacked_sequence_outputs, final_state_tuple
-class LstmTokenEmbedder(nn.Module):
- def __init__(self, config, word_emb_layer, char_emb_layer):
- super(LstmTokenEmbedder, self).__init__()
- self.config = config
- self.word_emb_layer = word_emb_layer
- self.char_emb_layer = char_emb_layer
- self.output_dim = config['encoder']['projection_dim']
- emb_dim = 0
- if word_emb_layer is not None:
- emb_dim += word_emb_layer.n_d
-
- if char_emb_layer is not None:
- emb_dim += char_emb_layer.n_d * 2
- self.char_lstm = nn.LSTM(char_emb_layer.n_d, char_emb_layer.n_d, num_layers=1, bidirectional=True,
- batch_first=True, dropout=config['dropout'])
-
- self.projection = nn.Linear(emb_dim, self.output_dim, bias=True)
-
- def forward(self, words, chars):
- embs = []
- if self.word_emb_layer is not None:
- if hasattr(self, 'words_to_words'):
- words = self.words_to_words[words]
- word_emb = self.word_emb_layer(words)
- embs.append(word_emb)
-
- if self.char_emb_layer is not None:
- batch_size, seq_len, _ = chars.shape
- chars = chars.view(batch_size * seq_len, -1)
- chars_emb = self.char_emb_layer(chars)
- # TODO 这里应该要考虑seq_len的问题
- _, (chars_outputs, __) = self.char_lstm(chars_emb)
- chars_outputs = chars_outputs.contiguous().view(-1, self.config['token_embedder']['char_dim'] * 2)
- embs.append(chars_outputs)
-
- token_embedding = torch.cat(embs, dim=2)
-
- return self.projection(token_embedding)
-
-
class ConvTokenEmbedder(nn.Module):
- def __init__(self, config, word_emb_layer, char_emb_layer):
+ def __init__(self, config, weight_file, word_emb_layer, char_emb_layer):
super(ConvTokenEmbedder, self).__init__()
- self.config = config
+ self.weight_file = weight_file
self.word_emb_layer = word_emb_layer
self.char_emb_layer = char_emb_layer
- self.output_dim = config['encoder']['projection_dim']
- self.emb_dim = 0
- if word_emb_layer is not None:
- self.emb_dim += word_emb_layer.weight.size(1)
+ self.output_dim = config['lstm']['projection_dim']
+ self._options = config
+
+ char_cnn_options = self._options['char_cnn']
+ if char_cnn_options['activation'] == 'tanh':
+ self.activation = torch.tanh
+ elif char_cnn_options['activation'] == 'relu':
+ self.activation = torch.nn.functional.relu
+ else:
+ raise Exception("Unknown activation")
if char_emb_layer is not None:
- self.convolutions = []
- cnn_config = config['token_embedder']
+ self.char_conv = []
+ cnn_config = config['char_cnn']
filters = cnn_config['filters']
- char_embed_dim = cnn_config['char_dim']
+ char_embed_dim = cnn_config['embedding']['dim']
+ convolutions = []
for i, (width, num) in enumerate(filters):
conv = torch.nn.Conv1d(
@@ -474,55 +441,56 @@ class ConvTokenEmbedder(nn.Module):
kernel_size=width,
bias=True
)
- self.convolutions.append(conv)
+ convolutions.append(conv)
+ self.add_module('char_conv_{}'.format(i), conv)
- self.convolutions = nn.ModuleList(self.convolutions)
+ self._convolutions = convolutions
- self.n_filters = sum(f[1] for f in filters)
- self.n_highway = cnn_config['n_highway']
+ n_filters = sum(f[1] for f in filters)
+ n_highway = cnn_config['n_highway']
- self.highways = Highway(self.n_filters, self.n_highway, activation=torch.nn.functional.relu)
- self.emb_dim += self.n_filters
+ self._highways = Highway(n_filters, n_highway, activation=torch.nn.functional.relu)
- self.projection = nn.Linear(self.emb_dim, self.output_dim, bias=True)
+ self._projection = torch.nn.Linear(n_filters, self.output_dim, bias=True)
def forward(self, words, chars):
- embs = []
- if self.word_emb_layer is not None:
- if hasattr(self, 'words_to_words'):
- words = self.words_to_words[words]
- word_emb = self.word_emb_layer(words)
- embs.append(word_emb)
-
- if self.char_emb_layer is not None:
- batch_size, seq_len, _ = chars.size()
- chars = chars.view(batch_size * seq_len, -1)
- character_embedding = self.char_emb_layer(chars)
- character_embedding = torch.transpose(character_embedding, 1, 2)
-
- cnn_config = self.config['token_embedder']
- if cnn_config['activation'] == 'tanh':
- activation = torch.nn.functional.tanh
- elif cnn_config['activation'] == 'relu':
- activation = torch.nn.functional.relu
- else:
- raise Exception("Unknown activation")
-
- convs = []
- for i in range(len(self.convolutions)):
- convolved = self.convolutions[i](character_embedding)
- # (batch_size * sequence_length, n_filters for this width)
- convolved, _ = torch.max(convolved, dim=-1)
- convolved = activation(convolved)
- convs.append(convolved)
- char_emb = torch.cat(convs, dim=-1)
- char_emb = self.highways(char_emb)
-
- embs.append(char_emb.view(batch_size, -1, self.n_filters))
-
- token_embedding = torch.cat(embs, dim=2)
-
- return self.projection(token_embedding)
+ """
+ :param words:
+ :param chars: Tensor Shape ``(batch_size, sequence_length, 50)``:
+ :return Tensor Shape ``(batch_size, sequence_length + 2, embedding_dim)`` :
+ """
+ # the character id embedding
+ # (batch_size * sequence_length, max_chars_per_token, embed_dim)
+ # character_embedding = torch.nn.functional.embedding(
+ # chars.view(-1, max_chars_per_token),
+ # self._char_embedding_weights
+ # )
+ batch_size, sequence_length, max_char_len = chars.size()
+ character_embedding = self.char_emb_layer(chars).reshape(batch_size * sequence_length, max_char_len, -1)
+ # run convolutions
+
+ # (batch_size * sequence_length, embed_dim, max_chars_per_token)
+ character_embedding = torch.transpose(character_embedding, 1, 2)
+ convs = []
+ for i in range(len(self._convolutions)):
+ conv = getattr(self, 'char_conv_{}'.format(i))
+ convolved = conv(character_embedding)
+ # (batch_size * sequence_length, n_filters for this width)
+ convolved, _ = torch.max(convolved, dim=-1)
+ convolved = self.activation(convolved)
+ convs.append(convolved)
+
+ # (batch_size * sequence_length, n_filters)
+ token_embedding = torch.cat(convs, dim=-1)
+
+ # apply the highway layers (batch_size * sequence_length, n_filters)
+ token_embedding = self._highways(token_embedding)
+
+ # final projection (batch_size * sequence_length, embedding_dim)
+ token_embedding = self._projection(token_embedding)
+
+ # reshape to (batch_size, sequence_length+2, embedding_dim)
+ return token_embedding.view(batch_size, sequence_length, -1)
class Highway(torch.nn.Module):
@@ -543,6 +511,7 @@ class Highway(torch.nn.Module):
activation : ``Callable[[torch.Tensor], torch.Tensor]``, optional (default=``torch.nn.functional.relu``)
The non-linearity to use in the highway layers.
"""
+
def __init__(self,
input_dim: int,
num_layers: int = 1,
@@ -573,6 +542,7 @@ class Highway(torch.nn.Module):
current_input = gate * linear_part + (1 - gate) * nonlinear_part
return current_input
+
class _ElmoModel(nn.Module):
"""
该Module是ElmoEmbedding中进行所有的heavy lifting的地方。做的工作,包括
@@ -582,10 +552,30 @@ class _ElmoModel(nn.Module):
(4) 设计一个保存token的embedding,允许缓存word的表示。
"""
- def __init__(self, model_dir:str, vocab:Vocabulary=None, cache_word_reprs:bool=False):
- super(_ElmoModel, self).__init__()
- config = json.load(open(os.path.join(model_dir, 'structure_config.json'), 'r'))
+ def __init__(self, model_dir: str, vocab: Vocabulary = None, cache_word_reprs: bool = False):
+ super(_ElmoModel, self).__init__()
+ self.model_dir = model_dir
+ dir = os.walk(self.model_dir)
+ config_file = None
+ weight_file = None
+ config_count = 0
+ weight_count = 0
+ for path, dir_list, file_list in dir:
+ for file_name in file_list:
+ if file_name.__contains__(".json"):
+ config_file = file_name
+ config_count += 1
+ elif file_name.__contains__(".pkl"):
+ weight_file = file_name
+ weight_count += 1
+ if config_count > 1 or weight_count > 1:
+ raise Exception(f"Multiple config files(*.json) or weight files(*.hdf5) detected in {model_dir}.")
+ elif config_count == 0 or weight_count == 0:
+ raise Exception(f"No config file or weight file found in {model_dir}")
+
+ config = json.load(open(os.path.join(model_dir, config_file), 'r'))
+ self.weight_file = os.path.join(model_dir, weight_file)
self.config = config
OOV_TAG = ''
@@ -595,152 +585,103 @@ class _ElmoModel(nn.Module):
BOW_TAG = ''
EOW_TAG = ''
- # 将加载embedding放到这里
- token_embedder_states = torch.load(os.path.join(model_dir, 'token_embedder.pkl'), map_location='cpu')
-
- # For the model trained with word form word encoder.
- if config['token_embedder']['word_dim'] > 0:
- word_lexicon = {}
- with codecs.open(os.path.join(model_dir, 'word.dic'), 'r', encoding='utf-8') as fpi:
- for line in fpi:
- tokens = line.strip().split('\t')
- if len(tokens) == 1:
- tokens.insert(0, '\u3000')
- token, i = tokens
- word_lexicon[token] = int(i)
- # 做一些sanity check
- for special_word in [PAD_TAG, OOV_TAG, BOS_TAG, EOS_TAG]:
- assert special_word in word_lexicon, f"{special_word} not found in word.dic."
- # 根据vocab调整word_embedding
- pre_word_embedding = token_embedder_states.pop('word_emb_layer.embedding.weight')
- word_emb_layer = nn.Embedding(len(vocab)+2, config['token_embedder']['word_dim']) #多增加两个是为了与
- found_word_count = 0
- for word, index in vocab:
- if index == vocab.unknown_idx: # 因为fastNLP的unknow是 而在这里是所以ugly强制适配一下
- index_in_pre = word_lexicon[OOV_TAG]
- found_word_count += 1
- elif index == vocab.padding_idx: # 需要pad对齐
- index_in_pre = word_lexicon[PAD_TAG]
- found_word_count += 1
- elif word in word_lexicon:
- index_in_pre = word_lexicon[word]
- found_word_count += 1
- else:
- index_in_pre = word_lexicon[OOV_TAG]
- word_emb_layer.weight.data[index] = pre_word_embedding[index_in_pre]
- print(f"{found_word_count} out of {len(vocab)} words were found in pretrained elmo embedding.")
- word_emb_layer.weight.data[-1] = pre_word_embedding[word_lexicon[EOS_TAG]]
- word_emb_layer.weight.data[-2] = pre_word_embedding[word_lexicon[BOS_TAG]]
- self.word_vocab = vocab
- else:
- word_emb_layer = None
-
# For the model trained with character-based word encoder.
- if config['token_embedder']['char_dim'] > 0:
- char_lexicon = {}
- with codecs.open(os.path.join(model_dir, 'char.dic'), 'r', encoding='utf-8') as fpi:
- for line in fpi:
- tokens = line.strip().split('\t')
- if len(tokens) == 1:
- tokens.insert(0, '\u3000')
- token, i = tokens
- char_lexicon[token] = int(i)
- # 做一些sanity check
- for special_word in [PAD_TAG, OOV_TAG, BOW_TAG, EOW_TAG]:
- assert special_word in char_lexicon, f"{special_word} not found in char.dic."
- # 从vocab中构建char_vocab
- char_vocab = Vocabulary(unknown=OOV_TAG, padding=PAD_TAG)
- # 需要保证与在里面
- char_vocab.add_word(BOW_TAG)
- char_vocab.add_word(EOW_TAG)
- for word, index in vocab:
- char_vocab.add_word_lst(list(word))
- # 保证, 也在
- char_vocab.add_word_lst(list(BOS_TAG))
- char_vocab.add_word_lst(list(EOS_TAG))
- # 根据char_lexicon调整
- char_emb_layer = nn.Embedding(len(char_vocab), int(config['token_embedder']['char_dim']))
- pre_char_embedding = token_embedder_states.pop('char_emb_layer.embedding.weight')
- found_char_count = 0
- for char, index in char_vocab: # 调整character embedding
- if char in char_lexicon:
- index_in_pre = char_lexicon.get(char)
- found_char_count += 1
- else:
- index_in_pre = char_lexicon[OOV_TAG]
- char_emb_layer.weight.data[index] = pre_char_embedding[index_in_pre]
- print(f"{found_char_count} out of {len(char_vocab)} characters were found in pretrained elmo embedding.")
- # 生成words到chars的映射
- if config['token_embedder']['name'].lower() == 'cnn':
- max_chars = config['token_embedder']['max_characters_per_token']
- elif config['token_embedder']['name'].lower() == 'lstm':
- max_chars = max(map(lambda x: len(x[0]), vocab)) + 2 # 需要补充两个与
+ char_lexicon = {}
+ with codecs.open(os.path.join(model_dir, 'char.dic'), 'r', encoding='utf-8') as fpi:
+ for line in fpi:
+ tokens = line.strip().split('\t')
+ if len(tokens) == 1:
+ tokens.insert(0, '\u3000')
+ token, i = tokens
+ char_lexicon[token] = int(i)
+
+ # 做一些sanity check
+ for special_word in [PAD_TAG, OOV_TAG, BOW_TAG, EOW_TAG]:
+ assert special_word in char_lexicon, f"{special_word} not found in char.dic."
+
+ # 从vocab中构建char_vocab
+ char_vocab = Vocabulary(unknown=OOV_TAG, padding=PAD_TAG)
+ # 需要保证与在里面
+ char_vocab.add_word_lst([BOW_TAG, EOW_TAG, BOS_TAG, EOS_TAG])
+
+ for word, index in vocab:
+ char_vocab.add_word_lst(list(word))
+
+ self.bos_index, self.eos_index, self._pad_index = len(vocab), len(vocab) + 1, vocab.padding_idx
+ # 根据char_lexicon调整, 多设置一位,是预留给word padding的(该位置的char表示为全0表示)
+ char_emb_layer = nn.Embedding(len(char_vocab) + 1, int(config['char_cnn']['embedding']['dim']),
+ padding_idx=len(char_vocab))
+
+ # 读入预训练权重 这里的elmo_model 包含char_cnn和 lstm 的 state_dict
+ elmo_model = torch.load(os.path.join(self.model_dir, weight_file), map_location='cpu')
+
+ char_embed_weights = elmo_model["char_cnn"]['char_emb_layer.weight']
+
+ found_char_count = 0
+ for char, index in char_vocab: # 调整character embedding
+ if char in char_lexicon:
+ index_in_pre = char_lexicon.get(char)
+ found_char_count += 1
else:
- raise ValueError('Unknown token_embedder: {0}'.format(config['token_embedder']['name']))
- # 增加, 所以加2.
- self.words_to_chars_embedding = nn.Parameter(torch.full((len(vocab)+2, max_chars),
- fill_value=char_vocab.to_index(PAD_TAG), dtype=torch.long),
- requires_grad=False)
- for word, index in vocab:
- if len(word)+2>max_chars:
- word = word[:max_chars-2]
- if index==vocab.padding_idx: # 如果是pad的话,需要和给定的对齐
- word = PAD_TAG
- elif index==vocab.unknown_idx:
- word = OOV_TAG
- char_ids = [char_vocab.to_index(BOW_TAG)] + [char_vocab.to_index(c) for c in word] + [char_vocab.to_index(EOW_TAG)]
- char_ids += [char_vocab.to_index(PAD_TAG)]*(max_chars-len(char_ids))
- self.words_to_chars_embedding[index] = torch.LongTensor(char_ids)
- for index, word in enumerate([BOS_TAG, EOS_TAG]): # 加上,
- if len(word)+2>max_chars:
- word = word[:max_chars-2]
- char_ids = [char_vocab.to_index(BOW_TAG)] + [char_vocab.to_index(c) for c in word] + [char_vocab.to_index(EOW_TAG)]
- char_ids += [char_vocab.to_index(PAD_TAG)]*(max_chars-len(char_ids))
- self.words_to_chars_embedding[index+len(vocab)] = torch.LongTensor(char_ids)
- self.char_vocab = char_vocab
- else:
- char_emb_layer = None
-
- if config['token_embedder']['name'].lower() == 'cnn':
- self.token_embedder = ConvTokenEmbedder(
- config, word_emb_layer, char_emb_layer)
- elif config['token_embedder']['name'].lower() == 'lstm':
- self.token_embedder = LstmTokenEmbedder(
- config, word_emb_layer, char_emb_layer)
- self.token_embedder.load_state_dict(token_embedder_states, strict=False)
- if config['token_embedder']['word_dim'] > 0 and vocab._no_create_word_length > 0: # 需要映射,使得来自于dev, test的idx指向unk
- words_to_words = nn.Parameter(torch.arange(len(vocab)+2).long(), requires_grad=False)
- for word, idx in vocab:
- if vocab._is_word_no_create_entry(word):
- words_to_words[idx] = vocab.unknown_idx
- setattr(self.token_embedder, 'words_to_words', words_to_words)
- self.output_dim = config['encoder']['projection_dim']
-
- if config['encoder']['name'].lower() == 'elmo':
- self.encoder = ElmobiLm(config)
- elif config['encoder']['name'].lower() == 'lstm':
- self.encoder = LstmbiLm(config)
- self.encoder.load_state_dict(torch.load(os.path.join(model_dir, 'encoder.pkl'),
- map_location='cpu'))
-
- self.bos_index = len(vocab)
- self.eos_index = len(vocab) + 1
- self._pad_index = vocab.padding_idx
+ index_in_pre = char_lexicon[OOV_TAG]
+ char_emb_layer.weight.data[index] = char_embed_weights[index_in_pre]
+
+ print(f"{found_char_count} out of {len(char_vocab)} characters were found in pretrained elmo embedding.")
+ # 生成words到chars的映射
+ max_chars = config['char_cnn']['max_characters_per_token']
+
+ self.words_to_chars_embedding = nn.Parameter(torch.full((len(vocab) + 2, max_chars),
+ fill_value=len(char_vocab),
+ dtype=torch.long),
+ requires_grad=False)
+ for word, index in list(iter(vocab)) + [(BOS_TAG, len(vocab)), (EOS_TAG, len(vocab) + 1)]:
+ if len(word) + 2 > max_chars:
+ word = word[:max_chars - 2]
+ if index == self._pad_index:
+ continue
+ elif word == BOS_TAG or word == EOS_TAG:
+ char_ids = [char_vocab.to_index(BOW_TAG)] + [char_vocab.to_index(word)] + [
+ char_vocab.to_index(EOW_TAG)]
+ char_ids += [char_vocab.to_index(PAD_TAG)] * (max_chars - len(char_ids))
+ else:
+ char_ids = [char_vocab.to_index(BOW_TAG)] + [char_vocab.to_index(c) for c in word] + [
+ char_vocab.to_index(EOW_TAG)]
+ char_ids += [char_vocab.to_index(PAD_TAG)] * (max_chars - len(char_ids))
+ self.words_to_chars_embedding[index] = torch.LongTensor(char_ids)
+
+ self.char_vocab = char_vocab
+
+ self.token_embedder = ConvTokenEmbedder(
+ config, self.weight_file, None, char_emb_layer)
+ elmo_model["char_cnn"]['char_emb_layer.weight'] = char_emb_layer.weight
+ self.token_embedder.load_state_dict(elmo_model["char_cnn"])
+
+ self.output_dim = config['lstm']['projection_dim']
+
+ # lstm encoder
+ self.encoder = ElmobiLm(config)
+ self.encoder.load_state_dict(elmo_model["lstm"])
if cache_word_reprs:
- if config['token_embedder']['char_dim']>0: # 只有在使用了chars的情况下有用
+ if config['char_cnn']['embedding']['dim'] > 0: # 只有在使用了chars的情况下有用
print("Start to generate cache word representations.")
batch_size = 320
- num_batches = self.words_to_chars_embedding.size(0)//batch_size + \
- int(self.words_to_chars_embedding.size(0)%batch_size!=0)
- self.cached_word_embedding = nn.Embedding(self.words_to_chars_embedding.size(0),
- config['encoder']['projection_dim'])
+ # bos eos
+ word_size = self.words_to_chars_embedding.size(0)
+ num_batches = word_size // batch_size + \
+ int(word_size % batch_size != 0)
+
+ self.cached_word_embedding = nn.Embedding(word_size,
+ config['lstm']['projection_dim'])
with torch.no_grad():
for i in range(num_batches):
- words = torch.arange(i*batch_size, min((i+1)*batch_size, self.words_to_chars_embedding.size(0))).long()
+ words = torch.arange(i * batch_size,
+ min((i + 1) * batch_size, word_size)).long()
chars = self.words_to_chars_embedding[words].unsqueeze(1) # batch_size x 1 x max_chars
- word_reprs = self.token_embedder(words.unsqueeze(1), chars).detach() # batch_size x 1 x config['encoder']['projection_dim']
+ word_reprs = self.token_embedder(words.unsqueeze(1),
+ chars).detach() # batch_size x 1 x config['encoder']['projection_dim']
self.cached_word_embedding.weight.data[words] = word_reprs.squeeze(1)
+
print("Finish generating cached word representations. Going to delete the character encoder.")
del self.token_embedder, self.words_to_chars_embedding
else:
@@ -758,8 +699,10 @@ class _ElmoModel(nn.Module):
seq_len = words.ne(self._pad_index).sum(dim=-1)
expanded_words[:, 1:-1] = words
expanded_words[:, 0].fill_(self.bos_index)
- expanded_words[torch.arange(batch_size).to(words), seq_len+1] = self.eos_index
+ expanded_words[torch.arange(batch_size).to(words), seq_len + 1] = self.eos_index
seq_len = seq_len + 2
+ zero_tensor = expanded_words.new_zeros(expanded_words.shape)
+ mask = (expanded_words == zero_tensor).unsqueeze(-1)
if hasattr(self, 'cached_word_embedding'):
token_embedding = self.cached_word_embedding(expanded_words)
else:
@@ -767,22 +710,19 @@ class _ElmoModel(nn.Module):
chars = self.words_to_chars_embedding[expanded_words]
else:
chars = None
- token_embedding = self.token_embedder(expanded_words, chars)
- if self.config['encoder']['name'] == 'elmo':
- encoder_output = self.encoder(token_embedding, seq_len)
- if encoder_output.size(2) < max_len+2:
- dummy_tensor = encoder_output.new_zeros(encoder_output.size(0), batch_size,
- max_len + 2 - encoder_output.size(2), encoder_output.size(-1))
- encoder_output = torch.cat([encoder_output, dummy_tensor], 2)
- sz = encoder_output.size() # 2, batch_size, max_len, hidden_size
- token_embedding = torch.cat([token_embedding, token_embedding], dim=2).view(1, sz[1], sz[2], sz[3])
- encoder_output = torch.cat([token_embedding, encoder_output], dim=0)
- elif self.config['encoder']['name'] == 'lstm':
- encoder_output = self.encoder(token_embedding, seq_len)
- else:
- raise ValueError('Unknown encoder: {0}'.format(self.config['encoder']['name']))
+ token_embedding = self.token_embedder(expanded_words, chars) # batch_size x max_len x embed_dim
+
+ encoder_output = self.encoder(token_embedding, seq_len)
+ if encoder_output.size(2) < max_len + 2:
+ num_layers, _, output_len, hidden_size = encoder_output.size()
+ dummy_tensor = encoder_output.new_zeros(num_layers, batch_size,
+ max_len + 2 - output_len, hidden_size)
+ encoder_output = torch.cat((encoder_output, dummy_tensor), 2)
+ sz = encoder_output.size() # 2, batch_size, max_len, hidden_size
+ token_embedding = token_embedding.masked_fill(mask, 0)
+ token_embedding = torch.cat((token_embedding, token_embedding), dim=2).view(1, sz[1], sz[2], sz[3])
+ encoder_output = torch.cat((token_embedding, encoder_output), dim=0)
# 删除, . 这里没有精确地删除,但应该也不会影响最后的结果了。
encoder_output = encoder_output[:, :, 1:-1]
-
return encoder_output
diff --git a/fastNLP/modules/aggregator/attention.py b/fastNLP/modules/encoder/attention.py
similarity index 96%
rename from fastNLP/modules/aggregator/attention.py
rename to fastNLP/modules/encoder/attention.py
index 4101b033..0a42d889 100644
--- a/fastNLP/modules/aggregator/attention.py
+++ b/fastNLP/modules/encoder/attention.py
@@ -8,9 +8,9 @@ import torch
import torch.nn.functional as F
from torch import nn
-from ..dropout import TimestepDropout
+from fastNLP.modules.dropout import TimestepDropout
-from ..utils import initial_parameter
+from fastNLP.modules.utils import initial_parameter
class DotAttention(nn.Module):
@@ -19,7 +19,7 @@ class DotAttention(nn.Module):
补上文档
"""
- def __init__(self, key_size, value_size, dropout=0):
+ def __init__(self, key_size, value_size, dropout=0.0):
super(DotAttention, self).__init__()
self.key_size = key_size
self.value_size = value_size
@@ -37,7 +37,7 @@ class DotAttention(nn.Module):
"""
output = torch.matmul(Q, K.transpose(1, 2)) / self.scale
if mask_out is not None:
- output.masked_fill_(mask_out, -1e8)
+ output.masked_fill_(mask_out, -1e18)
output = self.softmax(output)
output = self.drop(output)
return torch.matmul(output, V)
@@ -45,8 +45,7 @@ class DotAttention(nn.Module):
class MultiHeadAttention(nn.Module):
"""
- 别名::class:`fastNLP.modules.MultiHeadAttention` :class:`fastNLP.modules.aggregator.attention.MultiHeadAttention`
-
+ 别名::class:`fastNLP.modules.MultiHeadAttention` :class:`fastNLP.modules.encoder.attention.MultiHeadAttention`
:param input_size: int, 输入维度的大小。同时也是输出维度的大小。
:param key_size: int, 每个head的维度大小。
@@ -67,9 +66,8 @@ class MultiHeadAttention(nn.Module):
self.k_in = nn.Linear(input_size, in_size)
self.v_in = nn.Linear(input_size, in_size)
# follow the paper, do not apply dropout within dot-product
- self.attention = DotAttention(key_size=key_size, value_size=value_size, dropout=0)
+ self.attention = DotAttention(key_size=key_size, value_size=value_size, dropout=dropout)
self.out = nn.Linear(value_size * num_head, input_size)
- self.drop = TimestepDropout(dropout)
self.reset_parameters()
def reset_parameters(self):
@@ -105,7 +103,7 @@ class MultiHeadAttention(nn.Module):
# concat all heads, do output linear
atte = atte.permute(1, 2, 0, 3).contiguous().view(batch, sq, -1)
- output = self.drop(self.out(atte))
+ output = self.out(atte)
return output
diff --git a/fastNLP/modules/encoder/bert.py b/fastNLP/modules/encoder/bert.py
index 757973fe..1819cc69 100644
--- a/fastNLP/modules/encoder/bert.py
+++ b/fastNLP/modules/encoder/bert.py
@@ -2,35 +2,22 @@
import os
from torch import nn
import torch
-from ...io.file_utils import _get_base_url, cached_path
+from ...io.file_utils import _get_base_url, cached_path, PRETRAINED_BERT_MODEL_DIR
from ._bert import _WordPieceBertModel, BertModel
+
class BertWordPieceEncoder(nn.Module):
"""
读取bert模型,读取之后调用index_dataset方法在dataset中生成word_pieces这一列。
- :param fastNLP.Vocabulary vocab: 词表
:param str model_dir_or_name: 模型所在目录或者模型的名称。默认值为``en-base-uncased``
:param str layers:最终结果中的表示。以','隔开层数,可以以负数去索引倒数几层
:param bool requires_grad: 是否需要gradient。
"""
- def __init__(self, model_dir_or_name:str='en-base-uncased', layers:str='-1',
- requires_grad:bool=False):
+ def __init__(self, model_dir_or_name: str='en-base-uncased', layers: str='-1',
+ requires_grad: bool=False):
super().__init__()
PRETRAIN_URL = _get_base_url('bert')
- PRETRAINED_BERT_MODEL_DIR = {'en': 'bert-base-cased-f89bfe08.zip',
- 'en-base-uncased': 'bert-base-uncased-3413b23c.zip',
- 'en-base-cased': 'bert-base-cased-f89bfe08.zip',
- 'en-large-uncased': 'bert-large-uncased-20939f45.zip',
- 'en-large-cased': 'bert-large-cased-e0cf90fc.zip',
-
- 'cn': 'bert-base-chinese-29d0a84a.zip',
- 'cn-base': 'bert-base-chinese-29d0a84a.zip',
-
- 'multilingual': 'bert-base-multilingual-cased-1bd364ee.zip',
- 'multilingual-base-uncased': 'bert-base-multilingual-uncased-f8730fe4.zip',
- 'multilingual-base-cased': 'bert-base-multilingual-cased-1bd364ee.zip',
- }
if model_dir_or_name in PRETRAINED_BERT_MODEL_DIR:
model_name = PRETRAINED_BERT_MODEL_DIR[model_dir_or_name]
@@ -89,4 +76,4 @@ class BertWordPieceEncoder(nn.Module):
outputs = self.model(word_pieces, token_type_ids)
outputs = torch.cat([*outputs], dim=-1)
- return outputs
\ No newline at end of file
+ return outputs
diff --git a/fastNLP/modules/encoder/embedding.py b/fastNLP/modules/encoder/embedding.py
index 005cfe75..050a423a 100644
--- a/fastNLP/modules/encoder/embedding.py
+++ b/fastNLP/modules/encoder/embedding.py
@@ -35,15 +35,15 @@ class Embedding(nn.Module):
Embedding组件. 可以通过self.num_embeddings获取词表大小; self.embedding_dim获取embedding的维度"""
- def __init__(self, init_embed, dropout=0.0, dropout_word=0, unk_index=None):
+ def __init__(self, init_embed, word_dropout=0, dropout=0.0, unk_index=None):
"""
:param tuple(int,int),torch.FloatTensor,nn.Embedding,numpy.ndarray init_embed: Embedding的大小(传入tuple(int, int),
第一个int为vocab_zie, 第二个int为embed_dim); 如果为Tensor, Embedding, ndarray等则直接使用该值初始化Embedding;
- 也可以传入TokenEmbedding对象
+ :param float word_dropout: 按照一定概率随机将word设置为unk_index,这样可以使得unk这个token得到足够的训练, 且会对网络有
+ 一定的regularize的作用。
:param float dropout: 对Embedding的输出的dropout。
- :param float dropout_word: 按照一定比例随机将word设置为unk的idx,这样可以使得unk这个token得到足够的训练
- :param int unk_index: drop word时替换为的index,如果init_embed为TokenEmbedding不需要传入该值。
+ :param int unk_index: drop word时替换为的index。fastNLP的Vocabulary的unk_index默认为1。
"""
super(Embedding, self).__init__()
@@ -52,21 +52,21 @@ class Embedding(nn.Module):
self.dropout = nn.Dropout(dropout)
if not isinstance(self.embed, TokenEmbedding):
self._embed_size = self.embed.weight.size(1)
- if dropout_word>0 and not isinstance(unk_index, int):
+ if word_dropout>0 and not isinstance(unk_index, int):
raise ValueError("When drop word is set, you need to pass in the unk_index.")
else:
self._embed_size = self.embed.embed_size
unk_index = self.embed.get_word_vocab().unknown_idx
self.unk_index = unk_index
- self.dropout_word = dropout_word
+ self.word_dropout = word_dropout
def forward(self, x):
"""
:param torch.LongTensor x: [batch, seq_len]
:return: torch.Tensor : [batch, seq_len, embed_dim]
"""
- if self.dropout_word>0 and self.training:
- mask = torch.ones_like(x).float() * self.dropout_word
+ if self.word_dropout>0 and self.training:
+ mask = torch.ones_like(x).float() * self.word_dropout
mask = torch.bernoulli(mask).byte() # dropout_word越大,越多位置为1
x = x.masked_fill(mask, self.unk_index)
x = self.embed(x)
@@ -117,11 +117,38 @@ class Embedding(nn.Module):
class TokenEmbedding(nn.Module):
- def __init__(self, vocab):
+ def __init__(self, vocab, word_dropout=0.0, dropout=0.0):
super(TokenEmbedding, self).__init__()
- assert vocab.padding_idx is not None, "You vocabulary must have padding."
+ assert vocab.padding is not None, "Vocabulary must have a padding entry."
self._word_vocab = vocab
self._word_pad_index = vocab.padding_idx
+ if word_dropout>0:
+ assert vocab.unknown is not None, "Vocabulary must have unknown entry when you want to drop a word."
+ self.word_dropout = word_dropout
+ self._word_unk_index = vocab.unknown_idx
+ self.dropout_layer = nn.Dropout(dropout)
+
+ def drop_word(self, words):
+ """
+ 按照设定随机将words设置为unknown_index。
+
+ :param torch.LongTensor words: batch_size x max_len
+ :return:
+ """
+ if self.word_dropout > 0 and self.training:
+ mask = torch.ones_like(words).float() * self.word_dropout
+ mask = torch.bernoulli(mask).byte() # dropout_word越大,越多位置为1
+ words = words.masked_fill(mask, self._word_unk_index)
+ return words
+
+ def dropout(self, words):
+ """
+ 对embedding后的word表示进行drop。
+
+ :param torch.FloatTensor words: batch_size x max_len x embed_size
+ :return:
+ """
+ return self.dropout_layer(words)
@property
def requires_grad(self):
@@ -147,8 +174,16 @@ class TokenEmbedding(nn.Module):
def embed_size(self) -> int:
return self._embed_size
+ @property
+ def embedding_dim(self) -> int:
+ return self._embed_size
+
@property
def num_embedding(self) -> int:
+ """
+ 这个值可能会大于实际的embedding矩阵的大小。
+ :return:
+ """
return len(self._word_vocab)
def get_word_vocab(self):
@@ -163,6 +198,9 @@ class TokenEmbedding(nn.Module):
def size(self):
return torch.Size(self.num_embedding, self._embed_size)
+ @abstractmethod
+ def forward(self, *input):
+ raise NotImplementedError
class StaticEmbedding(TokenEmbedding):
"""
@@ -179,15 +217,17 @@ class StaticEmbedding(TokenEmbedding):
:param model_dir_or_name: 可以有两种方式调用预训练好的static embedding:第一种是传入embedding的文件名,第二种是传入embedding
的名称。目前支持的embedding包括{`en` 或者 `en-glove-840b-300` : glove.840B.300d, `en-glove-6b-50` : glove.6B.50d,
`en-word2vec-300` : GoogleNews-vectors-negative300}。第二种情况将自动查看缓存中是否存在该模型,没有的话将自动下载。
- :param requires_grad: 是否需要gradient. 默认为True
- :param init_method: 如何初始化没有找到的值。可以使用torch.nn.init.*中各种方法。调用该方法时传入一个tensor对象。
- :param normailize: 是否对vector进行normalize,使得每个vector的norm为1。
+ :param bool requires_grad: 是否需要gradient. 默认为True
+ :param callable init_method: 如何初始化没有找到的值。可以使用torch.nn.init.*中各种方法。调用该方法时传入一个tensor对象。
+ :param bool lower: 是否将vocab中的词语小写后再和预训练的词表进行匹配。如果你的词表中包含大写的词语,或者就是需要单独
+ 为大写的词语开辟一个vector表示,则将lower设置为False。
+ :param float word_dropout: 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。
+ :param float dropout: 以多大的概率对embedding的表示进行Dropout。0.1即随机将10%的值置为0。
+ :param bool normailize: 是否对vector进行normalize,使得每个vector的norm为1。
"""
def __init__(self, vocab: Vocabulary, model_dir_or_name: str='en', requires_grad: bool=True, init_method=None,
- normalize=False):
- super(StaticEmbedding, self).__init__(vocab)
-
- # 优先定义需要下载的static embedding有哪些。这里估计需要自己搞一个server,
+ lower=False, dropout=0, word_dropout=0, normalize=False):
+ super(StaticEmbedding, self).__init__(vocab, word_dropout=word_dropout, dropout=dropout)
# 得到cache_path
if model_dir_or_name.lower() in PRETRAIN_STATIC_FILES:
@@ -202,8 +242,40 @@ class StaticEmbedding(TokenEmbedding):
raise ValueError(f"Cannot recognize {model_dir_or_name}.")
# 读取embedding
- embedding = self._load_with_vocab(model_path, vocab=vocab, init_method=init_method,
- normalize=normalize)
+ if lower:
+ lowered_vocab = Vocabulary(padding=vocab.padding, unknown=vocab.unknown)
+ for word, index in vocab:
+ if not vocab._is_word_no_create_entry(word):
+ lowered_vocab.add_word(word.lower()) # 先加入需要创建entry的
+ for word in vocab._no_create_word.keys(): # 不需要创建entry的
+ if word in vocab:
+ lowered_word = word.lower()
+ if lowered_word not in lowered_vocab.word_count:
+ lowered_vocab.add_word(lowered_word)
+ lowered_vocab._no_create_word[lowered_word] += 1
+ print(f"All word in vocab have been lowered. There are {len(vocab)} words, {len(lowered_vocab)} unique lowered "
+ f"words.")
+ embedding = self._load_with_vocab(model_path, vocab=lowered_vocab, init_method=init_method,
+ normalize=normalize)
+ # 需要适配一下
+ if not hasattr(self, 'words_to_words'):
+ self.words_to_words = torch.arange(len(lowered_vocab, )).long()
+ if lowered_vocab.unknown:
+ unknown_idx = lowered_vocab.unknown_idx
+ else:
+ unknown_idx = embedding.size(0) - 1 # 否则是最后一个为unknow
+ words_to_words = nn.Parameter(torch.full((len(vocab),), fill_value=unknown_idx).long(),
+ requires_grad=False)
+ for word, index in vocab:
+ if word not in lowered_vocab:
+ word = word.lower()
+ if lowered_vocab._is_word_no_create_entry(word): # 如果不需要创建entry,已经默认unknown了
+ continue
+ words_to_words[index] = self.words_to_words[lowered_vocab.to_index(word)]
+ self.words_to_words = words_to_words
+ else:
+ embedding = self._load_with_vocab(model_path, vocab=vocab, init_method=init_method,
+ normalize=normalize)
self.embedding = nn.Embedding(num_embeddings=embedding.shape[0], embedding_dim=embedding.shape[1],
padding_idx=vocab.padding_idx,
max_norm=None, norm_type=2, scale_grad_by_freq=False,
@@ -301,7 +373,7 @@ class StaticEmbedding(TokenEmbedding):
if vocab._no_create_word_length>0:
if vocab.unknown is None: # 创建一个专门的unknown
unknown_idx = len(matrix)
- vectors = torch.cat([vectors, torch.zeros(1, dim)], dim=0).contiguous()
+ vectors = torch.cat((vectors, torch.zeros(1, dim)), dim=0).contiguous()
else:
unknown_idx = vocab.unknown_idx
words_to_words = nn.Parameter(torch.full((len(vocab),), fill_value=unknown_idx).long(),
@@ -330,12 +402,15 @@ class StaticEmbedding(TokenEmbedding):
"""
if hasattr(self, 'words_to_words'):
words = self.words_to_words[words]
- return self.embedding(words)
+ words = self.drop_word(words)
+ words = self.embedding(words)
+ words = self.dropout(words)
+ return words
class ContextualEmbedding(TokenEmbedding):
- def __init__(self, vocab: Vocabulary):
- super(ContextualEmbedding, self).__init__(vocab)
+ def __init__(self, vocab: Vocabulary, word_dropout:float=0.0, dropout:float=0.0):
+ super(ContextualEmbedding, self).__init__(vocab, word_dropout=word_dropout, dropout=dropout)
def add_sentence_cache(self, *datasets, batch_size=32, device='cpu', delete_weights: bool=True):
"""
@@ -438,19 +513,17 @@ class ElmoEmbedding(ContextualEmbedding):
:param model_dir_or_name: 可以有两种方式调用预训练好的ELMo embedding:第一种是传入ELMo权重的文件名,第二种是传入ELMo版本的名称,
目前支持的ELMo包括{`en` : 英文版本的ELMo, `cn` : 中文版本的ELMo,}。第二种情况将自动查看缓存中是否存在该模型,没有的话将自动下载
:param layers: str, 指定返回的层数, 以,隔开不同的层。如果要返回第二层的结果'2', 返回后两层的结果'1,2'。不同的层的结果
- 按照这个顺序concat起来。默认为'2'。
- :param requires_grad: bool, 该层是否需要gradient. 默认为False
+ 按照这个顺序concat起来。默认为'2'。'mix'会使用可学习的权重结合不同层的表示(权重是否可训练与requires_grad保持一致,
+ 初始化权重对三层结果进行mean-pooling, 可以通过ElmoEmbedding.set_mix_weights_requires_grad()方法只将mix weights设置为可学习。)
+ :param requires_grad: bool, 该层是否需要gradient, 默认为False.
+ :param float word_dropout: 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。
+ :param float dropout: 以多大的概率对embedding的表示进行Dropout。0.1即随机将10%的值置为0。
:param cache_word_reprs: 可以选择对word的表示进行cache; 设置为True的话,将在初始化的时候为每个word生成对应的embedding,
并删除character encoder,之后将直接使用cache的embedding。默认为False。
"""
- def __init__(self, vocab: Vocabulary, model_dir_or_name: str='en',
- layers: str='2', requires_grad: bool=False, cache_word_reprs: bool=False):
- super(ElmoEmbedding, self).__init__(vocab)
- layers = list(map(int, layers.split(',')))
- assert len(layers) > 0, "Must choose one output"
- for layer in layers:
- assert 0 <= layer <= 2, "Layer index should be in range [0, 2]."
- self.layers = layers
+ def __init__(self, vocab: Vocabulary, model_dir_or_name: str='en', layers: str='2', requires_grad: bool=False,
+ word_dropout=0.0, dropout=0.0, cache_word_reprs: bool=False):
+ super(ElmoEmbedding, self).__init__(vocab, word_dropout=word_dropout, dropout=dropout)
# 根据model_dir_or_name检查是否存在并下载
if model_dir_or_name.lower() in PRETRAINED_ELMO_MODEL_DIR:
@@ -464,8 +537,49 @@ class ElmoEmbedding(ContextualEmbedding):
else:
raise ValueError(f"Cannot recognize {model_dir_or_name}.")
self.model = _ElmoModel(model_dir, vocab, cache_word_reprs=cache_word_reprs)
+
+ if layers=='mix':
+ self.layer_weights = nn.Parameter(torch.zeros(self.model.config['lstm']['n_layers']+1),
+ requires_grad=requires_grad)
+ self.gamma = nn.Parameter(torch.ones(1), requires_grad=requires_grad)
+ self._get_outputs = self._get_mixed_outputs
+ self._embed_size = self.model.config['lstm']['projection_dim'] * 2
+ else:
+ layers = list(map(int, layers.split(',')))
+ assert len(layers) > 0, "Must choose one output"
+ for layer in layers:
+ assert 0 <= layer <= 2, "Layer index should be in range [0, 2]."
+ self.layers = layers
+ self._get_outputs = self._get_layer_outputs
+ self._embed_size = len(self.layers) * self.model.config['lstm']['projection_dim'] * 2
+
self.requires_grad = requires_grad
- self._embed_size = len(self.layers) * self.model.config['encoder']['projection_dim'] * 2
+
+ def _get_mixed_outputs(self, outputs):
+ # outputs: num_layers x batch_size x max_len x hidden_size
+ # return: batch_size x max_len x hidden_size
+ weights = F.softmax(self.layer_weights+1/len(outputs), dim=0).to(outputs)
+ outputs = torch.einsum('l,lbij->bij', weights, outputs)
+ return self.gamma.to(outputs)*outputs
+
+ def set_mix_weights_requires_grad(self, flag=True):
+ """
+ 当初始化ElmoEmbedding时layers被设置为mix时,可以通过调用该方法设置mix weights是否可训练。如果layers不是mix,调用
+ 该方法没有用。
+ :param bool flag: 混合不同层表示的结果是否可以训练。
+ :return:
+ """
+ if hasattr(self, 'layer_weights'):
+ self.layer_weights.requires_grad = flag
+ self.gamma.requires_grad = flag
+
+ def _get_layer_outputs(self, outputs):
+ if len(self.layers) == 1:
+ outputs = outputs[self.layers[0]]
+ else:
+ outputs = torch.cat(tuple([*outputs[self.layers]]), dim=-1)
+
+ return outputs
def forward(self, words: torch.LongTensor):
"""
@@ -476,19 +590,18 @@ class ElmoEmbedding(ContextualEmbedding):
:param words: batch_size x max_len
:return: torch.FloatTensor. batch_size x max_len x (512*len(self.layers))
"""
+ words = self.drop_word(words)
outputs = self._get_sent_reprs(words)
if outputs is not None:
- return outputs
+ return self.dropout(outputs)
outputs = self.model(words)
- if len(self.layers) == 1:
- outputs = outputs[self.layers[0]]
- else:
- outputs = torch.cat([*outputs[self.layers]], dim=-1)
-
- return outputs
+ outputs = self._get_outputs(outputs)
+ return self.dropout(outputs)
def _delete_model_weights(self):
- del self.layers, self.model
+ for name in ['layers', 'model', 'layer_weights', 'gamma']:
+ if hasattr(self, name):
+ delattr(self, name)
@property
def requires_grad(self):
@@ -529,13 +642,16 @@ class BertEmbedding(ContextualEmbedding):
:param str layers:最终结果中的表示。以','隔开层数,可以以负数去索引倒数几层
:param str pool_method: 因为在bert中,每个word会被表示为多个word pieces, 当获取一个word的表示的时候,怎样从它的word pieces
中计算得到它对应的表示。支持``last``, ``first``, ``avg``, ``max``。
+ :param float word_dropout: 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。
+ :param float dropout: 以多大的概率对embedding的表示进行Dropout。0.1即随机将10%的值置为0。
:param bool include_cls_sep: bool,在bert计算句子的表示的时候,需要在前面加上[CLS]和[SEP], 是否在结果中保留这两个内容。 这样
会使得word embedding的结果比输入的结果长两个token。在使用 :class::StackEmbedding 可能会遇到问题。
:param bool requires_grad: 是否需要gradient。
"""
def __init__(self, vocab: Vocabulary, model_dir_or_name: str='en-base-uncased', layers: str='-1',
- pool_method: str='first', include_cls_sep: bool=False, requires_grad: bool=False):
- super(BertEmbedding, self).__init__(vocab)
+ pool_method: str='first', word_dropout=0, dropout=0, requires_grad: bool=False,
+ include_cls_sep: bool=False):
+ super(BertEmbedding, self).__init__(vocab, word_dropout=word_dropout, dropout=dropout)
# 根据model_dir_or_name检查是否存在并下载
if model_dir_or_name.lower() in PRETRAINED_BERT_MODEL_DIR:
@@ -566,13 +682,14 @@ class BertEmbedding(ContextualEmbedding):
:param torch.LongTensor words: [batch_size, max_len]
:return: torch.FloatTensor. batch_size x max_len x (768*len(self.layers))
"""
+ words = self.drop_word(words)
outputs = self._get_sent_reprs(words)
if outputs is not None:
- return outputs
+ return self.dropout(words)
outputs = self.model(words)
outputs = torch.cat([*outputs], dim=-1)
- return outputs
+ return self.dropout(words)
@property
def requires_grad(self):
@@ -614,8 +731,8 @@ class CNNCharEmbedding(TokenEmbedding):
"""
别名::class:`fastNLP.modules.CNNCharEmbedding` :class:`fastNLP.modules.encoder.embedding.CNNCharEmbedding`
- 使用CNN生成character embedding。CNN的结果为, embed(x) -> Dropout(x) -> CNN(x) -> activation(x) -> pool
- -> fc. 不同的kernel大小的fitler结果是concat起来的。
+ 使用CNN生成character embedding。CNN的结果为, embed(x) -> Dropout(x) -> CNN(x) -> activation(x) -> pool -> fc -> Dropout.
+ 不同的kernel大小的fitler结果是concat起来的。
Example::
@@ -625,23 +742,24 @@ class CNNCharEmbedding(TokenEmbedding):
:param vocab: 词表
:param embed_size: 该word embedding的大小,默认值为50.
:param char_emb_size: character的embed的大小。character是从vocab中生成的。默认值为50.
- :param dropout: 以多大的概率drop
+ :param float word_dropout: 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。
+ :param float dropout: 以多大的概率drop
:param filter_nums: filter的数量. 长度需要和kernels一致。默认值为[40, 30, 20].
:param kernel_sizes: kernel的大小. 默认值为[5, 3, 1].
:param pool_method: character的表示在合成一个表示时所使用的pool方法,支持'avg', 'max'.
:param activation: CNN之后使用的激活方法,支持'relu', 'sigmoid', 'tanh' 或者自定义函数.
:param min_char_freq: character的最少出现次数。默认值为2.
"""
- def __init__(self, vocab: Vocabulary, embed_size: int=50, char_emb_size: int=50, dropout:float=0.5,
- filter_nums: List[int]=(40, 30, 20), kernel_sizes: List[int]=(5, 3, 1), pool_method: str='max',
- activation='relu', min_char_freq: int=2):
- super(CNNCharEmbedding, self).__init__(vocab)
+ def __init__(self, vocab: Vocabulary, embed_size: int=50, char_emb_size: int=50, word_dropout:float=0,
+ dropout:float=0.5, filter_nums: List[int]=(40, 30, 20), kernel_sizes: List[int]=(5, 3, 1),
+ pool_method: str='max', activation='relu', min_char_freq: int=2):
+ super(CNNCharEmbedding, self).__init__(vocab, word_dropout=word_dropout, dropout=dropout)
for kernel in kernel_sizes:
assert kernel % 2 == 1, "Only odd kernel is allowed."
assert pool_method in ('max', 'avg')
- self.dropout = nn.Dropout(dropout, inplace=True)
+ self.dropout = nn.Dropout(dropout)
self.pool_method = pool_method
# activation function
if isinstance(activation, str):
@@ -691,6 +809,7 @@ class CNNCharEmbedding(TokenEmbedding):
:param words: [batch_size, max_len]
:return: [batch_size, max_len, embed_size]
"""
+ words = self.drop_word(words)
batch_size, max_len = words.size()
chars = self.words_to_chars_embedding[words] # batch_size x max_len x max_word_len
word_lengths = self.word_lengths[words] # batch_size x max_len
@@ -699,7 +818,7 @@ class CNNCharEmbedding(TokenEmbedding):
# 为1的地方为mask
chars_masks = chars.eq(self.char_pad_index) # batch_size x max_len x max_word_len 如果为0, 说明是padding的位置了
chars = self.char_embedding(chars) # batch_size x max_len x max_word_len x embed_size
- self.dropout(chars)
+ chars = self.dropout(chars)
reshaped_chars = chars.reshape(batch_size*max_len, max_word_len, -1)
reshaped_chars = reshaped_chars.transpose(1, 2) # B' x E x M
conv_chars = [conv(reshaped_chars).transpose(1, 2).reshape(batch_size, max_len, max_word_len, -1)
@@ -713,7 +832,7 @@ class CNNCharEmbedding(TokenEmbedding):
conv_chars = conv_chars.masked_fill(chars_masks.unsqueeze(-1), 0)
chars = torch.sum(conv_chars, dim=-2)/chars_masks.eq(0).sum(dim=-1, keepdim=True).float()
chars = self.fc(chars)
- return chars
+ return self.dropout(chars)
@property
def requires_grad(self):
@@ -760,6 +879,7 @@ class LSTMCharEmbedding(TokenEmbedding):
:param vocab: 词表
:param embed_size: embedding的大小。默认值为50.
:param char_emb_size: character的embedding的大小。默认值为50.
+ :param float word_dropout: 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。
:param dropout: 以多大概率drop
:param hidden_size: LSTM的中间hidden的大小,如果为bidirectional的,hidden会除二,默认为50.
:param pool_method: 支持'max', 'avg'
@@ -767,15 +887,16 @@ class LSTMCharEmbedding(TokenEmbedding):
:param min_char_freq: character的最小出现次数。默认值为2.
:param bidirectional: 是否使用双向的LSTM进行encode。默认值为True。
"""
- def __init__(self, vocab: Vocabulary, embed_size: int=50, char_emb_size: int=50, dropout:float=0.5, hidden_size=50,
- pool_method: str='max', activation='relu', min_char_freq: int=2, bidirectional=True):
+ def __init__(self, vocab: Vocabulary, embed_size: int=50, char_emb_size: int=50, word_dropout:float=0,
+ dropout:float=0.5, hidden_size=50,pool_method: str='max', activation='relu', min_char_freq: int=2,
+ bidirectional=True):
super(LSTMCharEmbedding, self).__init__(vocab)
assert hidden_size % 2 == 0, "Only even kernel is allowed."
assert pool_method in ('max', 'avg')
self.pool_method = pool_method
- self.dropout = nn.Dropout(dropout, inplace=True)
+ self.dropout = nn.Dropout(dropout)
# activation function
if isinstance(activation, str):
if activation.lower() == 'relu':
@@ -824,6 +945,7 @@ class LSTMCharEmbedding(TokenEmbedding):
:param words: [batch_size, max_len]
:return: [batch_size, max_len, embed_size]
"""
+ words = self.drop_word(words)
batch_size, max_len = words.size()
chars = self.words_to_chars_embedding[words] # batch_size x max_len x max_word_len
word_lengths = self.word_lengths[words] # batch_size x max_len
@@ -848,7 +970,7 @@ class LSTMCharEmbedding(TokenEmbedding):
chars = self.fc(chars)
- return chars
+ return self.dropout(chars)
@property
def requires_grad(self):
@@ -887,17 +1009,21 @@ class StackEmbedding(TokenEmbedding):
:param embeds: 一个由若干个TokenEmbedding组成的list,要求每一个TokenEmbedding的词表都保持一致
+ :param float word_dropout: 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。不同embedidng会在相同的位置
+ 被设置为unknown。如果这里设置了dropout,则组成的embedding就不要再设置dropout了。
+ :param float dropout: 以多大的概率对embedding的表示进行Dropout。0.1即随机将10%的值置为0。
"""
- def __init__(self, embeds: List[TokenEmbedding]):
+ def __init__(self, embeds: List[TokenEmbedding], word_dropout=0, dropout=0):
vocabs = []
for embed in embeds:
- vocabs.append(embed.get_word_vocab())
+ if hasattr(embed, 'get_word_vocab'):
+ vocabs.append(embed.get_word_vocab())
_vocab = vocabs[0]
for vocab in vocabs[1:]:
- assert vocab == _vocab, "All embeddings should use the same word vocabulary."
+ assert vocab == _vocab, "All embeddings in StackEmbedding should use the same word vocabulary."
- super(StackEmbedding, self).__init__(_vocab)
+ super(StackEmbedding, self).__init__(_vocab, word_dropout=word_dropout, dropout=dropout)
assert isinstance(embeds, list)
for embed in embeds:
assert isinstance(embed, TokenEmbedding), "Only TokenEmbedding type is supported."
@@ -949,7 +1075,9 @@ class StackEmbedding(TokenEmbedding):
:return: 返回的shape和当前这个stack embedding中embedding的组成有关
"""
outputs = []
+ words = self.drop_word(words)
for embed in self.embeds:
outputs.append(embed(words))
- return torch.cat(outputs, dim=-1)
+ outputs = self.dropout(torch.cat(outputs, dim=-1))
+ return outputs
diff --git a/fastNLP/modules/aggregator/pooling.py b/fastNLP/modules/encoder/pooling.py
similarity index 96%
rename from fastNLP/modules/aggregator/pooling.py
rename to fastNLP/modules/encoder/pooling.py
index 51438aae..8337fe32 100644
--- a/fastNLP/modules/aggregator/pooling.py
+++ b/fastNLP/modules/encoder/pooling.py
@@ -1,7 +1,8 @@
__all__ = [
"MaxPool",
"MaxPoolWithMask",
- "AvgPool"
+ "AvgPool",
+ "AvgPoolWithMask"
]
import torch
import torch.nn as nn
@@ -9,7 +10,7 @@ import torch.nn as nn
class MaxPool(nn.Module):
"""
- 别名::class:`fastNLP.modules.MaxPool` :class:`fastNLP.modules.aggregator.pooling.MaxPool`
+ 别名::class:`fastNLP.modules.MaxPool` :class:`fastNLP.modules.encoder.pooling.MaxPool`
Max-pooling模块。
@@ -58,7 +59,7 @@ class MaxPool(nn.Module):
class MaxPoolWithMask(nn.Module):
"""
- 别名::class:`fastNLP.modules.MaxPoolWithMask` :class:`fastNLP.modules.aggregator.pooling.MaxPoolWithMask`
+ 别名::class:`fastNLP.modules.MaxPoolWithMask` :class:`fastNLP.modules.encoder.pooling.MaxPoolWithMask`
带mask矩阵的max pooling。在做max-pooling的时候不会考虑mask值为0的位置。
"""
@@ -98,7 +99,7 @@ class KMaxPool(nn.Module):
class AvgPool(nn.Module):
"""
- 别名::class:`fastNLP.modules.AvgPool` :class:`fastNLP.modules.aggregator.pooling.AvgPool`
+ 别名::class:`fastNLP.modules.AvgPool` :class:`fastNLP.modules.encoder.pooling.AvgPool`
给定形如[batch_size, max_len, hidden_size]的输入,在最后一维进行avg pooling. 输出为[batch_size, hidden_size]
"""
@@ -125,7 +126,7 @@ class AvgPool(nn.Module):
class AvgPoolWithMask(nn.Module):
"""
- 别名::class:`fastNLP.modules.AvgPoolWithMask` :class:`fastNLP.modules.aggregator.pooling.AvgPoolWithMask`
+ 别名::class:`fastNLP.modules.AvgPoolWithMask` :class:`fastNLP.modules.encoder.pooling.AvgPoolWithMask`
给定形如[batch_size, max_len, hidden_size]的输入,在最后一维进行avg pooling. 输出为[batch_size, hidden_size], pooling
的时候只会考虑mask为1的位置
diff --git a/fastNLP/modules/encoder/star_transformer.py b/fastNLP/modules/encoder/star_transformer.py
index 1eec7c13..097fbebb 100644
--- a/fastNLP/modules/encoder/star_transformer.py
+++ b/fastNLP/modules/encoder/star_transformer.py
@@ -34,12 +34,14 @@ class StarTransformer(nn.Module):
super(StarTransformer, self).__init__()
self.iters = num_layers
- self.norm = nn.ModuleList([nn.LayerNorm(hidden_size) for _ in range(self.iters)])
+ self.norm = nn.ModuleList([nn.LayerNorm(hidden_size, eps=1e-6) for _ in range(self.iters)])
+ # self.emb_fc = nn.Conv2d(hidden_size, hidden_size, 1)
+ self.emb_drop = nn.Dropout(dropout)
self.ring_att = nn.ModuleList(
- [_MSA1(hidden_size, nhead=num_head, head_dim=head_dim, dropout=dropout)
+ [_MSA1(hidden_size, nhead=num_head, head_dim=head_dim, dropout=0.0)
for _ in range(self.iters)])
self.star_att = nn.ModuleList(
- [_MSA2(hidden_size, nhead=num_head, head_dim=head_dim, dropout=dropout)
+ [_MSA2(hidden_size, nhead=num_head, head_dim=head_dim, dropout=0.0)
for _ in range(self.iters)])
if max_len is not None:
@@ -66,18 +68,19 @@ class StarTransformer(nn.Module):
smask = torch.cat([torch.zeros(B, 1, ).byte().to(mask), mask], 1)
embs = data.permute(0, 2, 1)[:, :, :, None] # B H L 1
- if self.pos_emb:
+ if self.pos_emb and False:
P = self.pos_emb(torch.arange(L, dtype=torch.long, device=embs.device) \
.view(1, L)).permute(0, 2, 1).contiguous()[:, :, :, None] # 1 H L 1
embs = embs + P
-
+ embs = norm_func(self.emb_drop, embs)
nodes = embs
relay = embs.mean(2, keepdim=True)
ex_mask = mask[:, None, :, None].expand(B, H, L, 1)
r_embs = embs.view(B, H, 1, L)
for i in range(self.iters):
ax = torch.cat([r_embs, relay.expand(B, H, 1, L)], 2)
- nodes = nodes + F.leaky_relu(self.ring_att[i](norm_func(self.norm[i], nodes), ax=ax))
+ nodes = F.leaky_relu(self.ring_att[i](norm_func(self.norm[i], nodes), ax=ax))
+ #nodes = F.leaky_relu(self.ring_att[i](nodes, ax=ax))
relay = F.leaky_relu(self.star_att[i](relay, torch.cat([relay, nodes], 2), smask))
nodes = nodes.masked_fill_(ex_mask, 0)
diff --git a/fastNLP/modules/encoder/transformer.py b/fastNLP/modules/encoder/transformer.py
index 698ff95c..d6bf2f1e 100644
--- a/fastNLP/modules/encoder/transformer.py
+++ b/fastNLP/modules/encoder/transformer.py
@@ -3,7 +3,7 @@ __all__ = [
]
from torch import nn
-from ..aggregator.attention import MultiHeadAttention
+from fastNLP.modules.encoder.attention import MultiHeadAttention
from ..dropout import TimestepDropout
diff --git a/legacy/api/api.py b/legacy/api/api.py
index d5d1df6b..1408731f 100644
--- a/legacy/api/api.py
+++ b/legacy/api/api.py
@@ -8,7 +8,8 @@ import os
from fastNLP.core.dataset import DataSet
from .utils import load_url
from .processor import ModelProcessor
-from fastNLP.io.dataset_loader import _cut_long_sentence, ConllLoader
+from fastNLP.io.dataset_loader import _cut_long_sentence
+from fastNLP.io.data_loader import ConllLoader
from fastNLP.core.instance import Instance
from ..api.pipeline import Pipeline
from fastNLP.core.metrics import SpanFPreRecMetric
diff --git a/reproduction/CNN-sentence_classification/.gitignore b/reproduction/CNN-sentence_classification/.gitignore
deleted file mode 100644
index 4ae0ed76..00000000
--- a/reproduction/CNN-sentence_classification/.gitignore
+++ /dev/null
@@ -1,110 +0,0 @@
-# Byte-compiled / optimized / DLL files
-__pycache__/
-*.py[cod]
-*$py.class
-
-# C extensions
-*.so
-
-# Distribution / packaging
-.Python
-build/
-develop-eggs/
-dist/
-downloads/
-eggs/
-.eggs/
-lib/
-lib64/
-parts/
-sdist/
-var/
-wheels/
-*.egg-info/
-.installed.cfg
-*.egg
-MANIFEST
-
-# PyInstaller
-# Usually these files are written by a python script from a template
-# before PyInstaller builds the exe, so as to inject date/other infos into it.
-*.manifest
-*.spec
-
-# Installer logs
-pip-log.txt
-pip-delete-this-directory.txt
-
-# Unit test / coverage reports
-htmlcov/
-.tox/
-.coverage
-.coverage.*
-.cache
-nosetests.xml
-coverage.xml
-*.cover
-.hypothesis/
-.pytest_cache/
-
-# Translations
-*.mo
-*.pot
-
-# Django stuff:
-*.log
-local_settings.py
-db.sqlite3
-
-# Flask stuff:
-instance/
-.webassets-cache
-
-# Scrapy stuff:
-.scrapy
-
-# Sphinx documentation
-docs/_build/
-
-# PyBuilder
-target/
-
-# Jupyter Notebook
-.ipynb_checkpoints
-
-# pyenv
-.python-version
-
-# celery beat schedule file
-celerybeat-schedule
-
-# SageMath parsed files
-*.sage.py
-
-# Environments
-.env
-.venv
-env/
-venv/
-ENV/
-env.bak/
-venv.bak/
-
-# Spyder project settings
-.spyderproject
-.spyproject
-
-# Rope project settings
-.ropeproject
-
-# mkdocs documentation
-/site
-
-# mypy
-.mypy_cache
-
-#custom
-GoogleNews-vectors-negative300.bin/
-GoogleNews-vectors-negative300.bin.gz
-models/
-*.swp
diff --git a/reproduction/CNN-sentence_classification/README.md b/reproduction/CNN-sentence_classification/README.md
deleted file mode 100644
index ee752779..00000000
--- a/reproduction/CNN-sentence_classification/README.md
+++ /dev/null
@@ -1,77 +0,0 @@
-## Introduction
-This is the implementation of [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882) paper in PyTorch.
-* MRDataset, non-static-model(word2vec rained by Mikolov etal. (2013) on 100 billion words of Google News)
-* It can be run in both CPU and GPU
-* The best accuracy is 82.61%, which is better than 81.5% in the paper
-(by Jingyuan Liu @Fudan University; Email:(fdjingyuan@outlook.com) Welcome to discussion!)
-
-## Requirement
-* python 3.6
-* pytorch > 0.1
-* numpy
-* gensim
-
-## Run
-STEP 1
-install packages like gensim (other needed pakages is the same)
-```
-pip install gensim
-```
-
-STEP 2
-install MRdataset and word2vec resources
-* MRdataset: you can download the dataset in (https://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz)
-* word2vec: you can download the file in (https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit)
-
-Since this file is more than 1.5G, I did not display in folders. If you download the file, please remember modify the path in Function def word_embeddings(path = './GoogleNews-vectors-negative300.bin/'):
-
-
-STEP 3
-train the model
-```
-python train.py
-```
-you will get the information printed in the screen, like
-```
-Epoch [1/20], Iter [100/192] Loss: 0.7008
-Test Accuracy: 71.869159 %
-Epoch [2/20], Iter [100/192] Loss: 0.5957
-Test Accuracy: 75.700935 %
-Epoch [3/20], Iter [100/192] Loss: 0.4934
-Test Accuracy: 78.130841 %
-
-......
-Epoch [20/20], Iter [100/192] Loss: 0.0364
-Test Accuracy: 81.495327 %
-Best Accuracy: 82.616822 %
-Best Model: models/cnn.pkl
-```
-
-## Hyperparameters
-According to the paper and experiment, I set:
-
-|Epoch|Kernel Size|dropout|learning rate|batch size|
-|---|---|---|---|---|
-|20|\(h,300,100\)|0.5|0.0001|50|
-
-h = [3,4,5]
-If the accuracy is not improved, the learning rate will \*0.8.
-
-## Result
-I just tried one dataset : MR. (Other 6 dataset in paper SST-1, SST-2, TREC, CR, MPQA)
-There are four models in paper: CNN-rand, CNN-static, CNN-non-static, CNN-multichannel.
-I have tried CNN-non-static:A model with pre-trained vectors from word2vec.
-All words—including the unknown ones that are randomly initialized and the pretrained vectors are fine-tuned for each task
-(which has almost the best performance and the most difficut to implement among the four models)
-
-|Dataset|Class Size|Best Result|Kim's Paper Result|
-|---|---|---|---|
-|MR|2|82.617%(CNN-non-static)|81.5%(CNN-nonstatic)|
-
-
-
-## Reference
-* [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882)
-* https://github.com/Shawn1993/cnn-text-classification-pytorch
-* https://github.com/junwang4/CNN-sentence-classification-pytorch-2017/blob/master/utils.py
-
diff --git a/reproduction/CNN-sentence_classification/dataset.py b/reproduction/CNN-sentence_classification/dataset.py
deleted file mode 100644
index 4cbe17a4..00000000
--- a/reproduction/CNN-sentence_classification/dataset.py
+++ /dev/null
@@ -1,136 +0,0 @@
-import codecs
-import random
-import re
-
-import gensim
-import numpy as np
-from gensim import corpora
-from torch.utils.data import Dataset
-
-
-def clean_str(string):
- """
- Tokenization/string cleaning for all datasets except for SST.
- Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
- """
- string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
- string = re.sub(r"\'s", " \'s", string)
- string = re.sub(r"\'ve", " \'ve", string)
- string = re.sub(r"n\'t", " n\'t", string)
- string = re.sub(r"\'re", " \'re", string)
- string = re.sub(r"\'d", " \'d", string)
- string = re.sub(r"\'ll", " \'ll", string)
- string = re.sub(r",", " , ", string)
- string = re.sub(r"!", " ! ", string)
- string = re.sub(r"\(", " \( ", string)
- string = re.sub(r"\)", " \) ", string)
- string = re.sub(r"\?", " \? ", string)
- string = re.sub(r"\s{2,}", " ", string)
- return string.strip()
-
-
-def pad_sentences(sentence, padding_word=" "):
- sequence_length = 64
- sent = sentence.split()
- padded_sentence = sentence + padding_word * (sequence_length - len(sent))
- return padded_sentence
-
-
-# data loader
-class MRDataset(Dataset):
- def __init__(self):
-
- # load positive and negative sentenses from files
- with codecs.open("./rt-polaritydata/rt-polarity.pos", encoding='ISO-8859-1') as f:
- positive_examples = list(f.readlines())
- with codecs.open("./rt-polaritydata/rt-polarity.neg", encoding='ISO-8859-1') as f:
- negative_examples = list(f.readlines())
- # s.strip: clear "\n"; clear_str; pad
- positive_examples = [pad_sentences(clean_str(s.strip())) for s in positive_examples]
- negative_examples = [pad_sentences(clean_str(s.strip())) for s in negative_examples]
- self.examples = positive_examples + negative_examples
- self.sentences_texts = [sample.split() for sample in self.examples]
-
- # word dictionary
- dictionary = corpora.Dictionary(self.sentences_texts)
- self.word2id_dict = dictionary.token2id # transform to dict, like {"human":0, "a":1,...}
-
- # set lables: postive is 1; negative is 0
- positive_labels = [1 for _ in positive_examples]
- negative_labels = [0 for _ in negative_examples]
- self.lables = positive_labels + negative_labels
- examples_lables = list(zip(self.examples, self.lables))
- random.shuffle(examples_lables)
- self.MRDataset_frame = examples_lables
-
- # transform word to id
- self.MRDataset_wordid = \
- [(
- np.array([self.word2id_dict[word] for word in sent[0].split()], dtype=np.int64),
- sent[1]
- ) for sent in self.MRDataset_frame]
-
- def word_embeddings(self, path="./GoogleNews-vectors-negative300.bin/GoogleNews-vectors-negative300.bin"):
- # establish from google
- model = gensim.models.KeyedVectors.load_word2vec_format(path, binary=True)
-
- print('Please wait ... (it could take a while to load the file : {})'.format(path))
- word_dict = self.word2id_dict
- embedding_weights = np.random.uniform(-0.25, 0.25, (len(self.word2id_dict), 300))
-
- for word in word_dict:
- word_id = word_dict[word]
- if word in model.wv.vocab:
- embedding_weights[word_id, :] = model[word]
- return embedding_weights
-
- def __len__(self):
- return len(self.MRDataset_frame)
-
- def __getitem__(self, idx):
-
- sample = self.MRDataset_wordid[idx]
- return sample
-
- def getsent(self, idx):
-
- sample = self.MRDataset_wordid[idx][0]
- return sample
-
- def getlabel(self, idx):
-
- label = self.MRDataset_wordid[idx][1]
- return label
-
- def word2id(self):
-
- return self.word2id_dict
-
- def id2word(self):
-
- id2word_dict = dict([val, key] for key, val in self.word2id_dict.items())
- return id2word_dict
-
-
-class train_set(Dataset):
-
- def __init__(self, samples):
- self.train_frame = samples
-
- def __len__(self):
- return len(self.train_frame)
-
- def __getitem__(self, idx):
- return self.train_frame[idx]
-
-
-class test_set(Dataset):
-
- def __init__(self, samples):
- self.test_frame = samples
-
- def __len__(self):
- return len(self.test_frame)
-
- def __getitem__(self, idx):
- return self.test_frame[idx]
diff --git a/reproduction/CNN-sentence_classification/model.py b/reproduction/CNN-sentence_classification/model.py
deleted file mode 100644
index 0aca34c7..00000000
--- a/reproduction/CNN-sentence_classification/model.py
+++ /dev/null
@@ -1,42 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-
-class CNN_text(nn.Module):
- def __init__(self, kernel_h=[3, 4, 5], kernel_num=100, embed_num=1000, embed_dim=300, num_classes=2, dropout=0.5,
- L2_constrain=3,
- pretrained_embeddings=None):
- super(CNN_text, self).__init__()
-
- self.embedding = nn.Embedding(embed_num, embed_dim)
- self.dropout = nn.Dropout(dropout)
- if pretrained_embeddings is not None:
- self.embedding.weight.data.copy_(torch.from_numpy(pretrained_embeddings))
-
- # the network structure
- # Conv2d: input- N,C,H,W output- (50,100,62,1)
- self.conv1 = nn.ModuleList([nn.Conv2d(1, kernel_num, (K, embed_dim)) for K in kernel_h])
- self.fc1 = nn.Linear(len(kernel_h) * kernel_num, num_classes)
-
- def max_pooling(self, x):
- x = F.relu(self.conv1(x)).squeeze(3) # N,C,L - (50,100,62)
- x = F.max_pool1d(x, x.size(2)).squeeze(2)
- # x.size(2)=62 squeeze: (50,100,1) -> (50,100)
- return x
-
- def forward(self, x):
- x = self.embedding(x) # output: (N,H,W) = (50,64,300)
- x = x.unsqueeze(1) # (N,C,H,W)
- x = [F.relu(conv(x)).squeeze(3) for conv in self.conv1] # [N, C, H(50,100,62),(50,100,61),(50,100,60)]
- x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x] # [N,C(50,100),(50,100),(50,100)]
- x = torch.cat(x, 1)
- x = self.dropout(x)
- x = self.fc1(x)
- return x
-
-
-if __name__ == '__main__':
- model = CNN_text(kernel_h=[1, 2, 3, 4], embed_num=3, embed_dim=2)
- x = torch.LongTensor([[1, 2, 1, 2, 0]])
- print(model(x))
diff --git a/reproduction/CNN-sentence_classification/train.py b/reproduction/CNN-sentence_classification/train.py
deleted file mode 100644
index 6e35ee5e..00000000
--- a/reproduction/CNN-sentence_classification/train.py
+++ /dev/null
@@ -1,92 +0,0 @@
-import os
-
-import torch
-import torch.nn as nn
-from torch.autograd import Variable
-
-from . import dataset as dst
-from .model import CNN_text
-
-# Hyper Parameters
-batch_size = 50
-learning_rate = 0.0001
-num_epochs = 20
-cuda = True
-
-# split Dataset
-dataset = dst.MRDataset()
-length = len(dataset)
-
-train_dataset = dataset[:int(0.9 * length)]
-test_dataset = dataset[int(0.9 * length):]
-
-train_dataset = dst.train_set(train_dataset)
-test_dataset = dst.test_set(test_dataset)
-
-# Data Loader
-train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
- batch_size=batch_size,
- shuffle=True)
-
-test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
- batch_size=batch_size,
- shuffle=False)
-
-# cnn
-
-cnn = CNN_text(embed_num=len(dataset.word2id()), pretrained_embeddings=dataset.word_embeddings())
-if cuda:
- cnn.cuda()
-
-# Loss and Optimizer
-criterion = nn.CrossEntropyLoss()
-optimizer = torch.optim.Adam(cnn.parameters(), lr=learning_rate)
-
-# train and test
-best_acc = None
-
-for epoch in range(num_epochs):
- # Train the Model
- cnn.train()
- for i, (sents, labels) in enumerate(train_loader):
- sents = Variable(sents)
- labels = Variable(labels)
- if cuda:
- sents = sents.cuda()
- labels = labels.cuda()
- optimizer.zero_grad()
- outputs = cnn(sents)
- loss = criterion(outputs, labels)
- loss.backward()
- optimizer.step()
-
- if (i + 1) % 100 == 0:
- print('Epoch [%d/%d], Iter [%d/%d] Loss: %.4f'
- % (epoch + 1, num_epochs, i + 1, len(train_dataset) // batch_size, loss.data[0]))
-
- # Test the Model
- cnn.eval()
- correct = 0
- total = 0
- for sents, labels in test_loader:
- sents = Variable(sents)
- if cuda:
- sents = sents.cuda()
- labels = labels.cuda()
- outputs = cnn(sents)
- _, predicted = torch.max(outputs.data, 1)
- total += labels.size(0)
- correct += (predicted == labels).sum()
- acc = 100. * correct / total
- print('Test Accuracy: %f %%' % (acc))
-
- if best_acc is None or acc > best_acc:
- best_acc = acc
- if os.path.exists("models") is False:
- os.makedirs("models")
- torch.save(cnn.state_dict(), 'models/cnn.pkl')
- else:
- learning_rate = learning_rate * 0.8
-
-print("Best Accuracy: %f %%" % best_acc)
-print("Best Model: models/cnn.pkl")
diff --git a/reproduction/Char-aware_NLM/LICENSE b/reproduction/Char-aware_NLM/LICENSE
deleted file mode 100644
index 9689f68b..00000000
--- a/reproduction/Char-aware_NLM/LICENSE
+++ /dev/null
@@ -1,21 +0,0 @@
-MIT License
-
-Copyright (c) 2017
-
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
-
-The above copyright notice and this permission notice shall be included in all
-copies or substantial portions of the Software.
-
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
\ No newline at end of file
diff --git a/reproduction/Char-aware_NLM/README.md b/reproduction/Char-aware_NLM/README.md
deleted file mode 100644
index 4bb06386..00000000
--- a/reproduction/Char-aware_NLM/README.md
+++ /dev/null
@@ -1,40 +0,0 @@
-
-# PyTorch-Character-Aware-Neural-Language-Model
-
-This is the PyTorch implementation of character-aware neural language model proposed in this [paper](https://arxiv.org/abs/1508.06615) by Yoon Kim.
-
-## Requiredments
-The code is run and tested with **Python 3.5.2** and **PyTorch 0.3.1**.
-
-## HyperParameters
-| HyperParam | value |
-| ------ | :-------|
-| LSTM batch size | 20 |
-| LSTM sequence length | 35 |
-| LSTM hidden units | 300 |
-| epochs | 35 |
-| initial learning rate | 1.0 |
-| character embedding dimension | 15 |
-
-## Demo
-Train the model with split train/valid/test data.
-
-`python train.py`
-
-The trained model will saved in `cache/net.pkl`.
-Test the model.
-
-`python test.py`
-
-Best result on test set:
-PPl=127.2163
-cross entropy loss=4.8459
-
-## Acknowledgement
-This implementation borrowed ideas from
-
-https://github.com/jarfo/kchar
-
-https://github.com/cronos123/Character-Aware-Neural-Language-Models
-
-
diff --git a/reproduction/Char-aware_NLM/main.py b/reproduction/Char-aware_NLM/main.py
deleted file mode 100644
index 6467d98d..00000000
--- a/reproduction/Char-aware_NLM/main.py
+++ /dev/null
@@ -1,9 +0,0 @@
-PICKLE = "./save/"
-
-
-def train():
- pass
-
-
-if __name__ == "__main__":
- train()
diff --git a/reproduction/Char-aware_NLM/model.py b/reproduction/Char-aware_NLM/model.py
deleted file mode 100644
index 7880d6eb..00000000
--- a/reproduction/Char-aware_NLM/model.py
+++ /dev/null
@@ -1,145 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-
-class Highway(nn.Module):
- """Highway network"""
-
- def __init__(self, input_size):
- super(Highway, self).__init__()
- self.fc1 = nn.Linear(input_size, input_size, bias=True)
- self.fc2 = nn.Linear(input_size, input_size, bias=True)
-
- def forward(self, x):
- t = F.sigmoid(self.fc1(x))
- return torch.mul(t, F.relu(self.fc2(x))) + torch.mul(1 - t, x)
-
-
-class charLM(nn.Module):
- """CNN + highway network + LSTM
- # Input:
- 4D tensor with shape [batch_size, in_channel, height, width]
- # Output:
- 2D Tensor with shape [batch_size, vocab_size]
- # Arguments:
- char_emb_dim: the size of each character's attention
- word_emb_dim: the size of each word's attention
- vocab_size: num of unique words
- num_char: num of characters
- use_gpu: True or False
- """
-
- def __init__(self, char_emb_dim, word_emb_dim,
- vocab_size, num_char, use_gpu):
- super(charLM, self).__init__()
- self.char_emb_dim = char_emb_dim
- self.word_emb_dim = word_emb_dim
- self.vocab_size = vocab_size
-
- # char attention layer
- self.char_embed = nn.Embedding(num_char, char_emb_dim)
-
- # convolutions of filters with different sizes
- self.convolutions = []
-
- # list of tuples: (the number of filter, width)
- self.filter_num_width = [(25, 1), (50, 2), (75, 3), (100, 4), (125, 5), (150, 6)]
-
- for out_channel, filter_width in self.filter_num_width:
- self.convolutions.append(
- nn.Conv2d(
- 1, # in_channel
- out_channel, # out_channel
- kernel_size=(char_emb_dim, filter_width), # (height, width)
- bias=True
- )
- )
-
- self.highway_input_dim = sum([x for x, y in self.filter_num_width])
-
- self.batch_norm = nn.BatchNorm1d(self.highway_input_dim, affine=False)
-
- # highway net
- self.highway1 = Highway(self.highway_input_dim)
- self.highway2 = Highway(self.highway_input_dim)
-
- # LSTM
- self.lstm_num_layers = 2
-
- self.lstm = nn.LSTM(input_size=self.highway_input_dim,
- hidden_size=self.word_emb_dim,
- num_layers=self.lstm_num_layers,
- bias=True,
- dropout=0.5,
- batch_first=True)
-
- # output layer
- self.dropout = nn.Dropout(p=0.5)
- self.linear = nn.Linear(self.word_emb_dim, self.vocab_size)
-
- if use_gpu is True:
- for x in range(len(self.convolutions)):
- self.convolutions[x] = self.convolutions[x].cuda()
- self.highway1 = self.highway1.cuda()
- self.highway2 = self.highway2.cuda()
- self.lstm = self.lstm.cuda()
- self.dropout = self.dropout.cuda()
- self.char_embed = self.char_embed.cuda()
- self.linear = self.linear.cuda()
- self.batch_norm = self.batch_norm.cuda()
-
- def forward(self, x, hidden):
- # Input: Variable of Tensor with shape [num_seq, seq_len, max_word_len+2]
- # Return: Variable of Tensor with shape [num_words, len(word_dict)]
- lstm_batch_size = x.size()[0]
- lstm_seq_len = x.size()[1]
-
- x = x.contiguous().view(-1, x.size()[2])
- # [num_seq*seq_len, max_word_len+2]
-
- x = self.char_embed(x)
- # [num_seq*seq_len, max_word_len+2, char_emb_dim]
-
- x = torch.transpose(x.view(x.size()[0], 1, x.size()[1], -1), 2, 3)
- # [num_seq*seq_len, 1, max_word_len+2, char_emb_dim]
-
- x = self.conv_layers(x)
- # [num_seq*seq_len, total_num_filters]
-
- x = self.batch_norm(x)
- # [num_seq*seq_len, total_num_filters]
-
- x = self.highway1(x)
- x = self.highway2(x)
- # [num_seq*seq_len, total_num_filters]
-
- x = x.contiguous().view(lstm_batch_size, lstm_seq_len, -1)
- # [num_seq, seq_len, total_num_filters]
-
- x, hidden = self.lstm(x, hidden)
- # [seq_len, num_seq, hidden_size]
-
- x = self.dropout(x)
- # [seq_len, num_seq, hidden_size]
-
- x = x.contiguous().view(lstm_batch_size * lstm_seq_len, -1)
- # [num_seq*seq_len, hidden_size]
-
- x = self.linear(x)
- # [num_seq*seq_len, vocab_size]
- return x, hidden
-
- def conv_layers(self, x):
- chosen_list = list()
- for conv in self.convolutions:
- feature_map = F.tanh(conv(x))
- # (batch_size, out_channel, 1, max_word_len-width+1)
- chosen = torch.max(feature_map, 3)[0]
- # (batch_size, out_channel, 1)
- chosen = chosen.squeeze()
- # (batch_size, out_channel)
- chosen_list.append(chosen)
-
- # (batch_size, total_num_filers)
- return torch.cat(chosen_list, 1)
diff --git a/reproduction/Char-aware_NLM/test.py b/reproduction/Char-aware_NLM/test.py
deleted file mode 100644
index abf3f44d..00000000
--- a/reproduction/Char-aware_NLM/test.py
+++ /dev/null
@@ -1,117 +0,0 @@
-import os
-from collections import namedtuple
-
-import numpy as np
-import torch
-import torch.nn as nn
-from torch.autograd import Variable
-from utilities import *
-
-
-def to_var(x):
- if torch.cuda.is_available():
- x = x.cuda()
- return Variable(x)
-
-
-def test(net, data, opt):
- net.eval()
-
- test_input = torch.from_numpy(data.test_input)
- test_label = torch.from_numpy(data.test_label)
-
- num_seq = test_input.size()[0] // opt.lstm_seq_len
- test_input = test_input[:num_seq * opt.lstm_seq_len, :]
- # [num_seq, seq_len, max_word_len+2]
- test_input = test_input.view(-1, opt.lstm_seq_len, opt.max_word_len + 2)
-
- criterion = nn.CrossEntropyLoss()
-
- loss_list = []
- num_hits = 0
- total = 0
- iterations = test_input.size()[0] // opt.lstm_batch_size
- test_generator = batch_generator(test_input, opt.lstm_batch_size)
- label_generator = batch_generator(test_label, opt.lstm_batch_size * opt.lstm_seq_len)
-
- hidden = (to_var(torch.zeros(2, opt.lstm_batch_size, opt.word_embed_dim)),
- to_var(torch.zeros(2, opt.lstm_batch_size, opt.word_embed_dim)))
-
- add_loss = 0.0
- for t in range(iterations):
- batch_input = test_generator.__next__()
- batch_label = label_generator.__next__()
-
- net.zero_grad()
- hidden = [state.detach() for state in hidden]
- test_output, hidden = net(to_var(batch_input), hidden)
-
- test_loss = criterion(test_output, to_var(batch_label)).data
- loss_list.append(test_loss)
- add_loss += test_loss
-
- print("Test Loss={0:.4f}".format(float(add_loss) / iterations))
- print("Test PPL={0:.4f}".format(float(np.exp(add_loss / iterations))))
-
-
-#############################################################
-
-if __name__ == "__main__":
-
- word_embed_dim = 300
- char_embedding_dim = 15
-
- if os.path.exists("cache/prep.pt") is False:
- print("Cannot find prep.pt")
-
- objetcs = torch.load("cache/prep.pt")
-
- word_dict = objetcs["word_dict"]
- char_dict = objetcs["char_dict"]
- reverse_word_dict = objetcs["reverse_word_dict"]
- max_word_len = objetcs["max_word_len"]
- num_words = len(word_dict)
-
- print("word/char dictionary built. Start making inputs.")
-
- if os.path.exists("cache/data_sets.pt") is False:
-
- test_text = read_data("./test.txt")
- test_set = np.array(text2vec(test_text, char_dict, max_word_len))
-
- # Labels are next-word index in word_dict with the same length as inputs
- test_label = np.array([word_dict[w] for w in test_text[1:]] + [word_dict[test_text[-1]]])
-
- category = {"test": test_set, "tlabel": test_label}
- torch.save(category, "cache/data_sets.pt")
- else:
- data_sets = torch.load("cache/data_sets.pt")
- test_set = data_sets["test"]
- test_label = data_sets["tlabel"]
- train_set = data_sets["tdata"]
- train_label = data_sets["trlabel"]
-
- DataTuple = namedtuple("DataTuple", "test_input test_label train_input train_label ")
- data = DataTuple(test_input=test_set,
- test_label=test_label, train_label=train_label, train_input=train_set)
-
- print("Loaded data sets. Start building network.")
-
- USE_GPU = True
- cnn_batch_size = 700
- lstm_seq_len = 35
- lstm_batch_size = 20
-
- net = torch.load("cache/net.pkl")
-
- Options = namedtuple("Options", ["cnn_batch_size", "lstm_seq_len",
- "max_word_len", "lstm_batch_size", "word_embed_dim"])
- opt = Options(cnn_batch_size=lstm_seq_len * lstm_batch_size,
- lstm_seq_len=lstm_seq_len,
- max_word_len=max_word_len,
- lstm_batch_size=lstm_batch_size,
- word_embed_dim=word_embed_dim)
-
- print("Network built. Start testing.")
-
- test(net, data, opt)
diff --git a/reproduction/Char-aware_NLM/test.txt b/reproduction/Char-aware_NLM/test.txt
deleted file mode 100644
index 92aaec44..00000000
--- a/reproduction/Char-aware_NLM/test.txt
+++ /dev/null
@@ -1,320 +0,0 @@
- no it was n't black monday
- but while the new york stock exchange did n't fall apart friday as the dow jones industrial average plunged N points most of it in the final hour it barely managed to stay this side of chaos
- some circuit breakers installed after the october N crash failed their first test traders say unable to cool the selling panic in both stocks and futures
- the N stock specialist firms on the big board floor the buyers and sellers of last resort who were criticized after the N crash once again could n't handle the selling pressure
- big investment banks refused to step up to the plate to support the beleaguered floor traders by buying big blocks of stock traders say
- heavy selling of standard & poor 's 500-stock index futures in chicago beat stocks downward
- seven big board stocks ual amr bankamerica walt disney capital cities\/abc philip morris and pacific telesis group stopped trading and never resumed
- the has already begun
- the equity market was
- once again the specialists were not able to handle the imbalances on the floor of the new york stock exchange said christopher senior vice president at securities corp
- james chairman of specialists henderson brothers inc. it is easy to say the specialist is n't doing his job
- when the dollar is in a even central banks ca n't stop it
- speculators are calling for a degree of liquidity that is not there in the market
- many money managers and some traders had already left their offices early friday afternoon on a warm autumn day because the stock market was so quiet
- then in a plunge the dow jones industrials in barely an hour surrendered about a third of their gains this year up a 190.58-point or N N loss on the day in trading volume
- trading accelerated to N million shares a record for the big board
- at the end of the day N million shares were traded
- the dow jones industrials closed at N
- the dow 's decline was second in point terms only to the black monday crash that occurred oct. N N
- in percentage terms however the dow 's dive was the ever and the sharpest since the market fell N or N N a week after black monday
- the dow fell N N on black monday
- shares of ual the parent of united airlines were extremely active all day friday reacting to news and rumors about the proposed $ N billion buy-out of the airline by an group
- wall street 's takeover-stock speculators or risk arbitragers had placed unusually large bets that a takeover would succeed and ual stock would rise
- at N p.m. edt came the news the big board was trading in ual pending news
- on the exchange floor as soon as ual stopped trading we for a panic said one top floor trader
- several traders could be seen shaking their heads when the news
- for weeks the market had been nervous about takeovers after campeau corp. 's cash crunch spurred concern about the prospects for future highly leveraged takeovers
- and N minutes after the ual trading halt came news that the ual group could n't get financing for its bid
- at this point the dow was down about N points
- the market
- arbitragers could n't dump their ual stock but they rid themselves of nearly every rumor stock they had
- for example their selling caused trading halts to be declared in usair group which closed down N N to N N delta air lines which fell N N to N N and industries which sank N to N N
- these stocks eventually reopened
- but as panic spread speculators began to sell blue-chip stocks such as philip morris and international business machines to offset their losses
- when trading was halted in philip morris the stock was trading at N down N N while ibm closed N N lower at N
- selling because of waves of automatic stop-loss orders which are triggered by computer when prices fall to certain levels
- most of the stock selling pressure came from wall street professionals including computer-guided program traders
- traders said most of their major institutional investors on the other hand sat tight
- now at N one of the market 's post-crash reforms took hold as the s&p N futures contract had plunged N points equivalent to around a drop in the dow industrials
- under an agreement signed by the big board and the chicago mercantile exchange trading was temporarily halted in chicago
- after the trading halt in the s&p N pit in chicago waves of selling continued to hit stocks themselves on the big board and specialists continued to prices down
- as a result the link between the futures and stock markets apart
- without the of stock-index futures the barometer of where traders think the overall stock market is headed many traders were afraid to trust stock prices quoted on the big board
- the futures halt was even by big board floor traders
- it things up said one major specialist
- this confusion effectively halted one form of program trading stock index arbitrage that closely links the futures and stock markets and has been blamed by some for the market 's big swings
- in a stock-index arbitrage sell program traders buy or sell big baskets of stocks and offset the trade in futures to lock in a price difference
- when the airline information came through it every model we had for the marketplace said a managing director at one of the largest program-trading firms
- we did n't even get a chance to do the programs we wanted to do
- but stocks kept falling
- the dow industrials were down N points at N p.m. before the halt
- at N p.m. at the end of the cooling off period the average was down N points
- meanwhile during the the s&p trading halt s&p futures sell orders began up while stocks in new york kept falling sharply
- big board chairman john j. phelan said yesterday the circuit breaker worked well
- i just think it 's at this point to get into a debate if index arbitrage would have helped or hurt things
- under another post-crash system big board president richard mr. phelan was flying to as the market was falling was talking on an hot line to the other exchanges the securities and exchange commission and the federal reserve board
- he out at a high-tech center on the floor of the big board where he could watch on prices and pending stock orders
- at about N p.m. edt s&p futures resumed trading and for a brief time the futures and stock markets started to come back in line
- buyers stepped in to the futures pit
- but the of s&p futures sell orders weighed on the market and the link with stocks began to fray again
- at about N the s&p market to still another limit of N points down and trading was locked again
- futures traders say the s&p was that the dow could fall as much as N points
- during this time small investors began ringing their brokers wondering whether another crash had begun
- at prudential-bache securities inc. which is trying to cater to small investors some brokers thought this would be the final
- that 's when george l. ball chairman of the prudential insurance co. of america unit took to the internal system to declare that the plunge was only mechanical
- i have a that this particular decline today is something more about less
- it would be my to advise clients not to sell to look for an opportunity to buy mr. ball told the brokers
- at merrill lynch & co. the nation 's biggest brokerage firm a news release was prepared merrill lynch comments on market drop
- the release cautioned that there are significant differences between the current environment and that of october N and that there are still attractive investment opportunities in the stock market
- however jeffrey b. lane president of shearson lehman hutton inc. said that friday 's plunge is going to set back relations with customers because it the concern of volatility
- and i think a lot of people will on program trading
- it 's going to bring the debate right back to the
- as the dow average ground to its final N loss friday the s&p pit stayed locked at its trading limit
- jeffrey of program trader investment group said N s&p contracts were for sale on the close the equivalent of $ N million in stock
- but there were no buyers
- while friday 's debacle involved mainly professional traders rather than investors it left the market vulnerable to continued selling this morning traders said
- stock-index futures contracts settled at much lower prices than indexes of the stock market itself
- at those levels stocks are set up to be by index arbitragers who lock in profits by buying futures when futures prices fall and simultaneously sell off stocks
- but nobody knows at what level the futures and stocks will open today
- the between the stock and futures markets friday will undoubtedly cause renewed debate about whether wall street is properly prepared for another crash situation
- the big board 's mr. said our performance was good
- but the exchange will look at the performance of all specialists in all stocks
- obviously we 'll take a close look at any situation in which we think the obligations were n't met he said
- see related story fed ready to big funds wsj oct. N N
- but specialists complain privately that just as in the N crash the firms big investment banks that support the market by trading big blocks of stock stayed on the sidelines during friday 's
- mr. phelan said it will take another day or two to analyze who was buying and selling friday
- concerning your sept. N page-one article on prince charles and the it 's a few hundred years since england has been a kingdom
- it 's now the united kingdom of great britain and northern ireland northern ireland scotland and oh yes england too
- just thought you 'd like to know
- george
- ports of call inc. reached agreements to sell its remaining seven aircraft to buyers that were n't disclosed
- the agreements bring to a total of nine the number of planes the travel company has sold this year as part of a restructuring
- the company said a portion of the $ N million realized from the sales will be used to repay its bank debt and other obligations resulting from the currently suspended operations
- earlier the company announced it would sell its aging fleet of boeing co. because of increasing maintenance costs
- a consortium of private investors operating as funding co. said it has made a $ N million cash bid for most of l.j. hooker corp. 's real-estate and holdings
- the $ N million bid includes the assumption of an estimated $ N million in secured liabilities on those properties according to those making the bid
- the group is led by jay chief executive officer of investment corp. in and a. boyd simpson chief executive of the atlanta-based simpson organization inc
- mr. 's company specializes in commercial real-estate investment and claims to have $ N billion in assets mr. simpson is a developer and a former senior executive of l.j. hooker
- the assets are good but they require more money and management than can be provided in l.j. hooker 's current situation said mr. simpson in an interview
- hooker 's philosophy was to build and sell
- we want to build and hold
- l.j. hooker based in atlanta is operating with protection from its creditors under chapter N of the u.s. bankruptcy code
- its parent company hooker corp. of sydney australia is currently being managed by a court-appointed provisional
- sanford chief executive of l.j. hooker said yesterday in a statement that he has not yet seen the bid but that he would review it and bring it to the attention of the creditors committee
- the $ N million bid is estimated by mr. simpson as representing N N of the value of all hooker real-estate holdings in the u.s.
- not included in the bid are teller or b. altman & co. l.j. hooker 's department-store chains
- the offer covers the massive N forest fair mall in cincinnati the N fashion mall in columbia s.c. and the N town center mall in
- the mall opened sept. N with a 's as its the columbia mall is expected to open nov. N
- other hooker properties included are a office tower in atlanta expected to be completed next february vacant land sites in florida and ohio l.j. hooker international the commercial real-estate brokerage company that once did business as merrill lynch commercial real estate plus other shopping centers
- the consortium was put together by the london-based investment banking company that is a subsidiary of security pacific corp
- we do n't anticipate any problems in raising the funding for the bid said campbell the head of mergers and acquisitions at in an interview
- is acting as the consortium 's investment bankers
- according to people familiar with the consortium the bid was project a reference to the film in which a played by actress is saved from a businessman by a police officer named john
- l.j. hooker was a small company based in atlanta in N when mr. simpson was hired to push it into commercial development
- the company grew modestly until N when a majority position in hooker corp. was acquired by australian developer george currently hooker 's chairman
- mr. to launch an ambitious but $ N billion acquisition binge that included teller and b. altman & co. as well as majority positions in merksamer jewelers a sacramento chain inc. the retailer and inc. the southeast department-store chain
- eventually mr. simpson and mr. had a falling out over the direction of the company and mr. simpson said he resigned in N
- since then hooker corp. has sold its interest in the chain back to 's management and is currently attempting to sell the b. altman & co. chain
- in addition robert chief executive of the chain is seeking funds to buy out the hooker interest in his company
- the merksamer chain is currently being offered for sale by first boston corp
- reached in mr. said that he believes the various hooker can become profitable with new management
- these are n't mature assets but they have the potential to be so said mr.
- managed properly and with a long-term outlook these can become investment-grade quality properties
- canadian production totaled N metric tons in the week ended oct. N up N N from the preceding week 's total of N tons statistics canada a federal agency said
- the week 's total was up N N from N tons a year earlier
- the total was N tons up N N from N tons a year earlier
- the treasury plans to raise $ N million in new cash thursday by selling about $ N billion of 52-week bills and $ N billion of maturing bills
- the bills will be dated oct. N and will mature oct. N N
- they will be available in minimum denominations of $ N
- bids must be received by N p.m. edt thursday at the treasury or at federal reserve banks or branches
- as small investors their mutual funds with phone calls over the weekend big fund managers said they have a strong defense against any wave of withdrawals cash
- unlike the weekend before black monday the funds were n't with heavy withdrawal requests
- and many fund managers have built up cash levels and say they will be buying stock this week
- at fidelity investments the nation 's largest fund company telephone volume was up sharply but it was still at just half the level of the weekend preceding black monday in N
- the boston firm said redemptions were running at less than one-third the level two years ago
- as of yesterday afternoon the redemptions represented less than N N of the total cash position of about $ N billion of fidelity 's stock funds
- two years ago there were massive redemption levels over the weekend and a lot of fear around said c. bruce who runs fidelity investments ' $ N billion fund
- this feels more like a deal
- people are n't
- the test may come today
- friday 's stock market sell-off came too late for many investors to act
- some shareholders have held off until today because any fund exchanges made after friday 's close would take place at today 's closing prices
- stock fund redemptions during the N debacle did n't begin to until after the market opened on black monday
- but fund managers say they 're ready
- many have raised cash levels which act as a buffer against steep market declines
- mario for instance holds cash positions well above N N in several of his funds
- windsor fund 's john and mutual series ' michael price said they had raised their cash levels to more than N N and N N respectively this year
- even peter lynch manager of fidelity 's $ N billion fund the nation 's largest stock fund built up cash to N N or $ N million
- one reason is that after two years of monthly net redemptions the fund posted net inflows of money from investors in august and september
- i 've let the money build up mr. lynch said who added that he has had trouble finding stocks he likes
- not all funds have raised cash levels of course
- as a group stock funds held N N of assets in cash as of august the latest figures available from the investment company institute
- that was modestly higher than the N N and N N levels in august and september of N
- also persistent redemptions would force some fund managers to dump stocks to raise cash
- but a strong level of investor withdrawals is much more unlikely this time around fund managers said
- a major reason is that investors already have sharply scaled back their purchases of stock funds since black monday
- sales have rebounded in recent months but monthly net purchases are still running at less than half N levels
- there 's not nearly as much said john chairman of vanguard group inc. a big valley forge pa. fund company
- many fund managers argue that now 's the time to buy
- vincent manager of the $ N billion wellington fund added to his positions in bristol-myers squibb woolworth and dun & bradstreet friday
- and today he 'll be looking to buy drug stocks like eli lilly pfizer and american home products whose dividend yields have been bolstered by stock declines
- fidelity 's mr. lynch for his part snapped up southern co. shares friday after the stock got
- if the market drops further today he said he 'll be buying blue chips such as bristol-myers and kellogg
- if they stocks like that he said it presents an opportunity that is the kind of thing you dream about
- major mutual-fund groups said phone calls were at twice the normal weekend pace yesterday
- but most investors were seeking share prices and other information
- trading volume was only modestly higher than normal
- still fund groups are n't taking any chances
- they hope to avoid the phone lines and other that some fund investors in october N
- fidelity on saturday opened its N investor centers across the country
- the centers normally are closed through the weekend
- in addition east coast centers will open at N edt this morning instead of the normal N
- t. rowe price associates inc. increased its staff of phone representatives to handle investor requests
- the group noted that some investors moved money from stock funds to money-market funds
- but most investors seemed to be in an information mode rather than in a transaction mode said steven a vice president
- and vanguard among other groups said it was adding more phone representatives today to help investors get through
- in an unusual move several funds moved to calm investors with on their phone lines
- we view friday 's market decline as offering us a buying opportunity as long-term investors a recording at & co. funds said over the weekend
- the group had a similar recording for investors
- several fund managers expect a rough market this morning before prices stabilize
- some early selling is likely to stem from investors and portfolio managers who want to lock in this year 's fat profits
- stock funds have averaged a staggering gain of N N through september according to lipper analytical services inc
- who runs shearson lehman hutton inc. 's $ N million sector analysis portfolio predicts the market will open down at least N points on technical factors and some panic selling
- but she expects prices to rebound soon and is telling investors she expects the stock market wo n't decline more than N N to N N from recent highs
- this is not a major crash she said
- nevertheless ms. said she was with phone calls over the weekend from nervous shareholders
- half of them are really scared and want to sell she said but i 'm trying to talk them out of it
- she added if they all were bullish i 'd really be upset
- the backdrop to friday 's slide was different from that of the october N crash fund managers argue
- two years ago unlike today the dollar was weak interest rates were rising and the market was very they say
- from the investors ' standpoint institutions and individuals learned a painful lesson by selling at the lows on black monday said stephen boesel manager of the $ N million t. rowe price growth and income fund
- this time i do n't think we 'll get a panic reaction
- newport corp. said it expects to report earnings of between N cents and N cents a share somewhat below analysts ' estimates of N cents to N cents
- the maker of scientific instruments and laser parts said orders fell below expectations in recent months
- a spokesman added that sales in the current quarter will about equal the quarter 's figure when newport reported net income of $ N million or N cents a share on $ N million in sales
- from the strike by N machinists union members against boeing co. reached air carriers friday as america west airlines announced it will postpone its new service out of houston because of delays in receiving aircraft from the seattle jet maker
- peter vice president for planning at the phoenix ariz. carrier said in an interview that the work at boeing now entering its 13th day has caused some turmoil in our scheduling and that more than N passengers who were booked to fly out of houston on america west would now be put on other airlines
- mr. said boeing told america west that the N it was supposed to get this thursday would n't be delivered until nov. N the day after the airline had been planning to service at houston with four daily flights including three to phoenix and one to las vegas
- now those routes are n't expected to begin until jan
- boeing is also supposed to send to america west another N aircraft as well as a N by year 's end
- those too are almost certain to arrive late
- at this point no other america west flights including its new service at san antonio texas newark n.j. and calif. have been affected by the delays in boeing deliveries
- nevertheless the company 's reaction the effect that a huge manufacturer such as boeing can have on other parts of the economy
- it also is sure to help the machinists put added pressure on the company
- i just do n't feel that the company can really stand or would want a prolonged tom baker president of machinists ' district N said in an interview yesterday
- i do n't think their customers would like it very much
- america west though is a smaller airline and therefore more affected by the delayed delivery of a single plane than many of its competitors would be
- i figure that american and united probably have such a hard time counting all the planes in their fleets they might not miss one at all mr. said
- indeed a random check friday did n't seem to indicate that the strike was having much of an effect on other airline operations
- southwest airlines has a boeing N set for delivery at the end of this month and expects to have the plane on time
- it 's so close to completion boeing 's told us there wo n't be a problem said a southwest spokesman
- a spokesman for amr corp. said boeing has assured american airlines it will deliver a N on time later this month
- american is preparing to take delivery of another N in early december and N more next year and is n't anticipating any changes in that timetable
- in seattle a boeing spokesman explained that the company has been in constant communication with all of its customers and that it was impossible to predict what further disruptions might be triggered by the strike
- meanwhile supervisors and employees have been trying to finish some N aircraft mostly N and N jumbo jets at the company 's wash. plant that were all but completed before the
- as of friday four had been delivered and a fifth plane a N was supposed to be out over the weekend to air china
- no date has yet been set to get back to the bargaining table
- we want to make sure they know what they want before they come back said doug hammond the federal mediator who has been in contact with both sides since the strike began
- the investment community for one has been anticipating a resolution
- though boeing 's stock price was battered along with the rest of the market friday it actually has risen over the last two weeks on the strength of new orders
- the market has taken two views that the labor situation will get settled in the short term and that things look very for boeing in the long term said howard an analyst at j. lawrence inc
- boeing 's shares fell $ N friday to close at $ N in composite trading on the new york stock exchange
- but mr. baker said he thinks the earliest a pact could be struck would be the end of this month that the company and union may resume negotiations as early as this week
- still he said it 's possible that the strike could last considerably longer
- i would n't expect an immediate resolution to anything
- last week boeing chairman frank sent striking workers a letter saying that to my knowledge boeing 's offer represents the best overall three-year contract of any major u.s. industrial firm in recent history
- but mr. baker called the letter and the company 's offer of a N N wage increase over the life of the pact plus bonuses very weak
- he added that the company the union 's resolve and the workers ' with being forced to work many hours overtime
- in separate developments talks have broken off between machinists representatives at lockheed corp. and the calif. aerospace company
- the union is continuing to work through its expired contract however
- it had planned a strike vote for next sunday but that has been pushed back indefinitely
- united auto workers local N which represents N workers at boeing 's helicopter unit in delaware county pa. said it agreed to extend its contract on a basis with a notification to cancel while it continues bargaining
- the accord expired yesterday
- and boeing on friday said it received an order from for four model N valued at a total of about $ N million
- the planes long range versions of the will be delivered with & engines
- & is a unit of united technologies inc
- is based in amsterdam
- a boeing spokeswoman said a delivery date for the planes is still being worked out for a variety of reasons but not because of the strike
- contributed to this article
- ltd. said its utilities arm is considering building new electric power plants some valued at more than one billion canadian dollars us$ N million in great britain and elsewhere
- 's senior vice president finance said its canadian utilities ltd. unit is reviewing projects in eastern canada and conventional electric power generating plants elsewhere including britain where the british government plans to allow limited competition in electrical generation from private-sector suppliers as part of its privatization program
- the projects are big
- they can be c$ N billion plus mr. said
- but we would n't go into them alone and canadian utilities ' equity stake would be small he said
- we 'd like to be the operator of the project and a modest equity investor
- our long suit is our proven ability to operate power plants he said
- mr. would n't offer regarding 's proposed british project but he said it would compete for customers with two huge british power generating companies that would be formed under the country 's plan to its massive water and electric utilities
- britain 's government plans to raise about # N billion $ N billion from the sale of most of its giant water and electric utilities beginning next month
- the planned electric utility sale scheduled for next year is alone expected to raise # N billion making it the world 's largest public offering
- under terms of the plan independent would be able to compete for N N of customers until N and for another N N between N and N
- canadian utilities had N revenue of c$ N billion mainly from its natural gas and electric utility businesses in alberta where the company serves about N customers
- there seems to be a move around the world to the generation of electricity mr. said and canadian utilities hopes to capitalize on it
- this is a real thrust on our utility side he said adding that canadian utilities is also projects in countries though he would be specific
- canadian utilities is n't alone in exploring power generation opportunities in britain in anticipation of the privatization program
- we 're certainly looking at some power generating projects in england said bruce vice president corporate strategy and corporate planning with enron corp. houston a big natural gas producer and pipeline operator
- mr.