Dev0.4.0 (#149)

* 1. CRF增加支持bmeso类型的tag 2. vocabulary中增加注释 * BucketSampler增加一条错误检测 * 1.修改ClipGradientCallback的bug；删除LRSchedulerCallback中的print，之后应该传入pbar进行打印;2.增加MLP注释 * update MLP module * 增加metric注释；修改trainer save过程中的bug * Update README.md fix tutorial link * Add ENAS (Efficient Neural Architecture Search) * add ignore_type in DataSet.add_field * * AutoPadder will not pad when dtype is None * add ignore_type in DataSet.apply * 修复fieldarray中padder潜在bug * 修复crf中typo; 以及可能导致数值不稳定的地方 * 修复CRF中可能存在的bug * change two default init arguments of Trainer into None * Changes to Callbacks: * 给callback添加给定几个只读属性 * 通过manager设置这些属性 * 代码优化，减轻@transfer的负担 * * 将enas相关代码放到automl目录下 * 修复fast_param_mapping的一个bug * Trainer添加自动创建save目录 * Vocabulary的打印，显示内容 * * 给vocabulary添加遍历方法 * 修复CRF为负数的bug * add SQuAD metric * add sigmoid activate function in MLP * - add star transformer model - add ConllLoader, for all kinds of conll-format files - add JsonLoader, for json-format files - add SSTLoader, for SST-2 & SST-5 - change Callback interface - fix batch multi-process when killed - add README to list models and their performance * - fix test * - fix callback & tests * - update README * 修改部分bug；调整callback * 准备发布0.4.0版本“ * update readme * support parallel loss * 防止多卡的情况导致无法正确计算loss“ * update advance_tutorial jupyter notebook * 1. 在embedding_loader中增加新的读取函数load_with_vocab(), load_without_vocab, 比之前的函数改变主要在(1)不再需要传入embed_dim(2)自动判断当前是word2vec还是glove. 2. vocabulary增加from_dataset(), index_dataset()函数。避免需要多行写index dataset的问题。 3. 在utils中新增一个cache_result()修饰器，用于cache函数的返回值。 4. callback中新增update_every属性 * 1.DataSet.apply()报错时提供错误的index 2.Vocabulary.from_dataset(), index_dataset()提供报错时的vocab顺序 3.embedloader在embed读取时遇到不规则的数据跳过这一行. * update attention * doc tools * fix some doc errors * 修改为中文注释，增加viterbi解码方法 * 样例版本 * - add pad sequence for lstm - add csv, conll, json filereader - update dataloader - remove useless dataloader - fix trainer loss print - fix tests * - fix test_tutorial * 注释增加 * 测试文档 * 本地暂存 * 本地暂存 * 修改文档的顺序 * - add document * 本地暂存 * update pooling * update bert * update documents in MLP * update documents in snli * combine self attention module to attention.py * update documents on losses.py * 对DataSet的文档进行更新 * update documents on metrics * 1. 删除了LSTM中print的内容; 2. 将Trainer和Tester的use_cuda修改为了device; 3.补充Trainer的文档 * 增加对Trainer的注释 * 完善了trainer，callback等的文档; 修改了部分代码的命名以使得代码从文档中隐藏 * update char level encoder * update documents on embedding.py * - update doc * 补充注释，并修改部分代码 * - update doc - add get_embeddings * 修改了文档配置项 * 修改embedding为init_embed初始化 * 1.增加对Trainer和Tester的多卡支持; * - add test - fix jsonloader * 删除了注释教程 * 给 dataset 增加了get_field_names * 修复bug * - add Const - fix bugs * 修改部分注释 * - add model runner for easier test models - add model tests * 修改了 docs 的配置和架构 * 修改了核心部分的一大部分文档，TODO： 1. 完善 trainer 和 tester 部分的文档 2. 研究注释样例与测试 * core部分的注释基本检查完成 * 修改了 io 部分的注释 * 全部改为相对路径引用 * 全部改为相对路径引用 * small change * 1. 从安装文件中删除api/automl的安装 2. metric中存在seq_len的bug 3. sampler中存在命名错误，已修改 * 修复 bug ：兼容 cpu 版本的 PyTorch TODO：其它地方可能也存在类似的 bug * 修改文档中的引用部分 * 把 tqdm.autonotebook 换成tqdm.auto * - fix batch & vocab * 上传了文档文件 *.rst * 上传了文档文件和若干 TODO * 讨论并整合了若干模块 * core部分的测试和一些小修改 * 删除了一些冗余文档 * update init files * update const files * update const files * 增加cnn的测试 * fix a little bug * - update attention - fix tests * 完善测试 * 完成快速入门教程 * 修改了sequence_modeling 命名为 sequence_labeling 的文档 * 重新 apidoc 解决改名的遗留问题 * 修改文档格式 * 统一不同位置的seq_len_to_mask, 现统一到core.utils.seq_len_to_mask * 增加了一行提示 * 在文档中展示 dataset_loader * 提示 Dataset.read_csv 会被 CSVLoader 替换 * 完成 Callback 和 Trainer 之间的文档 * index更新了部分 * 删除冗余的print * 删除用于分词的metric，因为有可能引起错误 * 修改文档中的中文名称 * 完成了详细介绍文档 * tutorial 的 ipynb 文件 * 修改了一些介绍文档 * 修改了 models 和 modules 的主页介绍 * 加上了 titlesonly 这个设置 * 修改了模块文档展示的标题 * 修改了 core 和 io 的开篇介绍 * 修改了 modules 和 models 开篇介绍 * 使用 .. todo:: 隐藏了可能被抽到文档中的 TODO 注释 * 修改了一些注释 * delete an old metric in test * 修改 tutorials 的测试文件 * 把暂不发布的功能移到 legacy 文件夹 * 删除了不能运行的测试 * 修改 callback 的测试文件 * 删除了过时的教程和测试文件 * cache_results 参数的修改 * 修改 io 的测试文件; 删除了一些过时的测试 * 修复bug * 修复无法通过test_utils.py的测试 * 修复与pytorch1.1中的padsequence的兼容问题; 修改Trainer的pbar * 1. 修复metric中的bug; 2.增加metric测试 * add model summary * 增加别名 * 删除encoder中的嵌套层 * 修改了 core 部分 import 的顺序，__all__ 暴露的内容 * 修改了 models 部分 import 的顺序，__all__ 暴露的内容 * 修改了文件名 * 修改了 modules 模块的__all__ 和 import * fix var runn * 增加vocab的clear方法 * 一些符合 PEP8 的微调 * 更新了cache_results的例子 * 1. 对callback中indices潜在None作出提示;2.DataSet支持通过List进行index * 修改了一个typo * 修改了 README.md * update documents on bert * update documents on encoder/bert * 增加一个fitlog callback，实现与fitlog实验记录 * typo * - update dataset_loader * 增加了到 fitlog 文档的链接。 * 增加了 DataSet Loader 的文档 * - add star-transformer reproduction
6 years ago · 881ce01762
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -0,0 +1,7 @@
 include requirements.txt
 include LICENSE
 include README.md
 prune test/
 prune reproduction/
 prune fastNLP/api
 prune fastNLP/automl
--- a/README.md
+++ b/README.md
@@ -6,87 +6,108 @@
 ![Hex.pm](https://img.shields.io/hexpm/l/plug.svg)
 [![Documentation Status](https://readthedocs.org/projects/fastnlp/badge/?version=latest)](http://fastnlp.readthedocs.io/?badge=latest)

 FastNLP is a modular Natural Language Processing system based on PyTorch, built for fast development of NLP models. 
 fastNLP 是一款轻量级的 NLP 处理套件。你既可以使用它快速地完成一个命名实体识别（NER）、中文分词或文本分类任务； 也可以使用他构建许多复杂的网络模型，进行科研。它具有如下的特性：

 - 统一的Tabular式数据容器，让数据预处理过程简洁明了。内置多种数据集的DataSet Loader，省去预处理代码。
 - 各种方便的NLP工具，例如预处理embedding加载; 中间数据cache等;
 - 详尽的中文文档以供查阅；
 - 提供诸多高级模块，例如Variational LSTM, Transformer, CRF等;
 - 封装CNNText，Biaffine等模型可供直接使用;
 - 便捷且具有扩展性的训练器; 提供多种内置callback函数，方便实验记录、异常捕获等。


 ## 安装指南

 fastNLP 依赖如下包:

 + numpy
 + torch>=0.4.0
 + tqdm
 + nltk

 其中torch的安装可能与操作系统及 CUDA 的版本相关，请参见 PyTorch 官网 。 
 在依赖包安装完成的情况，您可以在命令行执行如下指令完成安装

 ```shell
 pip install fastNLP
 ```


 ## 内置组件

 大部分用于的 NLP 任务神经网络都可以看做由编码（encoder）、聚合（aggregator）、解码（decoder）三种模块组成。


 ![](./docs/source/figures/text_classification.png)

 fastNLP 在 modules 模块中内置了三种模块的诸多组件，可以帮助用户快速搭建自己所需的网络。 三种模块的功能和常见组件如下:

 A deep learning NLP model is the composition of three types of modules:
 <table>
 <tr>
    <td><b> module type </b></td>
    <td><b> functionality </b></td>
    <td><b> example </b></td>
    <td><b> 类型 </b></td>
    <td><b> 功能 </b></td>
    <td><b> 例子 </b></td>
 </tr>
 <tr>
    <td> encoder </td>
    <td> encode the input into some abstract representation </td>
    <td> 将输入编码为具有具 有表示能力的向量 </td>
    <td> embedding, RNN, CNN, transformer
 </tr>
 <tr>
    <td> aggregator </td>
    <td> aggregate and reduce information </td>
    <td> 从多个向量中聚合信息 </td>
    <td> self-attention, max-pooling </td>
 </tr>
 <tr>
    <td> decoder </td>
    <td> decode the representation into the output </td>
    <td> 将具有某种表示意义的 向量解码为需要的输出 形式 </td>
    <td> MLP, CRF </td>
 </tr>
 </table>

 For example:

 ![](docs/source/figures/text_classification.png)

 ## Requirements

 - Python>=3.6
 - numpy>=1.14.2
 - torch>=0.4.0
 - tensorboardX
 - tqdm>=4.28.1

 ## 完整模型
 fastNLP 为不同的 NLP 任务实现了许多完整的模型，它们都经过了训练和测试。

 ## Resources
 你可以在以下两个地方查看相关信息
 - [介绍](reproduction/)
 - [源码](fastNLP/models/)

 - [Tutorials](https://github.com/fastnlp/fastNLP/tree/master/tutorials)
 - [Documentation](https://fastnlp.readthedocs.io/en/latest/)
 - [Source Code](https://github.com/fastnlp/fastNLP)


 ## Installation
 Run the following commands to install fastNLP package.
 ```shell
 pip install fastNLP
 ```
 ## 项目结构

 ![](./docs/source/figures/workflow.png)

 ## Project Structure
 fastNLP的大致工作流程如上图所示，而项目结构如下：

 <table>
 <tr>
    <td><b> fastNLP </b></td>
    <td> an open-source NLP library </td>
 </tr>
 <tr>
    <td><b> fastNLP.api </b></td>
    <td> APIs for end-to-end prediction </td>
    <td> 开源的自然语言处理库 </td>
 </tr>
 <tr>
    <td><b> fastNLP.core </b></td>
    <td> data representation & train/test procedure </td>
    <td> 实现了核心功能，包括数据处理组件、训练器、测速器等 </td>
 </tr>
 <tr>
    <td><b> fastNLP.models </b></td>
    <td> a collection of NLP models </td>
    <td> 实现了一些完整的神经网络模型 </td>
 </tr>
 <tr>
    <td><b> fastNLP.modules </b></td>
    <td> a collection of PyTorch sub-models/components/wheels </td>
    <td> 实现了用于搭建神经网络模型的诸多组件 </td>
 </tr>
 <tr>
    <td><b> fastNLP.io </b></td>
    <td> readers & savers </td>
    <td> 实现了读写功能，包括数据读入，模型读写等 </td>
 </tr>
 </table>

 ## 参考资源

 - [教程](https://github.com/fastnlp/fastNLP/tree/master/tutorials)
 - [文档](https://fastnlp.readthedocs.io/en/latest/)
 - [源码](https://github.com/fastnlp/fastNLP)



 *In memory of @FengZiYjun.  May his soul rest in peace. We will miss you very very much!*
--- a/docs/Makefile
+++ b/docs/Makefile
@@ -3,6 +3,7 @@

 # You can set these variables from the command line.
 SPHINXOPTS    =
 SPHINXAPIDOC  = sphinx-apidoc
 SPHINXBUILD   = sphinx-build
 SPHINXPROJ    = fastNLP
 SOURCEDIR     = source
@@ -12,6 +13,12 @@ BUILDDIR      = build
 help:
 	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

 apidoc:
 	$(SPHINXAPIDOC) -efM -o source ../$(SPHINXPROJ)

 server:
 	cd build/html && python -m http.server

 .PHONY: help Makefile

 # Catch-all target: route all unknown targets to Sphinx using the new
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -14,6 +14,7 @@
 #
 import os
 import sys

 sys.path.insert(0, os.path.abspath('../../'))

 # -- Project information -----------------------------------------------------
@@ -23,10 +24,9 @@ copyright = '2018, xpqiu'
 author = 'xpqiu'

 # The short X.Y version
 version = '0.2'
 version = '0.4'
 # The full version, including alpha/beta/rc tags
 release = '0.2'

 release = '0.4'

 # -- General configuration ---------------------------------------------------

@@ -42,9 +42,15 @@ extensions = [
    'sphinx.ext.viewcode',
    'sphinx.ext.autosummary',
    'sphinx.ext.mathjax',

    'sphinx.ext.todo'
 ]

 autodoc_default_options = {
    'member-order': 'bysource',
    'special-members': '__init__',
    'undoc-members': True,
 }

 # Add any paths that contain templates here, relative to this directory.
 templates_path = ['_templates']

@@ -62,17 +68,16 @@ master_doc = 'index'
 #
 # This is also used if you do content translation via gettext catalogs.
 # Usually you set "language" from the command line for these cases.
 language = None
 language = "zh_CN"

 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.
 # This pattern also affects html_static_path and html_extra_path .
 exclude_patterns = []
 exclude_patterns = ['modules.rst']

 # The name of the Pygments (syntax highlighting) style to use.
 pygments_style = 'sphinx'


 # -- Options for HTML output -------------------------------------------------

 # The theme to use for HTML and HTML Help pages.  See the documentation for
@@ -84,7 +89,10 @@ html_theme = 'sphinx_rtd_theme'
 # further.  For a list of options available for each theme, see the
 # documentation.
 #
 # html_theme_options = {}
 html_theme_options = {
    'collapse_navigation': False,
    'titles_only': True
 }

 # Add any paths that contain custom static files (such as style sheets) here,
 # relative to this directory. They are copied after the builtin static files,
@@ -107,22 +115,21 @@ html_static_path = ['_static']
 # Output file base name for HTML help builder.
 htmlhelp_basename = 'fastNLPdoc'


 # -- Options for LaTeX output ------------------------------------------------

 latex_elements = {
    # The paper size ('letterpaper' or 'a4paper').
    #
    # 'papersize': 'letterpaper',

    
    # The font size ('10pt', '11pt' or '12pt').
    #
    # 'pointsize': '10pt',

    
    # Additional stuff for the LaTeX preamble.
    #
    # 'preamble': '',

    
    # Latex figure (float) alignment
    #
    # 'figure_align': 'htbp',
@@ -136,7 +143,6 @@ latex_documents = [
     'xpqiu', 'manual'),
 ]


 # -- Options for manual page output ------------------------------------------

 # One entry per manual page. List of tuples
@@ -146,7 +152,6 @@ man_pages = [
     [author], 1)
 ]


 # -- Options for Texinfo output ----------------------------------------------

 # Grouping the document tree into Texinfo files. List of tuples
@@ -159,4 +164,14 @@ texinfo_documents = [
 ]


 # -- Extension configuration -------------------------------------------------
 # -- Extension configuration -------------------------------------------------
 def maybe_skip_member(app, what, name, obj, skip, options):
    if name.startswith("_"):
        return True
    if obj.__doc__ is None:
        return True
    return False


 def setup(app):
    app.connect('autodoc-skip-member', maybe_skip_member)
--- a/docs/source/fastNLP.api.rst
+++ b/docs/source/fastNLP.api.rst
@@ -1,36 +0,0 @@
 fastNLP.api 
 ============

 fastNLP.api.api 
 ----------------

 .. automodule:: fastNLP.api.api
    :members:

 fastNLP.api.converter 
 ----------------------

 .. automodule:: fastNLP.api.converter
    :members:

 fastNLP.api.model\_zoo 
 -----------------------

 .. automodule:: fastNLP.api.model_zoo
    :members:

 fastNLP.api.pipeline 
 ---------------------

 .. automodule:: fastNLP.api.pipeline
    :members:

 fastNLP.api.processor 
 ----------------------

 .. automodule:: fastNLP.api.processor
    :members:


 .. automodule:: fastNLP.api
    :members:
--- a/docs/source/fastNLP.core.batch.rst
+++ b/docs/source/fastNLP.core.batch.rst
@@ -0,0 +1,7 @@
 fastNLP.core.batch
 ==================

 .. automodule:: fastNLP.core.batch
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.core.callback.rst
+++ b/docs/source/fastNLP.core.callback.rst
@@ -0,0 +1,7 @@
 fastNLP.core.callback
 =====================

 .. automodule:: fastNLP.core.callback
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.core.const.rst
+++ b/docs/source/fastNLP.core.const.rst
@@ -0,0 +1,7 @@
 fastNLP.core.const
 ==================

 .. automodule:: fastNLP.core.const
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.core.dataset.rst
+++ b/docs/source/fastNLP.core.dataset.rst
@@ -0,0 +1,7 @@
 fastNLP.core.dataset
 ====================

 .. automodule:: fastNLP.core.dataset
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.core.field.rst
+++ b/docs/source/fastNLP.core.field.rst
@@ -0,0 +1,7 @@
 fastNLP.core.field
 ==================

 .. automodule:: fastNLP.core.field
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.core.instance.rst
+++ b/docs/source/fastNLP.core.instance.rst
@@ -0,0 +1,7 @@
 fastNLP.core.instance
 =====================

 .. automodule:: fastNLP.core.instance
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.core.losses.rst
+++ b/docs/source/fastNLP.core.losses.rst
@@ -0,0 +1,7 @@
 fastNLP.core.losses
 ===================

 .. automodule:: fastNLP.core.losses
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.core.metrics.rst
+++ b/docs/source/fastNLP.core.metrics.rst
@@ -0,0 +1,7 @@
 fastNLP.core.metrics
 ====================

 .. automodule:: fastNLP.core.metrics
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.core.optimizer.rst
+++ b/docs/source/fastNLP.core.optimizer.rst
@@ -0,0 +1,7 @@
 fastNLP.core.optimizer
 ======================

 .. automodule:: fastNLP.core.optimizer
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.core.rst
+++ b/docs/source/fastNLP.core.rst
@@ -1,84 +1,29 @@
 fastNLP.core 
 =============

 fastNLP.core.batch 
 -------------------

 .. automodule:: fastNLP.core.batch
    :members:

 fastNLP.core.dataset 
 ---------------------

 .. automodule:: fastNLP.core.dataset
    :members:

 fastNLP.core.fieldarray 
 ------------------------

 .. automodule:: fastNLP.core.fieldarray
    :members:

 fastNLP.core.instance 
 ----------------------

 .. automodule:: fastNLP.core.instance
    :members:

 fastNLP.core.losses 
 --------------------

 .. automodule:: fastNLP.core.losses
    :members:

 fastNLP.core.metrics 
 ---------------------

 .. automodule:: fastNLP.core.metrics
    :members:

 fastNLP.core.optimizer 
 -----------------------

 .. automodule:: fastNLP.core.optimizer
    :members:

 fastNLP.core.predictor 
 -----------------------

 .. automodule:: fastNLP.core.predictor
    :members:

 fastNLP.core.sampler 
 ---------------------

 .. automodule:: fastNLP.core.sampler
    :members:

 fastNLP.core.tester 
 --------------------

 .. automodule:: fastNLP.core.tester
    :members:

 fastNLP.core.trainer 
 ---------------------

 .. automodule:: fastNLP.core.trainer
    :members:

 fastNLP.core.utils 
 -------------------

 .. automodule:: fastNLP.core.utils
    :members:

 fastNLP.core.vocabulary 
 ------------------------

 .. automodule:: fastNLP.core.vocabulary
    :members:

 fastNLP.core
 ============

 .. automodule:: fastNLP.core
    :members:
    :undoc-members:
    :show-inheritance:

 子模块
 ----------

 .. toctree::
   :titlesonly:

   fastNLP.core.batch
   fastNLP.core.callback
   fastNLP.core.const
   fastNLP.core.dataset
   fastNLP.core.field
   fastNLP.core.instance
   fastNLP.core.losses
   fastNLP.core.metrics
   fastNLP.core.optimizer
   fastNLP.core.sampler
   fastNLP.core.tester
   fastNLP.core.trainer
   fastNLP.core.utils
   fastNLP.core.vocabulary

--- a/docs/source/fastNLP.core.sampler.rst
+++ b/docs/source/fastNLP.core.sampler.rst
@@ -0,0 +1,7 @@
 fastNLP.core.sampler
 ====================

 .. automodule:: fastNLP.core.sampler
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.core.tester.rst
+++ b/docs/source/fastNLP.core.tester.rst
@@ -0,0 +1,7 @@
 fastNLP.core.tester
 ===================

 .. automodule:: fastNLP.core.tester
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.core.trainer.rst
+++ b/docs/source/fastNLP.core.trainer.rst
@@ -0,0 +1,7 @@
 fastNLP.core.trainer
 ====================

 .. automodule:: fastNLP.core.trainer
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.core.utils.rst
+++ b/docs/source/fastNLP.core.utils.rst
@@ -0,0 +1,7 @@
 fastNLP.core.utils
 ==================

 .. automodule:: fastNLP.core.utils
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.core.vocabulary.rst
+++ b/docs/source/fastNLP.core.vocabulary.rst
@@ -0,0 +1,7 @@
 fastNLP.core.vocabulary
 =======================

 .. automodule:: fastNLP.core.vocabulary
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.io.base_loader.rst
+++ b/docs/source/fastNLP.io.base_loader.rst
@@ -0,0 +1,7 @@
 fastNLP.io.base\_loader
 =======================

 .. automodule:: fastNLP.io.base_loader
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.io.dataset_loader.rst
+++ b/docs/source/fastNLP.io.dataset_loader.rst
@@ -0,0 +1,7 @@
 fastNLP.io.dataset\_loader
 ==========================

 .. automodule:: fastNLP.io.dataset_loader
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.io.embed_loader.rst
+++ b/docs/source/fastNLP.io.embed_loader.rst
@@ -0,0 +1,7 @@
 fastNLP.io.embed\_loader
 ========================

 .. automodule:: fastNLP.io.embed_loader
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.io.model_io.rst
+++ b/docs/source/fastNLP.io.model_io.rst
@@ -0,0 +1,7 @@
 fastNLP.io.model\_io
 ====================

 .. automodule:: fastNLP.io.model_io
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.io.rst
+++ b/docs/source/fastNLP.io.rst
@@ -1,42 +1,19 @@
 fastNLP.io 
 ===========
 fastNLP.io
 ==========

 fastNLP.io.base\_loader 
 ------------------------

 .. automodule:: fastNLP.io.base_loader
    :members:

 fastNLP.io.config\_io 
 ----------------------

 .. automodule:: fastNLP.io.config_io
    :members:

 fastNLP.io.dataset\_loader 
 ---------------------------

 .. automodule:: fastNLP.io.dataset_loader
    :members:

 fastNLP.io.embed\_loader 
 -------------------------

 .. automodule:: fastNLP.io.embed_loader
    :members:

 fastNLP.io.logger 
 ------------------

 .. automodule:: fastNLP.io.logger
 .. automodule:: fastNLP.io
    :members:
    :undoc-members:
    :show-inheritance:

 fastNLP.io.model\_io 
 ---------------------
 子模块
 ----------

 .. automodule:: fastNLP.io.model_io
    :members:
 .. toctree::
   :titlesonly:

   fastNLP.io.base_loader
   fastNLP.io.dataset_loader
   fastNLP.io.embed_loader
   fastNLP.io.model_io

 .. automodule:: fastNLP.io
    :members:
--- a/docs/source/fastNLP.models.biaffine_parser.rst
+++ b/docs/source/fastNLP.models.biaffine_parser.rst
@@ -0,0 +1,7 @@
 fastNLP.models.biaffine\_parser
 ===============================

 .. automodule:: fastNLP.models.biaffine_parser
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.models.cnn_text_classification.rst
+++ b/docs/source/fastNLP.models.cnn_text_classification.rst
@@ -0,0 +1,7 @@
 fastNLP.models.cnn\_text\_classification
 ========================================

 .. automodule:: fastNLP.models.cnn_text_classification
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.models.rst
+++ b/docs/source/fastNLP.models.rst
@@ -1,42 +1,20 @@
 fastNLP.models 
 ===============
 fastNLP.models
 ==============

 fastNLP.models.base\_model 
 ---------------------------

 .. automodule:: fastNLP.models.base_model
    :members:

 fastNLP.models.biaffine\_parser 
 --------------------------------

 .. automodule:: fastNLP.models.biaffine_parser
    :members:

 fastNLP.models.char\_language\_model 
 -------------------------------------

 .. automodule:: fastNLP.models.char_language_model
    :members:

 fastNLP.models.cnn\_text\_classification 
 -----------------------------------------

 .. automodule:: fastNLP.models.cnn_text_classification
    :members:

 fastNLP.models.sequence\_modeling 
 ----------------------------------

 .. automodule:: fastNLP.models.sequence_modeling
 .. automodule:: fastNLP.models
    :members:
    :undoc-members:
    :show-inheritance:

 fastNLP.models.snli 
 --------------------
 子模块
 ----------

 .. automodule:: fastNLP.models.snli
    :members:
 .. toctree::
   :titlesonly:

   fastNLP.models.biaffine_parser
   fastNLP.models.cnn_text_classification
   fastNLP.models.sequence_labeling
   fastNLP.models.snli
   fastNLP.models.star_transformer

 .. automodule:: fastNLP.models
    :members:
--- a/docs/source/fastNLP.models.sequence_labeling.rst
+++ b/docs/source/fastNLP.models.sequence_labeling.rst
@@ -0,0 +1,7 @@
 fastNLP.models.sequence\_labeling
 =================================

 .. automodule:: fastNLP.models.sequence_labeling
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.models.snli.rst
+++ b/docs/source/fastNLP.models.snli.rst
@@ -0,0 +1,7 @@
 fastNLP.models.snli
 ===================

 .. automodule:: fastNLP.models.snli
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.models.star_transformer.rst
+++ b/docs/source/fastNLP.models.star_transformer.rst
@@ -0,0 +1,7 @@
 fastNLP.models.star\_transformer
 ================================

 .. automodule:: fastNLP.models.star_transformer
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.modules.aggregator.attention.rst
+++ b/docs/source/fastNLP.modules.aggregator.attention.rst
@@ -0,0 +1,7 @@
 fastNLP.modules.aggregator.attention
 ====================================

 .. automodule:: fastNLP.modules.aggregator.attention
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.modules.aggregator.pooling.rst
+++ b/docs/source/fastNLP.modules.aggregator.pooling.rst
@@ -0,0 +1,7 @@
 fastNLP.modules.aggregator.pooling
 ==================================

 .. automodule:: fastNLP.modules.aggregator.pooling
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.modules.aggregator.rst
+++ b/docs/source/fastNLP.modules.aggregator.rst
@@ -1,36 +1,17 @@
 fastNLP.modules.aggregator 
 ===========================
 fastNLP.modules.aggregator
 ==========================

 fastNLP.modules.aggregator.attention 
 -------------------------------------

 .. automodule:: fastNLP.modules.aggregator.attention
    :members:

 fastNLP.modules.aggregator.avg\_pool 
 -------------------------------------

 .. automodule:: fastNLP.modules.aggregator.avg_pool
    :members:

 fastNLP.modules.aggregator.kmax\_pool 
 --------------------------------------

 .. automodule:: fastNLP.modules.aggregator.kmax_pool
 .. automodule:: fastNLP.modules.aggregator
    :members:
    :undoc-members:
    :show-inheritance:

 fastNLP.modules.aggregator.max\_pool 
 -------------------------------------

 .. automodule:: fastNLP.modules.aggregator.max_pool
    :members:
 子模块
 ----------

 fastNLP.modules.aggregator.self\_attention 
 -------------------------------------------
 .. toctree::
   :titlesonly:

 .. automodule:: fastNLP.modules.aggregator.self_attention
    :members:
   fastNLP.modules.aggregator.attention
   fastNLP.modules.aggregator.pooling


 .. automodule:: fastNLP.modules.aggregator
    :members:
--- a/docs/source/fastNLP.modules.decoder.crf.rst
+++ b/docs/source/fastNLP.modules.decoder.crf.rst
@@ -0,0 +1,7 @@
 fastNLP.modules.decoder.CRF
 ===========================

 .. automodule:: fastNLP.modules.decoder.crf
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.modules.decoder.mlp.rst
+++ b/docs/source/fastNLP.modules.decoder.mlp.rst
@@ -0,0 +1,7 @@
 fastNLP.modules.decoder.MLP
 ===========================

 .. automodule:: fastNLP.modules.decoder.mlp
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.modules.decoder.rst
+++ b/docs/source/fastNLP.modules.decoder.rst
@@ -1,18 +1,18 @@
 fastNLP.modules.decoder 
 ========================
 fastNLP.modules.decoder
 =======================

 fastNLP.modules.decoder.CRF 
 ----------------------------

 .. automodule:: fastNLP.modules.decoder.CRF
 .. automodule:: fastNLP.modules.decoder
    :members:
    :undoc-members:
    :show-inheritance:

 fastNLP.modules.decoder.MLP 
 ----------------------------
 子模块
 ----------

 .. automodule:: fastNLP.modules.decoder.MLP
    :members:
 .. toctree::
   :titlesonly:

   fastNLP.modules.decoder.crf
   fastNLP.modules.decoder.mlp
   fastNLP.modules.decoder.utils

 .. automodule:: fastNLP.modules.decoder
    :members:
--- a/docs/source/fastNLP.modules.decoder.utils.rst
+++ b/docs/source/fastNLP.modules.decoder.utils.rst
@@ -0,0 +1,7 @@
 fastNLP.modules.decoder.utils
 =============================

 .. automodule:: fastNLP.modules.decoder.utils
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.modules.encoder.bert.rst
+++ b/docs/source/fastNLP.modules.encoder.bert.rst
@@ -0,0 +1,7 @@
 fastNLP.modules.encoder.bert
 ============================

 .. automodule:: fastNLP.modules.encoder.bert
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.modules.encoder.char_encoder.rst
+++ b/docs/source/fastNLP.modules.encoder.char_encoder.rst
@@ -0,0 +1,7 @@
 fastNLP.modules.encoder.char\_encoder
 =====================================

 .. automodule:: fastNLP.modules.encoder.char_encoder
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.modules.encoder.conv_maxpool.rst
+++ b/docs/source/fastNLP.modules.encoder.conv_maxpool.rst
@@ -0,0 +1,7 @@
 fastNLP.modules.encoder.conv\_maxpool
 =====================================

 .. automodule:: fastNLP.modules.encoder.conv_maxpool
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.modules.encoder.embedding.rst
+++ b/docs/source/fastNLP.modules.encoder.embedding.rst
@@ -0,0 +1,7 @@
 fastNLP.modules.encoder.embedding
 =================================

 .. automodule:: fastNLP.modules.encoder.embedding
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.modules.encoder.lstm.rst
+++ b/docs/source/fastNLP.modules.encoder.lstm.rst
@@ -0,0 +1,7 @@
 fastNLP.modules.encoder.lstm
 ============================

 .. automodule:: fastNLP.modules.encoder.lstm
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.modules.encoder.rst
+++ b/docs/source/fastNLP.modules.encoder.rst
@@ -1,60 +1,23 @@
 fastNLP.modules.encoder 
 ========================
 fastNLP.modules.encoder
 =======================

 fastNLP.modules.encoder.char\_embedding 
 ----------------------------------------

 .. automodule:: fastNLP.modules.encoder.char_embedding
    :members:

 fastNLP.modules.encoder.conv 
 -----------------------------

 .. automodule:: fastNLP.modules.encoder.conv
    :members:

 fastNLP.modules.encoder.conv\_maxpool 
 --------------------------------------

 .. automodule:: fastNLP.modules.encoder.conv_maxpool
    :members:

 fastNLP.modules.encoder.embedding 
 ----------------------------------

 .. automodule:: fastNLP.modules.encoder.embedding
    :members:

 fastNLP.modules.encoder.linear 
 -------------------------------

 .. automodule:: fastNLP.modules.encoder.linear
    :members:

 fastNLP.modules.encoder.lstm 
 -----------------------------

 .. automodule:: fastNLP.modules.encoder.lstm
    :members:

 fastNLP.modules.encoder.masked\_rnn 
 ------------------------------------

 .. automodule:: fastNLP.modules.encoder.masked_rnn
    :members:

 fastNLP.modules.encoder.transformer 
 ------------------------------------

 .. automodule:: fastNLP.modules.encoder.transformer
 .. automodule:: fastNLP.modules.encoder
    :members:
    :undoc-members:
    :show-inheritance:

 fastNLP.modules.encoder.variational\_rnn 
 -----------------------------------------
 子模块
 ----------

 .. automodule:: fastNLP.modules.encoder.variational_rnn
    :members:
 .. toctree::
   :titlesonly:

   fastNLP.modules.encoder.bert
   fastNLP.modules.encoder.char_encoder
   fastNLP.modules.encoder.conv_maxpool
   fastNLP.modules.encoder.embedding
   fastNLP.modules.encoder.lstm
   fastNLP.modules.encoder.star_transformer
   fastNLP.modules.encoder.transformer
   fastNLP.modules.encoder.variational_rnn

 .. automodule:: fastNLP.modules.encoder
    :members:
--- a/docs/source/fastNLP.modules.encoder.star_transformer.rst
+++ b/docs/source/fastNLP.modules.encoder.star_transformer.rst
@@ -0,0 +1,7 @@
 fastNLP.modules.encoder.star\_transformer
 =========================================

 .. automodule:: fastNLP.modules.encoder.star_transformer
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.modules.encoder.transformer.rst
+++ b/docs/source/fastNLP.modules.encoder.transformer.rst
@@ -0,0 +1,7 @@
 fastNLP.modules.encoder.transformer
 ===================================

 .. automodule:: fastNLP.modules.encoder.transformer
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.modules.encoder.variational_rnn.rst
+++ b/docs/source/fastNLP.modules.encoder.variational_rnn.rst
@@ -0,0 +1,7 @@
 fastNLP.modules.encoder.variational\_rnn
 ========================================

 .. automodule:: fastNLP.modules.encoder.variational_rnn
    :members:
    :undoc-members:
    :show-inheritance:
--- a/docs/source/fastNLP.modules.rst
+++ b/docs/source/fastNLP.modules.rst
@@ -1,30 +1,17 @@
 fastNLP.modules 
 ================
 fastNLP.modules
 ===============

 .. toctree::

    fastNLP.modules.aggregator
    fastNLP.modules.decoder
    fastNLP.modules.encoder

 fastNLP.modules.dropout 
 ------------------------

 .. automodule:: fastNLP.modules.dropout
    :members:

 fastNLP.modules.other\_modules 
 -------------------------------

 .. automodule:: fastNLP.modules.other_modules
 .. automodule:: fastNLP.modules
    :members:
    :undoc-members:
    :show-inheritance:

 fastNLP.modules.utils 
 ----------------------

 .. automodule:: fastNLP.modules.utils
    :members:
 子模块
 -----------

 .. toctree::
    :titlesonly:

 .. automodule:: fastNLP.modules
    :members:
    fastNLP.modules.aggregator
    fastNLP.modules.decoder
    fastNLP.modules.encoder
--- a/docs/source/fastNLP.rst
+++ b/docs/source/fastNLP.rst
@@ -1,13 +1,20 @@
 fastNLP 
 ========
 API 文档
 ===============

 .. automodule:: fastNLP
    :members:
    :undoc-members:
    :show-inheritance:

 内部模块
 -----------

 .. toctree::
    :titlesonly:
    :maxdepth: 3

    fastNLP.api
    fastNLP.core
    fastNLP.io
    fastNLP.models
    fastNLP.modules
    fastNLP.models

 .. automodule:: fastNLP
    :members:
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -1,63 +1,79 @@
 fastNLP documentation
 fastNLP 中文文档
 =====================
 A Modularized and Extensible Toolkit for Natural Language Processing. Currently still in incubation. 

 fastNLP 是一款轻量级的 NLP 处理套件。你既可以使用它快速地完成一个命名实体识别（NER）、中文分词或文本分类任务；
 也可以使用他构建许多复杂的网络模型，进行科研。它具有如下的特性:

 Introduction
 - 统一的Tabular式数据容器，让数据预处理过程简洁明了。内置多种数据集的DataSet Loader，省去预处理代码。
 - 各种方便的NLP工具，例如预处理embedding加载; 中间数据cache等;
 - 详尽的中文文档以供查阅；
 - 提供诸多高级模块，例如Variational LSTM, Transformer, CRF等;
 - 封装CNNText，Biaffine等模型可供直接使用;
 - 便捷且具有扩展性的训练器; 提供多种内置callback函数，方便实验记录、异常捕获等。


 内置组件
 ------------

 FastNLP is a modular Natural Language Processing system based on
 PyTorch, built for fast development of NLP models.
 大部分用于的 NLP 任务神经网络都可以看做由编码（encoder）、聚合（aggregator）、解码（decoder）三种模块组成。

 A deep learning NLP model is the composition of three types of modules:
 .. image:: figures/text_classification.png

 fastNLP 在 :mod:`~fastNLP.modules` 模块中内置了三种模块的诸多组件，可以帮助用户快速搭建自己所需的网络。
 三种模块的功能和常见组件如下:

 +-----------------------+-----------------------+-----------------------+
 | module type           | functionality         | example               |
 +=======================+=======================+=======================+
 | encoder               | encode the input into | embedding, RNN, CNN,  |
 |                       | some abstract         | transformer           |
 |                       | representation        |                       |
 | encoder               | 将输入编码为具有具    | embedding, RNN, CNN,  |
 |                       | 有表示能力的向量      | transformer           |
 +-----------------------+-----------------------+-----------------------+
 | aggregator            | aggregate and reduce  | self-attention,       |
 |                       | information           | max-pooling           |
 | aggregator            | 从多个向量中聚合信息  | self-attention,       |
 |                       |                       | max-pooling           |
 +-----------------------+-----------------------+-----------------------+
 | decoder               | decode the            | MLP, CRF              |
 |                       | representation into   |                       |
 |                       | the output            |                       |
 | decoder               | 将具有某种表示意义的  | MLP, CRF              |
 |                       | 向量解码为需要的输出  |                       |
 |                       | 形式                  |                       |
 +-----------------------+-----------------------+-----------------------+


 For example:

 .. image:: figures/text_classification.png
 内置模型
 ----------------

 fastNLP 在 :mod:`~fastNLP.models` 模块中内置了如 :class:`~fastNLP.models.CNNText` 、
 :class:`~fastNLP.models.SeqLabeling` 等完整的模型，以供用户直接使用。

 .. todo::
    这些模型的介绍如下表所示：（模型名称 + 介绍 + 任务上的结果）

 用户手册
 ----------------

 User's Guide
 ------------
 .. toctree::
   :maxdepth: 2
   :maxdepth: 1

   user/installation
   user/quickstart
    安装指南 <user/installation>
    快速入门 <user/quickstart>
    详细指南 <user/tutorial_one>


 API Reference
 API 文档
 -------------

 If you are looking for information on a specific function, class or
 method, this part of the documentation is for you.
 除了用户手册之外，你还可以通过查阅 API 文档来找到你所需要的工具。

 .. toctree::
   :titlesonly:
   :maxdepth: 2
   
   fastNLP API <fastNLP>

   fastNLP

 fitlog
 ------

 用户可以 `点此 <https://fitlog.readthedocs.io/zh/latest/>`_  查看fitlog的文档。
 fitlog 是由我们团队开发，用于帮助用户记录日志并管理代码的工具

 Indices and tables
 索引与搜索
 ==================

 * :ref:`genindex`
--- a/docs/source/modules.rst
+++ b/docs/source/modules.rst
@@ -0,0 +1,8 @@
 fastNLP
 =======

 .. toctree::
   :titlesonly:
   :maxdepth: 4

   fastNLP
--- a/docs/source/tutorials/fastnlp_10tmin_tutorial.rst
+++ b/docs/source/tutorials/fastnlp_10tmin_tutorial.rst
@@ -1,376 +0,0 @@
 fastNLP 10分钟上手教程
 ===============

 教程原文见 https://github.com/fastnlp/fastNLP/blob/master/tutorials/fastnlp_10min_tutorial.ipynb

 fastNLP提供方便的数据预处理，训练和测试模型的功能

 DataSet & Instance
 ------------------

 fastNLP用DataSet和Instance保存和处理数据。每个DataSet表示一个数据集，每个Instance表示一个数据样本。一个DataSet存有多个Instance，每个Instance可以自定义存哪些内容。

 有一些read\_\*方法，可以轻松从文件读取数据，存成DataSet。

 .. code:: ipython3

    from fastNLP import DataSet
    from fastNLP import Instance
    
    # 从csv读取数据到DataSet
    win_path = "C:\\Users\zyfeng\Desktop\FudanNLP\\fastNLP\\test\\data_for_tests\\tutorial_sample_dataset.csv"
    dataset = DataSet.read_csv(win_path, headers=('raw_sentence', 'label'), sep='\t')
    print(dataset[0])


 .. parsed-literal::

    {'raw_sentence': A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .,
    'label': 1}
    

 .. code:: ipython3

    # DataSet.append(Instance)加入新数据
    
    dataset.append(Instance(raw_sentence='fake data', label='0'))
    dataset[-1]




 .. parsed-literal::

    {'raw_sentence': fake data,
    'label': 0}



 .. code:: ipython3

    # DataSet.apply(func, new_field_name)对数据预处理
    
    # 将所有数字转为小写
    dataset.apply(lambda x: x['raw_sentence'].lower(), new_field_name='raw_sentence')
    # label转int
    dataset.apply(lambda x: int(x['label']), new_field_name='label_seq', is_target=True)
    # 使用空格分割句子
    dataset.drop(lambda x: len(x['raw_sentence'].split()) == 0)
    def split_sent(ins):
        return ins['raw_sentence'].split()
    dataset.apply(split_sent, new_field_name='words', is_input=True)

 .. code:: ipython3

    # DataSet.drop(func)筛除数据
    # 删除低于某个长度的词语
    dataset.drop(lambda x: len(x['words']) <= 3)

 .. code:: ipython3

    # 分出测试集、训练集
    
    test_data, train_data = dataset.split(0.3)
    print("Train size: ", len(test_data))
    print("Test size: ", len(train_data))


 .. parsed-literal::

    Train size:  54
    Test size: 

 Vocabulary
 ----------

 fastNLP中的Vocabulary轻松构建词表，将词转成数字

 .. code:: ipython3

    from fastNLP import Vocabulary
    
    # 构建词表, Vocabulary.add(word)
    vocab = Vocabulary(min_freq=2)
    train_data.apply(lambda x: [vocab.add(word) for word in x['words']])
    vocab.build_vocab()
    
    # index句子, Vocabulary.to_index(word)
    train_data.apply(lambda x: [vocab.to_index(word) for word in x['words']], new_field_name='word_seq', is_input=True)
    test_data.apply(lambda x: [vocab.to_index(word) for word in x['words']], new_field_name='word_seq', is_input=True)
    
    
    print(test_data[0])


 .. parsed-literal::

    {'raw_sentence': the plot is romantic comedy boilerplate from start to finish .,
    'label': 2,
    'label_seq': 2,
    'words': ['the', 'plot', 'is', 'romantic', 'comedy', 'boilerplate', 'from', 'start', 'to', 'finish', '.'],
    'word_seq': [2, 13, 9, 24, 25, 26, 15, 27, 11, 28, 3]}
    

 .. code:: ipython3

    # 假设你们需要做强化学习或者gan之类的项目，也许你们可以使用这里的dataset
    from fastNLP.core.batch import Batch
    from fastNLP.core.sampler import RandomSampler
    
    batch_iterator = Batch(dataset=train_data, batch_size=2, sampler=RandomSampler())
    for batch_x, batch_y in batch_iterator:
        print("batch_x has: ", batch_x)
        print("batch_y has: ", batch_y)
        break


 .. parsed-literal::

    batch_x has:  {'words': array([list(['this', 'kind', 'of', 'hands-on', 'storytelling', 'is', 'ultimately', 'what', 'makes', 'shanghai', 'ghetto', 'move', 'beyond', 'a', 'good', ',', 'dry', ',', 'reliable', 'textbook', 'and', 'what', 'allows', 'it', 'to', 'rank', 'with', 'its', 'worthy', 'predecessors', '.']),
           list(['the', 'entire', 'movie', 'is', 'filled', 'with', 'deja', 'vu', 'moments', '.'])],
          dtype=object), 'word_seq': tensor([[  19,  184,    6,    1,  481,    9,  206,   50,   91, 1210, 1609, 1330,
              495,    5,   63,    4, 1269,    4,    1, 1184,    7,   50, 1050,   10,
                8, 1611,   16,   21, 1039,    1,    2],
            [   3,  711,   22,    9, 1282,   16, 2482, 2483,  200,    2,    0,    0,
                0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
                0,    0,    0,    0,    0,    0,    0]])}
    batch_y has:  {'label_seq': tensor([3, 2])}
    

 Model
 -----

 .. code:: ipython3

    # 定义一个简单的Pytorch模型
    
    from fastNLP.models import CNNText
    model = CNNText(embed_num=len(vocab), embed_dim=50, num_classes=5, padding=2, dropout=0.1)
    model




 .. parsed-literal::

    CNNText(
      (embed): Embedding(
        (embed): Embedding(77, 50, padding_idx=0)
        (dropout): Dropout(p=0.0)
      )
      (conv_pool): ConvMaxpool(
        (convs): ModuleList(
          (0): Conv1d(50, 3, kernel_size=(3,), stride=(1,), padding=(2,))
          (1): Conv1d(50, 4, kernel_size=(4,), stride=(1,), padding=(2,))
          (2): Conv1d(50, 5, kernel_size=(5,), stride=(1,), padding=(2,))
        )
      )
      (dropout): Dropout(p=0.1)
      (fc): Linear(
        (linear): Linear(in_features=12, out_features=5, bias=True)
      )
    )



 Trainer & Tester
 ----------------

 使用fastNLP的Trainer训练模型

 .. code:: ipython3

    from fastNLP import Trainer
    from copy import deepcopy
    from fastNLP import CrossEntropyLoss
    from fastNLP import AccuracyMetric

 .. code:: ipython3

    # 进行overfitting测试
    copy_model = deepcopy(model)
    overfit_trainer = Trainer(model=copy_model, 
                              train_data=test_data, 
                              dev_data=test_data,
                              loss=CrossEntropyLoss(pred="output", target="label_seq"),
                              metrics=AccuracyMetric(),
                              n_epochs=10,
                              save_path=None)
    overfit_trainer.train()


 .. parsed-literal::

    training epochs started 2018-12-07 14:07:20
    



 .. parsed-literal::

    HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=20), HTML(value='')), layout=Layout(display='…



 .. parsed-literal::

    Epoch 1/10. Step:2/20. AccuracyMetric: acc=0.037037
    Epoch 2/10. Step:4/20. AccuracyMetric: acc=0.296296
    Epoch 3/10. Step:6/20. AccuracyMetric: acc=0.333333
    Epoch 4/10. Step:8/20. AccuracyMetric: acc=0.555556
    Epoch 5/10. Step:10/20. AccuracyMetric: acc=0.611111
    Epoch 6/10. Step:12/20. AccuracyMetric: acc=0.481481
    Epoch 7/10. Step:14/20. AccuracyMetric: acc=0.62963
    Epoch 8/10. Step:16/20. AccuracyMetric: acc=0.685185
    Epoch 9/10. Step:18/20. AccuracyMetric: acc=0.722222
    Epoch 10/10. Step:20/20. AccuracyMetric: acc=0.777778
    

 .. code:: ipython3

    # 实例化Trainer，传入模型和数据，进行训练
    trainer = Trainer(model=model, 
                      train_data=train_data, 
                      dev_data=test_data,
                      loss=CrossEntropyLoss(pred="output", target="label_seq"),
                      metrics=AccuracyMetric(),
                      n_epochs=5)
    trainer.train()
    print('Train finished!')


 .. parsed-literal::

    training epochs started 2018-12-07 14:08:10
    



 .. parsed-literal::

    HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=5), HTML(value='')), layout=Layout(display='i…



 .. parsed-literal::

    Epoch 1/5. Step:1/5. AccuracyMetric: acc=0.037037
    Epoch 2/5. Step:2/5. AccuracyMetric: acc=0.037037
    Epoch 3/5. Step:3/5. AccuracyMetric: acc=0.037037
    Epoch 4/5. Step:4/5. AccuracyMetric: acc=0.185185
    Epoch 5/5. Step:5/5. AccuracyMetric: acc=0.240741
    Train finished!
    

 .. code:: ipython3

    from fastNLP import Tester
    
    tester = Tester(data=test_data, model=model, metrics=AccuracyMetric())
    acc = tester.test()


 .. parsed-literal::

    [tester] 
    AccuracyMetric: acc=0.240741
    

 In summary
 ----------

 fastNLP Trainer的伪代码逻辑
 ---------------------------

 1. 准备DataSet，假设DataSet中共有如下的fields
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 ::

    ['raw_sentence', 'word_seq1', 'word_seq2', 'raw_label','label']
    通过
        DataSet.set_input('word_seq1', word_seq2', flag=True)将'word_seq1', 'word_seq2'设置为input
    通过
        DataSet.set_target('label', flag=True)将'label'设置为target

 2. 初始化模型
 ~~~~~~~~~~~~~

 ::

    class Model(nn.Module):
        def __init__(self):
            xxx
        def forward(self, word_seq1, word_seq2):
            # (1) 这里使用的形参名必须和DataSet中的input field的名称对应。因为我们是通过形参名, 进行赋值的
            # (2) input field的数量可以多于这里的形参数量。但是不能少于。
            xxxx
            # 输出必须是一个dict

 3. Trainer的训练过程
 ~~~~~~~~~~~~~~~~~~~~

 ::

    (1) 从DataSet中按照batch_size取出一个batch，调用Model.forward
    (2) 将 Model.forward的结果 与 标记为target的field 传入Losser当中。
           由于每个人写的Model.forward的output的dict可能key并不一样，比如有人是{'pred':xxx}, {'output': xxx}; 
           另外每个人将target可能也会设置为不同的名称, 比如有人是label, 有人设置为target；
        为了解决以上的问题，我们的loss提供映射机制
           比如CrossEntropyLosser的需要的输入是(prediction, target)。但是forward的output是{'output': xxx}; 'label'是target
           那么初始化losser的时候写为CrossEntropyLosser(prediction='output', target='label')即可
     (3) 对于Metric是同理的
         Metric计算也是从 forward的结果中取值 与 设置target的field中取值。 也是可以通过映射找到对应的值        

 一些问题.
 ---------

 1. DataSet中为什么需要设置input和target
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 ::

    只有被设置为input或者target的数据才会在train的过程中被取出来
    (1.1) 我们只会在设置为input的field中寻找传递给Model.forward的参数。
    (1.2) 我们在传递值给losser或者metric的时候会使用来自: 
            (a)Model.forward的output
            (b)被设置为target的field
          

 2. 我们是通过forwad中的形参名将DataSet中的field赋值给对应的参数
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 ::

     (1.1) 构建模型过程中，
      例如:
          DataSet中x，seq_lens是input，那么forward就应该是
          def forward(self, x, seq_lens):
              pass
          我们是通过形参名称进行匹配的field的
       

 1. 加载数据到DataSet
 ~~~~~~~~~~~~~~~~~~~~

 2. 使用apply操作对DataSet进行预处理
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 ::

      (2.1) 处理过程中将某些field设置为input，某些field设置为target

 3. 构建模型
 ~~~~~~~~~~~

 ::

      (3.1) 构建模型过程中，需要注意forward函数的形参名需要和DataSet中设置为input的field名称是一致的。
      例如:
          DataSet中x，seq_lens是input，那么forward就应该是
          def forward(self, x, seq_lens):
              pass
          我们是通过形参名称进行匹配的field的
      (3.2) 模型的forward的output需要是dict类型的。
          建议将输出设置为{"pred": xx}.
          
--- a/docs/source/tutorials/fastnlp_1_minute_tutorial.rst
+++ b/docs/source/tutorials/fastnlp_1_minute_tutorial.rst
@@ -1,113 +0,0 @@

 FastNLP 1分钟上手教程
 =====================

 教程原文见 https://github.com/fastnlp/fastNLP/blob/master/tutorials/fastnlp_1min_tutorial.ipynb

 step 1
 ------

 读取数据集

 .. code:: ipython3

    from fastNLP import DataSet
    # linux_path = "../test/data_for_tests/tutorial_sample_dataset.csv"
    win_path = "C:\\Users\zyfeng\Desktop\FudanNLP\\fastNLP\\test\\data_for_tests\\tutorial_sample_dataset.csv"
    ds = DataSet.read_csv(win_path, headers=('raw_sentence', 'label'), sep='\t')

 step 2
 ------

 数据预处理 1. 类型转换 2. 切分验证集 3. 构建词典

 .. code:: ipython3

    # 将所有数字转为小写
    ds.apply(lambda x: x['raw_sentence'].lower(), new_field_name='raw_sentence')
    # label转int
    ds.apply(lambda x: int(x['label']), new_field_name='label_seq', is_target=True)
    
    def split_sent(ins):
        return ins['raw_sentence'].split()
    ds.apply(split_sent, new_field_name='words', is_input=True)
    

 .. code:: ipython3

    # 分割训练集/验证集
    train_data, dev_data = ds.split(0.3)
    print("Train size: ", len(train_data))
    print("Test size: ", len(dev_data))


 .. parsed-literal::

    Train size:  54
    Test size:  23
    

 .. code:: ipython3

    from fastNLP import Vocabulary
    vocab = Vocabulary(min_freq=2)
    train_data.apply(lambda x: [vocab.add(word) for word in x['words']])
    
    # index句子, Vocabulary.to_index(word)
    train_data.apply(lambda x: [vocab.to_index(word) for word in x['words']], new_field_name='word_seq', is_input=True)
    dev_data.apply(lambda x: [vocab.to_index(word) for word in x['words']], new_field_name='word_seq', is_input=True)
    

 step 3
 ------

 定义模型

 .. code:: ipython3

    from fastNLP.models import CNNText
    model = CNNText(embed_num=len(vocab), embed_dim=50, num_classes=5, padding=2, dropout=0.1)
    

 step 4
 ------

 开始训练

 .. code:: ipython3

    from fastNLP import Trainer, CrossEntropyLoss, AccuracyMetric
    trainer = Trainer(model=model, 
                      train_data=train_data, 
                      dev_data=dev_data,
                      loss=CrossEntropyLoss(),
                      metrics=AccuracyMetric()
                      )
    trainer.train()
    print('Train finished!')
    


 .. parsed-literal::

    training epochs started 2018-12-07 14:03:41
    



 .. parsed-literal::

    HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=6), HTML(value='')), layout=Layout(display='i…



 .. parsed-literal::

    Epoch 1/3. Step:2/6. AccuracyMetric: acc=0.26087
    Epoch 2/3. Step:4/6. AccuracyMetric: acc=0.347826
    Epoch 3/3. Step:6/6. AccuracyMetric: acc=0.608696
    Train finished!
    

 本教程结束。更多操作请参考进阶教程。
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--- a/docs/source/tutorials/fastnlp_advanced_tutorial.rst
+++ b/docs/source/tutorials/fastnlp_advanced_tutorial.rst
@@ -1,5 +0,0 @@
 fastNLP 进阶教程
 ===============

 教程原文见 https://github.com/fastnlp/fastNLP/blob/master/tutorials/fastnlp_advanced_tutorial/advance_tutorial.ipynb

--- a/docs/source/tutorials/fastnlp_developer_guide.rst
+++ b/docs/source/tutorials/fastnlp_developer_guide.rst
@@ -1,5 +0,0 @@
 fastNLP 开发者指南
 ===============

 原文见 https://github.com/fastnlp/fastNLP/blob/master/tutorials/tutorial_for_developer.md

--- a/docs/source/user/installation.rst
+++ b/docs/source/user/installation.rst
@@ -1,17 +1,20 @@
 ============
 Installation
 ============
 ===============
 安装指南
 ===============

 .. contents::
   :local:

 Make sure your environment satisfies https://github.com/fastnlp/fastNLP/blob/master/requirements.txt .
 fastNLP 依赖如下包::

 Run the following commands to install fastNLP package:
    torch>=0.4.0
    numpy
    tqdm
    nltk

 .. code:: shell

   pip install fastNLP
   
 其中torch的安装可能与操作系统及 CUDA 的版本相关，请参见 `PyTorch 官网 <https://pytorch.org/get-started/locally/>`_ 。
 在依赖包安装完成的情况，您可以在命令行执行如下指令完成安装

 ..  code:: shell

   >>> pip install fastNLP
--- a/docs/source/user/quickstart.rst
+++ b/docs/source/user/quickstart.rst
@@ -1,11 +1,124 @@
 Quickstart
 ==========
 ===============
 快速入门
 ===============

 .. toctree::
   :maxdepth: 1
 这是一个简单的分类任务 (数据来源 `kaggle <https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews>`_ )。
 给出一段文字，预测它的标签是0~4中的哪一个。

   ../tutorials/fastnlp_1_minute_tutorial
   ../tutorials/fastnlp_10tmin_tutorial
   ../tutorials/fastnlp_advanced_tutorial
   ../tutorials/fastnlp_developer_guide
 我们可以使用 fastNLP 中 io 模块中的  :class:`~fastNLP.io.CSVLoader` 类，轻松地从 csv 文件读取我们的数据。

 .. code-block:: python

    from fastNLP.io import CSVLoader

    loader = CSVLoader(headers=('raw_sentence', 'label'), sep='\t')
    dataset = loader.load("./sample_data/tutorial_sample_dataset.csv")

 此时的 `dataset[0]` 的值如下,可以看到，数据集中的每个数据包含 ``raw_sentence`` 和 ``label`` 两个字段，他们的类型都是 ``str``::

    {'raw_sentence': A series of escapades demonstrating the adage that what is good for the
    goose is also good for the gander , some of which occasionally amuses but none of which
    amounts to much of a story . type=str,
    'label': 1 type=str}


 我们使用 :class:`~fastNLP.DataSet` 类的 :meth:`~fastNLP.DataSet.apply` 方法将 ``raw_sentence`` 中字母变成小写，并将句子分词。

 .. code-block:: python

    dataset.apply(lambda x: x['raw_sentence'].lower(), new_field_name='sentence')
    dataset.apply(lambda x: x['sentence'].split(), new_field_name='words', is_input=True)

 然后我们再用 :class:`~fastNLP.Vocabulary` 类来统计数据中出现的单词，并将单词序列转化为训练可用的数字序列。

 .. code-block:: python

    from fastNLP import Vocabulary
    vocab = Vocabulary(min_freq=2).from_dataset(dataset, field_name='words')
    vocab.index_dataset(dataset, field_name='words',new_field_name='words')

 同时，我们也将原来 str 类型的标签转化为数字，并设置为训练中的标准答案 ``target``

 .. code-block:: python

    dataset.apply(lambda x: int(x['label']), new_field_name='target', is_target=True)

 现在我们可以导入 fastNLP 内置的文本分类模型 :class:`~fastNLP.models.CNNText` ，


 .. code-block:: python

    from fastNLP.models import CNNText
    model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)

 :class:`~fastNLP.models.CNNText` 的网络结构如下::

    CNNText(
      (embed): Embedding(
        177, 50
        (dropout): Dropout(p=0.0)
      )
      (conv_pool): ConvMaxpool(
        (convs): ModuleList(
          (0): Conv1d(50, 3, kernel_size=(3,), stride=(1,), padding=(2,))
          (1): Conv1d(50, 4, kernel_size=(4,), stride=(1,), padding=(2,))
          (2): Conv1d(50, 5, kernel_size=(5,), stride=(1,), padding=(2,))
        )
      )
      (dropout): Dropout(p=0.1)
      (fc): Linear(in_features=12, out_features=5, bias=True)
    )

 下面我们用 :class:`~fastNLP.DataSet` 类的 :meth:`~fastNLP.DataSet.split` 方法将数据集划分为 ``train_data`` 和 ``dev_data``
 两个部分，分别用于训练和验证

 .. code-block:: python

    train_data, dev_data = dataset.split(0.2)

 最后我们用 fastNLP 的 :class:`~fastNLP.Trainer` 进行训练，训练的过程中需要传入模型 ``model`` ，训练数据集 ``train_data`` ，
 验证数据集 ``dev_data`` ，损失函数 ``loss`` 和衡量标准 ``metrics`` 。
 其中损失函数使用的是 fastNLP 提供的 :class:`~fastNLP.CrossEntropyLoss` 损失函数;
 衡量标准使用的是 fastNLP 提供的 :class:`~fastNLP.AccuracyMetric` 正确率指标。

 .. code-block:: python

    from fastNLP import Trainer, CrossEntropyLoss, AccuracyMetric

    trainer = Trainer(model=model, train_data=train_data, dev_data=dev_data,
                      loss=CrossEntropyLoss(), metrics=AccuracyMetric())
    trainer.train()

 训练过程的输出如下::

    input fields after batch(if batch size is 2):
        words: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
    target fields after batch(if batch size is 2):
        target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])

    training epochs started 2019-05-09-10-59-39
    Evaluation at Epoch 1/10. Step:2/20. AccuracyMetric: acc=0.333333

    Evaluation at Epoch 2/10. Step:4/20. AccuracyMetric: acc=0.533333

    Evaluation at Epoch 3/10. Step:6/20. AccuracyMetric: acc=0.533333

    Evaluation at Epoch 4/10. Step:8/20. AccuracyMetric: acc=0.533333

    Evaluation at Epoch 5/10. Step:10/20. AccuracyMetric: acc=0.6

    Evaluation at Epoch 6/10. Step:12/20. AccuracyMetric: acc=0.8

    Evaluation at Epoch 7/10. Step:14/20. AccuracyMetric: acc=0.8

    Evaluation at Epoch 8/10. Step:16/20. AccuracyMetric: acc=0.733333

    Evaluation at Epoch 9/10. Step:18/20. AccuracyMetric: acc=0.733333

    Evaluation at Epoch 10/10. Step:20/20. AccuracyMetric: acc=0.733333


    In Epoch:6/Step:12, got best dev performance:AccuracyMetric: acc=0.8
    Reloaded the best model.

 这份教程只是简单地介绍了使用 fastNLP 工作的流程，具体的细节分析见 :doc:`/user/tutorial_one`
--- a/docs/source/user/tutorial_one.rst
+++ b/docs/source/user/tutorial_one.rst
@@ -0,0 +1,371 @@
 ===============
 详细指南
 ===============

 我们使用和 :doc:`/user/quickstart` 中一样的任务来进行详细的介绍。给出一段文字，预测它的标签是0~4中的哪一个
 (数据来源 `kaggle <https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews>`_ )。

 --------------
 数据处理
 --------------

 数据读入
    我们可以使用 fastNLP  :mod:`fastNLP.io` 模块中的 :class:`~fastNLP.io.CSVLoader` 类，轻松地从 csv 文件读取我们的数据。
    这里的 dataset 是 fastNLP 中 :class:`~fastNLP.DataSet` 类的对象

    .. code-block:: python

        from fastNLP.io import CSVLoader

        loader = CSVLoader(headers=('raw_sentence', 'label'), sep='\t')
        dataset = loader.load("./sample_data/tutorial_sample_dataset.csv")

    除了读取数据外，fastNLP 还提供了读取其它文件类型的 Loader 类、读取 Embedding的 Loader 等。详见 :doc:`/fastNLP.io` 。

 Instance 和 DataSet
    fastNLP 中的 :class:`~fastNLP.DataSet` 类对象类似于二维表格，它的每一列是一个 :mod:`~fastNLP.core.field`
    每一行是一个 :mod:`~fastNLP.core.instance` 。我们可以手动向数据集中添加 :class:`~fastNLP.Instance` 类的对象

    .. code-block:: python

        from fastNLP import Instance

        dataset.append(Instance(raw_sentence='fake data', label='0'))

    此时的 ``dataset[-1]`` 的值如下,可以看到，数据集中的每个数据包含 ``raw_sentence`` 和 ``label`` 两个
    :mod:`~fastNLP.core.field` ，他们的类型都是 ``str`` ::

        {'raw_sentence': fake data type=str, 'label': 0 type=str}

 field 的修改
    我们使用 :class:`~fastNLP.DataSet` 类的 :meth:`~fastNLP.DataSet.apply` 方法将 ``raw_sentence`` 中字母变成小写，并将句子分词。
    同时也将 ``label`` :mod:`~fastNLP.core.field` 转化为整数并改名为 ``target``

    .. code-block:: python

        dataset.apply(lambda x: x['raw_sentence'].lower(), new_field_name='sentence')
        dataset.apply_field(lambda x: x.split(), field_name='sentence', new_field_name='words')
        dataset.apply(lambda x: int(x['label']), new_field_name='target')

    ``words`` 和 ``target`` 已经足够用于 :class:`~fastNLP.models.CNNText` 的训练了，但我们从其文档
    :class:`~fastNLP.models.CNNText` 中看到，在 :meth:`~fastNLP.models.CNNText.forward` 的时候，还可以传入可选参数 ``seq_len`` 。
    所以，我们再使用 :meth:`~fastNLP.DataSet.apply_field` 方法增加一个名为 ``seq_len`` 的 :mod:`~fastNLP.core.field` 。

    .. code-block:: python

        dataset.apply_field(lambda x: len(x), field_name='words', new_field_name='seq_len')

    观察可知： :meth:`~fastNLP.DataSet.apply_field` 与 :meth:`~fastNLP.DataSet.apply` 类似，
    但所传入的 `lambda` 函数是针对一个 :class:`~fastNLP.Instance` 中的一个 :mod:`~fastNLP.core.field` 的；
    而 :meth:`~fastNLP.DataSet.apply` 所传入的 `lambda` 函数是针对整个 :class:`~fastNLP.Instance` 的。

    .. note::
         `lambda` 函数即匿名函数，是 Python 的重要特性。 ``lambda x: len(x)``  和下面的这个函数的作用相同::

            def func_lambda(x):
                return len(x)

        你也可以编写复杂的函数做为 :meth:`~fastNLP.DataSet.apply_field` 与 :meth:`~fastNLP.DataSet.apply` 的参数

 Vocabulary 的使用
    我们再用 :class:`~fastNLP.Vocabulary` 类来统计数据中出现的单词，并使用 :meth:`~fastNLP.Vocabularyindex_dataset`
    将单词序列转化为训练可用的数字序列。

    .. code-block:: python

        from fastNLP import Vocabulary

        vocab = Vocabulary(min_freq=2).from_dataset(dataset, field_name='words')
        vocab.index_dataset(dataset, field_name='words',new_field_name='words')

 数据集分割
    除了修改 :mod:`~fastNLP.core.field` 之外，我们还可以对 :class:`~fastNLP.DataSet` 进行分割，以供训练、开发和测试使用。
    下面这段代码展示了 :meth:`~fastNLP.DataSet.split` 的使用方法（但实际应该放在后面两段改名和设置输入的代码之后）

    .. code-block:: python

        train_dev_data, test_data = dataset.split(0.1)
        train_data, dev_data = train_dev_data.split(0.1)
        len(train_data), len(dev_data), len(test_data)

 ---------------------
 使用内置模型训练
 ---------------------

 内置模型的输入输出命名
    fastNLP内置了一些完整的神经网络模型，详见 :doc:`/fastNLP.models` , 我们使用其中的 :class:`~fastNLP.models.CNNText` 模型进行训练。
    为了使用内置的 :class:`~fastNLP.models.CNNText`，我们必须修改 :class:`~fastNLP.DataSet` 中 :mod:`~fastNLP.core.field` 的名称。
    在这个例子中模型输入 (forward方法的参数) 为 ``words`` 和 ``seq_len`` ; 预测输出为 ``pred`` ;标准答案为 ``target`` 。
    具体的命名规范可以参考 :doc:`/fastNLP.core.const` 。

    如果不想查看文档，您也可以使用 :class:`~fastNLP.Const` 类进行命名。下面的代码展示了给 :class:`~fastNLP.DataSet` 中
    :mod:`~fastNLP.core.field` 改名的 :meth:`~fastNLP.DataSet.rename_field` 方法，以及 :class:`~fastNLP.Const` 类的使用方法。

    .. code-block:: python

        from fastNLP import Const

        dataset.rename_field('words', Const.INPUT)
        dataset.rename_field('seq_len', Const.INPUT_LEN)
        dataset.rename_field('target', Const.TARGET)

    在给 :class:`~fastNLP.DataSet` 中 :mod:`~fastNLP.core.field` 改名后，我们还需要设置训练所需的输入和目标，这里使用的是
    :meth:`~fastNLP.DataSet.set_input` 和 :meth:`~fastNLP.DataSet.set_target` 两个函数。

    .. code-block:: python

        dataset.set_input(Const.INPUT, Const.INPUT_LEN)
        dataset.set_target(Const.TARGET)

 快速训练
    现在我们可以导入 fastNLP 内置的文本分类模型 :class:`~fastNLP.models.CNNText` ，并使用 :class:`~fastNLP.Trainer` 进行训练了
    （其中 ``loss`` 和 ``metrics`` 的定义，我们将在后续两段代码中给出）。

    .. code-block:: python

        from fastNLP.models import CNNText
        from fastNLP import Trainer

        model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)

        trainer = Trainer(model=model_cnn, train_data=train_data, dev_data=dev_data,
                        loss=loss, metrics=metrics)
        trainer.train()

    训练过程的输出如下::

        input fields after batch(if batch size is 2):
            words: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
        target fields after batch(if batch size is 2):
            target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])

        training epochs started 2019-05-09-10-59-39
        Evaluation at Epoch 1/10. Step:2/20. AccuracyMetric: acc=0.333333

        Evaluation at Epoch 2/10. Step:4/20. AccuracyMetric: acc=0.533333

        Evaluation at Epoch 3/10. Step:6/20. AccuracyMetric: acc=0.533333

        Evaluation at Epoch 4/10. Step:8/20. AccuracyMetric: acc=0.533333

        Evaluation at Epoch 5/10. Step:10/20. AccuracyMetric: acc=0.6

        Evaluation at Epoch 6/10. Step:12/20. AccuracyMetric: acc=0.8

        Evaluation at Epoch 7/10. Step:14/20. AccuracyMetric: acc=0.8

        Evaluation at Epoch 8/10. Step:16/20. AccuracyMetric: acc=0.733333

        Evaluation at Epoch 9/10. Step:18/20. AccuracyMetric: acc=0.733333

        Evaluation at Epoch 10/10. Step:20/20. AccuracyMetric: acc=0.733333


        In Epoch:6/Step:12, got best dev performance:AccuracyMetric: acc=0.8
        Reloaded the best model.

 损失函数
    训练模型需要提供一个损失函数, 下面提供了一个在分类问题中常用的交叉熵损失。注意它的 **初始化参数** 。
    ``pred`` 参数对应的是模型的 forward 方法返回的 dict 中的一个 key 的名字。
    ``target`` 参数对应的是 :class:`~fastNLP.DataSet` 中作为标签的 :mod:`~fastNLP.core.field` 的名字。
    这里我们用 :class:`~fastNLP.Const` 来辅助命名，如果你自己编写模型中 forward 方法的返回值或
    数据集中 :mod:`~fastNLP.core.field` 的名字与本例不同， 你可以把 ``pred`` 参数和 ``target`` 参数设定符合自己代码的值。

    .. code-block:: python

        from fastNLP import CrossEntropyLoss

        # loss = CrossEntropyLoss() 在本例中与下面这行代码等价
        loss = CrossEntropyLoss(pred=Const.OUTPUT, target=Const.TARGET)

 评价指标
    训练模型需要提供一个评价指标。这里使用准确率做为评价指标。参数的 `命名规则` 跟上面类似。
    ``pred`` 参数对应的是模型的 forward 方法返回的 dict 中的一个 key 的名字。
    ``target`` 参数对应的是 :class:`~fastNLP.DataSet` 中作为标签的 :mod:`~fastNLP.core.field` 的名字。

    .. code-block:: python

        from fastNLP import AccuracyMetric

        # metrics=AccuracyMetric() 在本例中与下面这行代码等价
        metrics=AccuracyMetric(pred=Const.OUTPUT, target=Const.TARGET)

 快速测试
    与 :class:`~fastNLP.Trainer` 对应，fastNLP 也提供了 :class:`~fastNLP.Tester` 用于快速测试，用法如下

    .. code-block:: python

        from fastNLP import Tester

        tester = Tester(test_data, model_cnn, metrics=AccuracyMetric())
        tester.test()

 ---------------------
 编写自己的模型
 ---------------------

 因为 fastNLP 是基于 `PyTorch <https://pytorch.org/>`_ 开发的框架，所以我们可以基于 PyTorch 模型编写自己的神经网络模型。
 与标准的 PyTorch 模型不同，fastNLP 模型中 forward 方法返回的是一个字典，字典中至少需要包含 "pred" 这个字段。
 而 forward 方法的参数名称必须与 :class:`~fastNLP.DataSet` 中用 :meth:`~fastNLP.DataSet.set_input` 设定的名称一致。
 模型定义的代码如下:

 .. code-block:: python

    import torch
    import torch.nn as nn

    class LSTMText(nn.Module):
        def __init__(self, vocab_size, embedding_dim, output_dim, hidden_dim=64, num_layers=2, dropout=0.5):
            super().__init__()

            self.embedding = nn.Embedding(vocab_size, embedding_dim)
            self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, bidirectional=True, dropout=dropout)
            self.fc = nn.Linear(hidden_dim * 2, output_dim)
            self.dropout = nn.Dropout(dropout)

        def forward(self, words):
            # (input) words : (batch_size, seq_len)
            words = words.permute(1,0)
            # words : (seq_len, batch_size)

            embedded = self.dropout(self.embedding(words))
            # embedded : (seq_len, batch_size, embedding_dim)
            output, (hidden, cell) = self.lstm(embedded)
            # output: (seq_len, batch_size, hidden_dim * 2)
            # hidden: (num_layers * 2, batch_size, hidden_dim)
            # cell: (num_layers * 2, batch_size, hidden_dim)

            hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)
            hidden = self.dropout(hidden)
            # hidden: (batch_size, hidden_dim * 2)

            pred = self.fc(hidden.squeeze(0))
            # result: (batch_size, output_dim)
            return {"pred":pred}

 模型的使用方法与内置模型 :class:`~fastNLP.models.CNNText`  一致

 .. code-block:: python

    model_lstm = LSTMText(len(vocab),50,5)

    trainer = Trainer(model=model_lstm, train_data=train_data, dev_data=dev_data,
                    loss=loss, metrics=metrics)
    trainer.train()

    tester = Tester(test_data, model_lstm, metrics=AccuracyMetric())
    tester.test()

 .. todo::
    使用 :doc:`/fastNLP.modules` 编写模型

 --------------------------
 自己编写训练过程
 --------------------------

 如果你想用类似 PyTorch 的使用方法，自己编写训练过程，你可以参考下面这段代码。其中使用了 fastNLP 提供的 :class:`~fastNLP.Batch`
 来获得小批量训练的小批量数据，使用 :class:`~fastNLP.BucketSampler` 做为 :class:`~fastNLP.Batch` 的参数来选择采样的方式。
 这段代码中使用了 PyTorch 的 `torch.optim.Adam` 优化器 和 `torch.nn.CrossEntropyLoss` 损失函数，并自己计算了正确率

 .. code-block:: python

    from fastNLP import BucketSampler
    from fastNLP import Batch
    import torch
    import time

    model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)

    def train(epoch, data):
        optim = torch.optim.Adam(model.parameters(), lr=0.001)
        lossfunc = torch.nn.CrossEntropyLoss()
        batch_size = 32

        train_sampler = BucketSampler(batch_size=batch_size, seq_len_field_name='seq_len')
        train_batch = Batch(batch_size=batch_size, dataset=data, sampler=train_sampler)

        start_time = time.time()
        for i in range(epoch):
            loss_list = []
            for batch_x, batch_y in train_batch:
                optim.zero_grad()
                output = model(batch_x['words'])
                loss = lossfunc(output['pred'], batch_y['target'])
                loss.backward()
                optim.step()
                loss_list.append(loss.item())
            print('Epoch {:d} Avg Loss: {:.2f}'.format(i, sum(loss_list) / len(loss_list)),end=" ")
            print('{:d}ms'.format(round((time.time()-start_time)*1000)))
            loss_list.clear()

    train(10, train_data)

    tester = Tester(test_data, model, metrics=AccuracyMetric())
    tester.test()

 这段代码的输出如下::

    Epoch 0 Avg Loss: 2.76 17ms
    Epoch 1 Avg Loss: 2.55 29ms
    Epoch 2 Avg Loss: 2.37 41ms
    Epoch 3 Avg Loss: 2.30 53ms
    Epoch 4 Avg Loss: 2.12 65ms
    Epoch 5 Avg Loss: 2.16 76ms
    Epoch 6 Avg Loss: 1.88 88ms
    Epoch 7 Avg Loss: 1.84 99ms
    Epoch 8 Avg Loss: 1.71 111ms
    Epoch 9 Avg Loss: 1.62 122ms
    [tester]
    AccuracyMetric: acc=0.142857

 ----------------------------------
 使用 Callback 增强 Trainer
 ----------------------------------

 如果你不想自己实现繁琐的训练过程，只希望在训练过程中实现一些自己的功能（比如：输出从训练开始到当前 batch 结束的总时间），
 你可以使用 fastNLP 提供的 :class:`~fastNLP.Callback` 类。下面的例子中，我们继承 :class:`~fastNLP.Callback` 类实现了这个功能。

 .. code-block:: python

    from fastNLP import Callback

    start_time = time.time()

    class MyCallback(Callback):
        def on_epoch_end(self):
            print('Sum Time: {:d}ms\n\n'.format(round((time.time()-start_time)*1000)))


    model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)
    trainer = Trainer(model=model, train_data=train_data, dev_data=dev_data,
                      loss=CrossEntropyLoss(), metrics=AccuracyMetric(), callbacks=[MyCallback()])
    trainer.train()

 训练输出如下::

    input fields after batch(if batch size is 2):
        words: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 16])
        seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
    target fields after batch(if batch size is 2):
        target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])

    training epochs started 2019-05-12-21-38-40
    Evaluation at Epoch 1/10. Step:2/20. AccuracyMetric: acc=0.285714

    Sum Time: 51ms


    …………………………


    Evaluation at Epoch 10/10. Step:20/20. AccuracyMetric: acc=0.857143

    Sum Time: 212ms



    In Epoch:10/Step:20, got best dev performance:AccuracyMetric: acc=0.857143
    Reloaded the best model.

 这个例子只是介绍了 :class:`~fastNLP.Callback` 类的使用方法。实际应用（比如：负采样、Learning Rate Decay、Early Stop 等）中
 很多功能已经被 fastNLP 实现了。你可以直接 import 它们使用，详细请查看文档 :doc:`/fastNLP.core.callback` 。
--- a/docs/source/user/with_fitlog.rst
+++ b/docs/source/user/with_fitlog.rst
@@ -0,0 +1,5 @@
 =================
 科研向导
 =================

 本文介绍使用 fastNLP 和 fitlog 进行科学研究的方法
--- a/fastNLP/init.py
+++ b/fastNLP/init.py
@@ -1,3 +1,59 @@
 """
 fastNLP 由 :mod:`~fastNLP.core` 、 :mod:`~fastNLP.io` 、:mod:`~fastNLP.modules`、:mod:`~fastNLP.models`
 等子模块组成，你可以点进去查看每个模块的文档。

 - :mod:`~fastNLP.core` 是fastNLP 的核心模块，包括 DataSet、 Trainer、 Tester 等组件。详见文档 :doc:`/fastNLP.core`
 - :mod:`~fastNLP.io` 是实现输入输出的模块，包括了数据集的读取，模型的存取等功能。详见文档 :doc:`/fastNLP.io`
 - :mod:`~fastNLP.modules`  包含了用于搭建神经网络模型的诸多组件，可以帮助用户快速搭建自己所需的网络。详见文档 :doc:`/fastNLP.modules`
 - :mod:`~fastNLP.models` 包含了一些使用 fastNLP 实现的完整网络模型，包括CNNText、SeqLabeling等常见模型。详见文档 :doc:`/fastNLP.models`

 fastNLP 中最常用的组件可以直接从 fastNLP 包中 import ，他们的文档如下：
 """
 __all__ = [
    "Instance",
    "FieldArray",
    "Batch",
    "Vocabulary",
    "DataSet",
    "Const",
    
    "Trainer",
    "Tester",
    
    "Callback",
    "GradientClipCallback",
    "EarlyStopCallback",
    "TensorboardCallback",
    "LRScheduler",
    "ControlC",
    
    "Padder",
    "AutoPadder",
    "EngChar2DPadder",
    
    "AccuracyMetric",
    "SpanFPreRecMetric",
    "SQuADMetric",
    
    "Optimizer",
    "SGD",
    "Adam",
    
    "Sampler",
    "SequentialSampler",
    "BucketSampler",
    "RandomSampler",
    
    "LossFunc",
    "CrossEntropyLoss",
    "L1Loss", "BCELoss",
    "NLLLoss",
    "LossInForward",
    
    "cache_results"
 ]
 __version__ = '0.4.0'

 from .core import *
 from . import models
 from . import modules
--- a/fastNLP/api/init.py
+++ b/fastNLP/api/init.py
@@ -1 +0,0 @@
 from .api import CWS, POS, Parser
--- a/fastNLP/core/init.py
+++ b/fastNLP/core/init.py
@@ -1,13 +1,30 @@
 """
 core 模块里实现了 fastNLP 的核心框架，常用的功能都可以从 fastNLP 包中直接 import。当然你也同样可以从 core 模块的子模块中 import，
 例如 Batch 组件有两种 import 的方式::
    
    # 直接从 fastNLP 中 import
    from fastNLP import Batch
    
    # 从 core 模块的子模块 batch 中 import
    from fastNLP.core.batch import Batch

 对于常用的功能，你只需要在 :doc:`fastNLP` 中查看即可。如果想了解各个子模块的具体作用，您可以在下面找到每个子模块的具体文档。

 .. todo::
    介绍core 的子模块的分工，好像必要性不大
    
 """
 from .batch import Batch
 # from .dataset import DataSet
 from .fieldarray import FieldArray
 from .callback import Callback, GradientClipCallback, EarlyStopCallback, TensorboardCallback, LRScheduler, ControlC
 from .const import Const
 from .dataset import DataSet
 from .field import FieldArray, Padder, AutoPadder, EngChar2DPadder
 from .instance import Instance
 from .losses import LossFunc, CrossEntropyLoss, L1Loss, BCELoss, NLLLoss, LossInForward
 from .metrics import AccuracyMetric
 from .metrics import AccuracyMetric, SpanFPreRecMetric, SQuADMetric
 from .optimizer import Optimizer, SGD, Adam
 from .sampler import SequentialSampler, BucketSampler, RandomSampler, BaseSampler
 from .sampler import SequentialSampler, BucketSampler, RandomSampler, Sampler
 from .tester import Tester
 from .trainer import Trainer
 from .utils import cache_results, seq_len_to_mask
 from .vocabulary import Vocabulary
 from ..io.dataset_loader import DataSet

--- a/fastNLP/core/batch.py
+++ b/fastNLP/core/batch.py
@@ -1,28 +1,64 @@
 """
 batch 模块实现了 fastNLP 所需的 Batch 类。

 """
 __all__ = [
    "Batch"
 ]

 import atexit
 from queue import Empty, Full

 import numpy as np
 import torch

 from fastNLP.core.sampler import RandomSampler
 import torch.multiprocessing as mp

 class Batch(object):
    """Batch is an iterable object which iterates over mini-batches.
 from .sampler import RandomSampler

        Example::
 _python_is_exit = False

            for batch_x, batch_y in Batch(data_set, batch_size=16, sampler=SequentialSampler()):
                # ...

    :param DataSet dataset: a DataSet object
    :param int batch_size: the size of the batch
    :param Sampler sampler: a Sampler object
    :param bool as_numpy: If True, return Numpy array. Otherwise, return torch tensors.
    :param bool prefetch: If True, use multiprocessing to fetch next batch when training.
    :param str or torch.device device: the batch's device, if as_numpy is True, device is ignored.
    """
 def _set_python_is_exit():
    global _python_is_exit
    _python_is_exit = True

    def __init__(self, dataset, batch_size, sampler=RandomSampler(), as_numpy=False, prefetch=False):

 atexit.register(_set_python_is_exit)


 class Batch(object):
    """
    别名：:class:`fastNLP.Batch` :class:`fastNLP.core.batch.Batch`

    Batch 用于从 `DataSet` 中按一定的顺序, 依次按 ``batch_size`` 的大小将数据取出.
    组成 `x` 和 `y`


    Example::

        batch = Batch(data_set, batch_size=16, sampler=SequentialSampler())
        num_batch = len(batch)
        for batch_x, batch_y in batch:
            # do stuff ...

    :param dataset: :class:`~fastNLP.DataSet` 对象, 数据集
    :param int batch_size: 取出的batch大小
    :param sampler: 规定使用的 :class:`~fastNLP.Sampler` 方式. 若为 ``None`` , 使用 :class:`~fastNLP.RandomSampler`.
    
        Default: ``None``
    :param bool as_numpy: 若为 ``True`` , 输出batch为 numpy.array. 否则为 :class:`torch.Tensor`.
    
        Default: ``False``
    :param bool prefetch: 若为 ``True`` 使用多进程预先取出下一batch.
    
        Default: ``False``
    """
    
    def __init__(self, dataset, batch_size, sampler=None, as_numpy=False, prefetch=False):
        self.dataset = dataset
        self.batch_size = batch_size
        if sampler is None:
            sampler = RandomSampler()
        self.sampler = sampler
        self.as_numpy = as_numpy
        self.idx_list = None
@@ -31,37 +67,38 @@ class Batch(object):
        self.cur_batch_indices = None
        self.prefetch = prefetch
        self.lengths = 0

    
    def fetch_one(self):
        if self.curidx >= len(self.idx_list):
            return None
        else:
            endidx = min(self.curidx + self.batch_size, len(self.idx_list))
            batch_x, batch_y = {}, {}

            
            indices = self.idx_list[self.curidx:endidx]
            self.cur_batch_indices = indices

            
            for field_name, field in self.dataset.get_all_fields().items():
                if field.is_target or field.is_input:
                    batch = field.get(indices)
                    if not self.as_numpy and field.padder is not None:
                        batch = to_tensor(batch, field.dtype)
                        batch = _to_tensor(batch, field.dtype)
                    if field.is_target:
                        batch_y[field_name] = batch
                    if field.is_input:
                        batch_x[field_name] = batch

            
            self.curidx = endidx
            return batch_x, batch_y

    
    def __iter__(self):
        """
        Iterate on dataset, fetch batch data. Fetch process don't block the iterate process
        :return:
        """
        if self.prefetch:
            return run_batch_iter(self)
            return self._run_batch_iter(self)
        
        def batch_iter():
            self.init_iter()
            while 1:
@@ -69,21 +106,78 @@ class Batch(object):
                if res is None:
                    break
                yield res
        
        return batch_iter()

    
    def init_iter(self):
        self.idx_list = self.sampler(self.dataset)
        self.curidx = 0
        self.lengths = self.dataset.get_length()

    
    def __len__(self):
        return self.num_batches

    
    def get_batch_indices(self):
        """
        取得当前batch在DataSet中所在的index下标序列

        :return list(int) indexes: 下标序列
        """
        return self.cur_batch_indices
    
    @staticmethod
    def _run_fetch(batch, q):
        try:
            global _python_is_exit
            batch.init_iter()
            # print('start fetch')
            while 1:
                res = batch.fetch_one()
                # print('fetch one')
                while 1:
                    try:
                        q.put(res, timeout=3)
                        break
                    except Full:
                        if _python_is_exit:
                            return
                if res is None:
                    # print('fetch done, waiting processing')
                    break
            # print('fetch exit')
        except Exception as e:
            q.put(e)
        finally:
            q.join()
    
    @staticmethod
    def _run_batch_iter(batch):
        q = mp.JoinableQueue(maxsize=10)
        fetch_p = mp.Process(target=Batch._run_fetch, args=(batch, q))
        fetch_p.daemon = True
        fetch_p.start()
        # print('fork fetch process')
        while 1:
            try:
                res = q.get(timeout=1)
                q.task_done()
                # print('get fetched')
                if res is None:
                    break
                elif isinstance(res, Exception):
                    raise res
                yield res
            except Empty as e:
                if fetch_p.is_alive():
                    continue
                else:
                    break
        fetch_p.terminate()
        fetch_p.join()
        # print('iter done')


 def to_tensor(batch, dtype):
 def _to_tensor(batch, dtype):
    try:
        if dtype in (int, np.int8, np.int16, np.int32, np.int64):
            batch = torch.LongTensor(batch)
@@ -92,42 +186,3 @@ def to_tensor(batch, dtype):
    except:
        pass
    return batch


 def run_fetch(batch, q):
    batch.init_iter()
    # print('start fetch')
    while 1:
        res = batch.fetch_one()
        # print('fetch one')
        q.put(res)
        if res is None:
            # print('fetch done, waiting processing')
            q.join()
            break
    # print('fetch exit')


 def run_batch_iter(batch):
    q = mp.JoinableQueue(maxsize=10)
    fetch_p = mp.Process(target=run_fetch, args=(batch, q))
    fetch_p.daemon = True
    fetch_p.start()
    # print('fork fetch process')
    while 1:
        try:
            res = q.get(timeout=1)
            q.task_done()
            # print('get fetched')
            if res is None:
                break
            yield res
        except Exception as e:
            if fetch_p.is_alive():
                continue
            else:
                break
    fetch_p.terminate()
    fetch_p.join()
    # print('iter done')

--- a/fastNLP/core/callback.py
+++ b/fastNLP/core/callback.py
@@ -1,130 +1,298 @@
 r"""
 callback模块实现了 fastNLP 中的许多 callback 类，用于增强 :class:`~fastNLP.Trainer` 类。

 虽然Trainer本身已经集成了一些功能，但仍然不足以囊括训练过程中可能需要到的功能，
 比如负采样，learning rate decay, Early Stop等。
 为了解决这个问题fastNLP引入了callback的机制，Callback 是一种在Trainer训练过程中特定阶段会运行的函数集合。
 关于Trainer的详细文档，请参见 :doc:`trainer 模块<fastNLP.core.trainer>`

 我们将 :meth:`~fastNLP.Train.train` 这个函数内部分为以下的阶段，在对应阶段会触发相应的调用::

    callback.on_train_begin()  # 开始进行训练
    for i in range(1, n_epochs+1):
        callback.on_epoch_begin()  # 开始新的epoch
        for batch_x, batch_y in Batch:
            callback.on_batch_begin(batch_x, batch_y, indices) # batch_x是设置为input的field，batch_y是设置为target的field
            获取模型输出
            callback.on_loss_begin()
            计算loss
            callback.on_backward_begin() # 可以进行一些检查，比如loss是否为None
            反向梯度回传
            callback.on_backward_end() # 进行梯度截断等
            进行参数更新
            callback.on_step_end()
            callback.on_batch_end()
            # 根据设置进行evaluation，比如这是本epoch最后一个batch或者达到一定step
            if do evaluation:
                callback.on_valid_begin()
                进行dev data上的验证
                callback.on_valid_end()  # 可以进行在其它数据集上进行验证
        callback.on_epoch_end()  # epoch结束调用
    callback.on_train_end() # 训练结束
    callback.on_exception() # 这是一个特殊的步骤，在训练过程中遭遇exception会跳转到这里。

 如下面的例子所示，我们可以使用内置的 callback 类，或者继承 :class:`~fastNLP.core.callback.Callback`
 定义自己的 callback 类::
    
    from fastNLP import Callback, EarlyStopCallback, Trainer, CrossEntropyLoss, AccuracyMetric
    from fastNLP.models import CNNText
    
    start_time = time.time()
    
    class MyCallback(Callback):
        def on_epoch_end(self):
            print('{:d}ms\n\n'.format(round((time.time()-start_time)*1000)))
    
    model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)
    trainer = Trainer(model=model, train_data=train_data, dev_data=dev_data, loss=CrossEntropyLoss(),
                      metrics=AccuracyMetric(), callbacks=[MyCallback(),EarlyStopCallback(10)])
    trainer.train()

 """
 __all__ = [
    "Callback",
    "GradientClipCallback",
    "EarlyStopCallback",
    "TensorboardCallback",
    "LRScheduler",
    "ControlC",
    
    "CallbackException",
    "EarlyStopError"
 ]

 import os

 import torch
 from tensorboardX import SummaryWriter

 from fastNLP.io.model_io import ModelSaver, ModelLoader

 from copy import deepcopy
 try:
    from tensorboardX import SummaryWriter
    
    tensorboardX_flag = True
 except:
    tensorboardX_flag = False

 from ..io.model_io import ModelSaver, ModelLoader
 from .dataset import DataSet
 from .tester import Tester

 try:
    import fitlog
 except:
    pass

 class Callback(object):
    """An Interface for all callbacks.
    """
    别名：:class:`fastNLP.Callback` :class:`fastNLP.core.callback.Callback`

    Any customized callback should implement at least one of the following methods.
    Callback是fastNLP中被设计用于增强 :class:`~fastNLP.Trainer` 的类。
    如果Callback被传递给了 Trainer , 则 Trainer 会在对应的阶段调用Callback的函数，
    具体调用时机可以通过 :doc:`trainer 模块<fastNLP.core.trainer>` 查看。
    这是Callback的基类，所有的callback必须继承自这个类

    """

    
    def __init__(self):
        super(Callback, self).__init__()
        self.trainer = None  # 在Trainer内部被重新赋值

        self._trainer = None  # 在Trainer内部被重新赋值
    
    @property
    def trainer(self):
        """
        该属性可以通过self.trainer获取到，一般情况下不需要使用这个属性。
        """
        return self._trainer
    
    @property
    def step(self):
        """当前运行到的step, 范围为[1, self.n_steps+1)"""
        return self._trainer.step
    
    @property
    def n_steps(self):
        """Trainer一共会运行多少步"""
        return self._trainer.n_steps
    
    @property
    def batch_size(self):
        """train和evaluate时的batch_size为多大"""
        return self._trainer.batch_size
    
    @property
    def epoch(self):
        """当前运行的epoch数，范围是[1, self.n_epochs+1)"""
        return self._trainer.epoch
    
    @property
    def n_epochs(self):
        """一共会运行多少个epoch"""
        return self._trainer.n_epochs
    
    @property
    def optimizer(self):
        """初始化Trainer时传递的Optimizer"""
        return self._trainer.optimizer
    
    @property
    def model(self):
        """正在被Trainer训练的模型"""
        return self._trainer.model
    
    @property
    def pbar(self):
        """如果在Callback中需要打印内容，请使用self.pbar.write(str)。否则可能出现命令行显示效果不太好的问题。在
        on_train_begin(), on_train_end(), on_exception()中请不要使用该属性，通过print输出即可。"""
        return self._trainer.pbar
    
    @property
    def update_every(self):
        """Trainer中的模型多少次反向传播才进行一次梯度更新，在Trainer初始化时传入的。"""
        return self._trainer.update_every
    
    @property
    def batch_per_epoch(self):
        """每个epoch一共有多少个batch，只有在on_epoch_begin之后才能调用该属性。"""
        return self._trainer.batch_per_epoch
    
    def on_train_begin(self):
        # before the main training loop
        pass
        """
        在Train过程开始之前调用。

    def on_epoch_begin(self, cur_epoch, total_epoch):
        # at the beginning of each epoch
        :return:
        """
        pass
    
    def on_epoch_begin(self):
        """
        在每个epoch开始之前调用一次

    def on_batch_begin(self, batch_x, batch_y, indices):
        # at the beginning of each step/mini-batch
        :return:
        """
        pass
    
    def on_batch_begin(self, batch_x, batch_y, indices):
        """
        每次采集到一个batch的数据则调用一次。这里对batch_x或batch_y删除添加内容是可以影响到Trainer中内容的。所以在这一步
        可以进行一些负采样之类的操作

    def on_loss_begin(self, batch_y, predict_y):
        # after data_forward, and before loss computation
        :param dict batch_x: DataSet中被设置为input的field的batch。
        :param dict batch_y: DataSet中被设置为target的field的batch。
        :param list(int) indices: 这次采样使用到的indices，可以通过DataSet[indices]获取出这个batch采出的Instance，在一些
            情况下可以帮助定位是哪个Sample导致了错误。仅在Trainer的prefetch为False时可用。
        :return:
        """
        pass
    
    def on_loss_begin(self, batch_y, predict_y):
        """
        在计算loss前调用，即这里修改batch_y或predict_y的值是可以影响到loss计算的。

    def on_backward_begin(self, loss, model):
        # after loss computation, and before gradient backward
        :param dict batch_y: 在DataSet中被设置为target的field的batch集合。
        :param dict predict_y: 模型的forward()返回的结果。
        :return:
        """
        pass
    
    def on_backward_begin(self, loss):
        """
        在loss得到之后，但在反向传播之前。可能可以进行loss是否为NaN的检查。

    def on_backward_end(self, model):
        :param torch.Tensor loss: 计算得到的loss值
        :return:
        """
        pass
    
    def on_backward_end(self):
        """
        反向梯度传播已完成，但由于update_every的设置，可能并不是每一次调用都有梯度。到这一步，还没有更新参数。

    def on_step_end(self, optimizer):
        :return:
        """
        pass
    
    def on_step_end(self):
        """
        到这里模型的参数已经按照梯度更新。但可能受update_every影响，并不是每次都更新了。

    def on_batch_end(self, *args):
        # at the end of each step/mini-batch
        :return:
        """
        pass
    
    def on_batch_end(self):
        """
        这一步与on_step_end是紧接着的。只是为了对称性加上了这一步。

    def on_valid_begin(self):
        """
        pass

    def on_valid_end(self, eval_result, metric_key, optimizer):
    
    def on_valid_begin(self):
        """
        每次执行验证机的evaluation后会调用。传入eval_result
        如果Trainer中设置了验证，则发生验证前会调用该函数

        :param eval_result: Dict[str: Dict[str: float]], evaluation的结果
        :param metric_key: str
        :param optimizer:
        :return:
        """
        pass

    def on_epoch_end(self, cur_epoch, n_epoch, optimizer):
    
    def on_valid_end(self, eval_result, metric_key, optimizer, is_better_eval):
        """
        每个epoch结束将会调用该方法
        每次执行验证集的evaluation后会调用。

        :param cur_epoch: int, 当前的batch。从1开始。
        :param n_epoch: int, 总的batch数
        :param optimizer: 传入Trainer的optimizer。
        :param Dict[str: Dict[str: float]] eval_result: , evaluation的结果。一个例子为{'AccuracyMetric':{'acc':1.0}}，即
            传入的dict是有两层，第一层是metric的名称，第二层是metric的具体指标。
        :param str metric_key: 初始化Trainer时传入的metric_key。
        :param torch.Optimizer optimizer: Trainer中使用的优化器。
        :param bool is_better_eval: 当前dev结果是否比之前的好。
        :return:
        """
        pass

    def on_train_end(self, model):
    
    def on_epoch_end(self):
        """
        每个epoch结束将会调用该方法
        """
        pass
    
    def on_train_end(self):
        """
        训练结束，调用该方法

        :param model: nn.Module, 传入Trainer的模型
        :return:
        """
        pass

    def on_exception(self, exception, model):
    
    def on_exception(self, exception):
        """
        当训练过程出现异常，会触发该方法
        :param exception: 某种类型的Exception，比如KeyboardInterrupt等
        :param model: 传入Trainer的模型
        :return:
        """
        pass


 def transfer(func):
 def _transfer(func):
    """装饰器，将对CallbackManager的调用转发到各个Callback子类.

    
    :param func:
    :return:
    """

    
    def wrapper(manager, *arg):
        returns = []
        for callback in manager.callbacks:
            for env_name, env_value in manager.env.items():
                setattr(callback, env_name, env_value)
            returns.append(getattr(callback, func.__name__)(*arg))
        return returns

    
    return wrapper


 class CallbackManager(Callback):
    """A manager for all callbacks passed into Trainer.
    It collects resources inside Trainer and raise callbacks.

    """

    def __init__(self, env, callbacks=None):
        """
        内部使用的Callback管理类

        :param dict env: The key is the name of the Trainer attribute(str). The value is the attribute itself.
        :param Callback callbacks:
        :param List[Callback] callbacks:
        """
        super(CallbackManager, self).__init__()
        # set attribute of trainer environment
        self.env = env

        
        self.callbacks = []
        if callbacks is not None:
            if isinstance(callbacks, list):
@@ -135,108 +303,87 @@ class CallbackManager(Callback):
                    raise TypeError(f"Expect sub-classes of Callback. Got {type(obj)}")
            else:
                raise TypeError(f"Expect callbacks in CallbackManager(callbacks) to be list. Got {type(callbacks)}.")

    @transfer
        
        for env_name, env_val in env.items():
            for callback in self.callbacks:
                setattr(callback, '_' + env_name, env_val)  # Callback.trainer
    
    @_transfer
    def on_train_begin(self):
        pass

    @transfer
    def on_epoch_begin(self, cur_epoch, total_epoch):
    
    @_transfer
    def on_epoch_begin(self):
        pass

    @transfer
    
    @_transfer
    def on_batch_begin(self, batch_x, batch_y, indices):
        pass

    @transfer
    
    @_transfer
    def on_loss_begin(self, batch_y, predict_y):
        pass

    @transfer
    def on_backward_begin(self, loss, model):
    
    @_transfer
    def on_backward_begin(self, loss):
        pass

    @transfer
    def on_backward_end(self, model):
    
    @_transfer
    def on_backward_end(self):
        pass

    @transfer
    def on_step_end(self, optimizer):
    
    @_transfer
    def on_step_end(self):
        pass

    @transfer
    
    @_transfer
    def on_batch_end(self):
        pass

    @transfer
    
    @_transfer
    def on_valid_begin(self):
        pass

    @transfer
    def on_valid_end(self, eval_result, metric_key, optimizer):
    
    @_transfer
    def on_valid_end(self, eval_result, metric_key, optimizer, is_better_eval):
        pass

    @transfer
    def on_epoch_end(self, cur_epoch, n_epoch, optimizer):
    
    @_transfer
    def on_epoch_end(self):
        pass

    @transfer
    def on_train_end(self, model):
    
    @_transfer
    def on_train_end(self):
        pass

    @transfer
    def on_exception(self, exception, model):
    
    @_transfer
    def on_exception(self, exception):
        pass


 class DummyCallback(Callback):
    def on_train_begin(self, *arg):
        print(arg)

    def on_epoch_end(self, cur_epoch, n_epoch, optimizer):
        print(cur_epoch, n_epoch, optimizer)


 class EchoCallback(Callback):
    def on_train_begin(self):
        print("before_train")

    def on_epoch_begin(self, cur_epoch, total_epoch):
        print("before_epoch")

    def on_batch_begin(self, batch_x, batch_y, indices):
        print("before_batch")

    def on_loss_begin(self, batch_y, predict_y):
        print("before_loss")

    def on_backward_begin(self, loss, model):
        print("before_backward")

    def on_batch_end(self):
        print("after_batch")
 class GradientClipCallback(Callback):
    """
    别名：:class:`fastNLP.GradientClipCallback` :class:`fastNLP.core.callback.GradientClipCallback`

    def on_epoch_end(self, cur_epoch, n_epoch, optimizer):
        print("after_epoch")
    每次backward前，将parameter的gradient clip到某个范围。

    def on_train_end(self, model):
        print("after_train")
    :param None,torch.Tensor,List[torch.Tensor] parameters: 一般通过model.parameters()获得。如果为None则默认对Trainer
        的model中所有参数进行clip
    :param float clip_value: 将gradient 限制到[-clip_value, clip_value]。clip_value应该为正数
    :param str clip_type: 支持'norm', 'value'
        两种::

            1 'norm', 将gradient的norm rescale到[-clip_value, clip_value]
        
            2 'value', 将gradient限制在[-clip_value, clip_value], 小于-clip_value的gradient被赋值为-clip_value;
            大于clip_value的gradient被赋值为clip_value.

 class GradientClipCallback(Callback):
    """
    
    def __init__(self, parameters=None, clip_value=1, clip_type='norm'):
        """每次backward前，将parameter的gradient clip到某个范围。

        :param parameters: None, torch.Tensor或List[torch.Tensor], 一般通过model.parameters()获得。如果为None则默认对Trainer
            的model中所有参数进行clip
        :param clip_value: float, 将gradient 限制到[-clip_value, clip_value]。clip_value应该为正数
        :param clip_type: str, 支持'norm', 'value'两种。
            (1) 'norm', 将gradient的norm rescale到[-clip_value, clip_value]
            (2) 'value', 将gradient限制在[-clip_value, clip_value], 小于-clip_value的gradient被赋值为-clip_value; 大于
                clip_value的gradient被赋值为clip_value.
        """
        
        super().__init__()

        
        from torch import nn
        if clip_type == 'norm':
            self.clip_fun = nn.utils.clip_grad_norm_
@@ -246,36 +393,30 @@ class GradientClipCallback(Callback):
            raise ValueError("Only supports `norm` or `value` right now.")
        self.parameters = parameters
        self.clip_value = clip_value

    def on_backward_end(self, model):
        self.clip_fun(model.parameters(), self.clip_value)


 class CallbackException(BaseException):
    def __init__(self, msg):
        super(CallbackException, self).__init__(msg)


 class EarlyStopError(CallbackException):
    def __init__(self, msg):
        super(EarlyStopError, self).__init__(msg)
    
    def on_backward_end(self):
        if self.parameters is None:
            self.clip_fun(self.model.parameters(), self.clip_value)
        else:
            self.clip_fun(self.parameters, self.clip_value)


 class EarlyStopCallback(Callback):
    def __init__(self, patience):
        """
    """
    别名：:class:`fastNLP.EarlyStopCallback` :class:`fastNLP.core.callback.EarlyStopCallback`
    
    多少个epoch没有变好就停止训练，相关类 :class:`EarlyStopError`

        :param int patience: 停止之前等待的epoch数
        """
    :param int patience: epoch的数量
    """
    
    def __init__(self, patience):
        super(EarlyStopCallback, self).__init__()
        self.trainer = None  # override by CallbackManager
        self.patience = patience
        self.wait = 0
        self.epoch = 0

    def on_valid_end(self, eval_result, metric_key, optimizer):
        self.epoch += 1
        if not self.trainer._better_eval_result(eval_result):
    
    def on_valid_end(self, eval_result, metric_key, optimizer, is_better_eval):
        if not is_better_eval:
            # current result is getting worse
            if self.wait == self.patience:
                raise EarlyStopError("Early stopping raised.")
@@ -283,44 +424,135 @@ class EarlyStopCallback(Callback):
                self.wait += 1
        else:
            self.wait = 0

    def on_exception(self, exception, model):
    
    def on_exception(self, exception):
        if isinstance(exception, EarlyStopError):
            print("Early Stopping triggered in epoch {}!".format(self.epoch))
        else:
            raise exception  # 抛出陌生Error

 class FitlogCallback(Callback):
    """
    别名: :class:`fastNLP.FitlogCallback` :class:`fastNLP.core.callback.FitlogCallback`

    该callback将loss和progress自动写入到fitlog中; 如果Trainer有dev的数据，将自动把dev的结果写入到log中; 同时还支持传入
        一个(或多个)test数据集进行测试(只有在trainer具有dev时才能使用)，每次在dev上evaluate之后会在这些数据集上验证一下。
        并将验证结果写入到fitlog中。这些数据集的结果是根据dev上最好的结果报道的，即如果dev在第3个epoch取得了最佳，则
        fitlog中记录的关于这些数据集的结果就是来自第三个epoch的结果。

    :param DataSet,dict(DataSet) data: 传入DataSet对象，会使用多个Trainer中的metric对数据进行验证。如果需要传入多个
        DataSet请通过dict的方式传入，dict的key将作为对应dataset的name传递给fitlog。若tester不为None时，data需要通过
        dict的方式传入。如果仅传入DataSet, 则被命名为test
    :param Tester tester: Tester对象，将在on_valid_end时调用。tester中的DataSet会被称为为`test`
    :param int verbose: 是否在终端打印内容，0不打印
    :param bool log_exception: fitlog是否记录发生的exception信息
    """

    def __init__(self, data=None, tester=None, verbose=0, log_exception=False):
        super().__init__()
        self.datasets = {}
        self.testers = {}
        self._log_exception = log_exception
        if tester is not None:
            assert isinstance(tester, Tester), "Only fastNLP.Tester allowed."
            assert isinstance(data, dict) or data is None, "If tester is not None, only dict[DataSet] allowed for data."
            if data is not None:
                assert 'test' not in data, "Cannot use `test` as DataSet key, when tester is passed."
            setattr(tester, 'verbose', 0)
            self.testers['test'] = tester

        if isinstance(data, dict):
            for key, value in data.items():
                assert isinstance(value, DataSet), f"Only DataSet object is allowed, not {type(value)}."
            for key, value in data.items():
                self.datasets[key] = value
        elif isinstance(data, DataSet):
            self.datasets['test'] = data
        else:
            raise TypeError("data receives dict[DataSet] or DataSet object.")

        self.verbose = verbose

    def on_train_begin(self):
        if (len(self.datasets)>0 or len(self.testers)>0 ) and self.trainer.dev_data is None:
            raise RuntimeError("Trainer has no dev data, you cannot pass extra data to do evaluation.")

        if len(self.datasets)>0:
            for key, data in self.datasets.items():
                tester = Tester(data=data, model=self.model, batch_size=self.batch_size, metrics=self.trainer.metrics,
                                verbose=0)
                self.testers[key] = tester
        fitlog.add_progress(total_steps=self.n_steps)

    def on_backward_begin(self, loss):
        fitlog.add_loss(loss.item(), name='loss', step=self.step, epoch=self.epoch)

    def on_valid_end(self, eval_result, metric_key, optimizer, better_result):
        if better_result:
            eval_result = deepcopy(eval_result)
            eval_result['step'] = self.step
            eval_result['epoch'] = self.epoch
            fitlog.add_best_metric(eval_result)
        fitlog.add_metric(eval_result, step=self.step, epoch=self.epoch)
        if len(self.testers)>0:
            for key, tester in self.testers.items():
                try:
                    eval_result = tester.test()
                    if self.verbose!=0:
                        self.pbar.write("Evaluation on DataSet {}:".format(key))
                        self.pbar.write(tester._format_eval_results(eval_result))
                    fitlog.add_metric(eval_result, name=key, step=self.step, epoch=self.epoch)
                    if better_result:
                        fitlog.add_best_metric(eval_result, name=key)
                except Exception:
                    self.pbar.write("Exception happens when evaluate on DataSet named `{}`.".format(key))

    def on_train_end(self):
        fitlog.finish()

    def on_exception(self, exception):
        fitlog.finish(status=1)
        if self._log_exception:
            fitlog.add_other(str(exception), name='except_info')


 class LRScheduler(Callback):
    def __init__(self, lr_scheduler):
        """对PyTorch LR Scheduler的包装
    """
    别名：:class:`fastNLP.LRScheduler` :class:`fastNLP.core.callback.LRScheduler`

        :param lr_scheduler: PyTorch的lr_scheduler
        """
    对PyTorch LR Scheduler的包装以使得其可以被Trainer所使用

    :param torch.optim.lr_scheduler._LRScheduler lr_scheduler: PyTorch的lr_scheduler
    """
    
    def __init__(self, lr_scheduler):
        
        super(LRScheduler, self).__init__()
        import torch.optim
        if isinstance(lr_scheduler, torch.optim.lr_scheduler._LRScheduler):
            self.scheduler = lr_scheduler
        else:
            raise ValueError(f"Expect torch.optim.lr_scheduler for LRScheduler. Got {type(lr_scheduler)}.")

    def on_epoch_begin(self, cur_epoch, total_epoch):
        self.scheduler.step()
        print("scheduler step ", "lr=", self.trainer.optimizer.param_groups[0]["lr"])
    
    def on_epoch_begin(self):
        self.scheduler.step(self.epoch)


 class ControlC(Callback):
    def __init__(self, quit_all):
        """
    """
    别名：:class:`fastNLP.ControlC` :class:`fastNLP.core.callback.ControlC`

        :param quit_all: 若为True,则检测到control+C 直接退出程序；否则只退出Trainer
        """
    :param bool quit_all: 若为True,则检测到control+C 直接退出程序；否则只退出Trainer
    """
    
    def __init__(self, quit_all):
        
        super(ControlC, self).__init__()
        if type(quit_all) != bool:
            raise ValueError("In KeyBoardInterrupt, quit_all arguemnt must be a bool.")
        self.quit_all = quit_all

    def on_exception(self, exception, model):
    
    def on_exception(self, exception):
        if isinstance(exception, KeyboardInterrupt):
            if self.quit_all is True:
                import sys
@@ -335,7 +567,7 @@ class SmoothValue(object):
    def __init__(self, beta: float):
        self.beta, self.n, self.mov_avg = beta, 0, 0
        self.smooth = None

    
    def add_value(self, val: float) -> None:
        "Add `val` to calculate updated smoothed value."
        self.n += 1
@@ -344,48 +576,58 @@ class SmoothValue(object):


 class LRFinder(Callback):
    def __init__(self, n_batch, start_lr=1e-6, end_lr=10):
        """用第一个 epoch 找最佳的学习率，从第二个epoch开始应用它
    """
    别名：:class:`fastNLP.LRFinder` :class:`fastNLP.core.callback.LRFinder`

        :param n_batch: 一个epoch内的iteration数
        :param start_lr: 学习率下界
        :param end_lr: 学习率上界
        """
    用第一个 epoch 找最佳的学习率，从第二个epoch开始应用它

    :param float start_lr: 学习率下界
    :param float end_lr: 学习率上界
    """
    
    def __init__(self, start_lr=1e-6, end_lr=10):
        
        super(LRFinder, self).__init__()
        self.start_lr, self.end_lr = start_lr, end_lr
        self.num_it = n_batch
        
        self.stop = False
        self.best_loss = 0.
        self.best_lr = None
        self.loss_history = []
        self.smooth_value = SmoothValue(0.8)
        self.opt = None
        scale = (self.end_lr - self.start_lr) / self.num_it

        self.lr_gen = (self.start_lr + scale * (step + 1) for step in range(self.num_it))
        self.find = None
        self.loader = ModelLoader()

    def on_epoch_begin(self, cur_epoch, total_epoch):
        if cur_epoch == 1:
    
    @property
    def lr_gen(self):
        scale = (self.end_lr - self.start_lr) / self.batch_per_epoch
        return (self.start_lr + scale * (step + 1) for step in range(self.batch_per_epoch))
    
    @property
    def num_it(self):
        return self.batch_per_epoch
    
    def on_epoch_begin(self):
        if self.epoch == 1:  # first epoch
            self.opt = self.trainer.optimizer  # pytorch optimizer
            self.opt.param_groups[0]["lr"] = self.start_lr
            # save model
            ModelSaver("tmp").save_pytorch(self.trainer.model, param_only=True)
            self.find = True

    def on_backward_begin(self, loss, model):
    
    def on_backward_begin(self, loss):
        if self.find:
            if torch.isnan(loss) or self.stop is True:
                self.stop = True
                return
            loss_val = loss.detach().cpu().data
            loss_val = loss.detach().mean().item()
            self.loss_history.append(loss_val)
            self.smooth_value.add_value(loss_val)
            if self.best_loss == 0. or self.smooth_value.smooth < self.best_loss:
                self.best_loss = self.smooth_value.smooth
                self.best_lr = self.opt.param_groups[0]["lr"]

    
    def on_batch_end(self, *args):
        if self.find:
            lr = next(self.lr_gen, None)
@@ -394,24 +636,31 @@ class LRFinder(Callback):
                return
            self.opt.param_groups[0]["lr"] = lr
            # self.loader.load_pytorch(self.trainer.model, "tmp")

    def on_epoch_end(self, cur_epoch, n_epoch, optimizer):
        if cur_epoch == 1:
    
    def on_epoch_end(self):
        if self.epoch == 1:  # first epoch
            self.opt.param_groups[0]["lr"] = self.best_lr
            self.find = False
            # reset model
            ModelLoader().load_pytorch(self.trainer.model, "tmp")
            print("Model reset. \nFind best lr={}".format(self.best_lr))
            self.pbar.write("Model reset. \nFind best lr={}".format(self.best_lr))


 class TensorboardCallback(Callback):
    """
        接受以下一个或多个字符串作为参数：
        - "model"
        - "loss"
        - "metric"
    别名：:class:`fastNLP.TensorboardCallback` :class:`fastNLP.core.callback.TensorboardCallback`

    接受以下一个或多个字符串作为参数：
    - "model"
    - "loss"
    - "metric"
    
    .. warning::
        fastNLP 已停止对此功能的维护，请等待 fastNLP 兼容 PyTorch1.1 的下一个版本。
        或者使用和 fastNLP 高度配合的 fitlog（参见 :doc:`/user/with_fitlog` ）。
        
    """

    
    def __init__(self, *options):
        super(TensorboardCallback, self).__init__()
        args = {"model", "loss", "metric"}
@@ -421,15 +670,18 @@ class TensorboardCallback(Callback):
        self.options = options
        self._summary_writer = None
        self.graph_added = False

    
    def on_train_begin(self):
        save_dir = self.trainer.save_path
        if save_dir is None:
            path = os.path.join("./", 'tensorboard_logs_{}'.format(self.trainer.start_time))
        else:
            path = os.path.join(save_dir, 'tensorboard_logs_{}'.format(self.trainer.start_time))
        self._summary_writer = SummaryWriter(path)

        if tensorboardX_flag:
            self._summary_writer = SummaryWriter(path)
        else:
            self._summary_writer = None
    
    def on_batch_begin(self, batch_x, batch_y, indices):
        if "model" in self.options and self.graph_added is False:
            # tesorboardX 这里有大bug，暂时没法画模型图
@@ -439,37 +691,53 @@ class TensorboardCallback(Callback):
            # args = args[0] if len(args) == 1 else args
            # self._summary_writer.add_graph(self.trainer.model, torch.zeros(32, 2))
            self.graph_added = True

    def on_backward_begin(self, loss, model):
        if "loss" in self.options:
    
    def on_backward_begin(self, loss):
        if "loss" in self.options and self._summary_writer:
            self._summary_writer.add_scalar("loss", loss.item(), global_step=self.trainer.step)

        if "model" in self.options:
        
        if "model" in self.options and self._summary_writer:
            for name, param in self.trainer.model.named_parameters():
                if param.requires_grad:
                    self._summary_writer.add_scalar(name + "_mean", param.mean(), global_step=self.trainer.step)
                    # self._summary_writer.add_scalar(name + "_std", param.std(), global_step=self.trainer.step)
                    self._summary_writer.add_scalar(name + "_grad_mean", param.grad.mean(),
                                                    global_step=self.trainer.step)

    def on_valid_end(self, eval_result, metric_key, optimizer):
        if "metric" in self.options:
    
    def on_valid_end(self, eval_result, metric_key, optimizer, is_better_eval):
        if "metric" in self.options and self._summary_writer:
            for name, metric in eval_result.items():
                for metric_key, metric_val in metric.items():
                    self._summary_writer.add_scalar("valid_{}_{}".format(name, metric_key), metric_val,
                                                    global_step=self.trainer.step)

    def on_train_end(self, model):
        self._summary_writer.close()
        del self._summary_writer

    def on_exception(self, exception, model):
    
    def on_train_end(self):
        if self._summary_writer:
            self._summary_writer.close()
            del self._summary_writer
    
    def on_exception(self, exception):
        if hasattr(self, "_summary_writer"):
            self._summary_writer.close()
            del self._summary_writer


 if __name__ == "__main__":
    manager = CallbackManager(env={"n_epoch": 3}, callbacks=[DummyCallback(), DummyCallback()])
    manager.on_train_begin(10, 11, 12)
    # print(manager.after_epoch())
 class CallbackException(BaseException):
    """
   当需要通过callback跳出训练的时候可以通过抛出CallbackException并在on_exception中捕获这个值。

   :param str msg: Exception的信息。
   """
    
    def __init__(self, msg):
        super(CallbackException, self).__init__(msg)


 class EarlyStopError(CallbackException):
    """
    用于EarlyStop时从Trainer训练循环中跳出。
    
    """
    
    def __init__(self, msg):
        super(EarlyStopError, self).__init__(msg)
--- a/fastNLP/core/const.py
+++ b/fastNLP/core/const.py
@@ -0,0 +1,59 @@
 class Const:
    """
    fastNLP中field命名常量。
    
    .. todo::
        把下面这段改成表格
        
    具体列表::

        INPUT       模型的序列输入      words（复数words1, words2）
        CHAR_INPUT  模型character输入  chars（复数chars1， chars2）
        INPUT_LEN   序列长度           seq_len（复数seq_len1，seq_len2）
        OUTPUT      模型输出           pred（复数pred1， pred2）
        TARGET      真实目标           target（复数target1，target2）
        LOSS        损失函数           loss （复数loss1，loss2）

    """
    INPUT = 'words'
    CHAR_INPUT = 'chars'
    INPUT_LEN = 'seq_len'
    OUTPUT = 'pred'
    TARGET = 'target'
    LOSS = 'loss'

    @staticmethod
    def INPUTS(i):
        """得到第 i 个 ``INPUT`` 的命名"""
        i = int(i) + 1
        return Const.INPUT + str(i)

    @staticmethod
    def CHAR_INPUTS(i):
        """得到第 i 个 ``CHAR_INPUT`` 的命名"""
        i = int(i) + 1
        return Const.CHAR_INPUT + str(i)

    @staticmethod
    def INPUT_LENS(i):
        """得到第 i 个 ``INPUT_LEN`` 的命名"""
        i = int(i) + 1
        return Const.INPUT_LEN + str(i)

    @staticmethod
    def OUTPUTS(i):
        """得到第 i 个 ``OUTPUT`` 的命名"""
        i = int(i) + 1
        return Const.OUTPUT + str(i)

    @staticmethod
    def TARGETS(i):
        """得到第 i 个 ``TARGET`` 的命名"""
        i = int(i) + 1
        return Const.TARGET + str(i)

    @staticmethod
    def LOSSES(i):
        """得到第 i 个 ``LOSS`` 的命名"""
        i = int(i) + 1
        return Const.LOSS + str(i)
--- a/fastNLP/core/dataset.py
+++ b/fastNLP/core/dataset.py
--- a/fastNLP/core/fieldarray.py
+++ b/fastNLP/core/fieldarray.py
@@ -1,121 +1,37 @@
 import numpy as np


 class PadderBase:
    """
        所有padder都需要继承这个类，并覆盖__call__()方法。
        用于对batch进行padding操作。传入的element是inplace的，即直接修改element可能导致数据变化，建议inplace修改之前deepcopy一份。
    """
    def __init__(self, pad_val=0, **kwargs):
        self.pad_val = pad_val

    def set_pad_val(self, pad_val):
        self.pad_val = pad_val

    def __call__(self, contents, field_name, field_ele_dtype):
        """
        传入的是List内容。假设有以下的DataSet。
        from fastNLP import DataSet
        from fastNLP import Instance
        dataset = DataSet()
        dataset.append(Instance(word='this is a demo', length=4,
                                    chars=[['t', 'h', 'i', 's'], ['i', 's'], ['a'], ['d', 'e', 'm', 'o']]))
        dataset.append(Instance(word='another one', length=2,
                                    chars=[['a', 'n', 'o', 't', 'h', 'e', 'r'], ['o', 'n', 'e']]))
        # 如果batch_size=2, 下面只是用str的方式看起来更直观一点，但实际上可能word和chars在pad时都已经为index了。
        word这个field的pad_func会接收到的内容会是
            [
                'this is a demo',
                'another one'
            ]
        length这个field的pad_func会接收到的内容会是
            [4, 2]
        chars这个field的pad_func会接收到的内容会是
            [
                [['t', 'h', 'i', 's'], ['i', 's'], ['a'], ['d', 'e', 'm', 'o']],
                [['a', 'n', 'o', 't', 'h', 'e', 'r'], ['o', 'n', 'e']]
            ]
        即把每个instance中某个field的内容合成一个List传入
        :param contents: List[element]。传入的element是inplace的，即直接修改element可能导致数据变化，建议inplace修改之前
            deepcopy一份。
        :param field_name: str, field的名称，帮助定位错误
        :param field_ele_dtype: np.int64, np.float64, np.str. 该field的内层list元素的类型。辅助判断是否pad，大多数情况用不上
        :return: List[padded_element]或np.array([padded_element])
        """
        raise NotImplementedError

 """
 field模块实现了 FieldArray 和若干 Padder。 FieldArray 是  :class:`~fastNLP.DataSet` 中一列的存储方式，
 原理部分请参考 :doc:`fastNLP.core.dataset`

 class AutoPadder(PadderBase):
    """
    根据contents的数据自动判定是否需要做padding。
    (1) 如果元素类型(元素类型是指field中最里层List的元素的数据类型, 可以通过FieldArray.dtype查看，比如['This', 'is', ...]的元素类
        型为np.str, [[1,2], ...]的元素类型为np.int64)的数据不为(np.int64, np.float64)则不会进行padding
    (2) 如果元素类型为(np.int64, np.float64),
        (2.1) 如果该field的内容只有一个，比如为sequence_length, 则不进行padding
        (2.2) 如果该field的内容为List, 那么会将Batch中的List pad为一样长。若该List下还有里层的List需要padding，请使用其它padder。
            如果某个instance中field为[1, 2, 3]，则可以pad； 若为[[1,2], [3,4, ...]]则不能进行pad
    """
    def __init__(self, pad_val=0):
        """
        :param pad_val: int, padding的位置使用该index
        """
        super().__init__(pad_val=pad_val)
 """
 __all__ = [
    "FieldArray",
    "Padder",
    "AutoPadder",
    "EngChar2DPadder"
 ]

    def _is_two_dimension(self, contents):
        """
        判断contents是不是只有两个维度。[[1,2], [3]]是两个维度. [[[1,2], [3, 4, 5]], [[4,5]]]有三个维度
        :param contents:
        :return:
        """
        value = contents[0]
        if isinstance(value , (np.ndarray, list)):
            value = value[0]
            if isinstance(value, (np.ndarray, list)):
                return False
            return True
        return False
 from copy import deepcopy

    def __call__(self, contents, field_name, field_ele_dtype):
        if not is_iterable(contents[0]):
            array = np.array([content for content in contents], dtype=field_ele_dtype)
        elif field_ele_dtype in (np.int64, np.float64) and self._is_two_dimension(contents):
            max_len = max([len(content) for content in contents])
            array = np.full((len(contents), max_len), self.pad_val, dtype=field_ele_dtype)
            for i, content in enumerate(contents):
                array[i][:len(content)] = content
        else:  # should only be str
            array = np.array([content for content in contents])
        return array
 import numpy as np


 class FieldArray(object):
    """``FieldArray`` is the collection of ``Instance``s of the same field.
    It is the basic element of ``DataSet`` class.

    :param str name: the name of the FieldArray
    :param list content: a list of int, float, str or np.ndarray, or a list of list of one, or a np.ndarray.
    :param bool is_target: If True, this FieldArray is used to compute loss.
    :param bool is_input: If True, this FieldArray is used to the model input.
    :param padder: PadderBase类型。大多数情况下都不需要设置该值，除非需要在多个维度上进行padding(比如英文中对character进行padding)
    """

    def __init__(self, name, content, is_target=None, is_input=None, padder=AutoPadder(pad_val=0)):
        """DataSet在初始化时会有两类方法对FieldArray操作：
            1） 如果DataSet使用dict初始化，那么在add_field中会构造FieldArray：
                1.1) 二维list  DataSet({"x": [[1, 2], [3, 4]]})
                1.2) 二维array  DataSet({"x": np.array([[1, 2], [3, 4]])})
                1.3) 三维list  DataSet({"x": [[[1, 2], [3, 4]], [[1, 2], [3, 4]]]})
                1.4) list of array: DataSet({"x": [np.array([1,2,3]), np.array([1,2,3])]})
            2） 如果DataSet使用list of Instance 初始化,那么在append中会先对第一个样本初始化FieldArray；
            然后后面的样本使用FieldArray.append进行添加。
                2.1) 一维list DataSet([Instance(x=[1, 2, 3, 4])])
                2.2) 一维array DataSet([Instance(x=np.array([1, 2, 3, 4]))])
                2.3) 二维list  DataSet([Instance(x=[[1, 2], [3, 4]])])
                2.4) 二维array  DataSet([Instance(x=np.array([[1, 2], [3, 4]]))])

            类型检查(dtype check)发生在当该field被设置为is_input或者is_target时。

        """
    别名：:class:`fastNLP.FieldArray` :class:`fastNLP.core.field.FieldArray`

    FieldArray 是用于保存 :class:`~fastNLP.DataSet` 中一个field的类型。
    
    :param str name: FieldArray的名称
    :param list,numpy.ndarray content: 列表的元素可以为list，int，float，
    :param bool is_target: 这个field是否是一个target field。
    :param bool is_input: 这个field是否是一个input field。
    :param padder: :class:`~fastNLP.Padder` 类型。赋值给fieldarray的padder的对象会被deepcopy一份，需要修改padder参数必须通过
       fieldarray.set_pad_val()。默认为None，即使用 :class:`~fastNLP.AutoPadder`  。
    :param bool ignore_type: 是否忽略该field的type，一般如果这个field不需要转为torch.FloatTensor或torch.LongTensor,
        就可以设置为True。具体意义请参考 :class:`~fastNLP.DataSet` 。
    """
    
    def __init__(self, name, content, is_target=None, is_input=None, padder=None, ignore_type=False):
        self.name = name
        if isinstance(content, list):
            # 如果DataSet使用dict初始化, content 可能是二维list/二维array/三维list
@@ -132,30 +48,37 @@ class FieldArray(object):
            raise TypeError("content in FieldArray can only be list or numpy.ndarray, got {}.".format(type(content)))
        if len(content) == 0:
            raise RuntimeError("Cannot initialize FieldArray with empty list.")

        
        self.content = content  # 1维 或 2维 或 3维 list, 形状可能不对齐
        self.content_dim = None  # 表示content是多少维的list
        if padder is None:
            padder = AutoPadder(pad_val=0)
        else:
            assert isinstance(padder, Padder), "padder must be of type Padder."
            padder = deepcopy(padder)
        self.set_padder(padder)

        self.ignore_type = ignore_type
        
        self.BASIC_TYPES = (int, float, str)  # content中可接受的Python基本类型，这里没有np.array

        
        self.pytype = None
        self.dtype = None
        self._is_input = None
        self._is_target = None

        
        if is_input is not None or is_target is not None:
            self.is_input = is_input
            self.is_target = is_target

    
    def _set_dtype(self):
        self.pytype = self._type_detection(self.content)
        self.dtype = self._map_to_np_type(self.pytype)

        if self.ignore_type is False:
            self.pytype = self._type_detection(self.content)
            self.dtype = self._map_to_np_type(self.pytype)
    
    @property
    def is_input(self):
        return self._is_input

    
    @is_input.setter
    def is_input(self, value):
        """
@@ -164,33 +87,34 @@ class FieldArray(object):
        if value is True:
            self._set_dtype()
        self._is_input = value

    
    @property
    def is_target(self):
        return self._is_target

    
    @is_target.setter
    def is_target(self, value):
        """
            当 field_array.is_target = True / False 时被调用
        当 field_array.is_target = True / False 时被调用
        """
        if value is True:
            self._set_dtype()
        self._is_target = value

    
    def _type_detection(self, content):
        """当该field被设置为is_input或者is_target时被调用
        """
        当该field被设置为is_input或者is_target时被调用

        """
        if len(content) == 0:
            raise RuntimeError("Empty list in Field {}.".format(self.name))

        
        type_set = set([type(item) for item in content])

        
        if list in type_set:
            if len(type_set) > 1:
                # list 跟 非list 混在一起
                raise RuntimeError("Mixed data types in Field {}: {}".format(self.name, type_set))
                raise RuntimeError("Mixed data types in Field {}: {}".format(self.name, list(type_set)))
            # >1维list
            inner_type_set = set()
            for l in content:
@@ -213,7 +137,7 @@ class FieldArray(object):
                    return self._basic_type_detection(inner_inner_type_set)
                else:
                    # list 跟 非list 混在一起
                    raise RuntimeError("Mixed data types in Field {}: {}".format(self.name, inner_type_set))
                    raise RuntimeError("Mixed data types in Field {}: {}".format(self.name, list(inner_type_set)))
        else:
            # 一维list
            for content_type in type_set:
@@ -222,7 +146,7 @@ class FieldArray(object):
                        self.name, self.BASIC_TYPES, content_type))
            self.content_dim = 1
            return self._basic_type_detection(type_set)

    
    def _basic_type_detection(self, type_set):
        """
        :param type_set: a set of Python types
@@ -237,21 +161,21 @@ class FieldArray(object):
                return float
            else:
                # str 跟 int 或者 float 混在一起
                raise RuntimeError("Mixed data types in Field {}: {}".format(self.name, type_set))
                raise RuntimeError("Mixed data types in Field {}: {}".format(self.name, list(type_set)))
        else:
            # str, int, float混在一起
            raise RuntimeError("Mixed data types in Field {}: {}".format(self.name, type_set))

            raise RuntimeError("Mixed data types in Field {}: {}".format(self.name, list(type_set)))
    
    def _1d_list_check(self, val):
        """如果不是1D list就报错
        """
        type_set = set((type(obj) for obj in val))
        if any(obj not in self.BASIC_TYPES for obj in type_set):
            raise ValueError("Mixed data types in Field {}: {}".format(self.name, type_set))
            raise ValueError("Mixed data types in Field {}: {}".format(self.name, list(type_set)))
        self._basic_type_detection(type_set)
        # otherwise: _basic_type_detection will raise error
        return True

    
    def _2d_list_check(self, val):
        """如果不是2D list 就报错
        """
@@ -264,110 +188,132 @@ class FieldArray(object):
                inner_type_set.add(type(obj))
        self._basic_type_detection(inner_type_set)
        return True

    
    @staticmethod
    def _map_to_np_type(basic_type):
        type_mapping = {int: np.int64, float: np.float64, str: np.str, np.ndarray: np.ndarray}
        return type_mapping[basic_type]

    
    def __repr__(self):
        return "FieldArray {}: {}".format(self.name, self.content.__repr__())

    
    def append(self, val):
        """Add a new item to the tail of FieldArray.

        :param val: int, float, str, or a list of one.
        """
        if isinstance(val, list):
            pass
        elif isinstance(val, tuple):  # 确保最外层是list
            val = list(val)
        elif isinstance(val, np.ndarray):
            val = val.tolist()
        elif any((isinstance(val, t) for t in self.BASIC_TYPES)):
            pass
        else:
            raise RuntimeError(
                "Unexpected data type {}. Should be list, np.array, or {}".format(type(val), self.BASIC_TYPES))

        if self.is_input is True or self.is_target is True:
            if type(val) == list:
                if len(val) == 0:
                    raise ValueError("Cannot append an empty list.")
                if self.content_dim == 2 and self._1d_list_check(val):
                    # 1维list检查
                    pass
                elif self.content_dim == 3 and self._2d_list_check(val):
                    # 2维list检查
                    pass
                else:
                    raise RuntimeError(
                        "Dimension not matched: expect dim={}, got {}.".format(self.content_dim - 1, val))
            elif type(val) in self.BASIC_TYPES and self.content_dim == 1:
                # scalar检查
                if type(val) == float and self.pytype == int:
                    self.pytype = float
                    self.dtype = self._map_to_np_type(self.pytype)
        """将val append到这个field的尾部。如果这个field已经被设置为input或者target，则在append之前会检查该类型是否与已有
        的内容是匹配的。

        :param Any val: 需要append的值。
        """
        if self.ignore_type is False:
            if isinstance(val, list):
                pass
            elif isinstance(val, tuple):  # 确保最外层是list
                val = list(val)
            elif isinstance(val, np.ndarray):
                val = val.tolist()
            elif any((isinstance(val, t) for t in self.BASIC_TYPES)):
                pass
            else:
                raise RuntimeError(
                    "Unexpected data type {}. Should be list, np.array, or {}".format(type(val), self.BASIC_TYPES))
            
            if self.is_input is True or self.is_target is True:
                if type(val) == list:
                    if len(val) == 0:
                        raise ValueError("Cannot append an empty list.")
                    if self.content_dim == 2 and self._1d_list_check(val):
                        # 1维list检查
                        pass
                    elif self.content_dim == 3 and self._2d_list_check(val):
                        # 2维list检查
                        pass
                    else:
                        raise RuntimeError(
                            "Dimension not matched: expect dim={}, got {}.".format(self.content_dim - 1, val))
                elif type(val) in self.BASIC_TYPES and self.content_dim == 1:
                    # scalar检查
                    if type(val) == float and self.pytype == int:
                        self.pytype = float
                        self.dtype = self._map_to_np_type(self.pytype)
                else:
                    raise RuntimeError(
                        "Unexpected data type {}. Should be list, np.array, or {}".format(type(val), self.BASIC_TYPES))
        self.content.append(val)

    
    def __getitem__(self, indices):
        return self.get(indices)

        return self.get(indices, pad=False)
    
    def __setitem__(self, idx, val):
        assert isinstance(idx, int)
        self.content[idx] = val

    
    def get(self, indices, pad=True):
        """Fetch instances based on indices.
        """
        根据给定的indices返回内容

        :param indices: an int, or a list of int.
        :param pad: bool, 是否对返回的结果进行padding。
        :return:
        :param int,List[int] indices: 获取indices对应的内容。
        :param bool pad:  是否对返回的结果进行padding。仅对indices为List[int]时有效
        :return: 根据给定的indices返回的内容，可能是单个值或List
        """
        if isinstance(indices, int):
            return self.content[indices]
        if self.is_input is False and self.is_target is False:
            raise RuntimeError("Please specify either is_input or is_target is True for {}".format(self.name))

        
        contents = [self.content[i] for i in indices]
        if self.padder is None or pad is False:
            return np.array(contents)
        else:
            return self.padder(contents, field_name=self.name, field_ele_dtype=self.dtype)

    
    def set_padder(self, padder):
        """
        设置padding方式
        设置padder，在这个field进行pad的时候用这个padder进行pad，如果为None则不进行pad。

        :param padder: PadderBase类型或None. 设置为None即删除padder.
        :return:
        :param padder: :class:`~fastNLP.Padder` 类型，设置为None即删除padder。
        """
        if padder is not None:
            assert isinstance(padder, PadderBase), "padder must be of type PadderBase."
        self.padder = padder

            assert isinstance(padder, Padder), "padder must be of type Padder."
            self.padder = deepcopy(padder)
        else:
            self.padder = None
    
    def set_pad_val(self, pad_val):
        """
        修改padder的pad_val.
        :param pad_val: int。
        :return:

        :param int pad_val: 该field的pad值设置为该值。
        """
        if self.padder is not None:
            self.padder.set_pad_val(pad_val)


        return self
    
    def __len__(self):
        """Returns the size of FieldArray.
        """
        Returns the size of FieldArray.

        :return int length:
        """
        return len(self.content)
    
    def to(self, other):
        """
        将other的属性复制给本FieldArray(other必须为FieldArray类型).
        属性包括 is_input, is_target, padder, ignore_type

        :param  other: :class:`~fastNLP.FieldArray` 从哪个field拷贝属性
        :return: :class:`~fastNLP.FieldArray`
        """
        assert isinstance(other, FieldArray), "Only support FieldArray type, not {}.".format(type(other))
        
        self.is_input = other.is_input
        self.is_target = other.is_target
        self.padder = other.padder
        self.ignore_type = other.ignore_type
        
        return self


 def is_iterable(content):
 def _is_iterable(content):
    try:
        _ = (e for e in content)
    except TypeError:
@@ -375,24 +321,161 @@ def is_iterable(content):
    return True


 class EngChar2DPadder(PadderBase):
 class Padder:
    """
    用于为英语执行character级别的2D padding操作。对应的field内容应该为[['T', 'h', 'i', 's'], ['a'], ['d', 'e', 'm', 'o']](这里为
        了更直观，把它们写为str，但实际使用时它们应该是character的index)。
    padded过后的batch内容，形状为(batch_size, max_sentence_length, max_word_length). max_sentence_length最大句子长度。
        max_word_length最长的word的长度
    别名：:class:`fastNLP.Padder` :class:`fastNLP.core.field.Padder`

    所有padder都需要继承这个类，并覆盖__call__方法。
    用于对batch进行padding操作。传入的element是inplace的，即直接修改element可能导致数据变化，建议inplace修改之前deepcopy一份。
    
    .. py:function:: __call__(self, contents, field_name, field_ele_dtype):
        传入的是List内容。假设有以下的DataSet。
        
        :param list(Any) contents: 传入的element是inplace的，即直接修改element可能导致数据变化，建议inplace修改之前
            deepcopy一份。
        :param str, field_name: field的名称。
        :param np.int64,np.float64,np.str,None, field_ele_dtype: 该field的内层元素的类型。如果该field的ignore_type为True，该这个值为None。
        :return: np.array([padded_element])
    
    """
    def __init__(self, pad_val=0, pad_length=0):
    
    def __init__(self, pad_val=0, **kwargs):
        self.pad_val = pad_val
    
    def set_pad_val(self, pad_val):
        self.pad_val = pad_val
    
    def __call__(self, contents, field_name, field_ele_dtype):
        """
        传入的是List内容。假设有以下的DataSet。

        :param list(Any) contents: 传入的element是inplace的，即直接修改element可能导致数据变化，建议inplace修改之前
            deepcopy一份。
        :param str, field_name: field的名称。
        :param np.int64,np.float64,np.str,None, field_ele_dtype: 该field的内层元素的类型。如果该field的ignore_type为True，该这个值为None。
        :return: np.array([padded_element])

        Example::

            from fastNLP import DataSet
            from fastNLP import Instance
            dataset = DataSet()
            dataset.append(Instance(sent='this is a demo', length=4,
                                    chars=[['t', 'h', 'i', 's'], ['i', 's'], ['a'], ['d', 'e', 'm', 'o']]))
            dataset.append(Instance(sent='another one', length=2,
                                    chars=[['a', 'n', 'o', 't', 'h', 'e', 'r'], ['o', 'n', 'e']]))
            如果调用
            batch = dataset.get([0,1], pad=True)
            sent这个field的padder的__call__会接收到的内容会是
                [
                    'this is a demo',
                    'another one'
                ]

            length这个field的padder的__call__会接收到的内容会是
                [4, 2]

            chars这个field的padder的__call__会接收到的内容会是
                [
                    [['t', 'h', 'i', 's'], ['i', 's'], ['a'], ['d', 'e', 'm', 'o']],
                    [['a', 'n', 'o', 't', 'h', 'e', 'r'], ['o', 'n', 'e']]
                ]

        即把每个instance中某个field的内容合成一个List传入

        """
        raise NotImplementedError


 class AutoPadder(Padder):
    """
    别名：:class:`fastNLP.AutoPadder` :class:`fastNLP.core.field.AutoPadder`

    根据contents的数据自动判定是否需要做padding。

    1 如果元素类型(元素类型是指field中最里层元素的数据类型, 可以通过FieldArray.dtype查看，比如['This', 'is', ...]的元素类
    型为np.str, [[1,2], ...]的元素类型为np.int64)的数据不为(np.int64, np.float64)则不会进行pad

    2 如果元素类型为(np.int64, np.float64),

        2.1 如果该field的内容为(np.int64, np.float64)，比如为seq_len, 则不进行padding

        2.2 如果该field的内容为List, 那么会将Batch中的List pad为一样长。若该List下还有里层的List需要padding，请使用其它padder。
        即如果Instance中field形如[1, 2, 3, ...]，则可以pad；若为[[1,2], [3,4, ...]]则不能进行pad
    """
    
    def __init__(self, pad_val=0):
        """
        :param pad_val: int, padding的位置使用该index
        :param pad_length: int, 如果为0则取一个batch中最大的单词长度作为padding长度。如果为大于0的数，则将所有单词的长度都pad或截
            取到该长度.
        """
        super().__init__(pad_val=pad_val)
    
    def _is_two_dimension(self, contents):
        """
        判断contents是不是只有两个维度。[[1,2], [3]]是两个维度. [[[1,2], [3, 4, 5]], [[4,5]]]有三个维度
        :param contents:
        :return:
        """
        value = contents[0]
        if isinstance(value, (np.ndarray, list)):
            value = value[0]
            if isinstance(value, (np.ndarray, list)):
                return False
            return True
        return False
    
    def __call__(self, contents, field_name, field_ele_dtype):
        
        if not _is_iterable(contents[0]):
            array = np.array([content for content in contents], dtype=field_ele_dtype)
        elif field_ele_dtype in (np.int64, np.float64) and self._is_two_dimension(contents):
            max_len = max([len(content) for content in contents])
            array = np.full((len(contents), max_len), self.pad_val, dtype=field_ele_dtype)
            for i, content in enumerate(contents):
                array[i][:len(content)] = content
        elif field_ele_dtype is None:
            array = np.array(contents)  # 当ignore_type=True时，直接返回contents
        else:  # should only be str
            array = np.array([content for content in contents])
        return array

        self.pad_length = pad_length

 class EngChar2DPadder(Padder):
    """
    别名：:class:`fastNLP.EngChar2DPadder` :class:`fastNLP.core.field.EngChar2DPadder`

    用于为英语执行character级别的2D padding操作。对应的field内容应该类似[['T', 'h', 'i', 's'], ['a'], ['d', 'e', 'm', 'o']]，
    但这个Padder只能处理index为int的情况。

    padded过后的batch内容，形状为(batch_size, max_sentence_length, max_word_length). max_sentence_length为这个batch中最大句
    子长度；max_word_length为这个batch中最长的word的长度

    Example::

        from fastNLP import DataSet
        from fastNLP import EngChar2DPadder
        from fastNLP import Vocabulary
        dataset = DataSet({'sent': ['This is the first demo', 'This is the second demo']})
        dataset.apply(lambda ins:[list(word) for word in ins['sent'].split()], new_field_name='chars')
        vocab = Vocabulary()
        vocab.from_dataset(dataset, field_name='chars')
        vocab.index_dataset(dataset, field_name='chars')
        dataset.set_input('chars')
        padder = EngChar2DPadder()
        dataset.set_padder('chars', padder)  # chars这个field的设置为了EnChar2DPadder

    """
    
    def __init__(self, pad_val=0, pad_length=0):
        """
        :param pad_val: int, pad的位置使用该index
        :param pad_length: int, 如果为0则取一个batch中最大的单词长度作为padding长度。如果为大于0的数，则将所有单词的长度
            都pad或截取到该长度.
        """
        super().__init__(pad_val=pad_val)
        
        self.pad_length = pad_length
    
    def _exactly_three_dims(self, contents, field_name):
        """
        检查传入的contents是否刚好是3维，如果不是3维就报错。理论上，第一个维度是batch，第二个维度是word，第三个维度是character
@@ -411,10 +494,10 @@ class EngChar2DPadder(PadderBase):
            value = value[0]
        except:
            raise ValueError("Field:{} only has two dimensions.".format(field_name))

        if is_iterable(value):
        
        if _is_iterable(value):
            raise ValueError("Field:{} has more than 3 dimension.".format(field_name))

    
    def __call__(self, contents, field_name, field_ele_dtype):
        """
        期望输入类似于
@@ -441,12 +524,12 @@ class EngChar2DPadder(PadderBase):
        max_sent_length = max(len(word_lst) for word_lst in contents)
        batch_size = len(contents)
        dtype = type(contents[0][0][0])

        
        padded_array = np.full((batch_size, max_sent_length, max_char_length), fill_value=self.pad_val,
                                        dtype=dtype)
                               dtype=dtype)
        for b_idx, word_lst in enumerate(contents):
            for c_idx, char_lst in enumerate(word_lst):
                chars = char_lst[:max_char_length]
                padded_array[b_idx, c_idx, :len(chars)] = chars

        return padded_array
        
        return padded_array
--- a/fastNLP/core/instance.py
+++ b/fastNLP/core/instance.py
@@ -1,38 +1,52 @@
 class Instance(object):
    """An Instance is an example of data.
        Example::
            ins = Instance(field_1=[1, 1, 1], field_2=[2, 2, 2])
            ins["field_1"]
            >>[1, 1, 1]
            ins.add_field("field_3", [3, 3, 3])
 """
 instance 模块实现了Instance 类在fastNLP中对应sample。一个sample可以认为是一个Instance类型的对象。
 便于理解的例子可以参考文档 :doc:`fastNLP.core.dataset` 中的表格

        :param fields: a dict of (str: list).
 """
 __all__ = [
    "Instance"
 ]

    """

 class Instance(object):
    """
    别名：:class:`fastNLP.Instance` :class:`fastNLP.core.instance.Instance`

    Instance是fastNLP中对应一个sample的类。每个sample在fastNLP中是一个Instance对象。
    Instance一般与 :class:`~fastNLP.DataSet` 一起使用, Instance的初始化如下面的Example所示

    Example::
    
        >>>from fastNLP import Instance
        >>>ins = Instance(field_1=[1, 1, 1], field_2=[2, 2, 2])
        >>>ins["field_1"]
        [1, 1, 1]
        >>>ins.add_field("field_3", [3, 3, 3])
        >>>ins = Instance(**{'x1': 1, 'x2':np.zeros((3, 4))})
    """
    
    def __init__(self, **fields):
        """

        :param fields: 可能是一维或者二维的 list or np.array
        """
        
        self.fields = fields

    
    def add_field(self, field_name, field):
        """Add a new field to the instance.
        """
        向Instance中增加一个field

        :param field_name: str, the name of the field.
        :param str field_name: 新增field的名称
        :param Any field: 新增field的内容
        """
        self.fields[field_name] = field

    
    def __getitem__(self, name):
        if name in self.fields:
            return self.fields[name]
        else:
            raise KeyError("{} not found".format(name))

    
    def __setitem__(self, name, field):
        return self.add_field(name, field)

    
    def __repr__(self):
        s = '\''
        return "{" + ",\n".join(
--- a/fastNLP/core/losses.py
+++ b/fastNLP/core/losses.py
@@ -1,33 +1,50 @@
 """
 losses 模块定义了 fastNLP 中所需的各种损失函数，一般做为 :class:`~fastNLP.Trainer` 的参数使用。

 """
 __all__ = [
    "LossBase",
    
    "LossFunc",
    "LossInForward",
    
    "CrossEntropyLoss",
    "BCELoss",
    "L1Loss",
    "NLLLoss"
 ]

 import inspect
 from collections import defaultdict

 import torch
 import torch.nn.functional as F

 from fastNLP.core.utils import CheckError
 from fastNLP.core.utils import CheckRes
 from fastNLP.core.utils import _build_args
 from fastNLP.core.utils import _check_arg_dict_list
 from fastNLP.core.utils import _check_function_or_method
 from fastNLP.core.utils import get_func_signature
 from .utils import _CheckError
 from .utils import _CheckRes
 from .utils import _build_args
 from .utils import _check_arg_dict_list
 from .utils import _check_function_or_method
 from .utils import _get_func_signature


 class LossBase(object):
    """Base class for all losses.

    """
    所有loss的基类。如果想了解其中的原理，请查看源码。
    """
    
    def __init__(self):
        self.param_map = {}
        self._checked = False

    
    def get_loss(self, *args, **kwargs):
        raise NotImplementedError

    
    def _init_param_map(self, key_map=None, **kwargs):
        """Check the validity of key_map and other param map. Add these into self.param_map
        """检查key_map和其他参数map，并将这些映射关系添加到self.param_map

        :param key_map: dict
        :param kwargs:
        :param dict key_map: 表示key的映射关系
        :param kwargs: key word args里面的每一个的键-值对都会被构造成映射关系
        :return: None
        """
        value_counter = defaultdict(set)
@@ -55,21 +72,21 @@ class LossBase(object):
        for value, key_set in value_counter.items():
            if len(key_set) > 1:
                raise ValueError(f"Several parameters:{key_set} are provided with one output {value}.")

        
        # check consistence between signature and param_map
        func_spect = inspect.getfullargspec(self.get_loss)
        func_args = [arg for arg in func_spect.args if arg != 'self']
        for func_param, input_param in self.param_map.items():
            if func_param not in func_args:
                raise NameError(
                    f"Parameter `{func_param}` is not in {get_func_signature(self.get_loss)}. Please check the "
                    f"Parameter `{func_param}` is not in {_get_func_signature(self.get_loss)}. Please check the "
                    f"initialization parameters, or change its signature.")

        
        # evaluate should not have varargs.
        # if func_spect.varargs:
        #     raise NameError(f"Delete `*{func_spect.varargs}` in {get_func_signature(self.get_loss)}(Do not use "
        #                     f"positional argument.).")

    
    def _fast_param_map(self, pred_dict, target_dict):
        """Only used as inner function. When the pred_dict, target is unequivocal. Don't need users to pass key_map.
            such as pred_dict has one element, target_dict has one element
@@ -84,34 +101,34 @@ class LossBase(object):
            fast_param['target'] = list(target_dict.values())[0]
            return fast_param
        return fast_param

    
    def __call__(self, pred_dict, target_dict, check=False):
        """
        :param pred_dict: A dict from forward function of the network.
        :param target_dict: A dict from DataSet.batch_y.
        :param check: Boolean. Force to check the mapping functions when it is running.
        :param dict pred_dict: 模型的forward函数返回的dict
        :param dict target_dict: DataSet.batch_y里的键-值对所组成的dict
        :param Boolean check: 每一次执行映射函数的时候是否检查映射表，默认为不检查
        :return:
        """
        fast_param = self._fast_param_map(pred_dict, target_dict)
        if fast_param:
            loss = self.get_loss(**fast_param)
            return loss

        
        if not self._checked:
            # 1. check consistence between signature and param_map
            func_spect = inspect.getfullargspec(self.get_loss)
            func_args = set([arg for arg in func_spect.args if arg != 'self'])
            for func_arg, input_arg in self.param_map.items():
                if func_arg not in func_args:
                    raise NameError(f"`{func_arg}` not in {get_func_signature(self.get_loss)}.")

                    raise NameError(f"`{func_arg}` not in {_get_func_signature(self.get_loss)}.")
            
            # 2. only part of the param_map are passed, left are not
            for arg in func_args:
                if arg not in self.param_map:
                    self.param_map[arg] = arg  # This param does not need mapping.
            self._evaluate_args = func_args
            self._reverse_param_map = {input_arg: func_arg for func_arg, input_arg in self.param_map.items()}

        
        # need to wrap inputs in dict.
        mapped_pred_dict = {}
        mapped_target_dict = {}
@@ -131,7 +148,7 @@ class LossBase(object):
                not_duplicate_flag += 1
            if not_duplicate_flag == 3:
                duplicated.append(input_arg)

        
        # missing
        if not self._checked:
            check_res = _check_arg_dict_list(self.get_loss, [mapped_pred_dict, mapped_target_dict])
@@ -141,37 +158,50 @@ class LossBase(object):
            for idx, func_arg in enumerate(missing):
                # Don't delete `` in this information, nor add ``
                replaced_missing[idx] = f"{self.param_map[func_arg]}" + f"(assign to `{func_arg}` " \
                                                                        f"in `{self.__class__.__name__}`)"

            check_res = CheckRes(missing=replaced_missing,
                                 unused=check_res.unused,
                                 duplicated=duplicated,
                                 required=check_res.required,
                                 all_needed=check_res.all_needed,
                                 varargs=check_res.varargs)

                    f"in `{self.__class__.__name__}`)"
            
            check_res = _CheckRes(missing=replaced_missing,
                                  unused=check_res.unused,
                                  duplicated=duplicated,
                                  required=check_res.required,
                                  all_needed=check_res.all_needed,
                                  varargs=check_res.varargs)
            
            if check_res.missing or check_res.duplicated:
                raise CheckError(check_res=check_res,
                                 func_signature=get_func_signature(self.get_loss))
                raise _CheckError(check_res=check_res,
                                  func_signature=_get_func_signature(self.get_loss))
        refined_args = _build_args(self.get_loss, **mapped_pred_dict, **mapped_target_dict)

        
        loss = self.get_loss(**refined_args)
        self._checked = True

        
        return loss


 class LossFunc(LossBase):
    """A wrapper of user-provided loss function.
    """
    别名：:class:`fastNLP.LossFunc` :class:`fastNLP.core.losses.LossFunc`

    提供给用户使用自定义损失函数的类

    :param func: 用户自行定义的损失函数，应当为一个函数或者callable(func)为True的ojbect
    :param dict key_map: 参数映射表。键为Model/DataSet参数名，值为损失函数参数名。
                         fastNLP的trainer将在训练时从模型返回值或者训练数据DataSet的target=True的field中
                         找到相对应的参数名为value的参数，并传入func中作为参数名为key的参数
    :param kwargs: 除了参数映射表以外可以用key word args的方式设置参数映射关系

    Example::

        >>> func = torch.nn.CrossEntropyLoss()
        >>> loss_func = LossFunc(func, input="pred", target="label")
        # 这表示构建了一个损失函数类，由func计算损失函数，其中将从模型返回值或者DataSet的target=True的field
        # 当中找到一个参数名为`pred`的参数传入func一个参数名为`input`的参数；找到一个参数名为`label`的参数
        # 传入func作为一个名为`target`的参数

    """
    
    def __init__(self, func, key_map=None, **kwargs):
        """

        :param func: a callable object, such as a function.
        :param dict key_map:
        :param kwargs:
        """
        
        super(LossFunc, self).__init__()
        _check_function_or_method(func)
        if key_map is not None:
@@ -181,78 +211,129 @@ class LossFunc(LossBase):
        if len(kwargs) > 0:
            for key, val in kwargs.items():
                self.param_map.update({key: val})

        
        self.get_loss = func


 class CrossEntropyLoss(LossBase):
    """
    别名：:class:`fastNLP.CrossEntropyLoss` :class:`fastNLP.core.losses.CrossEntropyLoss`

    交叉熵损失函数
    
    :param pred: 参数映射表中 `pred` 的映射关系，None表示映射关系为 `pred` -> `pred`
    :param target: 参数映射表中 `target` 的映射关系，None表示映射关系为 `target` -> `target`
    :param padding_idx: padding的index，在计算loss时将忽略target中标号为padding_idx的内容

    Example::

        >>> loss = CrossEntropyLoss(pred='pred', target='label', padding_idx=0)
        
    """
    
    def __init__(self, pred=None, target=None, padding_idx=-100):
        # TODO 需要做一些检查，F.cross_entropy在计算时，如果pred是(16, 10 ,4), target的形状按道理应该是(16, 10), 但实际却需要
        # TODO  （16， 4）
        # TODO 需要做一些检查，F.cross_entropy在计算时，如果pred是(16, 10 ,4), target的形状按道理应该是(16, 10), 但实际需要（16，4）
        super(CrossEntropyLoss, self).__init__()
        self._init_param_map(pred=pred, target=target)
        self.padding_idx = padding_idx

    
    def get_loss(self, pred, target):
        return F.cross_entropy(input=pred, target=target,
                               ignore_index=self.padding_idx)


 class L1Loss(LossBase):
    """
    别名：:class:`fastNLP.L1Loss` :class:`fastNLP.core.losses.L1Loss`

    L1损失函数
    
    :param pred: 参数映射表中 `pred` 的映射关系，None表示映射关系为 `pred` -> `pred`
    :param target: 参数映射表中 `target` 的映射关系，None表示映射关系为 `target` >`target`
    
    """
    
    def __init__(self, pred=None, target=None):
        super(L1Loss, self).__init__()
        self._init_param_map(pred=pred, target=target)

    
    def get_loss(self, pred, target):
        return F.l1_loss(input=pred, target=target)


 class BCELoss(LossBase):
    """
    别名：:class:`fastNLP.BCELoss` :class:`fastNLP.core.losses.BCELoss`

    二分类交叉熵损失函数
    
    :param pred: 参数映射表中`pred`的映射关系，None表示映射关系为`pred`->`pred`
    :param target: 参数映射表中`target`的映射关系，None表示映射关系为`target`->`target`
    """
    
    def __init__(self, pred=None, target=None):
        super(BCELoss, self).__init__()
        self._init_param_map(pred=pred, target=target)

    
    def get_loss(self, pred, target):
        return F.binary_cross_entropy(input=pred, target=target)


 class NLLLoss(LossBase):
    """
    别名：:class:`fastNLP.NLLLoss` :class:`fastNLP.core.losses.NLLLoss`
    
    负对数似然损失函数
    
    :param pred: 参数映射表中`pred`的映射关系，None表示映射关系为`pred`->`pred`
    :param target: 参数映射表中`target`的映射关系，None表示映射关系为`target`->`target`
    """
    
    def __init__(self, pred=None, target=None):
        super(NLLLoss, self).__init__()
        self._init_param_map(pred=pred, target=target)

    
    def get_loss(self, pred, target):
        return F.nll_loss(input=pred, target=target)


 class LossInForward(LossBase):
    """
    别名：:class:`fastNLP.LossInForward` :class:`fastNLP.core.losses.LossInForward`

    从forward()函数返回结果中获取loss
    
    :param str loss_key: 在forward函数中loss的键名，默认为loss
    """
    
    def __init__(self, loss_key='loss'):
        super().__init__()
        if not isinstance(loss_key, str):
            raise TypeError(f"Only str allowed for loss_key, got {type(loss_key)}.")
        self.loss_key = loss_key

    
    def get_loss(self, **kwargs):
        if self.loss_key not in kwargs:
            check_res = CheckRes(missing=[self.loss_key + f"(assign to `{self.loss_key}` " \
                                                                        f"in `{self.__class__.__name__}`"],
                                 unused=[],
                                 duplicated=[],
                                 required=[],
                                 all_needed=[],
                                 varargs=[])
            raise CheckError(check_res=check_res, func_signature=get_func_signature(self.get_loss))
            check_res = _CheckRes(
                missing=[self.loss_key + f"(assign to `{self.loss_key}` in `{self.__class__.__name__}`"],
                unused=[],
                duplicated=[],
                required=[],
                all_needed=[],
                varargs=[])
            raise _CheckError(check_res=check_res, func_signature=_get_func_signature(self.get_loss))
        return kwargs[self.loss_key]

    
    def __call__(self, pred_dict, target_dict, check=False):

        
        loss = self.get_loss(**pred_dict)

        
        if not (isinstance(loss, torch.Tensor) and len(loss.size()) == 0):
            if not isinstance(loss, torch.Tensor):
                raise TypeError(f"Loss excepted to be a torch.Tensor, got {type(loss)}")
            raise RuntimeError(f"The size of loss excepts to be torch.Size([]), got {loss.size()}")

            loss = torch.sum(loss) / (loss.view(-1)).size(0)
            # raise RuntimeError(f"The size of loss excepts to be torch.Size([]), got {loss.size()}")
        
        return loss


@@ -271,7 +352,7 @@ def squash(predict, truth, **kwargs):

    :param predict: Tensor, model output
    :param truth: Tensor, truth from dataset
    :param **kwargs: extra arguments
    :param kwargs: extra arguments
    :return predict , truth: predict & truth after processing
    """
    return predict.view(-1, predict.size()[-1]), truth.view(-1, )
@@ -315,20 +396,20 @@ def mask(predict, truth, **kwargs):

    :param predict: Tensor, [batch_size , max_len , tag_size]
    :param truth: Tensor, [batch_size , max_len]
    :param **kwargs: extra arguments, kwargs["mask"]: ByteTensor, [batch_size , max_len], the mask Tensor. The position that is 1 will be selected.
    :param kwargs: extra arguments, kwargs["mask"]: ByteTensor, [batch_size , max_len], the mask Tensor. The position that is 1 will be selected.

    :return predict , truth: predict & truth after processing
    """
    if kwargs.get("mask") is None:
        return predict, truth
    mask = kwargs["mask"]

    
    predict, truth = squash(predict, truth)
    mask = mask.view(-1, )

    
    predict = torch.masked_select(predict.permute(1, 0), mask).view(predict.size()[-1], -1).permute(1, 0)
    truth = torch.masked_select(truth, mask)

    
    return predict, truth


@@ -343,4 +424,3 @@ def make_mask(lens, tar_len):
    mask = [torch.ge(lens, i + 1) for i in range(tar_len)]
    mask = torch.stack(mask, 1)
    return mask

--- a/fastNLP/core/metrics.py
+++ b/fastNLP/core/metrics.py
--- a/fastNLP/core/optimizer.py
+++ b/fastNLP/core/optimizer.py
@@ -1,57 +1,82 @@
 """
 optimizer 模块定义了 fastNLP 中所需的各种优化器，一般做为 :class:`~fastNLP.Trainer` 的参数使用。

 """
 __all__ = [
    "Optimizer",
    "SGD",
    "Adam"
 ]

 import torch


 class Optimizer(object):
    """
    别名：:class:`fastNLP.Optimizer` :class:`fastNLP.core.optimizer.Optimizer`

        :param model_params: a generator. E.g. ``model.parameters()`` for PyTorch models.
        :param kwargs: additional parameters.
    :param model_params: a generator. E.g. ``model.parameters()`` for PyTorch models.
    :param kwargs: additional parameters.
    """
    
    def __init__(self, model_params, **kwargs):
        if model_params is not None and not hasattr(model_params, "__next__"):
            raise RuntimeError("model parameters should be a generator, rather than {}.".format(type(model_params)))
        self.model_params = model_params
        self.settings = kwargs
    
    def construct_from_pytorch(self, model_params):
        raise NotImplementedError
    
    def _get_require_grads_param(self, params):
        """
        将params中不需要gradient的删除
        :param iterable params: parameters
        :return: list(nn.Parameters)
        """
        return [param for param in params if param.requires_grad]


 class SGD(Optimizer):
    """
    别名：:class:`fastNLP.SGD` :class:`fastNLP.core.optimizer.SGD`

        :param float lr: learning rate. Default: 0.01
        :param float momentum: momentum. Default: 0
        :param model_params: a generator. E.g. ``model.parameters()`` for PyTorch models.
    :param float lr: learning rate. Default: 0.01
    :param float momentum: momentum. Default: 0
    :param model_params: a generator. E.g. ``model.parameters()`` for PyTorch models.
    """

    
    def __init__(self, lr=0.001, momentum=0, model_params=None):
        if not isinstance(lr, float):
            raise TypeError("learning rate has to be float.")
        super(SGD, self).__init__(model_params, lr=lr, momentum=momentum)

    
    def construct_from_pytorch(self, model_params):
        if self.model_params is None:
            # careful! generator cannot be assigned.
            return torch.optim.SGD(model_params, **self.settings)
            return torch.optim.SGD(self._get_require_grads_param(model_params), **self.settings)
        else:
            return torch.optim.SGD(self.model_params, **self.settings)
            return torch.optim.SGD(self._get_require_grads_param(self.model_params), **self.settings)


 class Adam(Optimizer):
    """
    别名：:class:`fastNLP.Adam` :class:`fastNLP.core.optimizer.Adam`

        :param float lr: learning rate
        :param float weight_decay:
        :param model_params: a generator. E.g. ``model.parameters()`` for PyTorch models.
    :param float lr: learning rate
    :param float weight_decay:
    :param model_params: a generator. E.g. ``model.parameters()`` for PyTorch models.
    """

    
    def __init__(self, lr=0.001, weight_decay=0, betas=(0.9, 0.999), eps=1e-8, amsgrad=False, model_params=None):
        if not isinstance(lr, float):
            raise TypeError("learning rate has to be float.")
        super(Adam, self).__init__(model_params, lr=lr, betas=betas, eps=eps, amsgrad=amsgrad,
                                   weight_decay=weight_decay)

    
    def construct_from_pytorch(self, model_params):
        if self.model_params is None:
            # careful! generator cannot be assigned.
            return torch.optim.Adam(model_params, **self.settings)
            return torch.optim.Adam(self._get_require_grads_param(model_params), **self.settings)
        else:
            return torch.optim.Adam(self.model_params, **self.settings)
            return torch.optim.Adam(self._get_require_grads_param(self.model_params), **self.settings)
--- a/fastNLP/core/predictor.py
+++ b/fastNLP/core/predictor.py
@@ -1,15 +1,20 @@
 """
    ..todo::
        检查这个类是否需要
 """
 from collections import defaultdict

 import torch

 from fastNLP.core import Batch
 from fastNLP.core import DataSet
 from fastNLP.core import SequentialSampler
 from fastNLP.core.utils import _build_args
 from . import Batch
 from . import DataSet
 from . import SequentialSampler
 from .utils import _build_args


 class Predictor(object):
    """An interface for predicting outputs based on trained models.
    """
    An interface for predicting outputs based on trained models.

    It does not care about evaluations of the model, which is different from Tester.
    This is a high-level model wrapper to be called by FastNLP.
--- a/fastNLP/core/sampler.py
+++ b/fastNLP/core/sampler.py
@@ -1,89 +1,93 @@
 """
 sampler 子类实现了 fastNLP 所需的各种采样器。
 """
 __all__ = [
    "Sampler",
    "BucketSampler",
    "SequentialSampler",
    "RandomSampler"
 ]

 from itertools import chain

 import numpy as np
 import torch


 def convert_to_torch_tensor(data_list, use_cuda):
    """Convert lists into (cuda) Tensors.

    :param data_list: 2-level lists
    :param use_cuda: bool, whether to use GPU or not
    :return data_list: PyTorch Tensor of shape [batch_size, max_seq_len]
 class Sampler(object):
    """
    data_list = torch.Tensor(data_list).long()
    if torch.cuda.is_available() and use_cuda:
        data_list = data_list.cuda()
    return data_list
    别名：:class:`fastNLP.Sampler` :class:`fastNLP.core.sampler.Sampler`

     
    `Sampler` 类的基类. 规定以何种顺序取出data中的元素

 class BaseSampler(object):
    """The base class of all samplers.

        Sub-classes must implement the ``__call__`` method.
        ``__call__`` takes a DataSet object and returns a list of int - the sampling indices.
    子类必须实现 ``__call__`` 方法. 输入 `DataSet` 对象, 返回其中元素的下标序列
    """

    def __call__(self, *args, **kwargs):
    
    def __call__(self, data_set):
        """
       :param DataSet data_set: `DataSet` 对象, 需要Sample的数据
       :return result: list(int) 其中元素的下标序列, ``data_set`` 中元素会按 ``result`` 中顺序取出
       """
        raise NotImplementedError


 class SequentialSampler(BaseSampler):
    """Sample data in the original order.
 class SequentialSampler(Sampler):
    """
    别名：:class:`fastNLP.SequentialSampler` :class:`fastNLP.core.sampler.SequentialSampler`
     
    顺序取出元素的 `Sampler`

    """
    
    def __call__(self, data_set):
        """

        :param DataSet data_set:
        :return result: a list of integers.
        """
        return list(range(len(data_set)))


 class RandomSampler(BaseSampler):
    """Sample data in random permutation order.
 class RandomSampler(Sampler):
    """
    别名：:class:`fastNLP.RandomSampler` :class:`fastNLP.core.sampler.RandomSampler`

    随机化取元素的 `Sampler`

    """
    
    def __call__(self, data_set):
        """

            :param DataSet data_set:
            :return result: a list of integers.
        """
        return list(np.random.permutation(len(data_set)))


 class BucketSampler(BaseSampler):
 class BucketSampler(Sampler):
    """
    别名：:class:`fastNLP.BucketSampler` :class:`fastNLP.core.sampler.BucketSampler`

        :param int num_buckets: the number of buckets to use.
        :param int batch_size: batch size per epoch.
        :param str seq_lens_field_name: the field name indicating the field about sequence length.
    带Bucket的 `Random Sampler`. 可以随机地取出长度相似的元素

    :param int num_buckets: bucket的数量
    :param int batch_size: batch的大小
    :param str seq_len_field_name: 对应序列长度的 `field` 的名字
    """
    def __init__(self, num_buckets=10, batch_size=32, seq_lens_field_name='seq_lens'):
    
    def __init__(self, num_buckets=10, batch_size=32, seq_len_field_name='seq_len'):
        self.num_buckets = num_buckets
        self.batch_size = batch_size
        self.seq_lens_field_name = seq_lens_field_name

        self.seq_len_field_name = seq_len_field_name
    
    def __call__(self, data_set):

        seq_lens = data_set.get_all_fields()[self.seq_lens_field_name].content
        seq_lens = data_set.get_all_fields()[self.seq_len_field_name].content
        total_sample_num = len(seq_lens)

        
        bucket_indexes = []
        assert total_sample_num >= self.num_buckets, "The number of samples is smaller than the number of buckets."
        num_sample_per_bucket = total_sample_num // self.num_buckets
        for i in range(self.num_buckets):
            bucket_indexes.append([num_sample_per_bucket * i, num_sample_per_bucket * (i + 1)])
        bucket_indexes[-1][1] = total_sample_num

        
        sorted_seq_lens = list(sorted([(idx, seq_len) for
                                       idx, seq_len in zip(range(total_sample_num), seq_lens)],
                                      key=lambda x: x[1]))

        
        batchs = []

        
        left_init_indexes = []
        for b_idx in range(self.num_buckets):
            start_idx = bucket_indexes[b_idx][0]
@@ -98,7 +102,7 @@ class BucketSampler(BaseSampler):
        if (left_init_indexes) != 0:
            batchs.append(left_init_indexes)
        np.random.shuffle(batchs)

        
        return list(chain(*batchs))


@@ -136,10 +140,10 @@ def k_means_1d(x, k, max_iter=100):
    if len(sorted_x) < k:
        raise ValueError("too few buckets")
    gap = len(sorted_x) / k

    
    centroids = np.array([sorted_x[int(x * gap)] for x in range(k)])
    assign = None

    
    for i in range(max_iter):
        # Cluster Assignment step
        assign = np.array([np.argmin([np.absolute(x_i - x) for x in centroids]) for x_i in x])
@@ -171,7 +175,7 @@ def k_means_bucketing(lengths, buckets):
    bucket_data = [[] for _ in buckets]
    num_buckets = len(buckets)
    _, assignments = k_means_1d(lengths, num_buckets)

    
    for idx, bucket_id in enumerate(assignments):
        if buckets[bucket_id] is None or lengths[idx] <= buckets[bucket_id]:
            bucket_data[bucket_id].append(idx)
--- a/fastNLP/core/tester.py
+++ b/fastNLP/core/tester.py
@@ -1,50 +1,109 @@
 """
 tester模块实现了 fastNLP 所需的Tester类，能在提供数据、模型以及metric的情况下进行性能测试。

 Example::

    import numpy as np
    import torch
    from torch import nn
    from fastNLP import Tester
    from fastNLP import DataSet
    from fastNLP import AccuracyMetric

    class Model(nn.Module):
        def __init__(self):
            super().__init__()
            self.fc = nn.Linear(1, 1)
        def forward(self, a):
            return {'pred': self.fc(a.unsqueeze(1)).squeeze(1)}

    model = Model()

    dataset = DataSet({'a': np.arange(10, dtype=float), 'b':np.arange(10, dtype=float)*2})

    dataset.set_input('a')
    dataset.set_target('b')

    tester = Tester(dataset, model, metrics=AccuracyMetric())
    eval_results = tester.test()

 这里Metric的映射规律是和 :class:`fastNLP.Trainer` 中一致的，具体使用请参考 :doc:`trainer 模块<fastNLP.core.trainer>` 的1.3部分。
 Tester在验证进行之前会调用model.eval()提示当前进入了evaluation阶段，即会关闭nn.Dropout()等，在验证结束之后会调用model.train()恢复到训练状态。


 """
 import warnings

 import torch
 from torch import nn
 import torch.nn as nn

 from .batch import Batch
 from .dataset import DataSet
 from .metrics import _prepare_metrics
 from .sampler import SequentialSampler
 from .utils import _CheckError
 from .utils import _build_args
 from .utils import _check_loss_evaluate
 from .utils import _move_dict_value_to_device
 from .utils import _get_func_signature
 from .utils import _get_model_device
 from .utils import _move_model_to_device

 from fastNLP.core.batch import Batch
 from fastNLP.core.dataset import DataSet
 from fastNLP.core.metrics import _prepare_metrics
 from fastNLP.core.sampler import SequentialSampler
 from fastNLP.core.utils import CheckError
 from fastNLP.core.utils import _build_args
 from fastNLP.core.utils import _check_loss_evaluate
 from fastNLP.core.utils import _move_dict_value_to_device
 from fastNLP.core.utils import get_func_signature
 __all__ = [
    "Tester"
 ]


 class Tester(object):
    """An collection of model inference and evaluation of performance, used over validation/dev set and test set.
    """
    别名：:class:`fastNLP.Tester` :class:`fastNLP.core.tester.Tester`

        :param DataSet data: a validation/development set
        :param torch.nn.modules.module model: a PyTorch model
        :param MetricBase metrics: a metric object or a list of metrics (List[MetricBase])
        :param int batch_size: batch size for validation
        :param bool use_cuda: whether to use CUDA in validation.
        :param int verbose: the number of steps after which an information is printed.
    Tester是在提供数据，模型以及metric的情况下进行性能测试的类。需要传入模型，数据以及metric进行验证。

    """
    :param data: 需要测试的数据集， :class:`~fastNLP.DataSet` 类型
    :param torch.nn.module model: 使用的模型
    :param metrics: :class:`~fastNLP.core.metrics.MetricBase` 或者一个列表的 :class:`~fastNLP.core.metrics.MetricBase`
    :param int batch_size: evaluation时使用的batch_size有多大。
    :param str,int,torch.device,list(int) device: 将模型load到哪个设备。默认为None，即Trainer不对模型
        的计算位置进行管理。支持以下的输入:

    def __init__(self, data, model, metrics, batch_size=16, use_cuda=False, verbose=1):
        super(Tester, self).__init__()
        1. str: ['cpu', 'cuda', 'cuda:0', 'cuda:1', ...] 依次为'cpu'中, 可见的第一个GPU中, 可见的第一个GPU中,
        可见的第二个GPU中;

        2. torch.device：将模型装载到torch.device上。

        3. int: 将使用device_id为该值的gpu进行训练

        4. list(int)：如果多于1个device，将使用torch.nn.DataParallel包裹model, 并使用传入的device。

        5. None. 为None则不对模型进行任何处理，如果传入的model为torch.nn.DataParallel该值必须为None。

        如果模型是通过predict()进行预测的话，那么将不能使用多卡(DataParallel)进行验证，只会使用第一张卡上的模型。
    :param int verbose: 如果为0不输出任何信息; 如果为1，打印出验证结果。
    """
    
    def __init__(self, data, model, metrics, batch_size=16, device=None, verbose=1):
        super(Tester, self).__init__()
        
        if not isinstance(data, DataSet):
            raise TypeError(f"The type of data must be `fastNLP.DataSet`, got `{type(data)}`.")
        if not isinstance(model, nn.Module):
            raise TypeError(f"The type of model must be `torch.nn.Module`, got `{type(model)}`.")

        
        self.metrics = _prepare_metrics(metrics)

        
        self.data = data
        self.use_cuda = use_cuda
        self._model = _move_model_to_device(model, device=device)
        self.batch_size = batch_size
        self.verbose = verbose

        if torch.cuda.is_available() and self.use_cuda:
            self._model = model.cuda()
        else:
            self._model = model
        self._model_device = model.parameters().__next__().device

        
        #  如果是DataParallel将没有办法使用predict方法
        if isinstance(self._model, nn.DataParallel):
            if hasattr(self._model.module, 'predict') and not hasattr(self._model, 'predict'):
                warnings.warn("Cannot use DataParallel to test your model, because your model offer predict() function,"
                              " while DataParallel has no predict() function.")
                self._model = self._model.module
        
        # check predict
        if hasattr(self._model, 'predict'):
            self._predict_func = self._model.predict
@@ -54,14 +113,15 @@ class Tester(object):
                                f"for evaluation, not `{type(self._predict_func)}`.")
        else:
            self._predict_func = self._model.forward

    
    def test(self):
        """Start test or validation.

        :return eval_results: a dictionary whose keys are the class name of metrics to use, values are the evaluation results of these metrics.
        """开始进行验证，并返回验证结果。

        :return Dict[Dict] : dict的二层嵌套结构，dict的第一层是metric的名称; 第二层是这个metric的指标。
            一个AccuracyMetric的例子为{'AccuracyMetric': {'acc': 1.0}}。
        """
        # turn on the testing mode; clean up the history
        self._model_device = _get_model_device(self._model)
        network = self._model
        self._mode(network, is_test=True)
        data_iterator = Batch(self.data, self.batch_size, sampler=SequentialSampler(), as_numpy=False)
@@ -72,28 +132,28 @@ class Tester(object):
                    _move_dict_value_to_device(batch_x, batch_y, device=self._model_device)
                    pred_dict = self._data_forward(self._predict_func, batch_x)
                    if not isinstance(pred_dict, dict):
                        raise TypeError(f"The return value of {get_func_signature(self._predict_func)} "
                        raise TypeError(f"The return value of {_get_func_signature(self._predict_func)} "
                                        f"must be `dict`, got {type(pred_dict)}.")
                    for metric in self.metrics:
                        metric(pred_dict, batch_y)
                for metric in self.metrics:
                    eval_result = metric.get_metric()
                    if not isinstance(eval_result, dict):
                        raise TypeError(f"The return value of {get_func_signature(metric.get_metric)} must be "
                        raise TypeError(f"The return value of {_get_func_signature(metric.get_metric)} must be "
                                        f"`dict`, got {type(eval_result)}")
                    metric_name = metric.__class__.__name__
                    eval_results[metric_name] = eval_result
        except CheckError as e:
            prev_func_signature = get_func_signature(self._predict_func)
        except _CheckError as e:
            prev_func_signature = _get_func_signature(self._predict_func)
            _check_loss_evaluate(prev_func_signature=prev_func_signature, func_signature=e.func_signature,
                                 check_res=e.check_res, pred_dict=pred_dict, target_dict=batch_y,
                                 dataset=self.data, check_level=0)

        
        if self.verbose >= 1:
            print("[tester] \n{}".format(self._format_eval_results(eval_results)))
        self._mode(network, is_test=False)
        return eval_results

    
    def _mode(self, model, is_test=False):
        """Train mode or Test mode. This is for PyTorch currently.

@@ -105,13 +165,13 @@ class Tester(object):
            model.eval()
        else:
            model.train()

    
    def _data_forward(self, func, x):
        """A forward pass of the model. """
        x = _build_args(func, **x)
        y = func(**x)
        return y

    
    def _format_eval_results(self, results):
        """Override this method to support more print formats.

--- a/fastNLP/core/trainer.py
+++ b/fastNLP/core/trainer.py
@@ -1,87 +1,428 @@
 r"""
 Trainer在fastNLP中用于组织单任务的训练过程，可以避免用户在不同训练任务中重复撰以下步骤的代码

    (1) epoch循环;
    
    (2) 将数据分成不同的Batch;
    
    (3) 对Batch进行pad;
    
    (4) 每个epoch结束或一定step后进行验证集验证;
    
    (5) 保存获得更好验证性能的模型。

 1 Trainer的基本使用
    下面的例子是使用神经网络来进行预测一个序列中是否有偶数个1。

    Example::

        import numpy as np
        from torch import nn
        import torch
        import torch.nn.functional as F
        from torch.optim import SGD

        from fastNLP import DataSet
        from fastNLP import Trainer
        from fastNLP import CrossEntropyLoss
        from fastNLP import AccuracyMetric
        from fastNLP.modules.decoder import MLP

        # 模型
        class Model(nn.Module):
            def __init__(self, input_num):
                super().__init__()
                self.fcs = MLP([input_num, 40, 40, 2], 'relu')

            def forward(self, x):
                x = self.fcs(x)
                return {'pred': x}
        model = Model(10)

        # 生成数据
        def generate_psedo_dataset(num_samples):
            dataset = DataSet()
            data = np.random.randint(2, size=(num_samples, 10))
            label = np.sum(data, axis=1)%2
            dataset = DataSet({'x':data.astype(float), 'label': label})
            dataset.set_input('x')
            dataset.set_target('label')
            return dataset
        tr_dataset = generate_psedo_dataset(1000)
        dev_data = generate_psedo_dataset(100)

        # 训练
        trainer = Trainer(tr_dataset, model, loss=CrossEntropyLoss(target='label'),
                           optimizer=SGD(model.parameters(), lr=0.1),n_epochs=1000,
                           dev_data = dev_data, metrics=AccuracyMetric(target='label'))
        trainer.train()

    由上面的例子可以看出通过使用Trainer，可以使得训练部分的代码大幅减少。
    使用Trainer需要满足以下几个条件:

 1.1 模型
    1 模型的forward()的参数名需要与DataSet中的名字对应。实际上fastNLP在将DataSet中的数据传递给模型forward()时，是
    通过匹配名称实现的。所以上例中，如果Model的forward函数修改为forward(self, data), 则DataSet中的'x'这个field就应该
    改名为'data'。

    2 传递给forward()的参数是DataSet中被设置为input的那些field。但如果forward()中没有对应的参数，则不会将数据传递
    给forward()。例如，DataSet中'x1', 'x2'都是input，但是模型的函数为forward(self, x1), 那么'x2'不会传递给forward()。

    3 模型的forward()返回值需要为一个dict。

 1.2 Loss
    fastNLP中的为了不限制forward函数的返回内容数量(比如一些复杂任务需要返回多个内容，如Dependency Parsing，
    :mod:`Loss<fastNLP.core.losses>` 与 :mod:`Metric<fastNLP.core.metrics>` 都使用了通过名称来匹配相应内容的策略。如上面的例子中

    Example::

        trainer = Trainer(tr_dataset, model, loss=CrossEntropyLoss(target='label'),
                   optimizer=SGD(model.parameters(), lr=0.1),n_epochs=1000,
                   dev_data = dev_data, metrics=AccuracyMetric(target='label'))

    loss被设置为了 :class:`~fastNLP.CrossEntropyLoss` , 但在初始化的时候传入了target='label'这个参数，
    :class:`~fastNLP.CrossEntropyLoss` 的初始化参数为(pred=None, target=None, padding_idx=-100)。
    
    这里的两个参数分别为计算CrossEntropy时需要使用到的模型的预测值与真实值。
    其中 `pred` 一般来自于模型forward()的返回结果，`target` 一般是来自于DataSet中被设置为target的field。
    由于每个人对真实值或者model的返回值取名并不一样，所以fastNLP的 :mod:`Loss<fastNLP.core.losses>` 提供一种类似于映射的机制来匹配对应的值，
    比如这里 :class:`~fastNLP.CrossEntropyLoss` 将尝试找到名为'label'的内容来作为真实值得到loss；
    而pred=None, 则 :class:`~fastNLP.CrossEntropyLoss` 使用'pred'作为名称匹配预测值，
    正好forward的返回值也叫pred，所以这里不需要申明pred。

    尽管fastNLP使用了映射机制来使得loss的计算变得比较灵活，但有些情况下loss必须在模型中进行计算，比如使用了CRF的模型。
    fastNLP中提供了 :class:`~fastNLP.LossInForward` 这个loss。
    这个loss的原理是直接在forward()的返回结果中找到loss_key(默认寻找'loss')指定的那个tensor，并使用它作为loss。
    如果Trainer初始化没有提供loss则默认使用 :class:`~fastNLP.LossInForward` 。
    
    .. todo::
        补充一个例子  详细例子可以参照

 1.3 Metric
    :mod:`Metric<fastNLP.core.metrics>` 使用了与上述Loss一样的策略，即使用名称进行匹配。
    AccuracyMetric(target='label')的情况与CrossEntropyLoss 是同理的。
    
    在进行验证时，可能用到的计算与forward()中不太一致，没有办法直接从forward()的结果中得到预测值，这时模型可以提供一个predict()方法，
    如果提供的模型具有predict方法，则在模型验证时将调用predict()方法获取预测结果，
    传入到predict()的参数也是从DataSet中被设置为input的field中选择出来的;
    与forward()一样，返回值需要为一个dict。
    
    .. todo::
        补充一个例子 具体例子可以参考

 2 Trainer的代码检查
    由于在fastNLP中采取了映射的机制，所以难免可能存在对应出错的情况。Trainer提供一种映射检查机制，可以通过check_code_level来进行控制
    比如下面的例子中，由于各种原因产生的报错

 Example2.1
    ::
    
        import numpy as np
        from torch import nn
        import torch
        from torch.optim import SGD
        from fastNLP import Trainer
        from fastNLP import DataSet

        class Model(nn.Module):
            def __init__(self):
                super().__init__()
                self.fc = nn.Linear(1, 1)
            def forward(self, x, b):
                loss = torch.mean((self.fc(x)-b)**2)
                return {'loss': loss}
        model = Model()

        dataset = DataSet({'a': np.arange(10), 'b':np.arange(10)*2})
        dataset.set_input('a', 'b')

        trainer = Trainer(dataset, model, loss=None, optimizer=SGD(model.parameters(), lr=0.001))

        trainer = Trainer(dataset, model, SGD(model.parameters()))
        #  会报以下的错误
        # input fields after batch(if batch size is 2):
        #     a: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
        #     b: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
        # There is no target field.
        # ....
        # NameError:
        # Problems occurred when calling Model.forward(self, x, b)
        #     missing param: ['x']
        #     unused field: ['a']
        #     Suggestion: You need to provide ['x'] in DataSet and set it as input.

    这里就是由于在Trainer初始化的时候，fastNLP会尝试使用一个batch_size=2的batch去运行一遍forward()以及backward()。这里有两类
    信息可以为你提供参考

    1 'input fields after batch...'这部分显示的是train dataset经过Batch操作后，每个field对应的类型以及进行shape。这里
    因为train dataset没有target所以没有显示。根据这里可以看出是否正确将需要的内容设置为了input或target。

    2 NameError，NameError发生在映射出错的情况。这里报错的原因是由于尝试进行forward计算时(可以通过Model.forward(self, x, b)判断
    出当前是在调取forward)，却没有获取到forward()函数中需要的'x'；在报错信息中同时指出了缺'x'，而'a'没有被使用，那么可能
    就是由于field的名称不对。这里将dataset中'a'这个field的名称改为'x'，或者model的参数从'x'修改为'a'都可以解决问题。

    下面的例子是由于loss计算的时候找不到需要的值

 Example2.2
    ::

        import numpy as np
        from torch import nn
        from torch.optim import SGD
        from fastNLP import Trainer
        from fastNLP import DataSet
        from fastNLP import L1Loss
        import torch

        class Model(nn.Module):
            def __init__(self):
                super().__init__()
                self.fc = nn.Linear(1, 1)
            def forward(self, a):
                return {'pred_b': self.fc(a.unsqueeze(1)).squeeze(1), 'No use':1}

        model = Model()

        dataset = DataSet({'a': np.arange(10, dtype=float), 'b':np.arange(10, dtype=float)*2})

        dataset.set_input('a')
        dataset.set_target('b')

        trainer = Trainer(dataset, model, loss=L1Loss(target='label'), optimizer=SGD(model.parameters(), lr=0.001))
        # 报错信息如下
        # input fields after batch(if batch size is 2):
        #     a: (1)type:torch.Tensor (2)dtype:torch.float32, (3)shape:torch.Size([2])
        # target fields after batch(if batch size is 2):
        #     b: (1)type:torch.Tensor (2)dtype:torch.float32, (3)shape:torch.Size([2])
        # ....
        # NameError:
        # Problems occurred when calling L1Loss.get_loss(self, pred, target)
        #     missing param: ['pred(assign to `pred` in `L1Loss`)', 'label(assign to `target` in `L1Loss`)']
        #     unused field: ['b']
        #     unused param: ['pred_b', 'No use']
        #     target field: ['b']
        #     param from Model.forward(self, a): ['pred_b', 'No use']
        #     Suggestion: (1). Check key assignment for `target` when initialize L1Loss. Or provide `label` in DataSet or output of Model.forward(self, a).
        #             (2). Check key assignment for `pred` when initialize L1Loss. Or provide `pred` in DataSet or output of Model.forward(self, a).

    报错信息也包含两部分:

    1 第一部分与上面是一样的

    2 这里报错的原因是由于计算loss的时候找不到相应的值(通过L1Loss.get_loss(self, pred, target)判断出来的)；
    报错的原因是因为 `pred` 和 `label` (我们在初始化L1Loss时将target指定为了label)都没有找到。
    这里'unused field'是DataSet中出现了，但却没有被设置为input或者target的field；
    'unused param'是forward()中返回且没有被使用到的内容；'target field'是被设置为了target的field;
    'param from Model.forward(self, a)'是forward()返回的所有key。"Suggestion"是关于当前错误处理的建议。

    但是在一些情况下，比如forward()返回值只有一个，target也只有一个，fastNLP不会进行匹配，而直接将forward()的结果作为pred,
    将DataSet中的target设置为target。上面的例子在返回值中加入了一个'No use'则只是为了使得Loss去匹配结果。


    下面是带有dev dataset时如果出现错误会发生的报错，

 Example2.3
    ::
    
        import numpy as np
        from torch import nn
        from torch.optim import SGD
        from fastNLP import Trainer
        from fastNLP import DataSet
        from fastNLP import AccuracyMetric
        import torch

        class Model(nn.Module):
            def __init__(self):
                super().__init__()
                self.fc = nn.Linear(1, 1)
            def forward(self, a, b):
                loss = torch.mean((self.fc(a.float().unsqueeze(1))-b.float())**2)
                return {'loss': loss}
            def predict(self, a):  # 使用predict()进行验证
                return {'output':self.fc(a.float().unsqueeze(1))} #这里return的值不包含'pred'这个key
        model = Model()

        dataset = DataSet({'a': np.arange(10), 'b':np.arange(10)*2})
        dev_data = DataSet({'a': np.arange(10, 20), 'b':np.arange(10, 20)*2})

        dataset.set_input('a', 'b')
        dev_data.set_input('a')  # 这里没有设置target

        trainer = Trainer(dataset, model, loss=None, optimizer=SGD(model.parameters(), lr=0.001),
                         dev_data=dev_data, metrics=AccuracyMetric())

        # 报错信息
        # ...
        # NameError:
        # Problems occurred when calling AccuracyMetric.evaluate(self, pred, target, seq_len=None)
        #     missing param: ['pred(assign to `pred` in `AccuracyMetric`)', 'target(assign to `target` in `AccuracyMetric`)']
        #     unused param: ['output']
        #     target field: []
        #     param from Model.predict(self, a): ['output']
        #     Suggestion: (1). Check key assignment for `pred` when initialize AccuracyMetric. Or provide `pred` in DataSet or output of Model.predict(self, a).
        #             (2). Check key assignment for `target` when initialize AccuracyMetric. Or provide `target` in DataSet or output of Model.predict(self, a).

    报错信息和前面都是类似的，但是可以通过'AccuracyMetric.evaluate(self, pred, target, seq_len=None)'看出这里是evaluation
    的时候发生了错误。这样避免了需要在完成一整个epoch的训练才能发现evaluation弄错的情况。这里的修改是通过在初始化metric的时候
    指明通过'output'获取`pred`, 即AccuracyMetric(pred='output')。

    可以通过check_code_level调节检查的强度。默认为0，即进行检查。

 3 Trainer与callback
    虽然Trainer本身已经集成了一些功能，但仍然不足以囊括训练过程中可能需要到的功能，比如负采样，learning rate decay, Early Stop等。
    为了解决这个问题fastNLP引入了callback的机制，:class:`~fastNLP.Callback` 是一种在Trainer训练过程中特定阶段会运行的函数集合，
    所有的 :class:`~fastNLP.Callback` 都具有on_*(比如on_train_start, on_backward_begin)等函数。
    如果 Callback 实现了该函数，则Trainer运行至对应阶段，会进行调用，例如::
    
        from fastNLP import Callback, EarlyStopCallback, Trainer, CrossEntropyLoss, AccuracyMetric
        from fastNLP.models import CNNText

        start_time = time.time()
        
        class MyCallback(Callback):
            def on_epoch_end(self):
                print('{:d}ms\n\n'.format(round((time.time()-start_time)*1000)))
        
        model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)
        trainer = Trainer(model=model, train_data=train_data, dev_data=dev_data, loss=CrossEntropyLoss(),
                          metrics=AccuracyMetric(), callbacks=[MyCallback(),EarlyStopCallback(10)])
        trainer.train()
        
    这里，我们通过继承 :class:`~fastNLP.Callback` 类定义了自己的 callback 的，并和内置的 :class:`~fastNLP.EarlyStopCallback`
    一起传给了 :class:`~fastNLP.Trainer` ，增强了 :class:`~fastNLP.Trainer` 的功能
    
    fastNLP已经自带了很多callback函数供使用，可以参考 :doc:`fastNLP.core.callback` 。

 """
 __all__ = [
    "Trainer"
 ]

 import os
 import time
 from datetime import datetime
 from datetime import timedelta
 from datetime import datetime, timedelta

 import numpy as np
 import torch
 from torch import nn
 import torch.nn as nn

 try:
    from tqdm.autonotebook import tqdm
    from tqdm.auto import tqdm
 except:
    from fastNLP.core.utils import pseudo_tqdm as tqdm

 from fastNLP.core.batch import Batch
 from fastNLP.core.callback import CallbackManager, CallbackException
 from fastNLP.core.dataset import DataSet
 from fastNLP.core.losses import _prepare_losser
 from fastNLP.core.metrics import _prepare_metrics
 from fastNLP.core.optimizer import Adam
 from fastNLP.core.sampler import BaseSampler
 from fastNLP.core.sampler import RandomSampler
 from fastNLP.core.sampler import SequentialSampler
 from fastNLP.core.tester import Tester
 from fastNLP.core.utils import CheckError
 from fastNLP.core.utils import _build_args
 from fastNLP.core.utils import _check_forward_error
 from fastNLP.core.utils import _check_loss_evaluate
 from fastNLP.core.utils import _move_dict_value_to_device
 from fastNLP.core.utils import get_func_signature
    from .utils import _pseudo_tqdm as tqdm

 from .batch import Batch
 from .callback import CallbackManager, CallbackException
 from .dataset import DataSet
 from .losses import _prepare_losser
 from .metrics import _prepare_metrics
 from .optimizer import Optimizer
 from .sampler import Sampler
 from .sampler import RandomSampler
 from .sampler import SequentialSampler
 from .tester import Tester
 from .utils import _CheckError
 from .utils import _build_args
 from .utils import _check_forward_error
 from .utils import _check_loss_evaluate
 from .utils import _move_dict_value_to_device
 from .utils import _get_func_signature
 from .utils import _get_model_device
 from .utils import _move_model_to_device


 class Trainer(object):
    def __init__(self, train_data, model, loss=None, metrics=None, n_epochs=3, batch_size=32, print_every=50,
                 validate_every=-1, dev_data=None, save_path=None, optimizer=Adam(lr=0.01, weight_decay=0),
                 check_code_level=0, metric_key=None, sampler=RandomSampler(), prefetch=False, use_tqdm=True,
                 use_cuda=False, callbacks=None):
        """
        :param DataSet train_data: the training data
        :param torch.nn.modules.module model: a PyTorch model
        :param LossBase loss: a loss object
        :param MetricBase metrics: a metric object or a list of metrics (List[MetricBase])
        :param int n_epochs: the number of training epochs
        :param int batch_size: batch size for training and validation
        :param int print_every: step interval to print next training information. Default: -1(no print).
        :param int validate_every: step interval to do next validation. Default: -1(validate every epoch).
        :param DataSet dev_data: the validation data
        :param str save_path: file path to save models
        :param Optimizer optimizer: an optimizer object
        :param int check_code_level: level of FastNLP code checker. -1: don't check, 0: ignore. 1: warning. 2: strict.\\
            `ignore` will not check unused field; `warning` when warn if some field are not used; `strict` means
            it will raise error if some field are not used. 检查的原理是通过使用很小的batch(默认两个sample)来检查代码是
            否能够运行，但是这个过程理论上不会修改任何参数，只是会检查能否运行。但如果(1)模型中存在将batch_size写为某个
            固定值的情况；(2)模型中存在累加前向计算次数的，可能会多计算几次。以上情况建议将check_code_level设置为-1
        :param str metric_key: a single indicator used to decide the best model based on metric results. It must be one
            of the keys returned by the FIRST metric in `metrics`. If the overall result gets better if the indicator gets
            smaller, add "-" in front of the string. For example::

                    metric_key="-PPL"   # language model gets better as perplexity gets smaller
        :param BaseSampler sampler: method used to generate batch data.
        :param prefetch: bool, 是否使用额外的进程对产生batch数据。
        :param bool use_tqdm: whether to use tqdm to show train progress.
        :param callbacks: List[Callback]. 用于在train过程中起调节作用的回调函数。比如early stop，negative sampling等可以
            通过callback机制实现。
        """
    """
    别名：:class:`fastNLP.Trainer` :class:`fastNLP.core.trainer.Trainer`
    
    Trainer在fastNLP中用于组织单任务的训练过程，可以避免用户在不同训练任务中重复撰写
        (1) epoch循环;
        (2) 将数据分成不同的Batch;
        (3) 对Batch进行pad;
        (4) 每个epoch结束或一定step后进行验证集验证;
        (5) 保存获得更好验证性能的模型等。
    
    详细的介绍参见 :doc:`fastNLP.core.trainer`
    
    :param train_data: 训练集， :class:`~fastNLP.DataSet` 类型。
    :param nn.modules model: 待训练的模型
    :param optimizer: `torch.optim.Optimizer` 优化器。如果为None，则Trainer使用默认的Adam(model.parameters(), lr=4e-3)这个优化器
    :param int batch_size: 训练和验证的时候的batch大小。
    :param loss: 使用的 :class:`~fastNLP.core.losses.LossBase` 对象。当为None时，默认使用 :class:`~fastNLP.LossInForward`
    :param sampler: Batch数据生成的顺序， :class:`~fastNLP.Sampler` 类型。如果为None，默认使用 :class:`~fastNLP.RandomSampler`
    :param update_every: int, 多少步更新一次梯度。用于希望累计梯度的场景，比如需要128的batch_size, 但是直接设为128
        会导致内存不足，通过设置batch_size=32, update_every=4达到目的。当optimizer为None时，该参数无效。
    :param int n_epochs: 需要优化迭代多少次。
    :param int print_every: 多少次反向传播更新tqdm显示的loss; 如果use_tqdm=False, 则多少次反向传播打印loss。
    :param dev_data: 用于做验证的DataSet， :class:`~fastNLP.DataSet` 类型。
    :param metrics: 验证的评估函数。可以只使用一个 :class:`Metric<fastNLP.core.metrics.MetricBase>` ，
        也可以使用多个 :class:`Metric<fastNLP.core.metrics.MetricBase>` ，通过列表传入。
        如验证时取得了更好的验证结果(如果有多个Metric，以列表中第一个Metric为准)，且save_path不为None，
        则保存当前模型。Metric种类详见 :doc:`metrics模块 <fastNLP.core.metrics>` 。仅在传入dev_data时有效。
    :param str,None metric_key:  :class:`Metric<fastNLP.core.metrics.MetricBase>` 有时会有多个指标，
        比如 :class:`~fastNLP.core.metrics.SpanFPreRecMetric` 中包含了'f', 'pre', 'rec'。此时需
        要指定以哪个指标为准。另外有些指标是越小效果越好，比如语言模型的困惑度，这种情况下，在key前面增加一个'-'来表
        明验证时，值越小越好(比如: "-ppl")。仅在传入dev_data时有效。
    :param int validate_every: 多少个step在验证集上验证一次; 如果为-1，则每个epoch结束验证一次。仅在传入dev_data时有效。
    :param str,None save_path: 将模型保存路径。如果为None，则不保存模型。如果dev_data为None，则保存最后一次迭代的模型。
        保存的时候不仅保存了参数，还保存了模型结构。即便使用DataParallel，这里也只保存模型。
    :param prefetch: bool, 是否使用额外的进程对产生batch数据。理论上会使得Batch迭代更快。
    :param bool use_tqdm: 是否使用tqdm来显示训练进度; 如果为False，则将loss打印在终端中。
    :param str,int,torch.device,list(int) device: 将模型load到哪个设备。默认为None，即Trainer不对模型
        的计算位置进行管理。支持以下的输入:

        1. str: ['cpu', 'cuda', 'cuda:0', 'cuda:1', ...] 依次为'cpu'中, 可见的第一个GPU中, 可见的第一个GPU中,
        可见的第二个GPU中;

        2. torch.device：将模型装载到torch.device上。

        3. int: 将使用device_id为该值的gpu进行训练

        4. list(int)：如果多于1个device，将使用torch.nn.DataParallel包裹model, 并使用传入的device。

        5. None. 为None则不对模型进行任何处理，如果传入的model为torch.nn.DataParallel该值必须为None。

        已知可能会出现的问题：Adagrad优化器可能无法正常使用这个参数，请手动管理模型位置。

    :param list(callbacks) callbacks: 用于在train过程中起调节作用的回调函数。比如early stop，negative sampling等可以
        通过callback机制实现。 可使用的callback参见 :doc:`callback模块 <fastNLP.core.callback>`
    :param int check_code_level: 模型检查等级. -1: 不进行检查; 0: 仅出现错误时停止; 1: 如果有field没有被使用，
        报告警告信息; 2: 有任何field没有被使用都报错. 检查的原理是通过使用很小的batch(默认2个sample)来运行代码，但是
        这个过程理论上不会修改任何参数，只是会检查能否运行。但如果(1)模型中存在将batch_size写为某个固定值的情况；
        (2)模型中存在累加前向计算次数的，可能会多计算1次。以上情况建议将check_code_level设置为-1。
    """
    
    def __init__(self, train_data, model, optimizer=None, loss=None,
                 batch_size=32, sampler=None, update_every=1,
                 n_epochs=10, print_every=5,
                 dev_data=None, metrics=None, metric_key=None,
                 validate_every=-1, save_path=None,
                 prefetch=False, use_tqdm=True, device=None,
                 callbacks=None,
                 check_code_level=0):
        super(Trainer, self).__init__()

        if not isinstance(train_data, DataSet):
            raise TypeError(f"The type of train_data must be fastNLP.DataSet, got {type(train_data)}.")
        if not isinstance(model, nn.Module):
            raise TypeError(f"The type of model must be torch.nn.Module, got {type(model)}.")

        
        # check metrics and dev_data
        if (not metrics) and dev_data is not None:
            raise ValueError("No metric for dev_data evaluation.")
        if metrics and (dev_data is None):
            raise ValueError("No dev_data for evaluations, pass dev_data or set metrics to None. ")

        
        # check update every
        assert update_every >= 1, "update_every must be no less than 1."
        self.update_every = int(update_every)
        
        # check save_path
        if not (save_path is None or isinstance(save_path, str)):
            raise ValueError("save_path can only be None or `str`.")
        # prepare evaluate
        metrics = _prepare_metrics(metrics)

        
        # parse metric_key
        # increase_better is True. It means the exp result gets better if the indicator increases.
        # It is true by default.
@@ -91,19 +432,20 @@ class Trainer(object):
            self.metric_key = metric_key[1:] if metric_key[0] == "+" or metric_key[0] == "-" else metric_key
        elif len(metrics) > 0:
            self.metric_key = metrics[0].__class__.__name__.lower().strip('metric')

        
        # prepare loss
        losser = _prepare_losser(loss)

        
        # sampler check
        if not isinstance(sampler, BaseSampler):
        if sampler is not None and not isinstance(sampler, Sampler):
            raise ValueError("The type of sampler should be fastNLP.BaseSampler, got {}.".format(type(sampler)))

        
        if check_code_level > -1:
            _check_code(dataset=train_data, model=model, losser=losser, metrics=metrics, dev_data=dev_data,
                        metric_key=metric_key, check_level=check_code_level,
                        batch_size=min(batch_size, DEFAULT_CHECK_BATCH_SIZE))

            # _check_code 是 fastNLP 帮助你检查代码是否正确的方法 。如果你在错误栈中看到这行注释，请认真检查你的代码
        
        self.train_data = train_data
        self.dev_data = dev_data  # If None, No validation.
        self.model = model
@@ -111,73 +453,61 @@ class Trainer(object):
        self.metrics = metrics
        self.n_epochs = int(n_epochs)
        self.batch_size = int(batch_size)
        self.use_cuda = bool(use_cuda)
        self.save_path = save_path
        self.print_every = int(print_every)
        self.validate_every = int(validate_every) if validate_every!=0 else -1
        self.validate_every = int(validate_every) if validate_every != 0 else -1
        self.best_metric_indicator = None
        self.best_dev_epoch = None
        self.best_dev_step = None
        self.best_dev_perf = None
        self.sampler = sampler
        self.sampler = sampler if sampler is not None else RandomSampler()
        self.prefetch = prefetch
        self.callback_manager = CallbackManager(env={"trainer": self}, callbacks=callbacks)

        self.n_steps = (len(self.train_data) // self.batch_size + int(
            len(self.train_data) % self.batch_size != 0)) * self.n_epochs
        
        self.model = _move_model_to_device(self.model, device=device)
        
        if isinstance(optimizer, torch.optim.Optimizer):
            self.optimizer = optimizer
        elif isinstance(optimizer, Optimizer):
            self.optimizer = optimizer.construct_from_pytorch(model.parameters())
        elif optimizer is None:
            self.optimizer = torch.optim.Adam(model.parameters(), lr=4e-3)
        else:
            self.optimizer = optimizer.construct_from_pytorch(self.model.parameters())

            raise TypeError("optimizer can only be torch.optim.Optimizer type, not {}.".format(type(optimizer)))
        
        self.use_tqdm = use_tqdm
        self.pbar = None
        self.print_every = abs(self.print_every)

        
        if self.dev_data is not None:
            self.tester = Tester(model=self.model,
                                 data=self.dev_data,
                                 metrics=self.metrics,
                                 batch_size=self.batch_size,
                                 use_cuda=self.use_cuda,
                                 device=None,  # 由上面的部分处理device
                                 verbose=0)

        
        self.step = 0
        self.start_time = None  # start timestamp

        
        self.callback_manager = CallbackManager(env={"trainer": self},
                                                callbacks=callbacks)
    
    def train(self, load_best_model=True):
        """

        开始训练过程。主要有以下几个步骤::

            for epoch in range(num_epochs):
                # 使用Batch从DataSet中按批取出数据，并自动对DataSet中dtype为(float, int)的fields进行padding。并转换为Tensor。
                非float，int类型的参数将不会被转换为Tensor，且不进行padding。
                for batch_x, batch_y in Batch(DataSet)
                    # batch_x是一个dict, 被设为input的field会出现在这个dict中，
                        key为DataSet中的field_name, value为该field的value
                    # batch_y也是一个dict，被设为target的field会出现在这个dict中，
                        key为DataSet中的field_name, value为该field的value
                    2. 将batch_x的数据送入到model.forward函数中，并获取结果。这里我们就是通过匹配batch_x中的key与forward函数的形
                        参完成参数传递。例如，
                            forward(self, x, seq_lens) # fastNLP会在batch_x中找到key为"x"的value传递给x，key为"seq_lens"的
                                value传递给seq_lens。若在batch_x中没有找到所有必须要传递的参数，就会报错。如果forward存在默认参数
                                而且默认参数这个key没有在batch_x中，则使用默认参数。
                    3. 将batch_y与model.forward的结果一并送入loss中计算loss。loss计算时一般都涉及到pred与target。但是在不同情况
                        中，可能pred称为output或prediction, target称为y或label。fastNLP通过初始化loss时传入的映射找到pred或
                        target。比如在初始化Trainer时初始化loss为CrossEntropyLoss(pred='output', target='y'), 那么fastNLP计
                        算loss时，就会使用"output"在batch_y与forward的结果中找到pred;使用"y"在batch_y与forward的结果中找target
                        , 并完成loss的计算。
                    4. 获取到loss之后，进行反向求导并更新梯度
                    根据需要适时进行验证机测试
                        根据metrics进行evaluation，并根据是否提供了save_path判断是否存储模型
        使用该函数使Trainer开始训练。

        :param bool load_best_model: 该参数只有在初始化提供了dev_data的情况下有效，如果True, trainer将在返回之前重新加载dev表现
            最好的模型参数。
        :return results: 返回一个字典类型的数据, 内含以下内容::
                最好的模型参数。
        :return dict: 返回一个字典类型的数据,
                内含以下内容::

            seconds: float, 表示训练时长
            以下三个内容只有在提供了dev_data的情况下会有。
            best_eval: Dict of Dict, 表示evaluation的结果
            best_epoch: int，在第几个epoch取得的最佳值
            best_step: int, 在第几个step(batch)更新取得的最佳值
                    seconds: float, 表示训练时长
                    以下三个内容只有在提供了dev_data的情况下会有。
                    best_eval: Dict of Dict, 表示evaluation的结果。第一层的key为Metric的名称，第二层的key为具体的Metric
                    best_epoch: int，在第几个epoch取得的最佳值
                    best_step: int, 在第几个step(batch)更新取得的最佳值

        """
        results = {}
@@ -186,25 +516,24 @@ class Trainer(object):
            results['seconds'] = 0.
            return results
        try:
            if torch.cuda.is_available() and self.use_cuda:
                self.model = self.model.cuda()
            self._model_device = self.model.parameters().__next__().device
            self._model_device = _get_model_device(self.model)
            self._mode(self.model, is_test=False)

            self._load_best_model = load_best_model
            self.start_time = str(datetime.now().strftime('%Y-%m-%d-%H-%M-%S'))
            start_time = time.time()
            print("training epochs started " + self.start_time, flush=True)

            
            try:
                self.callback_manager.on_train_begin()
                self._train()
                self.callback_manager.on_train_end(self.model)
                self.callback_manager.on_train_end()
            except (CallbackException, KeyboardInterrupt) as e:
                self.callback_manager.on_exception(e, self.model)

            if self.dev_data is not None:
                print("\nIn Epoch:{}/Step:{}, got best dev performance:".format(self.best_dev_epoch, self.best_dev_step) +
                      self.tester._format_eval_results(self.best_dev_perf),)
                self.callback_manager.on_exception(e)
            
            if self.dev_data is not None and hasattr(self, 'best_dev_perf'):
                print(
                    "\nIn Epoch:{}/Step:{}, got best dev performance:".format(self.best_dev_epoch, self.best_dev_step) +
                    self.tester._format_eval_results(self.best_dev_perf), )
                results['best_eval'] = self.best_dev_perf
                results['best_epoch'] = self.best_dev_epoch
                results['best_step'] = self.best_dev_step
@@ -218,49 +547,55 @@ class Trainer(object):
        finally:
            pass
        results['seconds'] = round(time.time() - start_time, 2)

        
        return results

    
    def _train(self):
        if not self.use_tqdm:
            from fastNLP.core.utils import pseudo_tqdm as inner_tqdm
            from fastNLP.core.utils import _pseudo_tqdm as inner_tqdm
        else:
            inner_tqdm = tqdm
        self.step = 0
        self.epoch = 0
        start = time.time()
        total_steps = (len(self.train_data) // self.batch_size + int(
            len(self.train_data) % self.batch_size != 0)) * self.n_epochs
        with inner_tqdm(total=total_steps, postfix='loss:{0:<6.5f}', leave=False, dynamic_ncols=True) as pbar:
        
        with inner_tqdm(total=self.n_steps, postfix='loss:{0:<6.5f}', leave=False, dynamic_ncols=True) as pbar:
            self.pbar = pbar
            avg_loss = 0
            data_iterator = Batch(self.train_data, batch_size=self.batch_size, sampler=self.sampler, as_numpy=False,
                                  prefetch=self.prefetch)
            for epoch in range(1, self.n_epochs+1):
            self.batch_per_epoch = data_iterator.num_batches
            for epoch in range(1, self.n_epochs + 1):
                self.epoch = epoch
                pbar.set_description_str(desc="Epoch {}/{}".format(epoch, self.n_epochs))
                # early stopping
                self.callback_manager.on_epoch_begin(epoch, self.n_epochs)
                self.callback_manager.on_epoch_begin()
                for batch_x, batch_y in data_iterator:
                    self.step += 1
                    _move_dict_value_to_device(batch_x, batch_y, device=self._model_device)
                    indices = data_iterator.get_batch_indices()
                    # negative sampling; replace unknown; re-weight batch_y
                    self.callback_manager.on_batch_begin(batch_x, batch_y, indices)
                    prediction = self._data_forward(self.model, batch_x)

                    
                    # edit prediction
                    self.callback_manager.on_loss_begin(batch_y, prediction)
                    loss = self._compute_loss(prediction, batch_y)
                    loss = self._compute_loss(prediction, batch_y).mean()
                    avg_loss += loss.item()

                    loss = loss / self.update_every
                    
                    # Is loss NaN or inf? requires_grad = False
                    self.callback_manager.on_backward_begin(loss, self.model)
                    self.callback_manager.on_backward_begin(loss)
                    self._grad_backward(loss)
                    self.callback_manager.on_backward_end(self.model)

                    self.callback_manager.on_backward_end()
                    
                    self._update()
                    self.callback_manager.on_step_end(self.optimizer)

                    if (self.step+1) % self.print_every == 0:
                    self.callback_manager.on_step_end()
                    
                    if self.step % self.print_every == 0:
                        avg_loss = float(avg_loss) / self.print_every
                        if self.use_tqdm:
                            print_output = "loss:{0:<6.5f}".format(avg_loss / self.print_every)
                            print_output = "loss:{0:<6.5f}".format(avg_loss)
                            pbar.update(self.print_every)
                        else:
                            end = time.time()
@@ -269,43 +604,45 @@ class Trainer(object):
                                epoch, self.step, avg_loss, diff)
                        pbar.set_postfix_str(print_output)
                        avg_loss = 0
                    self.step += 1
                    self.callback_manager.on_batch_end()

                    
                    if ((self.validate_every > 0 and self.step % self.validate_every == 0) or
                        (self.validate_every < 0 and self.step % len(data_iterator) == 0)) \
                            and self.dev_data is not None:
                        eval_res = self._do_validation(epoch=epoch, step=self.step)
                        eval_str = "Evaluation at Epoch {}/{}. Step:{}/{}. ".format(epoch, self.n_epochs, self.step,
                                                                                    total_steps) + \
                                                                                    self.n_steps) + \
                                   self.tester._format_eval_results(eval_res)
                        pbar.write(eval_str)

                        pbar.write(eval_str + '\n')
                
                # ================= mini-batch end ==================== #

                
                # lr decay; early stopping
                self.callback_manager.on_epoch_end(epoch, self.n_epochs, self.optimizer)
                self.callback_manager.on_epoch_end()
            # =============== epochs end =================== #
            pbar.close()
            self.pbar = None
        # ============ tqdm end ============== #

    
    def _do_validation(self, epoch, step):
        self.callback_manager.on_valid_begin()
        res = self.tester.test()

        
        is_better_eval = False
        if self._better_eval_result(res):
            if self.save_path is not None:
                self._save_model(self.model,
                             "best_" + "_".join([self.model.__class__.__name__, self.metric_key, self.start_time]))
            else:
                                 "best_" + "_".join([self.model.__class__.__name__, self.metric_key, self.start_time]))
            elif self._load_best_model:
                self._best_model_states = {name: param.cpu().clone() for name, param in self.model.named_parameters()}
            self.best_dev_perf = res
            self.best_dev_epoch = epoch
            self.best_dev_step = step
            is_better_eval = True
        # get validation results; adjust optimizer
        self.callback_manager.on_valid_end(res, self.metric_key, self.optimizer)
        self.callback_manager.on_valid_end(res, self.metric_key, self.optimizer, is_better_eval)
        return res

    
    def _mode(self, model, is_test=False):
        """Train mode or Test mode. This is for PyTorch currently.

@@ -317,20 +654,22 @@ class Trainer(object):
            model.eval()
        else:
            model.train()

    
    def _update(self):
        """Perform weight update on a model.

        """
        self.optimizer.step()

        if self.optimizer is not None and (self.step + 1) % self.update_every == 0:
            self.optimizer.step()
    
    def _data_forward(self, network, x):
        x = _build_args(network.forward, **x)
        y = network(**x)
        if not isinstance(y, dict):
            raise TypeError(f"The return value of {get_func_signature(network.forward)} should be dict, got {type(y)}.")
            raise TypeError(
                f"The return value of {_get_func_signature(network.forward)} should be dict, got {type(y)}.")
        return y

    
    def _grad_backward(self, loss):
        """Compute gradient with link rules.

@@ -338,9 +677,10 @@ class Trainer(object):

        For PyTorch, just do "loss.backward()"
        """
        self.model.zero_grad()
        if self.step % self.update_every == 0:
            self.model.zero_grad()
        loss.backward()

    
    def _compute_loss(self, predict, truth):
        """Compute loss given prediction and ground truth.

@@ -349,7 +689,7 @@ class Trainer(object):
        :return: a scalar
        """
        return self.losser(predict, truth)

    
    def _save_model(self, model, model_name, only_param=False):
        """ 存储不含有显卡信息的state_dict或model
        :param model:
@@ -359,6 +699,10 @@ class Trainer(object):
        """
        if self.save_path is not None:
            model_path = os.path.join(self.save_path, model_name)
            if not os.path.exists(self.save_path):
                os.makedirs(self.save_path, exist_ok=True)
            if isinstance(model, nn.DataParallel):
                model = model.module
            if only_param:
                state_dict = model.state_dict()
                for key in state_dict:
@@ -367,8 +711,8 @@ class Trainer(object):
            else:
                model.cpu()
                torch.save(model, model_path)
                model.cuda()

                model.to(self._model_device)
    
    def _load_model(self, model, model_name, only_param=False):
        # 返回bool值指示是否成功reload模型
        if self.save_path is not None:
@@ -377,13 +721,16 @@ class Trainer(object):
                states = torch.load(model_path)
            else:
                states = torch.load(model_path).state_dict()
            model.load_state_dict(states)
            if isinstance(model, nn.DataParallel):
                model.module.load_state_dict(states)
            else:
                model.load_state_dict(states)
        elif hasattr(self, "_best_model_states"):
            model.load_state_dict(self._best_model_states)
        else:
            return False
        return True

    
    def _better_eval_result(self, metrics):
        """Check if the current epoch yields better validation results.

@@ -411,6 +758,7 @@ class Trainer(object):
 DEFAULT_CHECK_BATCH_SIZE = 2
 DEFAULT_CHECK_NUM_BATCH = 2


 def _get_value_info(_dict):
    # given a dict value, return information about this dict's value. Return list of str
    strs = []
@@ -427,27 +775,28 @@ def _get_value_info(_dict):
        strs.append(_str)
    return strs


 def _check_code(dataset, model, losser, metrics, batch_size=DEFAULT_CHECK_BATCH_SIZE,
                dev_data=None, metric_key=None,
                check_level=0):
    # check get_loss 方法
    model_devcie = model.parameters().__next__().device

    
    batch = Batch(dataset=dataset, batch_size=batch_size, sampler=SequentialSampler())
    for batch_count, (batch_x, batch_y) in enumerate(batch):
        _move_dict_value_to_device(batch_x, batch_y, device=model_devcie)
        # forward check
        if batch_count==0:
        if batch_count == 0:
            info_str = ""
            input_fields = _get_value_info(batch_x)
            target_fields = _get_value_info(batch_y)
            if len(input_fields)>0:
            if len(input_fields) > 0:
                info_str += "input fields after batch(if batch size is {}):\n".format(batch_size)
                info_str += "\n".join(input_fields)
                info_str += '\n'
            else:
                raise RuntimeError("There is no input field.")
            if len(target_fields)>0:
            if len(target_fields) > 0:
                info_str += "target fields after batch(if batch size is {}):\n".format(batch_size)
                info_str += "\n".join(target_fields)
                info_str += '\n'
@@ -455,14 +804,14 @@ def _check_code(dataset, model, losser, metrics, batch_size=DEFAULT_CHECK_BATCH_
                info_str += 'There is no target field.'
            print(info_str)
            _check_forward_error(forward_func=model.forward, dataset=dataset,
                                    batch_x=batch_x, check_level=check_level)

                                 batch_x=batch_x, check_level=check_level)
        
        refined_batch_x = _build_args(model.forward, **batch_x)
        pred_dict = model(**refined_batch_x)
        func_signature = get_func_signature(model.forward)
        func_signature = _get_func_signature(model.forward)
        if not isinstance(pred_dict, dict):
            raise TypeError(f"The return value of {func_signature} should be `dict`, not `{type(pred_dict)}`.")

        
        # loss check
        try:
            loss = losser(pred_dict, batch_y)
@@ -470,23 +819,23 @@ def _check_code(dataset, model, losser, metrics, batch_size=DEFAULT_CHECK_BATCH_
            if batch_count == 0:
                if not isinstance(loss, torch.Tensor):
                    raise TypeError(
                        f"The return value of {get_func_signature(losser.get_loss)} should be `torch.Tensor`, "
                        f"The return value of {_get_func_signature(losser.get_loss)} should be `torch.Tensor`, "
                        f"but got `{type(loss)}`.")
                if len(loss.size()) != 0:
                    raise ValueError(
                        f"The size of return value of {get_func_signature(losser.get_loss)} is {loss.size()}, "
                        f"The size of return value of {_get_func_signature(losser.get_loss)} is {loss.size()}, "
                        f"should be torch.size([])")
            loss.backward()
        except CheckError as e:
            # TODO: another error raised if CheckError caught
            pre_func_signature = get_func_signature(model.forward)
        except _CheckError as e:
            # TODO: another error raised if _CheckError caught
            pre_func_signature = _get_func_signature(model.forward)
            _check_loss_evaluate(prev_func_signature=pre_func_signature, func_signature=e.func_signature,
                                 check_res=e.check_res, pred_dict=pred_dict, target_dict=batch_y,
                                 dataset=dataset, check_level=check_level)
        model.zero_grad()
        if batch_count + 1 >= DEFAULT_CHECK_NUM_BATCH:
            break

    
    if dev_data is not None:
        tester = Tester(data=dev_data[:batch_size * DEFAULT_CHECK_NUM_BATCH], model=model, metrics=metrics,
                        batch_size=batch_size, verbose=-1)
@@ -500,7 +849,7 @@ def _check_eval_results(metrics, metric_key, metric_list):
    # metric_list: 多个用来做评价的指标，来自Trainer的初始化
    if isinstance(metrics, tuple):
        loss, metrics = metrics

    
    if isinstance(metrics, dict):
        if len(metrics) == 1:
            # only single metric, just use it
@@ -511,7 +860,7 @@ def _check_eval_results(metrics, metric_key, metric_list):
            if metrics_name not in metrics:
                raise RuntimeError(f"{metrics_name} is chosen to do validation, but got {metrics}")
            metric_dict = metrics[metrics_name]

        
        if len(metric_dict) == 1:
            indicator_val, indicator = list(metric_dict.values())[0], list(metric_dict.keys())[0]
        elif len(metric_dict) > 1 and metric_key is None:
--- a/fastNLP/core/utils.py
+++ b/fastNLP/core/utils.py
@@ -1,59 +1,274 @@
 """
 utils模块实现了 fastNLP 内部和外部所需的很多工具。其中用户可以使用的是 :func:`cache_results` 修饰器。
 """
 __all__ = [
    "cache_results",
    "seq_len_to_mask"
 ]

 import _pickle
 import inspect
 import os
 import warnings
 from collections import Counter
 from collections import namedtuple
 from collections import Counter, namedtuple

 import numpy as np
 import torch
 import torch.nn as nn


 CheckRes = namedtuple('CheckRes', ['missing', 'unused', 'duplicated', 'required', 'all_needed',
                                   'varargs'])
 _CheckRes = namedtuple('_CheckRes', ['missing', 'unused', 'duplicated', 'required', 'all_needed',
                                     'varargs'])


 def save_pickle(obj, pickle_path, file_name):
    """Save an object into a pickle file.
 def _prepare_cache_filepath(filepath):
    """
    检查filepath是否可以作为合理的cache文件. 如果可以的话，会自动创造路径
    :param filepath: str.
    :return: None, if not, this function will raise error
    """
    _cache_filepath = os.path.abspath(filepath)
    if os.path.isdir(_cache_filepath):
        raise RuntimeError("The cache_file_path must be a file, not a directory.")
    cache_dir = os.path.dirname(_cache_filepath)
    if not os.path.exists(cache_dir):
        os.makedirs(cache_dir)


    :param obj: an object
    :param pickle_path: str, the directory where the pickle file is to be saved
    :param file_name: str, the name of the pickle file. In general, it should be ended by "pkl".
 #  TODO 可以保存下缓存时的参数，如果load的时候发现参数不一致，发出警告。
 def cache_results(_cache_fp, _refresh=False, _verbose=1):
    """
    别名：:class:`fastNLP.cache_results` :class:`fastNLP.core.uitls.cache_results`

    cache_results是fastNLP中用于cache数据的装饰器。通过下面的例子看一下如何使用::

        import time
        import numpy as np
        from fastNLP import cache_results
        
        @cache_results('cache.pkl')
        def process_data():
            # 一些比较耗时的工作，比如读取数据，预处理数据等，这里用time.sleep()代替耗时
            time.sleep(1)
            return np.random.randint(10, size=(5,))
        
        start_time = time.time()
        print("res =",process_data())
        print(time.time() - start_time)
        
        start_time = time.time()
        print("res =",process_data())
        print(time.time() - start_time)
        
        # 输出内容如下，可以看到两次结果相同，且第二次几乎没有花费时间
        # Save cache to cache.pkl.
        # res = [5 4 9 1 8]
        # 1.0042750835418701
        # Read cache from cache.pkl.
        # res = [5 4 9 1 8]
        # 0.0040721893310546875

    可以看到第二次运行的时候，只用了0.0001s左右，是由于第二次运行将直接从cache.pkl这个文件读取数据，而不会经过再次预处理

    Example::

        # 还是以上面的例子为例，如果需要重新生成另一个cache，比如另一个数据集的内容，通过如下的方式调用即可
        process_data(_cache_fp='cache2.pkl')  # 完全不影响之前的‘cache.pkl'

    上面的_cache_fp是cache_results会识别的参数，它将从'cache2.pkl'这里缓存/读取数据，即这里的'cache2.pkl'覆盖默认的
    'cache.pkl'。如果在你的函数前面加上了@cache_results()则你的函数会增加三个参数[_cache_fp, _refresh, _verbose]。
    上面的例子即为使用_cache_fp的情况，这三个参数不会传入到你的函数中，当然你写的函数参数名也不可能包含这三个名称。

    Example::

        process_data(_cache_fp='cache2.pkl', _refresh=True)  # 这里强制重新生成一份对预处理的cache。
        #  _verbose是用于控制输出信息的，如果为0,则不输出任何内容;如果为1,则会提醒当前步骤是读取的cache还是生成了新的cache

    :param str _cache_fp: 将返回结果缓存到什么位置;或从什么位置读取缓存。如果为None，cache_results没有任何效用，除非在
        函数调用的时候传入_cache_fp这个参数。
    :param bool _refresh: 是否重新生成cache。
    :param int _verbose: 是否打印cache的信息。
    :return:
    """
    if not os.path.exists(pickle_path):
        os.mkdir(pickle_path)
        print("make dir {} before saving pickle file".format(pickle_path))
    with open(os.path.join(pickle_path, file_name), "wb") as f:
        _pickle.dump(obj, f)
    print("{} saved in {}".format(file_name, pickle_path))
    
    def wrapper_(func):
        signature = inspect.signature(func)
        for key, _ in signature.parameters.items():
            if key in ('_cache_fp', '_refresh', '_verbose'):
                raise RuntimeError("The function decorated by cache_results cannot have keyword `{}`.".format(key))
        
        def wrapper(*args, **kwargs):
            if '_cache_fp' in kwargs:
                cache_filepath = kwargs.pop('_cache_fp')
                assert isinstance(cache_filepath, str), "_cache_fp can only be str."
            else:
                cache_filepath = _cache_fp
            if '_refresh' in kwargs:
                refresh = kwargs.pop('_refresh')
                assert isinstance(refresh, bool), "_refresh can only be bool."
            else:
                refresh = _refresh
            if '_verbose' in kwargs:
                verbose = kwargs.pop('_verbose')
                assert isinstance(verbose, int), "_verbose can only be integer."
            else:
                verbose = _verbose
            refresh_flag = True
            
            if cache_filepath is not None and refresh is False:
                # load data
                if os.path.exists(cache_filepath):
                    with open(cache_filepath, 'rb') as f:
                        results = _pickle.load(f)
                    if verbose == 1:
                        print("Read cache from {}.".format(cache_filepath))
                    refresh_flag = False
            
            if refresh_flag:
                results = func(*args, **kwargs)
                if cache_filepath is not None:
                    if results is None:
                        raise RuntimeError("The return value is None. Delete the decorator.")
                    _prepare_cache_filepath(cache_filepath)
                    with open(cache_filepath, 'wb') as f:
                        _pickle.dump(results, f)
                    print("Save cache to {}.".format(cache_filepath))
            
            return results
        
        return wrapper
    
    return wrapper_


 # def save_pickle(obj, pickle_path, file_name):
 #     """Save an object into a pickle file.
 #
 #     :param obj: an object
 #     :param pickle_path: str, the directory where the pickle file is to be saved
 #     :param file_name: str, the name of the pickle file. In general, it should be ended by "pkl".
 #     """
 #     if not os.path.exists(pickle_path):
 #         os.mkdir(pickle_path)
 #         print("make dir {} before saving pickle file".format(pickle_path))
 #     with open(os.path.join(pickle_path, file_name), "wb") as f:
 #         _pickle.dump(obj, f)
 #     print("{} saved in {}".format(file_name, pickle_path))
 #
 #
 # def load_pickle(pickle_path, file_name):
 #     """Load an object from a given pickle file.
 #
 #     :param pickle_path: str, the directory where the pickle file is.
 #     :param file_name: str, the name of the pickle file.
 #     :return obj: an object stored in the pickle
 #     """
 #     with open(os.path.join(pickle_path, file_name), "rb") as f:
 #         obj = _pickle.load(f)
 #     print("{} loaded from {}".format(file_name, pickle_path))
 #     return obj
 #
 #
 # def pickle_exist(pickle_path, pickle_name):
 #     """Check if a given pickle file exists in the directory.
 #
 #     :param pickle_path: the directory of target pickle file
 #     :param pickle_name: the filename of target pickle file
 #     :return: True if file exists else False
 #     """
 #     if not os.path.exists(pickle_path):
 #         os.makedirs(pickle_path)
 #     file_name = os.path.join(pickle_path, pickle_name)
 #     if os.path.exists(file_name):
 #         return True
 #     else:
 #         return False

 def _move_model_to_device(model, device):
    """
    将model移动到device

    :param model: torch.nn.DataParallel or torch.nn.Module. 当为torch.nn.DataParallel, 则只是调用一次cuda。device必须为
        None。
    :param str,int,torch.device,list(int),list(torch.device) device: 将模型load到哪个设备。默认为None，即Trainer不对模型
        的计算位置进行管理。支持以下的输入:

        1. str: ['cpu', 'cuda', 'cuda:0', 'cuda:1', ...] 依次为'cpu'中, 可见的第一个GPU中, 可见的第一个GPU中,
        可见的第二个GPU中;

        2. torch.device：将模型装载到torch.device上。

        3. int: 将使用device_id为该值的gpu进行训练

 def load_pickle(pickle_path, file_name):
    """Load an object from a given pickle file.
        4. list(int)：如果多于1个device，将使用torch.nn.DataParallel包裹model, 并使用传入的device。

    :param pickle_path: str, the directory where the pickle file is.
    :param file_name: str, the name of the pickle file.
    :return obj: an object stored in the pickle
        5. None. 为None则不对模型进行任何处理，如果传入的model为torch.nn.DataParallel该值必须为None。

    :return: torch.nn.DataParallel or torch.nn.Module
    """
    with open(os.path.join(pickle_path, file_name), "rb") as f:
        obj = _pickle.load(f)
    print("{} loaded from {}".format(file_name, pickle_path))
    return obj
    if isinstance(model, torch.nn.parallel.DistributedDataParallel):
        raise RuntimeError("model of `torch.nn.parallel.DistributedDataParallel` is not supported right now.")
    
    if device is None:
        if isinstance(model, torch.nn.DataParallel):
            model.cuda()
        return model
    else:
        if not torch.cuda.is_available() and (
                device != 'cpu' or (isinstance(device, torch.device) and device.type != 'cpu')):
            raise ValueError("There is no usable gpu. set `device` as `cpu` or `None`.")
    
    if isinstance(model, torch.nn.DataParallel):
        raise RuntimeError("When model is `torch.nn.DataParallel`, the device has to be `None`.")
    
    if isinstance(device, int):
        assert device > -1, "device can only be non-negative integer"
        assert torch.cuda.device_count() > device, "Only has {} gpus, cannot use device {}.".format(
            torch.cuda.device_count(),
            device)
        device = torch.device('cuda:{}'.format(device))
    elif isinstance(device, str):
        device = torch.device(device)
        if device.type == 'cuda' and device.index is not None:
            assert device.index < torch.cuda.device_count(), "Only has {} gpus, cannot use device cuda:{}.".format(
                torch.cuda.device_count(),
                device)
    elif isinstance(device, torch.device):
        if device.type == 'cuda' and device.index is not None:
            assert device.index < torch.cuda.device_count(), "Only has {} gpus, cannot use device cuda:{}.".format(
                torch.cuda.device_count(),
                device)
    elif isinstance(device, list):
        types = set([type(d) for d in device])
        assert len(types) == 1, "Mixed type in device, only `int` allowed."
        assert list(types)[0] == int, "Only int supported for multiple devices."
        assert len(set(device)) == len(device), "Duplicated device id found in device."
        for d in device:
            assert d > -1, "Only non-negative device id allowed."
        if len(device) > 1:
            output_device = device[0]
            model = nn.DataParallel(model, device_ids=device, output_device=output_device)
        device = torch.device(device[0])
    else:
        raise TypeError("Unsupported device type.")
    model = model.to(device)
    return model


 def pickle_exist(pickle_path, pickle_name):
    """Check if a given pickle file exists in the directory.
 def _get_model_device(model):
    """
    传入一个nn.Module的模型，获取它所在的device

    :param pickle_path: the directory of target pickle file
    :param pickle_name: the filename of target pickle file
    :return: True if file exists else False
    :param model: nn.Module
    :return: torch.device,None 如果返回值为None，说明这个模型没有任何参数。
    """
    if not os.path.exists(pickle_path):
        os.makedirs(pickle_path)
    file_name = os.path.join(pickle_path, pickle_name)
    if os.path.exists(file_name):
        return True
    assert isinstance(model, nn.Module)
    
    parameters = list(model.parameters())
    if len(parameters) == 0:
        return None
    else:
        return False
        return parameters[0].device


 def _build_args(func, **kwargs):
@@ -126,30 +341,35 @@ def _check_arg_dict_list(func, args):
    missing = list(require_args - input_args)
    unused = list(input_args - all_args)
    varargs = [] if not spect.varargs else [spect.varargs]
    return CheckRes(missing=missing,
                    unused=unused,
                    duplicated=duplicated,
                    required=list(require_args),
                    all_needed=list(all_args),
                    varargs=varargs)
    return _CheckRes(missing=missing,
                     unused=unused,
                     duplicated=duplicated,
                     required=list(require_args),
                     all_needed=list(all_args),
                     varargs=varargs)


 def get_func_signature(func):
 def _get_func_signature(func):
    """

    Given a function or method, return its signature.
    For example:
    (1) function
    
    1 function::
    
        def func(a, b='a', *args):
            xxxx
        get_func_signature(func) # 'func(a, b='a', *args)'
    (2) method
        
    2 method::
    
        class Demo:
            def __init__(self):
                xxx
            def forward(self, a, b='a', **args)
        demo = Demo()
        get_func_signature(demo.forward) # 'Demo.forward(self, a, b='a', **args)'
        
    :param func: a function or a method
    :return: str or None
    """
@@ -195,9 +415,12 @@ def _move_dict_value_to_device(*args, device: torch.device, non_blocking=False):
    :param args:
    :return:
    """
    if not torch.cuda.is_available():
        return
    
    if not isinstance(device, torch.device):
        raise TypeError(f"device must be `torch.device`, got `{type(device)}`")

    
    for arg in args:
        if isinstance(arg, dict):
            for key, value in arg.items():
@@ -207,15 +430,15 @@ def _move_dict_value_to_device(*args, device: torch.device, non_blocking=False):
            raise TypeError("Only support `dict` type right now.")


 class CheckError(Exception):
 class _CheckError(Exception):
    """

    CheckError. Used in losses.LossBase, metrics.MetricBase.
    _CheckError. Used in losses.LossBase, metrics.MetricBase.
    """

    def __init__(self, check_res: CheckRes, func_signature: str):
    
    def __init__(self, check_res: _CheckRes, func_signature: str):
        errs = [f'Problems occurred when calling `{func_signature}`']

        
        if check_res.varargs:
            errs.append(f"\tvarargs: {check_res.varargs}(Does not support pass positional arguments, please delete it)")
        if check_res.missing:
@@ -224,9 +447,9 @@ class CheckError(Exception):
            errs.append(f"\tduplicated param: {check_res.duplicated}")
        if check_res.unused:
            errs.append(f"\tunused param: {check_res.unused}")

        
        Exception.__init__(self, '\n'.join(errs))

        
        self.check_res = check_res
        self.func_signature = func_signature

@@ -236,7 +459,7 @@ WARNING_CHECK_LEVEL = 1
 STRICT_CHECK_LEVEL = 2


 def _check_loss_evaluate(prev_func_signature: str, func_signature: str, check_res: CheckRes,
 def _check_loss_evaluate(prev_func_signature: str, func_signature: str, check_res: _CheckRes,
                         pred_dict: dict, target_dict: dict, dataset, check_level=0):
    errs = []
    unuseds = []
@@ -246,7 +469,7 @@ def _check_loss_evaluate(prev_func_signature: str, func_signature: str, check_re
    # if check_res.varargs:
    #     errs.append(f"\tvarargs: *{check_res.varargs}")
    #     suggestions.append(f"Does not support pass positional arguments, please delete *{check_res.varargs}.")

    
    if check_res.unused:
        for _unused in check_res.unused:
            if _unused in target_dict:
@@ -256,20 +479,19 @@ def _check_loss_evaluate(prev_func_signature: str, func_signature: str, check_re
        if _unused_field:
            unuseds.append(f"\tunused field: {_unused_field}")
        if _unused_param:
            unuseds.append(f"\tunused param: {_unused_param}") # output from predict or forward

            unuseds.append(f"\tunused param: {_unused_param}")  # output from predict or forward
    
    module_name = func_signature.split('.')[0]
    if check_res.missing:
        errs.append(f"\tmissing param: {check_res.missing}")
        import re
        mapped_missing = []
        unmapped_missing = []
        mapped_missing = []  # 提供了映射的参数
        unmapped_missing = []  # 没有指定映射的参数
        input_func_map = {}
        for _miss in check_res.missing:
            if '(' in _miss:
                # if they are like 'SomeParam(assign to xxx)'
                _miss = _miss.split('(')[0]
            matches = re.findall("(?<=`)[a-zA-Z0-9]*?(?=`)", _miss)
        for _miss_ in check_res.missing:
            # they shoudl like 'SomeParam(assign to xxx)'
            _miss = _miss_.split('(')[0]
            matches = re.findall("(?<=`)[a-zA-Z0-9]*?(?=`)", _miss_)
            if len(matches) == 2:
                fun_arg, module_name = matches
                input_func_map[_miss] = fun_arg
@@ -279,50 +501,50 @@ def _check_loss_evaluate(prev_func_signature: str, func_signature: str, check_re
                    mapped_missing.append(_miss)
            else:
                unmapped_missing.append(_miss)

        for _miss in mapped_missing:
        
        for _miss in mapped_missing + unmapped_missing:
            if _miss in dataset:
                suggestions.append(f"Set {_miss} as target.")
                suggestions.append(f"Set `{_miss}` as target.")
            else:
                _tmp = ''
                if check_res.unused:
                    _tmp = f"Check key assignment for `{input_func_map.get(_miss, _miss)}` when initialize {module_name}."
                    _tmp = f"Check key assignment for `{input_func_map.get(_miss,_miss)}` when initialize {module_name}."
                if _tmp:
                    _tmp += f' Or provide {_miss} in DataSet or output of {prev_func_signature}.'
                    _tmp += f' Or provide `{_miss}` in DataSet or output of {prev_func_signature}.'
                else:
                    _tmp = f'Provide {_miss} in DataSet or output of {prev_func_signature}.'
                    _tmp = f'Provide `{_miss}` in DataSet or output of {prev_func_signature}.'
                suggestions.append(_tmp)
        for _miss in unmapped_missing:
            if _miss in dataset:
                suggestions.append(f"Set {_miss} as target.")
            else:
                _tmp = ''
                if check_res.unused:
                    _tmp = f"Specify your assignment for `{input_func_map.get(_miss, _miss)}` when initialize {module_name}."
                if _tmp:
                    _tmp += f' Or provide {_miss} in DataSet or output of {prev_func_signature}.'
                else:
                    _tmp = f'Provide {_miss} in output of {prev_func_signature} or DataSet.'
                suggestions.append(_tmp)

        # for _miss in unmapped_missing:
        #     if _miss in dataset:
        #         suggestions.append(f"Set `{_miss}` as target.")
        #     else:
        #         _tmp = ''
        #         if check_res.unused:
        #             _tmp = f"Specify your assignment for `{input_func_map.get(_miss, _miss)}` when initialize {module_name}."
        #         if _tmp:
        #             _tmp += f' Or provide `{_miss}` in DataSet or output of {prev_func_signature}.'
        #         else:
        #             _tmp = f'Provide `{_miss}` in output of {prev_func_signature} or DataSet.'
        #         suggestions.append(_tmp)
    
    if check_res.duplicated:
        errs.append(f"\tduplicated param: {check_res.duplicated}.")
        suggestions.append(f"Delete {check_res.duplicated} in the output of "
                           f"{prev_func_signature} or do not set {check_res.duplicated} as targets. ")

    if len(errs)>0:
    
    if len(errs) > 0:
        errs.extend(unuseds)
    elif check_level == STRICT_CHECK_LEVEL:
        errs.extend(unuseds)

    
    if len(errs) > 0:
        errs.insert(0, f'Problems occurred when calling {func_signature}')
        sugg_str = ""
        if len(suggestions) > 1:
            for idx, sugg in enumerate(suggestions):
                if idx>0:
                if idx > 0:
                    sugg_str += '\t\t\t'
                sugg_str += f'({idx+1}). {sugg}\n'
                sugg_str += f'({idx + 1}). {sugg}\n'
            sugg_str = sugg_str[:-1]
        else:
            sugg_str += suggestions[0]
@@ -337,14 +559,15 @@ def _check_loss_evaluate(prev_func_signature: str, func_signature: str, check_re
            _unused_warn = f'{check_res.unused} is not used by {module_name}.'
            warnings.warn(message=_unused_warn)


 def _check_forward_error(forward_func, batch_x, dataset, check_level):
    check_res = _check_arg_dict_list(forward_func, batch_x)
    func_signature = get_func_signature(forward_func)

    func_signature = _get_func_signature(forward_func)
    
    errs = []
    suggestions = []
    _unused = []

    
    # if check_res.varargs:
    #     errs.append(f"\tvarargs: {check_res.varargs}")
    #     suggestions.append(f"Does not support pass positional arguments, please delete *{check_res.varargs}.")
@@ -365,20 +588,20 @@ def _check_forward_error(forward_func, batch_x, dataset, check_level):
            #     _tmp += f"Or you might find it in `unused field:`, you can use DataSet.rename_field() to " \
            #             f"rename the field in `unused field:`."
            suggestions.append(_tmp)

    
    if check_res.unused:
        _unused = [f"\tunused field: {check_res.unused}"]
        if len(errs)>0:
        if len(errs) > 0:
            errs.extend(_unused)
        elif check_level == STRICT_CHECK_LEVEL:
            errs.extend(_unused)

    
    if len(errs) > 0:
        errs.insert(0, f'Problems occurred when calling {func_signature}')
        sugg_str = ""
        if len(suggestions) > 1:
            for idx, sugg in enumerate(suggestions):
                sugg_str += f'({idx+1}). {sugg}'
                sugg_str += f'({idx + 1}). {sugg}'
        else:
            sugg_str += suggestions[0]
        err_str = '\n' + '\n'.join(errs) + '\n\tSuggestion: ' + sugg_str
@@ -389,72 +612,66 @@ def _check_forward_error(forward_func, batch_x, dataset, check_level):
            warnings.warn(message=_unused_warn)


 def seq_lens_to_masks(seq_lens, float=False):
 def seq_len_to_mask(seq_len):
    """

    Convert seq_lens to masks.
    :param seq_lens: list, np.ndarray, or torch.LongTensor, shape should all be (B,)
    :param float: if True, the return masks is in float type, otherwise it is byte.
    :return: list, np.ndarray or torch.Tensor, shape will be (B, max_length)
    将一个表示sequence length的一维数组转换为二维的mask，不包含的位置为0。
    转变 1-d seq_len到2-d mask.

    Example::
    
        >>> seq_len = torch.arange(2, 16)
        >>> mask = seq_len_to_mask(seq_len)
        >>> print(mask.size())
        torch.Size([14, 15])
        >>> seq_len = np.arange(2, 16)
        >>> mask = seq_len_to_mask(seq_len)
        >>> print(mask.shape)
        (14, 15)

    :param np.ndarray,torch.LongTensor seq_len: shape将是(B,)
    :return: np.ndarray or torch.Tensor, shape将是(B, max_length)。 元素类似为bool或torch.uint8
    """
    if isinstance(seq_lens, np.ndarray):
        assert len(np.shape(seq_lens)) == 1, f"seq_lens can only have one dimension, got {len(np.shape(seq_lens))}."
        assert seq_lens.dtype in (int, np.int32, np.int64), f"seq_lens can only be integer, not {seq_lens.dtype}."
        raise NotImplemented
    elif isinstance(seq_lens, torch.Tensor):
        assert len(seq_lens.size()) == 1, f"seq_lens can only have one dimension, got {len(seq_lens.size())==1}."
        batch_size = seq_lens.size(0)
        max_len = seq_lens.max()
        indexes = torch.arange(max_len).view(1, -1).repeat(batch_size, 1).to(seq_lens.device)
        masks = indexes.lt(seq_lens.unsqueeze(1))

        if float:
            masks = masks.float()

        return masks
    elif isinstance(seq_lens, list):
        raise NotImplemented
    if isinstance(seq_len, np.ndarray):
        assert len(np.shape(seq_len)) == 1, f"seq_len can only have one dimension, got {len(np.shape(seq_len))}."
        max_len = int(seq_len.max())
        broad_cast_seq_len = np.tile(np.arange(max_len), (len(seq_len), 1))
        mask = broad_cast_seq_len < seq_len.reshape(-1, 1)
    
    elif isinstance(seq_len, torch.Tensor):
        assert seq_len.dim() == 1, f"seq_len can only have one dimension, got {seq_len.dim() == 1}."
        batch_size = seq_len.size(0)
        max_len = seq_len.max().long()
        broad_cast_seq_len = torch.arange(max_len).expand(batch_size, -1).to(seq_len)
        mask = broad_cast_seq_len.lt(seq_len.unsqueeze(1))
    else:
        raise NotImplemented


 def seq_mask(seq_len, max_len):
    """Create sequence mask.

    :param seq_len: list or torch.Tensor, the lengths of sequences in a batch.
    :param max_len: int, the maximum sequence length in a batch.
    :return mask: torch.LongTensor, [batch_size, max_len]

    """
    if not isinstance(seq_len, torch.Tensor):
        seq_len = torch.LongTensor(seq_len)
    seq_len = seq_len.view(-1, 1).long()   # [batch_size, 1]
    seq_range = torch.arange(start=0, end=max_len, dtype=torch.long, device=seq_len.device).view(1, -1) # [1, max_len]
    return torch.gt(seq_len, seq_range) # [batch_size, max_len]
        raise TypeError("Only support 1-d numpy.ndarray or 1-d torch.Tensor.")
    
    return mask


 class pseudo_tqdm:
 class _pseudo_tqdm:
    """
    当无法引入tqdm，或者Trainer中设置use_tqdm为false的时候，用该方法打印数据
    """

    
    def __init__(self, **kwargs):
        pass

    
    def write(self, info):
        print(info)

    
    def set_postfix_str(self, info):
        print(info)

    
    def __getattr__(self, item):
        def pass_func(*args, **kwargs):
            pass

        
        return pass_func

    
    def __enter__(self):
        return self

    
    def __exit__(self, exc_type, exc_val, exc_tb):
        del self
--- a/fastNLP/core/vocabulary.py
+++ b/fastNLP/core/vocabulary.py
@@ -1,24 +1,33 @@
 __all__ = [
    "Vocabulary"
 ]

 from functools import wraps
 from collections import Counter

 from .dataset import DataSet


 def check_build_vocab(func):
 def _check_build_vocab(func):
    """A decorator to make sure the indexing is built before used.

    """

    
    @wraps(func)  # to solve missing docstring
    def _wrapper(self, *args, **kwargs):
        if self.word2idx is None or self.rebuild is True:
            self.build_vocab()
        return func(self, *args, **kwargs)

    
    return _wrapper


 def check_build_status(func):
 def _check_build_status(func):
    """A decorator to check whether the vocabulary updates after the last build.

    """

    
    @wraps(func)  # to solve missing docstring
    def _wrapper(self, *args, **kwargs):
        if self.rebuild is False:
            self.rebuild = True
@@ -27,27 +36,38 @@ def check_build_status(func):
                      "Adding more words may cause unexpected behaviour of Vocabulary. ".format(
                    self.max_size, func.__name__))
        return func(self, *args, **kwargs)

    
    return _wrapper


 class Vocabulary(object):
    """Use for word and index one to one mapping
    """
    别名：:class:`fastNLP.Vocabulary` :class:`fastNLP.core.vocabulary.Vocabulary`
    
    用于构建, 存储和使用 `str` 到 `int` 的一一映射

    Example::

        vocab = Vocabulary()
        word_list = "this is a word list".split()
        vocab.update(word_list)
        vocab["word"]
        vocab.to_word(5)

    :param int max_size: set the max number of words in Vocabulary. Default: None
    :param int min_freq: set the min occur frequency of words in Vocabulary. Default: None

        vocab["word"] # str to int
        vocab.to_word(5) # int to str

    :param int max_size: `Vocabulary` 的最大大小, 即能存储词的最大数量
        若为 ``None`` , 则不限制大小. Default: ``None``
    :param int min_freq: 能被记录下的词在文本中的最小出现频率, 应大于或等于 1.
        若小于该频率, 词语将被视为 `unknown`. 若为 ``None`` , 所有文本中的词都被记录. Default: ``None``
    :param str optional padding: padding的字符. 如果设置为 ``None`` ,
        则vocabulary中不考虑padding, 也不计入词表大小，为 ``None`` 的情况多在为label建立Vocabulary的情况.
        Default: '<pad>'
    :param str optional unknown: unknown的字符，所有未被记录的词在转为 `int` 时将被视为unknown.
        如果设置为 ``None`` ,则vocabulary中不考虑unknow, 也不计入词表大小.
        为 ``None`` 的情况多在为label建立Vocabulary的情况.
        Default: '<unk>'
    """

    def __init__(self, max_size=None, min_freq=None, unknown='<unk>', padding='<pad>'):
    
    def __init__(self, max_size=None, min_freq=None, padding='<pad>', unknown='<unk>'):
        self.max_size = max_size
        self.min_freq = min_freq
        self.word_count = Counter()
@@ -56,51 +76,55 @@ class Vocabulary(object):
        self.word2idx = None
        self.idx2word = None
        self.rebuild = True

    @check_build_status
    
    @_check_build_status
    def update(self, word_lst):
        """Add a list of words into the vocabulary.
        """依次增加序列中词在词典中的出现频率

        :param list word_lst: a list of strings
        """
        self.word_count.update(word_lst)

    @check_build_status
    
    @_check_build_status
    def add(self, word):
        """Add a single word into the vocabulary.
        """
        增加一个新词在词典中的出现频率

        :param str word: a word or token.
        :param str word: 新词
        """
        self.word_count[word] += 1

    @check_build_status
    
    @_check_build_status
    def add_word(self, word):
        """Add a single word into the vocabulary.

        :param str word: a word or token.
        """
        增加一个新词在词典中的出现频率

        :param str word: 新词
        """
        self.add(word)

    @check_build_status
    
    @_check_build_status
    def add_word_lst(self, word_lst):
        """Add a list of words into the vocabulary.

        :param list word_lst: a list of strings
        """
        依次增加序列中词在词典中的出现频率

        :param list[str] word_lst: 词的序列
        """
        self.update(word_lst)

    
    def build_vocab(self):
        """Build a mapping from word to index, and filter the word using ``max_size`` and ``min_freq``.
        """
        根据已经出现的词和出现频率构建词典. 注意: 重复构建可能会改变词典的大小,
        但已经记录在词典中的词, 不会改变对应的 `int`

        """
        self.word2idx = {}
        if self.word2idx is None:
            self.word2idx = {}
        if self.padding is not None:
            self.word2idx[self.padding] = 0
            self.word2idx[self.padding] = len(self.word2idx)
        if self.unknown is not None:
            self.word2idx[self.unknown] = 1

            self.word2idx[self.unknown] = len(self.word2idx)
        
        max_size = min(self.max_size, len(self.word_count)) if self.max_size else None
        words = self.word_count.most_common(max_size)
        if self.min_freq is not None:
@@ -111,32 +135,47 @@ class Vocabulary(object):
        self.word2idx.update({w: i + start_idx for i, (w, _) in enumerate(words)})
        self.build_reverse_vocab()
        self.rebuild = False

    
    def build_reverse_vocab(self):
        """Build "index to word" dict based on "word to index" dict.
        """
        基于 "word to index" dict, 构建 "index to word" dict.

        """
        self.idx2word = {i: w for w, i in self.word2idx.items()}

    @check_build_vocab
    
    @_check_build_vocab
    def __len__(self):
        return len(self.word2idx)

    @check_build_vocab
    
    @_check_build_vocab
    def __contains__(self, item):
        """Check if a word in vocabulary.
        """
        检查词是否被记录

        :param item: the word
        :return: True or False
        """
        return item in self.word2idx

    
    def has_word(self, w):
        return self.__contains__(w)
        """
        检查词是否被记录

        Example::

    @check_build_vocab
            has_abc = vocab.has_word('abc')
            # equals to
            has_abc = 'abc' in vocab

        :param item: the word
        :return: ``True`` or ``False``
        """
        return self.__contains__(w)
    
    @_check_build_vocab
    def __getitem__(self, w):
        """To support usage like::
        """
        To support usage like::

            vocab[w]
        """
@@ -146,49 +185,174 @@ class Vocabulary(object):
            return self.word2idx[self.unknown]
        else:
            raise ValueError("word {} not in vocabulary".format(w))
    
    @_check_build_vocab
    def index_dataset(self, *datasets, field_name, new_field_name=None):
        """
        将DataSet中对应field的词转为数字.

        Example::

            # remember to use `field_name`
            vocab.index_dataset(train_data, dev_data, test_data, field_name='words')

        :param datasets: 需要转index的 class:`~fastNLP.DataSet` , 支持一个或多个（list）
        :param str field_name: 需要转index的field, 若有多个 DataSet, 每个DataSet都必须有此 field.
            目前仅支持 ``str`` , ``list(str)`` , ``list(list(str))``
        :param str new_field_name: 保存结果的field_name. 若为 ``None`` , 将覆盖原field.
            Default: ``None``
        """
        
        def index_instance(ins):
            """
            有几种情况, str, 1d-list, 2d-list
            :param ins:
            :return:
            """
            field = ins[field_name]
            if isinstance(field, str):
                return self.to_index(field)
            elif isinstance(field, list):
                if not isinstance(field[0], list):
                    return [self.to_index(w) for w in field]
                else:
                    if isinstance(field[0][0], list):
                        raise RuntimeError("Only support field with 2 dimensions.")
                    return [[self.to_index(c) for c in w] for w in field]
        
        if new_field_name is None:
            new_field_name = field_name
        for idx, dataset in enumerate(datasets):
            if isinstance(dataset, DataSet):
                try:
                    dataset.apply(index_instance, new_field_name=new_field_name)
                except Exception as e:
                    print("When processing the `{}` dataset, the following error occurred.".format(idx))
                    raise e
            else:
                raise RuntimeError("Only DataSet type is allowed.")
    
    def from_dataset(self, *datasets, field_name):
        """
        使用dataset的对应field中词构建词典

         Example::

            # remember to use `field_name`
            vocab.from_dataset(train_data1, train_data2, field_name='words')

        :param datasets: 需要转index的 class:`~fastNLP.DataSet` , 支持一个或多个（list）
        :param field_name: 可为 ``str`` 或 ``list(str)`` .
            构建词典所使用的 field(s), 支持一个或多个field
            若有多个 DataSet, 每个DataSet都必须有这些field.
            目前仅支持的field结构: ``str`` , ``list(str)`` , ``list(list(str))``
        :return self:
        """
        if isinstance(field_name, str):
            field_name = [field_name]
        elif not isinstance(field_name, list):
            raise TypeError('invalid argument field_name: {}'.format(field_name))
        
        def construct_vocab(ins):
            for fn in field_name:
                field = ins[fn]
                if isinstance(field, str):
                    self.add_word(field)
                elif isinstance(field, list):
                    if not isinstance(field[0], list):
                        self.add_word_lst(field)
                    else:
                        if isinstance(field[0][0], list):
                            raise RuntimeError("Only support field with 2 dimensions.")
                        [self.add_word_lst(w) for w in field]
        
        for idx, dataset in enumerate(datasets):
            if isinstance(dataset, DataSet):
                try:
                    dataset.apply(construct_vocab)
                except Exception as e:
                    print("When processing the `{}` dataset, the following error occurred.".format(idx))
                    raise e
            else:
                raise RuntimeError("Only DataSet type is allowed.")
        return self
    
    def to_index(self, w):
        """ Turn a word to an index. If w is not in Vocabulary, return the unknown label.
        """
        将词转为数字. 若词不再词典中被记录, 将视为 unknown, 若 ``unknown=None`` , 将抛出
        ``ValueError``

        Example::

            index = vocab.to_index('abc')
            # equals to
            index = vocab['abc']

        :param str w: a word
        :return int index: the number
        """
        return self.__getitem__(w)

    
    @property
    @check_build_vocab
    @_check_build_vocab
    def unknown_idx(self):
        """
        unknown 对应的数字.
        """
        if self.unknown is None:
            return None
        return self.word2idx[self.unknown]

    
    @property
    @check_build_vocab
    @_check_build_vocab
    def padding_idx(self):
        """
        padding 对应的数字
        """
        if self.padding is None:
            return None
        return self.word2idx[self.padding]

    @check_build_vocab
    
    @_check_build_vocab
    def to_word(self, idx):
        """given a word's index, return the word itself
        """
        给定一个数字, 将其转为对应的词.

        :param int idx: the index
        :return str word: the indexed word
        :return str word: the word
        """
        return self.idx2word[idx]
    
    def clear(self):
        """
        删除Vocabulary中的词表数据。相当于重新初始化一下。

        :return:
        """
        self.word_count.clear()
        self.word2idx = None
        self.idx2word = None
        self.rebuild = True
    
    def __getstate__(self):
        """Use to prepare data for pickle.

        """
        len(self)  # make sure vocab has been built
        state = self.__dict__.copy()
        # no need to pickle idx2word as it can be constructed from word2idx
        del state['idx2word']
        return state

    
    def __setstate__(self, state):
        """Use to restore state from pickle.

        """
        self.__dict__.update(state)
        self.build_reverse_vocab()
    
    def __repr__(self):
        return "Vocabulary({}...)".format(list(self.word_count.keys())[:5])
    
    def __iter__(self):
        return iter(list(self.word_count.keys()))
--- a/fastNLP/io/init.py
+++ b/fastNLP/io/init.py
@@ -0,0 +1,31 @@
 """
 用于IO的模块, 具体包括:

 1. 用于读入 embedding 的 :doc:`EmbedLoader <fastNLP.io.embed_loader>` 类,

 2. 用于读入数据的 :doc:`DataSetLoader <fastNLP.io.dataset_loader>` 类

 3. 用于保存和载入模型的类, 参考 :doc:`/fastNLP.io.model_io`

 这些类的使用方法如下:
 """
 __all__ = [
    'EmbedLoader',
    
    'DataSetLoader',
    'CSVLoader',
    'JsonLoader',
    'ConllLoader',
    'SNLILoader',
    'SSTLoader',
    'PeopleDailyCorpusLoader',
    'Conll2003Loader',
    
    'ModelLoader',
    'ModelSaver',
 ]

 from .embed_loader import EmbedLoader
 from .dataset_loader import DataSetLoader, CSVLoader, JsonLoader, ConllLoader, SNLILoader, SSTLoader, \
    PeopleDailyCorpusLoader, Conll2003Loader
 from .model_io import ModelLoader, ModelSaver
--- a/fastNLP/io/base_loader.py
+++ b/fastNLP/io/base_loader.py
@@ -1,30 +1,42 @@
 __all__ = [
    "BaseLoader"
 ]

 import _pickle as pickle
 import os


 class BaseLoader(object):
    """Base loader for all loaders.
    """
    各个 Loader 的基类，提供了 API 的参考。

    """
    
    def __init__(self):
        super(BaseLoader, self).__init__()

    
    @staticmethod
    def load_lines(data_path):
        """按行读取，舍弃每行两侧空白字符，返回list of str
        """
        按行读取，舍弃每行两侧空白字符，返回list of str

        :param data_path: 读取数据的路径
        """
        with open(data_path, "r", encoding="utf=8") as f:
            text = f.readlines()
        return [line.strip() for line in text]

    
    @classmethod
    def load(cls, data_path):
        """先按行读取，去除一行两侧空白，再提取每行的字符。返回list of list of str
        """
        先按行读取，去除一行两侧空白，再提取每行的字符。返回list of list of str
        
        :param data_path:
        """
        with open(data_path, "r", encoding="utf-8") as f:
            text = f.readlines()
        return [[word for word in sent.strip()] for sent in text]

    
    @classmethod
    def load_with_cache(cls, data_path, cache_path):
        """缓存版的load
@@ -40,22 +52,23 @@ class BaseLoader(object):


 class DataLoaderRegister:
    """Register for all data sets.

    """
    _readers = {}

    
    @classmethod
    def set_reader(cls, reader_cls, read_fn_name):
        # def wrapper(reader_cls):
        if read_fn_name in cls._readers:
            raise KeyError('duplicate reader: {} and {} for read_func: {}'.format(cls._readers[read_fn_name], reader_cls, read_fn_name))
            raise KeyError(
                'duplicate reader: {} and {} for read_func: {}'.format(cls._readers[read_fn_name], reader_cls,
                                                                       read_fn_name))
        if hasattr(reader_cls, 'load'):
            cls._readers[read_fn_name] = reader_cls().load
        return reader_cls

    
    @classmethod
    def get_reader(cls, read_fn_name):
        if read_fn_name in cls._readers:
            return cls._readers[read_fn_name]
        raise AttributeError('no read function: {}'.format(read_fn_name))
    
    # TODO 这个类使用在何处？
--- a/fastNLP/io/config_io.py
+++ b/fastNLP/io/config_io.py
@@ -1,31 +1,48 @@
 """
 用于读入和处理和保存 config 文件
 .. todo::
    这个模块中的类可能被抛弃？
 """
 __all__ = [
    "ConfigLoader",
    "ConfigSection",
    "ConfigSaver"
 ]

 import configparser
 import json
 import os

 from fastNLP.io.base_loader import BaseLoader
 from .base_loader import BaseLoader


 class ConfigLoader(BaseLoader):
    """Loader for configuration.
    """
    别名：:class:`fastNLP.io.ConfigLoader` :class:`fastNLP.io.config_io.ConfigLoader`

    读取配置文件的Loader

    :param str data_path: path to the config
    :param str data_path: 配置文件的路径

    """
    
    def __init__(self, data_path=None):
        super(ConfigLoader, self).__init__()
        if data_path is not None:
            self.config = self.parse(super(ConfigLoader, self).load(data_path))

    
    @staticmethod
    def parse(string):
        raise NotImplementedError

    
    @staticmethod
    def load_config(file_path, sections):
        """Load section(s) of configuration into the ``sections`` provided. No returns.
        """
        把配置文件的section 存入提供的 ``sections`` 中

        :param str file_path: the path of config file
        :param dict sections: the dict of ``{section_name(string): ConfigSection object}``
        :param str file_path: 配置文件的路径
        :param dict sections:  符合如下键值对组成的字典 `section_name(string)` : :class:`~fastNLP.io.ConfigSection`
            
        Example::

            test_args = ConfigSection()
@@ -65,13 +82,16 @@ class ConfigLoader(BaseLoader):


 class ConfigSection(object):
    """ConfigSection is the data structure storing all key-value pairs in one section in a config file.

    """
    别名：:class:`fastNLP.io.ConfigSection` :class:`fastNLP.io.config_io.ConfigSection`

    ConfigSection是一个存储了一个section中所有键值对的数据结构，推荐使用此类的实例来配合 :meth:`ConfigLoader.load_config` 使用

    """
    
    def __init__(self):
        super(ConfigSection, self).__init__()

    
    def __getitem__(self, key):
        """
        :param key: str, the name of the attribute
@@ -84,7 +104,7 @@ class ConfigSection(object):
        if key in self.__dict__.keys():
            return getattr(self, key)
        raise AttributeError("do NOT have attribute %s" % key)

    
    def __setitem__(self, key, value):
        """
        :param key: str, the name of the attribute
@@ -99,14 +119,14 @@ class ConfigSection(object):
                raise AttributeError("attr %s except %s but got %s" %
                                     (key, str(type(getattr(self, key))), str(type(value))))
        setattr(self, key, value)

    
    def __contains__(self, item):
        """
        :param item: The key of item.
        :return: True if the key in self.__dict__.keys() else False.
        """
        return item in self.__dict__.keys()

    
    def __eq__(self, other):
        """Overwrite the == operator

@@ -118,15 +138,15 @@ class ConfigSection(object):
                return False
            if getattr(self, k) != getattr(self, k):
                return False

        
        for k in other.__dict__.keys():
            if k not in self.__dict__.keys():
                return False
            if getattr(self, k) != getattr(self, k):
                return False

        
        return True

    
    def __ne__(self, other):
        """Overwrite the != operator

@@ -134,25 +154,30 @@ class ConfigSection(object):
        :return:
        """
        return not self.__eq__(other)

    
    @property
    def data(self):
        return self.__dict__


 class ConfigSaver(object):
    """ConfigSaver is used to save config file and solve related conflicts.
    """
    别名：:class:`fastNLP.io.ConfigSaver` :class:`fastNLP.io.config_io.ConfigSaver`

    ConfigSaver 是用来存储配置文件并解决相关冲突的类

    :param str file_path: path to the config file
    :param str file_path: 配置文件的路径

    """
    
    def __init__(self, file_path):
        self.file_path = file_path
        if not os.path.exists(self.file_path):
            raise FileNotFoundError("file {} NOT found!".__format__(self.file_path))

    
    def _get_section(self, sect_name):
        """This is the function to get the section with the section name.
        """
        This is the function to get the section with the section name.

        :param sect_name: The name of section what wants to load.
        :return: The section.
@@ -160,25 +185,26 @@ class ConfigSaver(object):
        sect = ConfigSection()
        ConfigLoader().load_config(self.file_path, {sect_name: sect})
        return sect

    
    def _read_section(self):
        """This is the function to read sections from the config file.
        """
        This is the function to read sections from the config file.

        :return: sect_list, sect_key_list
            sect_list: A list of ConfigSection().
            sect_key_list: A list of names in sect_list.
        """
        sect_name = None

        
        sect_list = {}
        sect_key_list = []

        
        single_section = {}
        single_section_key = []

        
        with open(self.file_path, 'r') as f:
            lines = f.readlines()

        
        for line in lines:
            if line.startswith('[') and line.endswith(']\n'):
                if sect_name is None:
@@ -190,31 +216,32 @@ class ConfigSaver(object):
                    sect_key_list.append(sect_name)
                sect_name = line[1: -2]
                continue

            
            if line.startswith('#'):
                single_section[line] = '#'
                single_section_key.append(line)
                continue

            
            if line.startswith('\n'):
                single_section_key.append('\n')
                continue

            
            if '=' not in line:
                raise RuntimeError("can NOT load config file {}".__format__(self.file_path))

            
            key = line.split('=', maxsplit=1)[0].strip()
            value = line.split('=', maxsplit=1)[1].strip() + '\n'
            single_section[key] = value
            single_section_key.append(key)

        
        if sect_name is not None:
            sect_list[sect_name] = single_section, single_section_key
            sect_key_list.append(sect_name)
        return sect_list, sect_key_list

    
    def _write_section(self, sect_list, sect_key_list):
        """This is the function to write config file with section list and name list.
        """
        This is the function to write config file with section list and name list.

        :param sect_list: A list of ConfigSection() need to be writen into file.
        :param sect_key_list: A list of name of sect_list.
@@ -233,12 +260,13 @@ class ConfigSaver(object):
                        continue
                    f.write(key + ' = ' + single_section[key])
                f.write('\n')

    
    def save_config_file(self, section_name, section):
        """This is the function to be called to change the config file with a single section and its name.
        """
        这个方法可以用来修改并保存配置文件中单独的一个 section

        :param str section_name: The name of section what needs to be changed and saved.
        :param ConfigSection section: The section with key and value what needs to be changed and saved.
        :param str section_name: 需要保存的 section 的名字.
        :param section: 你需要修改并保存的 section， :class:`~fastNLP.io.ConfigSaver` 类型
        """
        section_file = self._get_section(section_name)
        if len(section_file.__dict__.keys()) == 0:  # the section not in the file before
@@ -264,11 +292,11 @@ class ConfigSaver(object):
                    break
            if not change_file:
                return

            
            sect_list, sect_key_list = self._read_section()
            if section_name not in sect_key_list:
                raise AttributeError()

            
            sect, sect_key = sect_list[section_name]
            for k in section.__dict__.keys():
                if k not in sect_key:
--- a/fastNLP/io/dataset_loader.py
+++ b/fastNLP/io/dataset_loader.py
--- a/fastNLP/io/embed_loader.py
+++ b/fastNLP/io/embed_loader.py
@@ -1,126 +1,155 @@
 __all__ = [
    "EmbedLoader"
 ]

 import os
 import warnings

 import numpy as np
 import torch

 from fastNLP.core.vocabulary import Vocabulary
 from fastNLP.io.base_loader import BaseLoader
 from ..core.vocabulary import Vocabulary
 from .base_loader import BaseLoader


 class EmbedLoader(BaseLoader):
    """docstring for EmbedLoader"""
    """
    别名：:class:`fastNLP.io.EmbedLoader` :class:`fastNLP.io.embed_loader.EmbedLoader`

    用于读取预训练的embedding, 读取结果可直接载入为模型参数。
    """
    
    def __init__(self):
        super(EmbedLoader, self).__init__()

    
    @staticmethod
    def _load_glove(emb_file):
        """Read file as a glove embedding

        file format:
            embeddings are split by line,
            for one embedding, word and numbers split by space
        Example::

        word_1 float_1 float_2 ... float_emb_dim
        word_2 float_1 float_2 ... float_emb_dim
        ...
    def load_with_vocab(embed_filepath, vocab, dtype=np.float32, normalize=True, error='ignore'):
        """
        emb = {}
        with open(emb_file, 'r', encoding='utf-8') as f:
            for line in f:
                line = list(filter(lambda w: len(w) > 0, line.strip().split(' ')))
                if len(line) > 2:
                    emb[line[0]] = torch.Tensor(list(map(float, line[1:])))
        return emb

    @staticmethod
    def _load_pretrain(emb_file, emb_type):
        """Read txt data from embedding file and convert to np.array as pre-trained embedding

        :param str emb_file: the pre-trained embedding file path
        :param str emb_type: the pre-trained embedding data format
        :return: a dict of ``{str: np.array}``
        从embed_filepath这个预训练的词向量中抽取出vocab这个词表的词的embedding。EmbedLoader将自动判断embed_filepath是
        word2vec(第一行只有两个元素)还是glove格式的数据。

        :param str embed_filepath: 预训练的embedding的路径。
        :param vocab: 词表 :class:`~fastNLP.Vocabulary` 类型，读取出现在vocab中的词的embedding。
            没有出现在vocab中的词的embedding将通过找到的词的embedding的正态分布采样出来，以使得整个Embedding是同分布的。
        :param dtype: 读出的embedding的类型
        :param bool normalize: 是否将每个vector归一化到norm为1
        :param str error: `ignore` , `strict` ; 如果 `ignore` ，错误将自动跳过; 如果 `strict` , 错误将抛出。
            这里主要可能出错的地方在于词表有空行或者词表出现了维度不一致。
        :return numpy.ndarray:  shape为 [len(vocab), dimension], dimension由pretrain的embedding决定。
        """
        if emb_type == 'glove':
            return EmbedLoader._load_glove(emb_file)
        else:
            raise Exception("embedding type {} not support yet".format(emb_type))

        assert isinstance(vocab, Vocabulary), "Only fastNLP.Vocabulary is supported."
        if not os.path.exists(embed_filepath):
            raise FileNotFoundError("`{}` does not exist.".format(embed_filepath))
        with open(embed_filepath, 'r', encoding='utf-8') as f:
            hit_flags = np.zeros(len(vocab), dtype=bool)
            line = f.readline().strip()
            parts = line.split()
            start_idx = 0
            if len(parts) == 2:
                dim = int(parts[1])
                start_idx += 1
            else:
                dim = len(parts) - 1
                f.seek(0)
            matrix = np.random.randn(len(vocab), dim).astype(dtype)
            for idx, line in enumerate(f, start_idx):
                try:
                    parts = line.strip().split()
                    if parts[0] in vocab:
                        index = vocab.to_index(parts[0])
                        matrix[index] = np.fromstring(' '.join(parts[1:]), sep=' ', dtype=dtype, count=dim)
                        hit_flags[index] = True
                except Exception as e:
                    if error == 'ignore':
                        warnings.warn("Error occurred at the {} line.".format(idx))
                    else:
                        print("Error occurred at the {} line.".format(idx))
                        raise e
            total_hits = sum(hit_flags)
            print("Found {} out of {} words in the pre-training embedding.".format(total_hits, len(vocab)))
            found_vectors = matrix[hit_flags]
            if len(found_vectors) != 0:
                mean = np.mean(found_vectors, axis=0, keepdims=True)
                std = np.std(found_vectors, axis=0, keepdims=True)
                unfound_vec_num = len(vocab) - total_hits
                r_vecs = np.random.randn(unfound_vec_num, dim).astype(dtype) * std + mean
                matrix[hit_flags == False] = r_vecs
            
            if normalize:
                matrix /= np.linalg.norm(matrix, axis=1, keepdims=True)
            
            return matrix
    
    @staticmethod
    def load_embedding(emb_dim, emb_file, emb_type, vocab):
        """Load the pre-trained embedding and combine with the given dictionary.

        :param int emb_dim: the dimension of the embedding. Should be the same as pre-trained embedding.
        :param str emb_file: the pre-trained embedding file path.
        :param str emb_type: the pre-trained embedding format, support glove now
        :param Vocabulary vocab: a mapping from word to index, can be provided by user or built from pre-trained embedding
        :return (embedding_tensor, vocab):
                embedding_tensor - Tensor of shape (len(word_dict), emb_dim);
                vocab - input vocab or vocab built by pre-train

    def load_without_vocab(embed_filepath, dtype=np.float32, padding='<pad>', unknown='<unk>', normalize=True,
                           error='ignore'):
        """
        pretrain = EmbedLoader._load_pretrain(emb_file, emb_type)
        if vocab is None:
            # build vocabulary from pre-trained embedding
            vocab = Vocabulary()
            for w in pretrain.keys():
                vocab.add(w)
        embedding_tensor = torch.randn(len(vocab), emb_dim)
        for w, v in pretrain.items():
            if len(v.shape) > 1 or emb_dim != v.shape[0]:
                raise ValueError(
                    "Pretrained embedding dim is {}. Dimension dismatched. Required {}".format(v.shape, (emb_dim,)))
            if vocab.has_word(w):
                embedding_tensor[vocab[w]] = v
        return embedding_tensor, vocab

    @staticmethod
    def parse_glove_line(line):
        line = line.split()
        if len(line) <= 2:
            raise RuntimeError("something goes wrong in parsing glove embedding")
        return line[0], line[1:]

    @staticmethod
    def str_list_2_vec(line):
        try:
            return torch.Tensor(list(map(float, line)))
        except Exception:
            raise RuntimeError("something goes wrong in parsing glove embedding")


    @staticmethod
    def fast_load_embedding(emb_dim, emb_file, vocab):
        """Fast load the pre-trained embedding and combine with the given dictionary.
        This loading method uses line-by-line operation.

        :param int emb_dim: the dimension of the embedding. Should be the same as pre-trained embedding.
        :param str emb_file: the pre-trained embedding file path.
        :param Vocabulary vocab: a mapping from word to index, can be provided by user or built from pre-trained embedding
        :return embedding_matrix: numpy.ndarray

        从embed_filepath中读取预训练的word vector。根据预训练的词表读取embedding并生成一个对应的Vocabulary。

        :param str embed_filepath: 预训练的embedding的路径。
        :param dtype: 读出的embedding的类型
        :param str padding: the padding tag for vocabulary.
        :param str unknown: the unknown tag for vocabulary.
        :param bool normalize: 是否将每个vector归一化到norm为1
        :param str error: `ignore` , `strict` ; 如果 `ignore` ，错误将自动跳过; 如果 `strict` , 错误将抛出。这里主要可能出错的地
            方在于词表有空行或者词表出现了维度不一致。
        :return numpy.ndarray: shape为 [len(vocab), dimension], dimension由pretrain的embedding决定。
        :return numpy.ndarray: Vocabulary Embedding的shape是[词表大小+x, 词表维度], "词表大小+x"是由于最终的大小还取决与
            是否使用padding, 以及unknown有没有在词表中找到对应的词。 Vocabulary中的词的顺序与Embedding的顺序是一一对应的。
        """
        if vocab is None:
            raise RuntimeError("You must provide a vocabulary.")
        embedding_matrix = np.zeros(shape=(len(vocab), emb_dim), dtype=np.float32)
        hit_flags = np.zeros(shape=(len(vocab),), dtype=int)
        with open(emb_file, "r", encoding="utf-8") as f:
            startline = f.readline()
            if len(startline.split()) > 2:
        vocab = Vocabulary(padding=padding, unknown=unknown)
        vec_dict = {}
        found_unknown = False
        found_pad = False
        
        with open(embed_filepath, 'r', encoding='utf-8') as f:
            line = f.readline()
            start = 1
            dim = -1
            if len(line.strip().split()) != 2:
                f.seek(0)
            for line in f:
                word, vector = EmbedLoader.parse_glove_line(line)
                if word in vocab:
                    vector = EmbedLoader.str_list_2_vec(vector)
                    if len(vector.shape) > 1 or emb_dim != vector.shape[0]:
                        raise ValueError("Pre-trained embedding dim is {}. Expect {}.".format(vector.shape, (emb_dim,)))
                    embedding_matrix[vocab[word]] = vector
                    hit_flags[vocab[word]] = 1

        if np.sum(hit_flags) < len(vocab):
            # some words from vocab are missing in pre-trained embedding
            # we normally sample each dimension
            vocab_embed = embedding_matrix[np.where(hit_flags)]
            sampled_vectors = np.random.normal(vocab_embed.mean(axis=0), vocab_embed.std(axis=0),
                                               size=(len(vocab) - np.sum(hit_flags), emb_dim))
            embedding_matrix[np.where(1 - hit_flags)] = sampled_vectors
        return embedding_matrix
                start = 0
            for idx, line in enumerate(f, start=start):
                try:
                    parts = line.strip().split()
                    word = parts[0]
                    if dim == -1:
                        dim = len(parts) - 1
                    vec = np.fromstring(' '.join(parts[1:]), sep=' ', dtype=dtype, count=dim)
                    vec_dict[word] = vec
                    vocab.add_word(word)
                    if unknown is not None and unknown == word:
                        found_unknown = True
                    if found_pad is not None and padding == word:
                        found_pad = True
                except Exception as e:
                    if error == 'ignore':
                        warnings.warn("Error occurred at the {} line.".format(idx))
                        pass
                    else:
                        print("Error occurred at the {} line.".format(idx))
                        raise e
            if dim == -1:
                raise RuntimeError("{} is an empty file.".format(embed_filepath))
            matrix = np.random.randn(len(vocab), dim).astype(dtype)
            if (unknown is not None and not found_unknown) or (padding is not None and not found_pad):
                start_idx = 0
                if padding is not None:
                    start_idx += 1
                if unknown is not None:
                    start_idx += 1
                
                mean = np.mean(matrix[start_idx:], axis=0, keepdims=True)
                std = np.std(matrix[start_idx:], axis=0, keepdims=True)
                if (unknown is not None and not found_unknown):
                    matrix[start_idx - 1] = np.random.randn(1, dim).astype(dtype) * std + mean
                if (padding is not None and not found_pad):
                    matrix[0] = np.random.randn(1, dim).astype(dtype) * std + mean
            
            for key, vec in vec_dict.items():
                index = vocab.to_index(key)
                matrix[index] = vec
            
            if normalize:
                matrix /= np.linalg.norm(matrix, axis=1, keepdims=True)
            
            return matrix, vocab
--- a/fastNLP/io/file_reader.py
+++ b/fastNLP/io/file_reader.py
@@ -0,0 +1,118 @@
 """
 此模块用于给其它模块提供读取文件的函数，没有为用户提供 API
 """
 import json


 def _read_csv(path, encoding='utf-8', headers=None, sep=',', dropna=True):
    """
    Construct a generator to read csv items.

    :param path: file path
    :param encoding: file's encoding, default: utf-8
    :param headers: file's headers, if None, make file's first line as headers. default: None
    :param sep: separator for each column. default: ','
    :param dropna: weather to ignore and drop invalid data,
            :if False, raise ValueError when reading invalid data. default: True
    :return: generator, every time yield (line number, csv item)
    """
    with open(path, 'r', encoding=encoding) as f:
        start_idx = 0
        if headers is None:
            headers = f.readline().rstrip('\r\n')
            headers = headers.split(sep)
            start_idx += 1
        elif not isinstance(headers, (list, tuple)):
                raise TypeError("headers should be list or tuple, not {}." \
                        .format(type(headers)))
        for line_idx, line in enumerate(f, start_idx):
            contents = line.rstrip('\r\n').split(sep)
            if len(contents) != len(headers):
                if dropna:
                    continue
                else:
                    raise ValueError("Line {} has {} parts, while header has {} parts." \
                                     .format(line_idx, len(contents), len(headers)))
            _dict = {}
            for header, content in zip(headers, contents):
                _dict[header] = content
            yield line_idx, _dict


 def _read_json(path, encoding='utf-8', fields=None, dropna=True):
    """
    Construct a generator to read json items.

    :param path: file path
    :param encoding: file's encoding, default: utf-8
    :param fields: json object's fields that needed, if None, all fields are needed. default: None
    :param dropna: weather to ignore and drop invalid data,
            :if False, raise ValueError when reading invalid data. default: True
    :return: generator, every time yield (line number, json item)
    """
    if fields:
        fields = set(fields)
    with open(path, 'r', encoding=encoding) as f:
        for line_idx, line in enumerate(f):
            data = json.loads(line)
            if fields is None:
                yield line_idx, data
                continue
            _res = {}
            for k, v in data.items():
                if k in fields:
                    _res[k] = v
            if len(_res) < len(fields):
                if dropna:
                    continue
                else:
                    raise ValueError('invalid instance at line: {}'.format(line_idx))
            yield line_idx, _res


 def _read_conll(path, encoding='utf-8', indexes=None, dropna=True):
    """
    Construct a generator to read conll items.

    :param path: file path
    :param encoding: file's encoding, default: utf-8
    :param indexes: conll object's column indexes that needed, if None, all columns are needed. default: None
    :param dropna: weather to ignore and drop invalid data,
            :if False, raise ValueError when reading invalid data. default: True
    :return: generator, every time yield (line number, conll item)
    """
    def parse_conll(sample):
        sample = list(map(list, zip(*sample)))
        sample = [sample[i] for i in indexes]
        for f in sample:
            if len(f) <= 0:
                raise ValueError('empty field')
        return sample
    with open(path, 'r', encoding=encoding) as f:
        sample = []
        start = next(f)
        if '-DOCSTART-' not in start:
            sample.append(start.split())
        for line_idx, line in enumerate(f, 1):
            if line.startswith('\n'):
                if len(sample):
                    try:
                        res = parse_conll(sample)
                        sample = []
                        yield line_idx, res
                    except Exception as e:
                        if dropna:
                            continue
                        raise ValueError('invalid instance at line: {}'.format(line_idx))
            elif line.startswith('#'):
                continue
            else:
                sample.append(line.split())
        if len(sample) > 0:
            try:
                res = parse_conll(sample)
                yield line_idx, res
            except Exception as e:
                if dropna:
                    return
                raise ValueError('invalid instance at line: {}'.format(line_idx))
--- a/fastNLP/io/logger.py
+++ b/fastNLP/io/logger.py
@@ -1,35 +0,0 @@
 import logging
 import os


 def create_logger(logger_name, log_path, log_format=None, log_level=logging.INFO):
    """Create a logger.

    :param str logger_name:
    :param str log_path:
    :param log_format:
    :param log_level:
    :return: logger

    To use a logger::

        logger.debug("this is a debug message")
        logger.info("this is a info message")
        logger.warning("this is a warning message")
        logger.error("this is an error message")
    """
    logger = logging.getLogger(logger_name)
    logger.setLevel(log_level)
    if log_path is None:
        handler = logging.StreamHandler()
    else:
        os.stat(os.path.dirname(os.path.abspath(log_path)))
        handler = logging.FileHandler(log_path)
    handler.setLevel(log_level)
    if log_format is None:
        log_format = "[%(asctime)s %(name)-13s %(levelname)s %(process)d %(thread)d " \
                     "%(filename)s:%(lineno)-5d] %(message)s"
    formatter = logging.Formatter(log_format)
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    return logger
--- a/fastNLP/io/model_io.py
+++ b/fastNLP/io/model_io.py
@@ -1,53 +1,72 @@
 """
 用于载入和保存模型
 """
 __all__ = [
    "ModelLoader",
    "ModelSaver"
 ]

 import torch

 from fastNLP.io.base_loader import BaseLoader
 from .base_loader import BaseLoader


 class ModelLoader(BaseLoader):
    """
        Loader for models.
    """
    别名：:class:`fastNLP.io.ModelLoader` :class:`fastNLP.io.model_io.ModelLoader`

    用于读取模型
    """
    
    def __init__(self):
        super(ModelLoader, self).__init__()

    
    @staticmethod
    def load_pytorch(empty_model, model_path):
        """Load model parameters from ".pkl" files into the empty PyTorch model.
        """
        从 ".pkl" 文件读取 PyTorch 模型

        :param empty_model: a PyTorch model with initialized parameters.
        :param str model_path: the path to the saved model.
        :param empty_model: 初始化参数的 PyTorch 模型
        :param str model_path: 模型保存的路径
        """
        empty_model.load_state_dict(torch.load(model_path))

    
    @staticmethod
    def load_pytorch_model(model_path):
        """Load the entire model.
        """
        读取整个模型

        :param str model_path: the path to the saved model.
        :param str model_path: 模型保存的路径
        """
        return torch.load(model_path)


 class ModelSaver(object):
    """Save a model
    """
    别名：:class:`fastNLP.io.ModelSaver` :class:`fastNLP.io.model_io.ModelSaver`

        :param str save_path: the path to the saving directory.
        Example::
    用于保存模型
    
    Example::

            saver = ModelSaver("./save/model_ckpt_100.pkl")
            saver.save_pytorch(model)
        saver = ModelSaver("./save/model_ckpt_100.pkl")
        saver.save_pytorch(model)

    """

    
    def __init__(self, save_path):
        self.save_path = save_path
        """

        :param save_path: 模型保存的路径
        """
        self.save_path = save_path
    
    def save_pytorch(self, model, param_only=True):
        """Save a pytorch model into ".pkl" file.
        """
        把 PyTorch 模型存入 ".pkl" 文件

        :param model: a PyTorch model
        :param bool param_only: whether only to save the model parameters or the entire model.
        :param model: PyTorch 模型
        :param bool param_only: 是否只保存模型的参数（否则保存整个模型）

        """
        if param_only is True:
--- a/fastNLP/models/init.py
+++ b/fastNLP/models/init.py
@@ -1,6 +1,34 @@
 """
 fastNLP 在 :mod:`~fastNLP.models` 模块中内置了如 :class:`~fastNLP.models.CNNText` 、
 :class:`~fastNLP.models.SeqLabeling` 等完整的模型，以供用户直接使用。

 .. todo::
    这些模型的介绍（与主页一致）


 """
 __all__ = [
    "CNNText",
    
    "SeqLabeling",
    "AdvSeqLabel",
    
    "ESIM",
    
    "StarTransEnc",
    "STSeqLabel",
    "STNLICls",
    "STSeqCls",
    
    "BiaffineParser",
    "GraphParser"
 ]

 from .base_model import BaseModel
 from .bert import BertForMultipleChoice, BertForQuestionAnswering, BertForSequenceClassification, \
    BertForTokenClassification
 from .biaffine_parser import BiaffineParser, GraphParser
 from .char_language_model import CharLM
 from .cnn_text_classification import CNNText
 from .sequence_modeling import SeqLabeling, AdvSeqLabel
 from .sequence_labeling import SeqLabeling, AdvSeqLabel
 from .snli import ESIM
 from .star_transformer import StarTransEnc, STSeqCls, STNLICls, STSeqLabel
--- a/fastNLP/models/base_model.py
+++ b/fastNLP/models/base_model.py
@@ -1,18 +1,18 @@
 import torch

 from fastNLP.modules.decoder.MLP import MLP
 from ..modules.decoder.mlp import MLP


 class BaseModel(torch.nn.Module):
    """Base PyTorch model for all models.
    """

    
    def __init__(self):
        super(BaseModel, self).__init__()

    
    def fit(self, train_data, dev_data=None, **train_args):
        pass

    
    def predict(self, *args, **kwargs):
        raise NotImplementedError

@@ -21,9 +21,9 @@ class NaiveClassifier(BaseModel):
    def __init__(self, in_feature_dim, out_feature_dim):
        super(NaiveClassifier, self).__init__()
        self.mlp = MLP([in_feature_dim, in_feature_dim, out_feature_dim])

    
    def forward(self, x):
        return {"predict": torch.sigmoid(self.mlp(x))}

    
    def predict(self, x):
        return {"predict": torch.sigmoid(self.mlp(x)) > 0.5}
--- a/fastNLP/models/bert.py
+++ b/fastNLP/models/bert.py
@@ -2,361 +2,292 @@
 bert.py is modified from huggingface/pytorch-pretrained-BERT, which is licensed under the Apache License 2.0.

 """
 import copy
 import json
 import math
 import os

 import torch
 from torch import nn

 CONFIG_FILE = 'bert_config.json'
 MODEL_WEIGHTS = 'pytorch_model.bin'


 def gelu(x):
    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))


 def swish(x):
    return x * torch.sigmoid(x)


 ACT2FN = {"gelu": gelu, "relu": torch.nn.functional.relu, "swish": swish}


 class BertLayerNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-12):
        super(BertLayerNorm, self).__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.bias = nn.Parameter(torch.zeros(hidden_size))
        self.variance_epsilon = eps

    def forward(self, x):
        u = x.mean(-1, keepdim=True)
        s = (x - u).pow(2).mean(-1, keepdim=True)
        x = (x - u) / torch.sqrt(s + self.variance_epsilon)
        return self.weight * x + self.bias


 class BertEmbeddings(nn.Module):
    def __init__(self, vocab_size, hidden_size, max_position_embeddings, type_vocab_size, hidden_dropout_prob):
        super(BertEmbeddings, self).__init__()
        self.word_embeddings = nn.Embedding(vocab_size, hidden_size)
        self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size)
        self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size)

        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
        # any TensorFlow checkpoint file
        self.LayerNorm = BertLayerNorm(hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(hidden_dropout_prob)

    def forward(self, input_ids, token_type_ids=None):
        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)

        words_embeddings = self.word_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        embeddings = words_embeddings + position_embeddings + token_type_embeddings
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings


 class BertSelfAttention(nn.Module):
    def __init__(self, hidden_size, num_attention_heads, attention_probs_dropout_prob):
        super(BertSelfAttention, self).__init__()
        if hidden_size % num_attention_heads != 0:
            raise ValueError(
                "The hidden size (%d) is not a multiple of the number of attention "
                "heads (%d)" % (hidden_size, num_attention_heads))
        self.num_attention_heads = num_attention_heads
        self.attention_head_size = int(hidden_size / num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(hidden_size, self.all_head_size)
        self.key = nn.Linear(hidden_size, self.all_head_size)
        self.value = nn.Linear(hidden_size, self.all_head_size)

        self.dropout = nn.Dropout(attention_probs_dropout_prob)

    def transpose_for_scores(self, x):
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(*new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(self, hidden_states, attention_mask):
        mixed_query_layer = self.query(hidden_states)
        mixed_key_layer = self.key(hidden_states)
        mixed_value_layer = self.value(hidden_states)

        query_layer = self.transpose_for_scores(mixed_query_layer)
        key_layer = self.transpose_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)

        # Take the dot product between "query" and "key" to get the raw attention scores.
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
        attention_scores = attention_scores + attention_mask

        # Normalize the attention scores to probabilities.
        attention_probs = nn.Softmax(dim=-1)(attention_scores)

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs)

        context_layer = torch.matmul(attention_probs, value_layer)
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(*new_context_layer_shape)
        return context_layer


 class BertSelfOutput(nn.Module):
    def __init__(self, hidden_size, hidden_dropout_prob):
        super(BertSelfOutput, self).__init__()
        self.dense = nn.Linear(hidden_size, hidden_size)
        self.LayerNorm = BertLayerNorm(hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states


 class BertAttention(nn.Module):
    def __init__(self, hidden_size, num_attention_heads, attention_probs_dropout_prob, hidden_dropout_prob):
        super(BertAttention, self).__init__()
        self.self = BertSelfAttention(hidden_size, num_attention_heads, attention_probs_dropout_prob)
        self.output = BertSelfOutput(hidden_size, hidden_dropout_prob)

    def forward(self, input_tensor, attention_mask):
        self_output = self.self(input_tensor, attention_mask)
        attention_output = self.output(self_output, input_tensor)
        return attention_output


 class BertIntermediate(nn.Module):
    def __init__(self, hidden_size, intermediate_size, hidden_act):
        super(BertIntermediate, self).__init__()
        self.dense = nn.Linear(hidden_size, intermediate_size)
        self.intermediate_act_fn = ACT2FN[hidden_act] \
            if isinstance(hidden_act, str) else hidden_act

    def forward(self, hidden_states):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.intermediate_act_fn(hidden_states)
        return hidden_states


 class BertOutput(nn.Module):
    def __init__(self, hidden_size, intermediate_size, hidden_dropout_prob):
        super(BertOutput, self).__init__()
        self.dense = nn.Linear(intermediate_size, hidden_size)
        self.LayerNorm = BertLayerNorm(hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states


 class BertLayer(nn.Module):
    def __init__(self, hidden_size, num_attention_heads, attention_probs_dropout_prob, hidden_dropout_prob,
                 intermediate_size, hidden_act):
        super(BertLayer, self).__init__()
        self.attention = BertAttention(hidden_size, num_attention_heads, attention_probs_dropout_prob,
                                       hidden_dropout_prob)
        self.intermediate = BertIntermediate(hidden_size, intermediate_size, hidden_act)
        self.output = BertOutput(hidden_size, intermediate_size, hidden_dropout_prob)

    def forward(self, hidden_states, attention_mask):
        attention_output = self.attention(hidden_states, attention_mask)
        intermediate_output = self.intermediate(attention_output)
        layer_output = self.output(intermediate_output, attention_output)
        return layer_output


 class BertEncoder(nn.Module):
    def __init__(self, num_hidden_layers, hidden_size, num_attention_heads, attention_probs_dropout_prob,
                 hidden_dropout_prob,
                 intermediate_size, hidden_act):
        super(BertEncoder, self).__init__()
        layer = BertLayer(hidden_size, num_attention_heads, attention_probs_dropout_prob, hidden_dropout_prob,
                          intermediate_size, hidden_act)
        self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(num_hidden_layers)])

    def forward(self, hidden_states, attention_mask, output_all_encoded_layers=True):
        all_encoder_layers = []
        for layer_module in self.layer:
            hidden_states = layer_module(hidden_states, attention_mask)
            if output_all_encoded_layers:
                all_encoder_layers.append(hidden_states)
        if not output_all_encoded_layers:
            all_encoder_layers.append(hidden_states)
        return all_encoder_layers


 class BertPooler(nn.Module):
    def __init__(self, hidden_size):
        super(BertPooler, self).__init__()
        self.dense = nn.Linear(hidden_size, hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output


 class BertModel(nn.Module):
    """Bidirectional Embedding Representations from Transformers.

    If you want to use pre-trained weights, please download from the following sources provided by pytorch-pretrained-BERT.
    sources::

    'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz",
    'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased.tar.gz",
    'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz",
    'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased.tar.gz",
    'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased.tar.gz",
    'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased.tar.gz",
    'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz",


    Construct a BERT model with pre-trained weights::

        model = BertModel.from_pretrained("path/to/weights/directory")

 from .base_model import BaseModel
 from ..core.const import Const
 from ..modules.encoder import BertModel


 class BertForSequenceClassification(BaseModel):
    """BERT model for classification.
    This module is composed of the BERT model with a linear layer on top of
    the pooled output.
    Params:
        `config`: a BertConfig class instance with the configuration to build a new model.
        `num_labels`: the number of classes for the classifier. Default = 2.
    Inputs:
        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
            with the word token indices in the vocabulary. Items in the batch should begin with the special "CLS" token. (see the tokens preprocessing logic in the scripts
            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
            a `sentence B` token (see BERT paper for more details).
        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
            input sequence length in the current batch. It's the mask that we typically use for attention when
            a batch has varying length sentences.
        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size]
            with indices selected in [0, ..., num_labels].
    Outputs:
        if `labels` is not `None`:
            Outputs the CrossEntropy classification loss of the output with the labels.
        if `labels` is `None`:
            Outputs the classification logits of shape [batch_size, num_labels].
    Example usage:
    ```python
    # Already been converted into WordPiece token ids
    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
    num_labels = 2
    model = BertForSequenceClassification(config, num_labels)
    logits = model(input_ids, token_type_ids, input_mask)
    ```
    """

    def __init__(self, vocab_size,
                 hidden_size=768,
                 num_hidden_layers=12,
                 num_attention_heads=12,
                 intermediate_size=3072,
                 hidden_act="gelu",
                 hidden_dropout_prob=0.1,
                 attention_probs_dropout_prob=0.1,
                 max_position_embeddings=512,
                 type_vocab_size=2,
                 initializer_range=0.02, **kwargs):
        super(BertModel, self).__init__()
        self.embeddings = BertEmbeddings(vocab_size, hidden_size, max_position_embeddings,
                                         type_vocab_size, hidden_dropout_prob)
        self.encoder = BertEncoder(num_hidden_layers, hidden_size, num_attention_heads,
                                   attention_probs_dropout_prob, hidden_dropout_prob, intermediate_size,
                                   hidden_act)
        self.pooler = BertPooler(hidden_size)
        self.initializer_range = initializer_range

        self.apply(self.init_bert_weights)

    def init_bert_weights(self, module):
        if isinstance(module, (nn.Linear, nn.Embedding)):
            # Slightly different from the TF version which uses truncated_normal for initialization
            # cf https://github.com/pytorch/pytorch/pull/5617
            module.weight.data.normal_(mean=0.0, std=self.initializer_range)
        elif isinstance(module, BertLayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
        if isinstance(module, nn.Linear) and module.bias is not None:
            module.bias.data.zero_()

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, output_all_encoded_layers=True):
        if attention_mask is None:
            attention_mask = torch.ones_like(input_ids)
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)

        # We create a 3D attention mask from a 2D tensor mask.
        # Sizes are [batch_size, 1, 1, to_seq_length]
        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
        # this attention mask is more simple than the triangular masking of causal attention
        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)

        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
        # masked positions, this operation will create a tensor which is 0.0 for
        # positions we want to attend and -10000.0 for masked positions.
        # Since we are adding it to the raw scores before the softmax, this is
        # effectively the same as removing these entirely.
        extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0

        embedding_output = self.embeddings(input_ids, token_type_ids)
        encoded_layers = self.encoder(embedding_output,
                                      extended_attention_mask,
                                      output_all_encoded_layers=output_all_encoded_layers)
        sequence_output = encoded_layers[-1]
        pooled_output = self.pooler(sequence_output)
        if not output_all_encoded_layers:
            encoded_layers = encoded_layers[-1]
        return encoded_layers, pooled_output

    @classmethod
    def from_pretrained(cls, pretrained_model_dir, state_dict=None, *inputs, **kwargs):
        # Load config
        config_file = os.path.join(pretrained_model_dir, CONFIG_FILE)
        config = json.load(open(config_file, "r"))
        # config = BertConfig.from_json_file(config_file)
        # logger.info("Model config {}".format(config))
        # Instantiate model.
        model = cls(*inputs, **config, **kwargs)
        if state_dict is None:
            weights_path = os.path.join(pretrained_model_dir, MODEL_WEIGHTS)
            state_dict = torch.load(weights_path)

        old_keys = []
        new_keys = []
        for key in state_dict.keys():
            new_key = None
            if 'gamma' in key:
                new_key = key.replace('gamma', 'weight')
            if 'beta' in key:
                new_key = key.replace('beta', 'bias')
            if new_key:
                old_keys.append(key)
                new_keys.append(new_key)
        for old_key, new_key in zip(old_keys, new_keys):
            state_dict[new_key] = state_dict.pop(old_key)

        missing_keys = []
        unexpected_keys = []
        error_msgs = []
        # copy state_dict so _load_from_state_dict can modify it
        metadata = getattr(state_dict, '_metadata', None)
        state_dict = state_dict.copy()
        if metadata is not None:
            state_dict._metadata = metadata

        def load(module, prefix=''):
            local_metadata = {} if metadata is None else metadata.get(prefix[:-1], {})
            module._load_from_state_dict(
                state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs)
            for name, child in module._modules.items():
                if child is not None:
                    load(child, prefix + name + '.')

        load(model, prefix='' if hasattr(model, 'bert') else 'bert.')
        if len(missing_keys) > 0:
            print("Weights of {} not initialized from pretrained model: {}".format(
                model.__class__.__name__, missing_keys))
        if len(unexpected_keys) > 0:
            print("Weights from pretrained model not used in {}: {}".format(
                model.__class__.__name__, unexpected_keys))
        return model
    def __init__(self, config, num_labels, bert_dir):
        super(BertForSequenceClassification, self).__init__()
        self.num_labels = num_labels
        self.bert = BertModel.from_pretrained(bert_dir)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, num_labels)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
        _, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            return {Const.OUTPUT: logits, Const.LOSS: loss}
        else:
            return {Const.OUTPUT: logits}

    def predict(self, input_ids, token_type_ids=None, attention_mask=None):
        logits = self.forward(input_ids, token_type_ids, attention_mask)
        return {Const.OUTPUT: torch.argmax(logits, dim=-1)}


 class BertForMultipleChoice(BaseModel):
    """BERT model for multiple choice tasks.
    This module is composed of the BERT model with a linear layer on top of
    the pooled output.
    Params:
        `config`: a BertConfig class instance with the configuration to build a new model.
        `num_choices`: the number of classes for the classifier. Default = 2.
    Inputs:
        `input_ids`: a torch.LongTensor of shape [batch_size, num_choices, sequence_length]
            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, num_choices, sequence_length]
            with the token types indices selected in [0, 1]. Type 0 corresponds to a `sentence A`
            and type 1 corresponds to a `sentence B` token (see BERT paper for more details).
        `attention_mask`: an optional torch.LongTensor of shape [batch_size, num_choices, sequence_length] with indices
            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
            input sequence length in the current batch. It's the mask that we typically use for attention when
            a batch has varying length sentences.
        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size]
            with indices selected in [0, ..., num_choices].
    Outputs:
        if `labels` is not `None`:
            Outputs the CrossEntropy classification loss of the output with the labels.
        if `labels` is `None`:
            Outputs the classification logits of shape [batch_size, num_labels].
    Example usage:
    ```python
    # Already been converted into WordPiece token ids
    input_ids = torch.LongTensor([[[31, 51, 99], [15, 5, 0]], [[12, 16, 42], [14, 28, 57]]])
    input_mask = torch.LongTensor([[[1, 1, 1], [1, 1, 0]],[[1,1,0], [1, 0, 0]]])
    token_type_ids = torch.LongTensor([[[0, 0, 1], [0, 1, 0]],[[0, 1, 1], [0, 0, 1]]])
    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
    num_choices = 2
    model = BertForMultipleChoice(config, num_choices, bert_dir)
    logits = model(input_ids, token_type_ids, input_mask)
    ```
    """
    def __init__(self, config, num_choices, bert_dir):
        super(BertForMultipleChoice, self).__init__()
        self.num_choices = num_choices
        self.bert = BertModel.from_pretrained(bert_dir)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, 1)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
        flat_input_ids = input_ids.view(-1, input_ids.size(-1))
        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1))
        flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1))
        _, pooled_output = self.bert(flat_input_ids, flat_token_type_ids, flat_attention_mask, output_all_encoded_layers=False)
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        reshaped_logits = logits.view(-1, self.num_choices)

        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(reshaped_logits, labels)
            return {Const.OUTPUT: reshaped_logits, Const.LOSS: loss}
        else:
            return {Const.OUTPUT: reshaped_logits}

    def predict(self, input_ids, token_type_ids=None, attention_mask=None):
        logits = self.forward(input_ids, token_type_ids, attention_mask)[Const.OUTPUT]
        return {Const.OUTPUT: torch.argmax(logits, dim=-1)}


 class BertForTokenClassification(BaseModel):
    """BERT model for token-level classification.
    This module is composed of the BERT model with a linear layer on top of
    the full hidden state of the last layer.
    Params:
        `config`: a BertConfig class instance with the configuration to build a new model.
        `num_labels`: the number of classes for the classifier. Default = 2.
        `bert_dir`: a dir which contains the bert parameters within file `pytorch_model.bin`
    Inputs:
        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
            a `sentence B` token (see BERT paper for more details).
        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
            input sequence length in the current batch. It's the mask that we typically use for attention when
            a batch has varying length sentences.
        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size, sequence_length]
            with indices selected in [0, ..., num_labels].
    Outputs:
        if `labels` is not `None`:
            Outputs the CrossEntropy classification loss of the output with the labels.
        if `labels` is `None`:
            Outputs the classification logits of shape [batch_size, sequence_length, num_labels].
    Example usage:
    ```python
    # Already been converted into WordPiece token ids
    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
    num_labels = 2
    bert_dir = 'your-bert-file-dir'
    model = BertForTokenClassification(config, num_labels, bert_dir)
    logits = model(input_ids, token_type_ids, input_mask)
    ```
    """
    def __init__(self, config, num_labels, bert_dir):
        super(BertForTokenClassification, self).__init__()
        self.num_labels = num_labels
        self.bert = BertModel.from_pretrained(bert_dir)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, num_labels)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
        sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
        sequence_output = self.dropout(sequence_output)
        logits = self.classifier(sequence_output)

        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            # Only keep active parts of the loss
            if attention_mask is not None:
                active_loss = attention_mask.view(-1) == 1
                active_logits = logits.view(-1, self.num_labels)[active_loss]
                active_labels = labels.view(-1)[active_loss]
                loss = loss_fct(active_logits, active_labels)
            else:
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            return {Const.OUTPUT: logits, Const.LOSS: loss}
        else:
            return {Const.OUTPUT: logits}

    def predict(self, input_ids, token_type_ids=None, attention_mask=None):
        logits = self.forward(input_ids, token_type_ids, attention_mask)[Const.OUTPUT]
        return {Const.OUTPUT: torch.argmax(logits, dim=-1)}


 class BertForQuestionAnswering(BaseModel):
    """BERT model for Question Answering (span extraction).
    This module is composed of the BERT model with a linear layer on top of
    the sequence output that computes start_logits and end_logits
    Params:
        `config`: a BertConfig class instance with the configuration to build a new model.
        `bert_dir`: a dir which contains the bert parameters within file `pytorch_model.bin`
    Inputs:
        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
            a `sentence B` token (see BERT paper for more details).
        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
            input sequence length in the current batch. It's the mask that we typically use for attention when
            a batch has varying length sentences.
        `start_positions`: position of the first token for the labeled span: torch.LongTensor of shape [batch_size].
            Positions are clamped to the length of the sequence and position outside of the sequence are not taken
            into account for computing the loss.
        `end_positions`: position of the last token for the labeled span: torch.LongTensor of shape [batch_size].
            Positions are clamped to the length of the sequence and position outside of the sequence are not taken
            into account for computing the loss.
    Outputs:
        if `start_positions` and `end_positions` are not `None`:
            Outputs the total_loss which is the sum of the CrossEntropy loss for the start and end token positions.
        if `start_positions` or `end_positions` is `None`:
            Outputs a tuple of start_logits, end_logits which are the logits respectively for the start and end
            position tokens of shape [batch_size, sequence_length].
    Example usage:
    ```python
    # Already been converted into WordPiece token ids
    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
    bert_dir = 'your-bert-file-dir'
    model = BertForQuestionAnswering(config, bert_dir)
    start_logits, end_logits = model(input_ids, token_type_ids, input_mask)
    ```
    """
    def __init__(self, config, bert_dir):
        super(BertForQuestionAnswering, self).__init__()
        self.bert = BertModel.from_pretrained(bert_dir)
        # TODO check with Google if it's normal there is no dropout on the token classifier of SQuAD in the TF version
        # self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.qa_outputs = nn.Linear(config.hidden_size, 2)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, start_positions=None, end_positions=None):
        sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
        logits = self.qa_outputs(sequence_output)
        start_logits, end_logits = logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)

        if start_positions is not None and end_positions is not None:
            # If we are on multi-GPU, split add a dimension
            if len(start_positions.size()) > 1:
                start_positions = start_positions.squeeze(-1)
            if len(end_positions.size()) > 1:
                end_positions = end_positions.squeeze(-1)
            # sometimes the start/end positions are outside our model inputs, we ignore these terms
            ignored_index = start_logits.size(1)
            start_positions.clamp_(0, ignored_index)
            end_positions.clamp_(0, ignored_index)

            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
            start_loss = loss_fct(start_logits, start_positions)
            end_loss = loss_fct(end_logits, end_positions)
            total_loss = (start_loss + end_loss) / 2
            return {Const.OUTPUTS(0): start_logits, Const.OUTPUTS(1): end_logits, Const.LOSS: total_loss}
        else:
            return {Const.OUTPUTS(0): start_logits, Const.OUTPUTS(1): end_logits}

    def predict(self, input_ids, token_type_ids=None, attention_mask=None, **kwargs):
        logits = self.forward(input_ids, token_type_ids, attention_mask)
        start_logits = logits[Const.OUTPUTS(0)]
        end_logits = logits[Const.OUTPUTS(1)]
        return {Const.OUTPUTS(0): torch.argmax(start_logits, dim=-1),
                Const.OUTPUTS(1): torch.argmax(end_logits, dim=-1)}
--- a/fastNLP/models/biaffine_parser.py
+++ b/fastNLP/models/biaffine_parser.py
@@ -1,22 +1,31 @@
 from collections import defaultdict
 """
 Biaffine Dependency Parser 的 Pytorch 实现.
 """
 __all__ = [
    "BiaffineParser",
    "GraphParser"
 ]

 import numpy as np
 import torch
 from torch import nn
 from torch.nn import functional as F
 import torch.nn as nn
 import torch.nn.functional as F

 from collections import defaultdict

 from fastNLP.core.losses import LossFunc
 from fastNLP.core.metrics import MetricBase
 from fastNLP.core.utils import seq_lens_to_masks
 from fastNLP.models.base_model import BaseModel
 from fastNLP.modules.dropout import TimestepDropout
 from fastNLP.modules.encoder.transformer import TransformerEncoder
 from fastNLP.modules.encoder.variational_rnn import VarLSTM
 from fastNLP.modules.utils import initial_parameter
 from fastNLP.modules.utils import seq_mask
 from ..core.const import Const as C
 from ..core.losses import LossFunc
 from ..core.metrics import MetricBase
 from ..modules.dropout import TimestepDropout
 from ..modules.encoder.transformer import TransformerEncoder
 from ..modules.encoder.variational_rnn import VarLSTM
 from ..modules.utils import initial_parameter
 from ..modules.utils import get_embeddings
 from .base_model import BaseModel
 from ..core.utils import seq_len_to_mask


 def mst(scores):
 def _mst(scores):
    """
    with some modification to support parser output for MST decoding
    https://github.com/tdozat/Parser/blob/0739216129cd39d69997d28cbc4133b360ea3934/lib/models/nn.py#L692
@@ -42,7 +51,7 @@ def mst(scores):
            scores[roots, new_heads] / root_scores)]
        heads[roots] = new_heads
        heads[new_root] = 0

    
    edges = defaultdict(set)
    vertices = set((0,))
    for dep, head in enumerate(heads[tokens]):
@@ -71,7 +80,7 @@ def mst(scores):
        heads[changed_cycle] = new_head
        edges[new_head].add(changed_cycle)
        edges[old_head].remove(changed_cycle)

    
    return heads


@@ -86,7 +95,7 @@ def _find_cycle(vertices, edges):
    _lowlinks = {}
    _onstack = defaultdict(lambda: False)
    _SCCs = []

    
    def _strongconnect(v):
        nonlocal _index
        _indices[v] = _index
@@ -94,38 +103,49 @@ def _find_cycle(vertices, edges):
        _index += 1
        _stack.append(v)
        _onstack[v] = True

        
        for w in edges[v]:
            if w not in _indices:
                _strongconnect(w)
                _lowlinks[v] = min(_lowlinks[v], _lowlinks[w])
            elif _onstack[w]:
                _lowlinks[v] = min(_lowlinks[v], _indices[w])

        
        if _lowlinks[v] == _indices[v]:
            SCC = set()
            while True:
                w = _stack.pop()
                _onstack[w] = False
                SCC.add(w)
                if not(w != v):
                if not (w != v):
                    break
            _SCCs.append(SCC)

    
    for v in vertices:
        if v not in _indices:
            _strongconnect(v)

    
    return [SCC for SCC in _SCCs if len(SCC) > 1]


 class GraphParser(BaseModel):
    """Graph based Parser helper class, support greedy decoding and MST(Maximum Spanning Tree) decoding
    """
    基于图的parser base class, 支持贪婪解码和最大生成树解码
    """
    
    def __init__(self):
        super(GraphParser, self).__init__()
    
    @staticmethod
    def greedy_decoder(arc_matrix, mask=None):
        """
        贪心解码方式, 输入图, 输出贪心解码的parsing结果, 不保证合法的构成树

    def _greedy_decoder(self, arc_matrix, mask=None):
        :param arc_matrix: [batch, seq_len, seq_len] 输入图矩阵
        :param mask: [batch, seq_len] 输入图的padding mask, 有内容的部分为 1, 否则为 0.
            若为 ``None`` 时, 默认为全1向量. Default: ``None``
        :return heads: [batch, seq_len] 每个元素在树中对应的head(parent)预测结果
        """
        _, seq_len, _ = arc_matrix.shape
        matrix = arc_matrix + torch.diag(arc_matrix.new(seq_len).fill_(-np.inf))
        flip_mask = (mask == 0).byte()
@@ -134,24 +154,37 @@ class GraphParser(BaseModel):
        if mask is not None:
            heads *= mask.long()
        return heads
    
    @staticmethod
    def mst_decoder(arc_matrix, mask=None):
        """
        用最大生成树算法, 计算parsing结果, 保证输出合法的树结构

    def _mst_decoder(self, arc_matrix, mask=None):
        :param arc_matrix: [batch, seq_len, seq_len] 输入图矩阵
        :param mask: [batch, seq_len] 输入图的padding mask, 有内容的部分为 1, 否则为 0.
            若为 ``None`` 时, 默认为全1向量. Default: ``None``
        :return heads: [batch, seq_len] 每个元素在树中对应的head(parent)预测结果
        """
        batch_size, seq_len, _ = arc_matrix.shape
        matrix = arc_matrix.clone()
        ans = matrix.new_zeros(batch_size, seq_len).long()
        lens = (mask.long()).sum(1) if mask is not None else torch.zeros(batch_size) + seq_len
        batch_idx = torch.arange(batch_size, dtype=torch.long, device=lens.device)
        for i, graph in enumerate(matrix):
            len_i = lens[i]
            ans[i, :len_i] = torch.as_tensor(mst(graph.detach()[:len_i, :len_i].cpu().numpy()), device=ans.device)
            ans[i, :len_i] = torch.as_tensor(_mst(graph.detach()[:len_i, :len_i].cpu().numpy()), device=ans.device)
        if mask is not None:
            ans *= mask.long()
        return ans


 class ArcBiaffine(nn.Module):
    """helper module for Biaffine Dependency Parser predicting arc
    """
    Biaffine Dependency Parser 的子模块, 用于构建预测边的图

    :param hidden_size: 输入的特征维度
    :param bias: 是否使用bias. Default: ``True``
    """
    
    def __init__(self, hidden_size, bias=True):
        super(ArcBiaffine, self).__init__()
        self.U = nn.Parameter(torch.Tensor(hidden_size, hidden_size), requires_grad=True)
@@ -161,13 +194,13 @@ class ArcBiaffine(nn.Module):
        else:
            self.register_parameter("bias", None)
        initial_parameter(self)

    
    def forward(self, head, dep):
        """
        :param head arc-head tensor = [batch, length, emb_dim]
        :param dep arc-dependent tensor = [batch, length, emb_dim]

        :return output tensor = [bacth, length, length]
        :param head: arc-head tensor [batch, length, hidden]
        :param dep: arc-dependent tensor [batch, length, hidden]
        :return output: tensor [bacth, length, length]
        """
        output = dep.matmul(self.U)
        output = output.bmm(head.transpose(-1, -2))
@@ -177,41 +210,72 @@ class ArcBiaffine(nn.Module):


 class LabelBilinear(nn.Module):
    """helper module for Biaffine Dependency Parser predicting label
    """
    Biaffine Dependency Parser 的子模块, 用于构建预测边类别的图

    :param in1_features: 输入的特征1维度
    :param in2_features: 输入的特征2维度
    :param num_label: 边类别的个数
    :param bias: 是否使用bias. Default: ``True``
    """
    
    def __init__(self, in1_features, in2_features, num_label, bias=True):
        super(LabelBilinear, self).__init__()
        self.bilinear = nn.Bilinear(in1_features, in2_features, num_label, bias=bias)
        self.lin = nn.Linear(in1_features + in2_features, num_label, bias=False)

    
    def forward(self, x1, x2):
        """

        :param x1: [batch, seq_len, hidden] 输入特征1, 即label-head
        :param x2: [batch, seq_len, hidden] 输入特征2, 即label-dep
        :return output: [batch, seq_len, num_cls] 每个元素对应类别的概率图
        """
        output = self.bilinear(x1, x2)
        output += self.lin(torch.cat([x1, x2], dim=2))
        return output


 class BiaffineParser(GraphParser):
    """Biaffine Dependency Parser implemantation.
    refer to ` Deep Biaffine Attention for Neural Dependency Parsing (Dozat and Manning, 2016)
    <https://arxiv.org/abs/1611.01734>`_ .
    """
    别名：:class:`fastNLP.models.BiaffineParser`  :class:`fastNLP.models.baffine_parser.BiaffineParser`

    Biaffine Dependency Parser 实现.
    论文参考 `Deep Biaffine Attention for Neural Dependency Parsing (Dozat and Manning, 2016) <https://arxiv.org/abs/1611.01734>`_ .

    :param init_embed: 单词词典, 可以是 tuple, 包括(num_embedings, embedding_dim), 即
        embedding的大小和每个词的维度. 也可以传入 nn.Embedding 对象,
        此时就以传入的对象作为embedding
    :param pos_vocab_size: part-of-speech 词典大小
    :param pos_emb_dim: part-of-speech 向量维度
    :param num_label: 边的类别个数
    :param rnn_layers: rnn encoder的层数
    :param rnn_hidden_size: rnn encoder 的隐状态维度
    :param arc_mlp_size: 边预测的MLP维度
    :param label_mlp_size: 类别预测的MLP维度
    :param dropout: dropout概率.
    :param encoder: encoder类别, 可选 ('lstm', 'var-lstm', 'transformer'). Default: lstm
    :param use_greedy_infer: 是否在inference时使用贪心算法.
        若 ``False`` , 使用更加精确但相对缓慢的MST算法. Default: ``False``
    """
    
    def __init__(self,
                word_vocab_size,
                word_emb_dim,
                pos_vocab_size,
                pos_emb_dim,
                num_label,
                rnn_layers=1,
                rnn_hidden_size=200,
                arc_mlp_size=100,
                label_mlp_size=100,
                dropout=0.3,
                encoder='lstm',
                use_greedy_infer=False):

                 init_embed,
                 pos_vocab_size,
                 pos_emb_dim,
                 num_label,
                 rnn_layers=1,
                 rnn_hidden_size=200,
                 arc_mlp_size=100,
                 label_mlp_size=100,
                 dropout=0.3,
                 encoder='lstm',
                 use_greedy_infer=False):
        super(BiaffineParser, self).__init__()
        rnn_out_size = 2 * rnn_hidden_size
        word_hid_dim = pos_hid_dim = rnn_hidden_size
        self.word_embedding = nn.Embedding(num_embeddings=word_vocab_size, embedding_dim=word_emb_dim)
        self.word_embedding = get_embeddings(init_embed)
        word_emb_dim = self.word_embedding.embedding_dim
        self.pos_embedding = nn.Embedding(num_embeddings=pos_vocab_size, embedding_dim=pos_emb_dim)
        self.word_fc = nn.Linear(word_emb_dim, word_hid_dim)
        self.pos_fc = nn.Linear(pos_emb_dim, pos_hid_dim)
@@ -242,20 +306,20 @@ class BiaffineParser(GraphParser):
            if (d_k * n_head) != rnn_out_size:
                raise ValueError('unsupported rnn_out_size: {} for transformer'.format(rnn_out_size))
            self.position_emb = nn.Embedding(num_embeddings=self.max_len,
                                             embedding_dim=rnn_out_size,)
                                             embedding_dim=rnn_out_size, )
            self.encoder = TransformerEncoder(num_layers=rnn_layers,
                                              model_size=rnn_out_size,
                                              inner_size=1024,
                                              key_size=d_k,
                                              value_size=d_v,
                                              num_head=n_head,
                                              dropout=dropout,)
                                              dropout=dropout, )
        else:
            raise ValueError('unsupported encoder type: {}'.format(encoder))

        
        self.mlp = nn.Sequential(nn.Linear(rnn_out_size, arc_mlp_size * 2 + label_mlp_size * 2),
                                          nn.ELU(),
                                          TimestepDropout(p=dropout),)
                                 nn.ELU(),
                                 TimestepDropout(p=dropout), )
        self.arc_mlp_size = arc_mlp_size
        self.label_mlp_size = label_mlp_size
        self.arc_predictor = ArcBiaffine(arc_mlp_size, bias=True)
@@ -263,7 +327,7 @@ class BiaffineParser(GraphParser):
        self.use_greedy_infer = use_greedy_infer
        self.reset_parameters()
        self.dropout = dropout

    
    def reset_parameters(self):
        for m in self.modules():
            if isinstance(m, nn.Embedding):
@@ -274,167 +338,210 @@ class BiaffineParser(GraphParser):
            else:
                for p in m.parameters():
                    nn.init.normal_(p, 0, 0.1)
    
    def forward(self, words1, words2, seq_len, target1=None):
        """模型forward阶段

        :param words1: [batch_size, seq_len] 输入word序列
        :param words2: [batch_size, seq_len] 输入pos序列
        :param seq_len: [batch_size, seq_len] 输入序列长度
        :param target1: [batch_size, seq_len] 输入真实标注的heads, 仅在训练阶段有效,
            用于训练label分类器. 若为 ``None`` , 使用预测的heads输入到label分类器
            Default: ``None``
        :return dict: parsing
                结果::

                    pred1: [batch_size, seq_len, seq_len] 边预测logits
                    pred2: [batch_size, seq_len, num_label] label预测logits
                    pred3: [batch_size, seq_len] heads的预测结果, 在 ``target1=None`` 时预测

    def forward(self, word_seq, pos_seq, seq_lens, gold_heads=None):
        """
        :param word_seq: [batch_size, seq_len] sequence of word's indices
        :param pos_seq: [batch_size, seq_len] sequence of word's indices
        :param seq_lens: [batch_size, seq_len] sequence of length masks
        :param gold_heads: [batch_size, seq_len] sequence of golden heads
        :return dict: parsing results
            arc_pred: [batch_size, seq_len, seq_len]
            label_pred: [batch_size, seq_len, seq_len]
            mask: [batch_size, seq_len]
            head_pred: [batch_size, seq_len] if gold_heads is not provided, predicting the heads
        """
        # prepare embeddings
        batch_size, seq_len = word_seq.shape
        batch_size, length = words1.shape
        # print('forward {} {}'.format(batch_size, seq_len))

        
        # get sequence mask
        mask = seq_mask(seq_lens, seq_len).long()

        word = self.word_embedding(word_seq) # [N,L] -> [N,L,C_0]
        pos = self.pos_embedding(pos_seq) # [N,L] -> [N,L,C_1]

        mask = seq_len_to_mask(seq_len).long()
        
        word = self.word_embedding(words1)  # [N,L] -> [N,L,C_0]
        pos = self.pos_embedding(words2)  # [N,L] -> [N,L,C_1]
        
        word, pos = self.word_fc(word), self.pos_fc(pos)
        word, pos = self.word_norm(word), self.pos_norm(pos)
        x = torch.cat([word, pos], dim=2) # -> [N,L,C]

        x = torch.cat([word, pos], dim=2)  # -> [N,L,C]
        
        # encoder, extract features
        if self.encoder_name.endswith('lstm'):
            sort_lens, sort_idx = torch.sort(seq_lens, dim=0, descending=True)
            sort_lens, sort_idx = torch.sort(seq_len, dim=0, descending=True)
            x = x[sort_idx]
            x = nn.utils.rnn.pack_padded_sequence(x, sort_lens, batch_first=True)
            feat, _ = self.encoder(x) # -> [N,L,C]
            feat, _ = self.encoder(x)  # -> [N,L,C]
            feat, _ = nn.utils.rnn.pad_packed_sequence(feat, batch_first=True)
            _, unsort_idx = torch.sort(sort_idx, dim=0, descending=False)
            feat = feat[unsort_idx]
        else:
            seq_range = torch.arange(seq_len, dtype=torch.long, device=x.device)[None,:]
            seq_range = torch.arange(length, dtype=torch.long, device=x.device)[None, :]
            x = x + self.position_emb(seq_range)
            feat = self.encoder(x, mask.float())

        
        # for arc biaffine
        # mlp, reduce dim
        feat = self.mlp(feat)
        arc_sz, label_sz = self.arc_mlp_size, self.label_mlp_size
        arc_dep, arc_head = feat[:,:,:arc_sz], feat[:,:,arc_sz:2*arc_sz]
        label_dep, label_head = feat[:,:,2*arc_sz:2*arc_sz+label_sz], feat[:,:,2*arc_sz+label_sz:]

        arc_dep, arc_head = feat[:, :, :arc_sz], feat[:, :, arc_sz:2 * arc_sz]
        label_dep, label_head = feat[:, :, 2 * arc_sz:2 * arc_sz + label_sz], feat[:, :, 2 * arc_sz + label_sz:]
        
        # biaffine arc classifier
        arc_pred = self.arc_predictor(arc_head, arc_dep) # [N, L, L]

        arc_pred = self.arc_predictor(arc_head, arc_dep)  # [N, L, L]
        
        # use gold or predicted arc to predict label
        if gold_heads is None or not self.training:
        if target1 is None or not self.training:
            # use greedy decoding in training
            if self.training or self.use_greedy_infer:
                heads = self._greedy_decoder(arc_pred, mask)
                heads = self.greedy_decoder(arc_pred, mask)
            else:
                heads = self._mst_decoder(arc_pred, mask)
                heads = self.mst_decoder(arc_pred, mask)
            head_pred = heads
        else:
            assert self.training # must be training mode
            if gold_heads is None:
                heads = self._greedy_decoder(arc_pred, mask)
            assert self.training  # must be training mode
            if target1 is None:
                heads = self.greedy_decoder(arc_pred, mask)
                head_pred = heads
            else:
                head_pred = None
                heads = gold_heads

        batch_range = torch.arange(start=0, end=batch_size, dtype=torch.long, device=word_seq.device).unsqueeze(1)
                heads = target1
        
        batch_range = torch.arange(start=0, end=batch_size, dtype=torch.long, device=words1.device).unsqueeze(1)
        label_head = label_head[batch_range, heads].contiguous()
        label_pred = self.label_predictor(label_head, label_dep) # [N, L, num_label]
        res_dict = {'arc_pred': arc_pred, 'label_pred': label_pred, 'mask': mask}
        label_pred = self.label_predictor(label_head, label_dep)  # [N, L, num_label]
        res_dict = {C.OUTPUTS(0): arc_pred, C.OUTPUTS(1): label_pred}
        if head_pred is not None:
            res_dict['head_pred'] = head_pred
            res_dict[C.OUTPUTS(2)] = head_pred
        return res_dict

    
    @staticmethod
    def loss(arc_pred, label_pred, arc_true, label_true, mask):
    def loss(pred1, pred2, target1, target2, seq_len):
        """
        Compute loss.

        :param arc_pred: [batch_size, seq_len, seq_len]
        :param label_pred: [batch_size, seq_len, n_tags]
        :param arc_true: [batch_size, seq_len]
        :param label_true: [batch_size, seq_len]
        :param mask: [batch_size, seq_len]
        :return: loss value
        计算parser的loss

        :param pred1: [batch_size, seq_len, seq_len] 边预测logits
        :param pred2: [batch_size, seq_len, num_label] label预测logits
        :param target1: [batch_size, seq_len] 真实边的标注
        :param target2: [batch_size, seq_len] 真实类别的标注
        :param seq_len: [batch_size, seq_len] 真实目标的长度
        :return loss: scalar
        """

        batch_size, seq_len, _ = arc_pred.shape
        
        batch_size, length, _ = pred1.shape
        mask = seq_len_to_mask(seq_len)
        flip_mask = (mask == 0)
        _arc_pred = arc_pred.clone()
        _arc_pred = pred1.clone()
        _arc_pred.masked_fill_(flip_mask.unsqueeze(1), -float('inf'))
        arc_logits = F.log_softmax(_arc_pred, dim=2)
        label_logits = F.log_softmax(label_pred, dim=2)
        label_logits = F.log_softmax(pred2, dim=2)
        batch_index = torch.arange(batch_size, device=arc_logits.device, dtype=torch.long).unsqueeze(1)
        child_index = torch.arange(seq_len, device=arc_logits.device, dtype=torch.long).unsqueeze(0)
        arc_loss = arc_logits[batch_index, child_index, arc_true]
        label_loss = label_logits[batch_index, child_index, label_true]

        child_index = torch.arange(length, device=arc_logits.device, dtype=torch.long).unsqueeze(0)
        arc_loss = arc_logits[batch_index, child_index, target1]
        label_loss = label_logits[batch_index, child_index, target2]
        
        byte_mask = flip_mask.byte()
        arc_loss.masked_fill_(byte_mask, 0)
        label_loss.masked_fill_(byte_mask, 0)
        arc_nll = -arc_loss.mean()
        label_nll = -label_loss.mean()
        return arc_nll + label_nll
    
    def predict(self, words1, words2, seq_len):
        """模型预测API

    def predict(self, word_seq, pos_seq, seq_lens):
        """
        :param words1: [batch_size, seq_len] 输入word序列
        :param words2: [batch_size, seq_len] 输入pos序列
        :param seq_len: [batch_size, seq_len] 输入序列长度
        :return dict: parsing
                结果::

                    pred1: [batch_size, seq_len] heads的预测结果
                    pred2: [batch_size, seq_len, num_label] label预测logits

        :param word_seq:
        :param pos_seq:
        :param seq_lens:
        :return: arc_pred: [B, L]
                 label_pred: [B, L]
        """
        res = self(word_seq, pos_seq, seq_lens)
        res = self(words1, words2, seq_len)
        output = {}
        output['arc_pred'] = res.pop('head_pred')
        _, label_pred = res.pop('label_pred').max(2)
        output['label_pred'] = label_pred
        output[C.OUTPUTS(0)] = res.pop(C.OUTPUTS(2))
        _, label_pred = res.pop(C.OUTPUTS(1)).max(2)
        output[C.OUTPUTS(1)] = label_pred
        return output


 class ParserLoss(LossFunc):
    def __init__(self, arc_pred=None, label_pred=None, arc_true=None, label_true=None):
    """
    别名：:class:`fastNLP.models.ParserLoss`  :class:`fastNLP.models.baffine_parser.ParserLoss`

    计算parser的loss

    :param pred1: [batch_size, seq_len, seq_len] 边预测logits
    :param pred2: [batch_size, seq_len, num_label] label预测logits
    :param target1: [batch_size, seq_len] 真实边的标注
    :param target2: [batch_size, seq_len] 真实类别的标注
    :param seq_len: [batch_size, seq_len] 真实目标的长度
    :return loss: scalar
    """
    
    def __init__(self, pred1=None, pred2=None,
                 target1=None, target2=None,
                 seq_len=None):
        super(ParserLoss, self).__init__(BiaffineParser.loss,
                                                 arc_pred=arc_pred,
                                                 label_pred=label_pred,
                                                 arc_true=arc_true,
                                                 label_true=label_true)
                                         pred1=pred1,
                                         pred2=pred2,
                                         target1=target1,
                                         target2=target2,
                                         seq_len=seq_len)


 class ParserMetric(MetricBase):
    def __init__(self, arc_pred=None, label_pred=None,
                       arc_true=None, label_true=None, seq_lens=None):
    """
    别名：:class:`fastNLP.models.ParserMetric`  :class:`fastNLP.models.baffine_parser.ParserMetric`

    评估parser的性能

    :param pred1: 边预测logits
    :param pred2: label预测logits
    :param target1: 真实边的标注
    :param target2: 真实类别的标注
    :param seq_len: 序列长度
    :return dict: 评估结果::

        UAS: 不带label时, 边预测的准确率
        LAS: 同时预测边和label的准确率
    """
    
    def __init__(self, pred1=None, pred2=None,
                 target1=None, target2=None, seq_len=None):
        
        super().__init__()
        self._init_param_map(arc_pred=arc_pred, label_pred=label_pred,
                             arc_true=arc_true, label_true=label_true,
                             seq_lens=seq_lens)
        self._init_param_map(pred1=pred1, pred2=pred2,
                             target1=target1, target2=target2,
                             seq_len=seq_len)
        self.num_arc = 0
        self.num_label = 0
        self.num_sample = 0

    
    def get_metric(self, reset=True):
        res = {'UAS': self.num_arc*1.0 / self.num_sample, 'LAS': self.num_label*1.0 / self.num_sample}
        res = {'UAS': self.num_arc * 1.0 / self.num_sample, 'LAS': self.num_label * 1.0 / self.num_sample}
        if reset:
            self.num_sample = self.num_label = self.num_arc = 0
        return res

    def evaluate(self, arc_pred, label_pred, arc_true, label_true, seq_lens=None):
    
    def evaluate(self, pred1, pred2, target1, target2, seq_len=None):
        """Evaluate the performance of prediction.
        """
        if seq_lens is None:
            seq_mask = arc_pred.new_ones(arc_pred.size(), dtype=torch.long)
        if seq_len is None:
            seq_mask = pred1.new_ones(pred1.size(), dtype=torch.long)
        else:
            seq_mask = seq_lens_to_masks(seq_lens.long(), float=False).long()
            seq_mask = seq_len_to_mask(seq_len.long()).long()
        # mask out <root> tag
        seq_mask[:,0] = 0
        head_pred_correct = (arc_pred == arc_true).long() * seq_mask
        label_pred_correct = (label_pred == label_true).long() * head_pred_correct
        seq_mask[:, 0] = 0
        head_pred_correct = (pred1 == target1).long() * seq_mask
        label_pred_correct = (pred2 == target2).long() * head_pred_correct
        self.num_arc += head_pred_correct.sum().item()
        self.num_label += label_pred_correct.sum().item()
        self.num_sample += seq_mask.sum().item()
--- a/fastNLP/models/char_language_model.py
+++ b/fastNLP/models/char_language_model.py
@@ -1,131 +0,0 @@
 import torch
 import torch.nn as nn
 import torch.nn.functional as F

 from fastNLP.modules.encoder.lstm import LSTM


 class Highway(nn.Module):
    """Highway network"""

    def __init__(self, input_size):
        super(Highway, self).__init__()
        self.fc1 = nn.Linear(input_size, input_size, bias=True)
        self.fc2 = nn.Linear(input_size, input_size, bias=True)

    def forward(self, x):
        t = F.sigmoid(self.fc1(x))
        return torch.mul(t, F.relu(self.fc2(x))) + torch.mul(1 - t, x)


 class CharLM(nn.Module):
    """CNN + highway network + LSTM
    # Input:
        4D tensor with shape [batch_size, in_channel, height, width]
    # Output:
        2D Tensor with shape [batch_size, vocab_size]
    # Arguments:
        char_emb_dim: the size of each character's attention
        word_emb_dim: the size of each word's attention
        vocab_size: num of unique words
        num_char: num of characters
        use_gpu: True or False
    """

    def __init__(self, char_emb_dim, word_emb_dim,
                 vocab_size, num_char):
        super(CharLM, self).__init__()
        self.char_emb_dim = char_emb_dim
        self.word_emb_dim = word_emb_dim
        self.vocab_size = vocab_size

        # char attention layer
        self.char_embed = nn.Embedding(num_char, char_emb_dim)

        # convolutions of filters with different sizes
        self.convolutions = []

        # list of tuples: (the number of filter, width)
        self.filter_num_width = [(25, 1), (50, 2), (75, 3), (100, 4), (125, 5), (150, 6)]

        for out_channel, filter_width in self.filter_num_width:
            self.convolutions.append(
                nn.Conv2d(
                    1,  # in_channel
                    out_channel,  # out_channel
                    kernel_size=(char_emb_dim, filter_width),  # (height, width)
                    bias=True
                )
            )

        self.highway_input_dim = sum([x for x, y in self.filter_num_width])

        self.batch_norm = nn.BatchNorm1d(self.highway_input_dim, affine=False)

        # highway net
        self.highway1 = Highway(self.highway_input_dim)
        self.highway2 = Highway(self.highway_input_dim)

        # LSTM
        self.lstm_num_layers = 2

        self.lstm = LSTM(self.highway_input_dim, hidden_size=self.word_emb_dim, num_layers=self.lstm_num_layers,
                         dropout=0.5)
        # output layer
        self.dropout = nn.Dropout(p=0.5)
        self.linear = nn.Linear(self.word_emb_dim, self.vocab_size)

    def forward(self, x):
        # Input: Variable of Tensor with shape [num_seq, seq_len, max_word_len+2]
        # Return: Variable of Tensor with shape [num_words, len(word_dict)]
        lstm_batch_size = x.size()[0]
        lstm_seq_len = x.size()[1]

        x = x.contiguous().view(-1, x.size()[2])
        # [num_seq*seq_len, max_word_len+2]

        x = self.char_embed(x)
        # [num_seq*seq_len, max_word_len+2, char_emb_dim]

        x = torch.transpose(x.view(x.size()[0], 1, x.size()[1], -1), 2, 3)
        # [num_seq*seq_len, 1, max_word_len+2, char_emb_dim]

        x = self.conv_layers(x)
        # [num_seq*seq_len, total_num_filters]

        x = self.batch_norm(x)
        # [num_seq*seq_len, total_num_filters]

        x = self.highway1(x)
        x = self.highway2(x)
        # [num_seq*seq_len, total_num_filters]

        x = x.contiguous().view(lstm_batch_size, lstm_seq_len, -1)
        # [num_seq, seq_len, total_num_filters]

        x = self.lstm(x)
        # [seq_len, num_seq, hidden_size]

        x = self.dropout(x)
        # [seq_len, num_seq, hidden_size]

        x = x.contiguous().view(lstm_batch_size * lstm_seq_len, -1)
        # [num_seq*seq_len, hidden_size]

        x = self.linear(x)
        # [num_seq*seq_len, vocab_size]
        return x

    def conv_layers(self, x):
        chosen_list = list()
        for conv in self.convolutions:
            feature_map = F.tanh(conv(x))
            # (batch_size, out_channel, 1, max_word_len-width+1)
            chosen = torch.max(feature_map, 3)[0]
            # (batch_size, out_channel, 1)
            chosen = chosen.squeeze()
            # (batch_size, out_channel)
            chosen_list.append(chosen)

        # (batch_size, total_num_filers)
        return torch.cat(chosen_list, 1)
--- a/fastNLP/models/cnn_text_classification.py
+++ b/fastNLP/models/cnn_text_classification.py
@@ -1,57 +1,68 @@
 # python: 3.6
 # encoding: utf-8
 __all__ = [
    "CNNText"
 ]

 import torch
 import torch.nn as nn

 # import torch.nn.functional as F
 import fastNLP.modules.encoder as encoder
 from ..core.const import Const as C
 from ..modules import encoder


 class CNNText(torch.nn.Module):
    """
    Text classification model by character CNN, the implementation of paper
    'Yoon Kim. 2014. Convolution Neural Networks for Sentence
    Classification.'
    """
    别名：:class:`fastNLP.models.CNNText`  :class:`fastNLP.models.cnn_text_classification.CNNText`

    def __init__(self, embed_num,
                 embed_dim,
    使用CNN进行文本分类的模型
    'Yoon Kim. 2014. Convolution Neural Networks for Sentence Classification.'
    
    :param tuple(int,int),torch.FloatTensor,nn.Embedding,numpy.ndarray init_embed: Embedding的大小(传入tuple(int, int),
        第一个int为vocab_zie, 第二个int为embed_dim); 如果为Tensor, Embedding, ndarray等则直接使用该值初始化Embedding
    :param int num_classes: 一共有多少类
    :param int,tuple(int) out_channels: 输出channel的数量。如果为list，则需要与kernel_sizes的数量保持一致
    :param int,tuple(int) kernel_sizes: 输出channel的kernel大小。
    :param int padding: 对句子前后的pad的大小, 用0填充。
    :param float dropout: Dropout的大小
    """
    
    def __init__(self, init_embed,
                 num_classes,
                 kernel_nums=(3, 4, 5),
                 kernel_sizes=(3, 4, 5),
                 padding=0,
                 dropout=0.5):
        super(CNNText, self).__init__()

        
        # no support for pre-trained embedding currently
        self.embed = encoder.Embedding(embed_num, embed_dim)
        self.embed = encoder.Embedding(init_embed)
        self.conv_pool = encoder.ConvMaxpool(
            in_channels=embed_dim,
            in_channels=self.embed.embedding_dim,
            out_channels=kernel_nums,
            kernel_sizes=kernel_sizes,
            padding=padding)
        self.dropout = nn.Dropout(dropout)
        self.fc = encoder.Linear(sum(kernel_nums), num_classes)

    def forward(self, word_seq):
        self.fc = nn.Linear(sum(kernel_nums), num_classes)
    
    def forward(self, words, seq_len=None):
        """

        :param word_seq: torch.LongTensor, [batch_size, seq_len]
        :param torch.LongTensor words: [batch_size, seq_len]，句子中word的index
        :param torch.LongTensor seq_len:  [batch,] 每个句子的长度
        :return output: dict of torch.LongTensor, [batch_size, num_classes]
        """
        x = self.embed(word_seq)  # [N,L] -> [N,L,C]
        x = self.embed(words)  # [N,L] -> [N,L,C]
        x = self.conv_pool(x)  # [N,L,C] -> [N,C]
        x = self.dropout(x)
        x = self.fc(x)  # [N,C] -> [N, N_class]
        return {'pred': x}

    def predict(self, word_seq):
        return {C.OUTPUT: x}
    
    def predict(self, words, seq_len=None):
        """
        :param torch.LongTensor words: [batch_size, seq_len]，句子中word的index
        :param torch.LongTensor seq_len:  [batch,] 每个句子的长度

        :param word_seq: torch.LongTensor, [batch_size, seq_len]
        :return predict: dict of torch.LongTensor, [batch_size, seq_len]
        :return predict: dict of torch.LongTensor, [batch_size, ]
        """
        output = self(word_seq)
        _, predict = output['pred'].max(dim=1)
        return {'pred': predict}
        output = self(words, seq_len)
        _, predict = output[C.OUTPUT].max(dim=1)
        return {C.OUTPUT: predict}
--- a/fastNLP/models/enas_controller.py
+++ b/fastNLP/models/enas_controller.py
@@ -5,9 +5,9 @@ import os

 import torch
 import torch.nn.functional as F
 import fastNLP
 import fastNLP.models.enas_utils as utils
 from fastNLP.models.enas_utils import Node

 from . import enas_utils as utils
 from .enas_utils import Node


 def _construct_dags(prev_nodes, activations, func_names, num_blocks):
--- a/fastNLP/models/enas_model.py
+++ b/fastNLP/models/enas_model.py
@@ -1,17 +1,18 @@
 # Code Modified from https://github.com/carpedm20/ENAS-pytorch

 """Module containing the shared RNN model."""
 import numpy as np
 """
 Module containing the shared RNN model.
 Code Modified from https://github.com/carpedm20/ENAS-pytorch
 """
 import collections

 import numpy as np
 import torch
 from torch import nn
 import torch.nn as nn
 import torch.nn.functional as F
 from torch.autograd import Variable

 import fastNLP.models.enas_utils as utils
 from fastNLP.models.base_model import BaseModel
 import fastNLP.modules.encoder as encoder
 from . import enas_utils as utils
 from .base_model import BaseModel


 def _get_dropped_weights(w_raw, dropout_p, is_training):
    """Drops out weights to implement DropConnect.
@@ -36,12 +37,13 @@ def _get_dropped_weights(w_raw, dropout_p, is_training):
    The above TODO is the reason for the hacky check for `torch.nn.Parameter`.
    """
    dropped_w = F.dropout(w_raw, p=dropout_p, training=is_training)

    
    if isinstance(dropped_w, torch.nn.Parameter):
        dropped_w = dropped_w.clone()

    
    return dropped_w


 class EmbeddingDropout(torch.nn.Embedding):
    """Class for dropping out embeddings by zero'ing out parameters in the
    embedding matrix.
@@ -54,6 +56,7 @@ class EmbeddingDropout(torch.nn.Embedding):
    See 'A Theoretically Grounded Application of Dropout in Recurrent Neural
    Networks', (Gal and Ghahramani, 2016).
    """
    
    def __init__(self,
                 num_embeddings,
                 embedding_dim,
@@ -84,14 +87,14 @@ class EmbeddingDropout(torch.nn.Embedding):
        assert (dropout >= 0.0) and (dropout < 1.0), ('Dropout must be >= 0.0 '
                                                      'and < 1.0')
        self.scale = scale

    
    def forward(self, inputs):  # pylint:disable=arguments-differ
        """Embeds `inputs` with the dropped out embedding weight matrix."""
        if self.training:
            dropout = self.dropout
        else:
            dropout = 0

        
        if dropout:
            mask = self.weight.data.new(self.weight.size(0), 1)
            mask.bernoulli_(1 - dropout)
@@ -102,7 +105,7 @@ class EmbeddingDropout(torch.nn.Embedding):
            masked_weight = self.weight
        if self.scale and self.scale != 1:
            masked_weight = masked_weight * self.scale

        
        return F.embedding(inputs,
                           masked_weight,
                           max_norm=self.max_norm,
@@ -115,7 +118,7 @@ class LockedDropout(nn.Module):
    # code from https://github.com/salesforce/awd-lstm-lm/blob/master/locked_dropout.py
    def __init__(self):
        super().__init__()

    
    def forward(self, x, dropout=0.5):
        if not self.training or not dropout:
            return x
@@ -127,11 +130,12 @@ class LockedDropout(nn.Module):

 class ENASModel(BaseModel):
    """Shared RNN model."""
    
    def __init__(self, embed_num, num_classes, num_blocks=4, cuda=False, shared_hid=1000, shared_embed=1000):
        super(ENASModel, self).__init__()

        
        self.use_cuda = cuda

        
        self.shared_hid = shared_hid
        self.num_blocks = num_blocks
        self.decoder = nn.Linear(self.shared_hid, num_classes)
@@ -140,16 +144,16 @@ class ENASModel(BaseModel):
                                        dropout=0.1)
        self.lockdrop = LockedDropout()
        self.dag = None

        
        # Tie weights
        # self.decoder.weight = self.encoder.weight

        
        # Since W^{x, c} and W^{h, c} are always summed, there
        # is no point duplicating their bias offset parameter. Likewise for
        # W^{x, h} and W^{h, h}.
        self.w_xc = nn.Linear(shared_embed, self.shared_hid)
        self.w_xh = nn.Linear(shared_embed, self.shared_hid)

        
        # The raw weights are stored here because the hidden-to-hidden weights
        # are weight dropped on the forward pass.
        self.w_hc_raw = torch.nn.Parameter(
@@ -158,10 +162,10 @@ class ENASModel(BaseModel):
            torch.Tensor(self.shared_hid, self.shared_hid))
        self.w_hc = None
        self.w_hh = None

        
        self.w_h = collections.defaultdict(dict)
        self.w_c = collections.defaultdict(dict)

        
        for idx in range(self.num_blocks):
            for jdx in range(idx + 1, self.num_blocks):
                self.w_h[idx][jdx] = nn.Linear(self.shared_hid,
@@ -170,48 +174,47 @@ class ENASModel(BaseModel):
                self.w_c[idx][jdx] = nn.Linear(self.shared_hid,
                                               self.shared_hid,
                                               bias=False)

        
        self._w_h = nn.ModuleList([self.w_h[idx][jdx]
                                   for idx in self.w_h
                                   for jdx in self.w_h[idx]])
        self._w_c = nn.ModuleList([self.w_c[idx][jdx]
                                   for idx in self.w_c
                                   for jdx in self.w_c[idx]])

        
        self.batch_norm = None
        # if args.mode == 'train':
        #     self.batch_norm = nn.BatchNorm1d(self.shared_hid)
        # else:
        #     self.batch_norm = None

        
        self.reset_parameters()
        self.static_init_hidden = utils.keydefaultdict(self.init_hidden)

    
    def setDAG(self, dag):
        if self.dag is None:
            self.dag = dag

    
    def forward(self, word_seq, hidden=None):
        inputs = torch.transpose(word_seq, 0, 1)

        
        time_steps = inputs.size(0)
        batch_size = inputs.size(1)


        
        self.w_hh = _get_dropped_weights(self.w_hh_raw,
                                         0.5,
                                         self.training)
        self.w_hc = _get_dropped_weights(self.w_hc_raw,
                                         0.5,
                                         self.training)

        
        # hidden = self.static_init_hidden[batch_size] if hidden is None else hidden
        hidden = self.static_init_hidden[batch_size]

        
        embed = self.encoder(inputs)

        
        embed = self.lockdrop(embed, 0.65 if self.training else 0)

        
        # The norm of hidden states are clipped here because
        # otherwise ENAS is especially prone to exploding activations on the
        # forward pass. This could probably be fixed in a more elegant way, but
@@ -227,7 +230,7 @@ class ENASModel(BaseModel):
        for step in range(time_steps):
            x_t = embed[step]
            logit, hidden = self.cell(x_t, hidden, self.dag)

            
            hidden_norms = hidden.norm(dim=-1)
            max_norm = 25.0
            if hidden_norms.data.max() > max_norm:
@@ -238,60 +241,60 @@ class ENASModel(BaseModel):
                # because the PyTorch slicing and slice assignment is too
                # flaky.
                hidden_norms = hidden_norms.data.cpu().numpy()

                
                clipped_num += 1
                if hidden_norms.max() > max_clipped_norm:
                    max_clipped_norm = hidden_norms.max()

                
                clip_select = hidden_norms > max_norm
                clip_norms = hidden_norms[clip_select]

                
                mask = np.ones(hidden.size())
                normalizer = max_norm/clip_norms
                normalizer = max_norm / clip_norms
                normalizer = normalizer[:, np.newaxis]

                
                mask[clip_select] = normalizer

                
                if self.use_cuda:
                    hidden *= torch.autograd.Variable(
                        torch.FloatTensor(mask).cuda(), requires_grad=False)
                else:
                    hidden *= torch.autograd.Variable(
                        torch.FloatTensor(mask), requires_grad=False)                    
                        torch.FloatTensor(mask), requires_grad=False)
            logits.append(logit)
            h1tohT.append(hidden)

        
        h1tohT = torch.stack(h1tohT)
        output = torch.stack(logits)
        raw_output = output

        
        output = self.lockdrop(output, 0.4 if self.training else 0)

        #Pooling 
        
        # Pooling
        output = torch.mean(output, 0)

        
        decoded = self.decoder(output)

        
        extra_out = {'dropped': decoded,
                     'hiddens': h1tohT,
                     'raw': raw_output}
        return {'pred': decoded, 'hidden': hidden, 'extra_out': extra_out}

    
    def cell(self, x, h_prev, dag):
        """Computes a single pass through the discovered RNN cell."""
        c = {}
        h = {}
        f = {}

        
        f[0] = self.get_f(dag[-1][0].name)
        c[0] = torch.sigmoid(self.w_xc(x) + F.linear(h_prev, self.w_hc, None))
        h[0] = (c[0]*f[0](self.w_xh(x) + F.linear(h_prev, self.w_hh, None)) +
                (1 - c[0])*h_prev)

        h[0] = (c[0] * f[0](self.w_xh(x) + F.linear(h_prev, self.w_hh, None)) +
                (1 - c[0]) * h_prev)
        
        leaf_node_ids = []
        q = collections.deque()
        q.append(0)

        
        # Computes connections from the parent nodes `node_id`
        # to their child nodes `next_id` recursively, skipping leaf nodes. A
        # leaf node is a node whose id == `self.num_blocks`.
@@ -307,10 +310,10 @@ class ENASModel(BaseModel):
        while True:
            if len(q) == 0:
                break

            
            node_id = q.popleft()
            nodes = dag[node_id]

            
            for next_node in nodes:
                next_id = next_node.id
                if next_id == self.num_blocks:
@@ -318,38 +321,38 @@ class ENASModel(BaseModel):
                    assert len(nodes) == 1, ('parent of leaf node should have '
                                             'only one child')
                    continue

                
                w_h = self.w_h[node_id][next_id]
                w_c = self.w_c[node_id][next_id]

                
                f[next_id] = self.get_f(next_node.name)
                c[next_id] = torch.sigmoid(w_c(h[node_id]))
                h[next_id] = (c[next_id]*f[next_id](w_h(h[node_id])) +
                              (1 - c[next_id])*h[node_id])

                h[next_id] = (c[next_id] * f[next_id](w_h(h[node_id])) +
                              (1 - c[next_id]) * h[node_id])
                
                q.append(next_id)

        
        # Instead of averaging loose ends, perhaps there should
        # be a set of separate unshared weights for each "loose" connection
        # between each node in a cell and the output.
        #
        # As it stands, all weights W^h_{ij} are doing double duty by
        # connecting both from i to j, as well as from i to the output.

        
        # average all the loose ends
        leaf_nodes = [h[node_id] for node_id in leaf_node_ids]
        output = torch.mean(torch.stack(leaf_nodes, 2), -1)

        
        # stabilizing the Updates of omega
        if self.batch_norm is not None:
            output = self.batch_norm(output)

        
        return output, h[self.num_blocks - 1]

    
    def init_hidden(self, batch_size):
        zeros = torch.zeros(batch_size, self.shared_hid)
        return utils.get_variable(zeros, self.use_cuda, requires_grad=False)

    
    def get_f(self, name):
        name = name.lower()
        if name == 'relu':
@@ -361,22 +364,21 @@ class ENASModel(BaseModel):
        elif name == 'sigmoid':
            f = torch.sigmoid
        return f
        

    
    @property
    def num_parameters(self):
        def size(p):
            return np.prod(p.size())
        
        return sum([size(param) for param in self.parameters()])


    
    def reset_parameters(self):
        init_range = 0.025
        # init_range = 0.025 if self.args.mode == 'train' else 0.04
        for param in self.parameters():
            param.data.uniform_(-init_range, init_range)
        self.decoder.bias.data.fill_(0)

    
    def predict(self, word_seq):
        """

--- a/fastNLP/models/enas_trainer.py
+++ b/fastNLP/models/enas_trainer.py
@@ -1,30 +1,25 @@
 # Code Modified from https://github.com/carpedm20/ENAS-pytorch

 import os
 import time
 from datetime import datetime
 from datetime import timedelta

 import math
 import numpy as np
 import time
 import torch
 import math
 from torch import nn

 from datetime import datetime, timedelta

 from torch.optim import Adam

 try:
    from tqdm.autonotebook import tqdm
    from tqdm.auto import tqdm
 except:
    from fastNLP.core.utils import pseudo_tqdm as tqdm

 from fastNLP.core.batch import Batch
 from fastNLP.core.callback import CallbackManager, CallbackException
 from fastNLP.core.dataset import DataSet
 from fastNLP.core.utils import CheckError
 from fastNLP.core.utils import _move_dict_value_to_device
 import fastNLP
 import fastNLP.models.enas_utils as utils
 from fastNLP.core.utils import _build_args
    from ..core.utils import _pseudo_tqdm as tqdm

 from torch.optim import Adam
 from ..core.trainer import Trainer
 from ..core.batch import Batch
 from ..core.callback import CallbackManager, CallbackException
 from ..core.dataset import DataSet
 from ..core.utils import _move_dict_value_to_device
 from . import enas_utils as utils
 from ..core.utils import _build_args


 def _get_no_grad_ctx_mgr():
@@ -34,8 +29,9 @@ def _get_no_grad_ctx_mgr():
    return torch.no_grad()


 class ENASTrainer(fastNLP.Trainer):
 class ENASTrainer(Trainer):
    """A class to wrap training code."""
    
    def __init__(self, train_data, model, controller, **kwargs):
        """Constructor for training algorithm.
        :param DataSet train_data: the training data
@@ -48,30 +44,31 @@ class ENASTrainer(fastNLP.Trainer):
        self.controller_step = 0
        self.shared_step = 0
        self.max_length = 35

        
        self.shared = model
        self.controller = controller

        
        self.shared_optim = Adam(
            self.shared.parameters(),
            lr=20.0,
            weight_decay=1e-7)

        
        self.controller_optim = Adam(
            self.controller.parameters(),
            lr=3.5e-4)

    
    def train(self, load_best_model=True):
        """
        :param bool load_best_model: 该参数只有在初始化提供了dev_data的情况下有效，如果True, trainer将在返回之前重新加载dev表现
            最好的模型参数。
        :return results: 返回一个字典类型的数据, 内含以下内容::
        :return results: 返回一个字典类型的数据,
            内含以下内容::

            seconds: float, 表示训练时长
            以下三个内容只有在提供了dev_data的情况下会有。
            best_eval: Dict of Dict, 表示evaluation的结果
            best_epoch: int，在第几个epoch取得的最佳值
            best_step: int, 在第几个step(batch)更新取得的最佳值
                seconds: float, 表示训练时长
                以下三个内容只有在提供了dev_data的情况下会有。
                best_eval: Dict of Dict, 表示evaluation的结果
                best_epoch: int，在第几个epoch取得的最佳值
                best_step: int, 在第几个step(batch)更新取得的最佳值

        """
        results = {}
@@ -80,25 +77,26 @@ class ENASTrainer(fastNLP.Trainer):
            results['seconds'] = 0.
            return results
        try:
            if torch.cuda.is_available() and self.use_cuda:
            if torch.cuda.is_available() and "cuda" in self.device:
                self.model = self.model.cuda()
            self._model_device = self.model.parameters().__next__().device
            self._mode(self.model, is_test=False)

            
            self.start_time = str(datetime.now().strftime('%Y-%m-%d-%H-%M-%S'))
            start_time = time.time()
            print("training epochs started " + self.start_time, flush=True)

            
            try:
                self.callback_manager.on_train_begin()
                self._train()
                self.callback_manager.on_train_end(self.model)
                self.callback_manager.on_train_end()
            except (CallbackException, KeyboardInterrupt) as e:
                self.callback_manager.on_exception(e, self.model)

                self.callback_manager.on_exception(e)
            
            if self.dev_data is not None:
                print("\nIn Epoch:{}/Step:{}, got best dev performance:".format(self.best_dev_epoch, self.best_dev_step) +
                      self.tester._format_eval_results(self.best_dev_perf),)
                print(
                    "\nIn Epoch:{}/Step:{}, got best dev performance:".format(self.best_dev_epoch, self.best_dev_step) +
                    self.tester._format_eval_results(self.best_dev_perf), )
                results['best_eval'] = self.best_dev_perf
                results['best_epoch'] = self.best_dev_epoch
                results['best_step'] = self.best_dev_step
@@ -112,12 +110,12 @@ class ENASTrainer(fastNLP.Trainer):
        finally:
            pass
        results['seconds'] = round(time.time() - start_time, 2)

        
        return results

    
    def _train(self):
        if not self.use_tqdm:
            from fastNLP.core.utils import pseudo_tqdm as inner_tqdm
            from fastNLP.core.utils import _pseudo_tqdm as inner_tqdm
        else:
            inner_tqdm = tqdm
        self.step = 0
@@ -128,21 +126,21 @@ class ENASTrainer(fastNLP.Trainer):
            avg_loss = 0
            data_iterator = Batch(self.train_data, batch_size=self.batch_size, sampler=self.sampler, as_numpy=False,
                                  prefetch=self.prefetch)
            for epoch in range(1, self.n_epochs+1):
            for epoch in range(1, self.n_epochs + 1):
                pbar.set_description_str(desc="Epoch {}/{}".format(epoch, self.n_epochs))
                last_stage = (epoch > self.n_epochs + 1 - self.final_epochs)
                if epoch == self.n_epochs + 1 - self.final_epochs:
                    print('Entering the final stage. (Only train the selected structure)')
                # early stopping
                self.callback_manager.on_epoch_begin(epoch, self.n_epochs)

                self.callback_manager.on_epoch_begin()
                
                # 1. Training the shared parameters omega of the child models
                self.train_shared(pbar)

                
                # 2. Training the controller parameters theta
                if not last_stage:
                    self.train_controller()

                
                if ((self.validate_every > 0 and self.step % self.validate_every == 0) or
                    (self.validate_every < 0 and self.step % len(data_iterator) == 0)) \
                        and self.dev_data is not None:
@@ -151,16 +149,15 @@ class ENASTrainer(fastNLP.Trainer):
                    eval_res = self._do_validation(epoch=epoch, step=self.step)
                    eval_str = "Evaluation at Epoch {}/{}. Step:{}/{}. ".format(epoch, self.n_epochs, self.step,
                                                                                total_steps) + \
                                self.tester._format_eval_results(eval_res)
                               self.tester._format_eval_results(eval_res)
                    pbar.write(eval_str)

                
                # lr decay; early stopping
                self.callback_manager.on_epoch_end(epoch, self.n_epochs, self.optimizer)
                self.callback_manager.on_epoch_end()
            # =============== epochs end =================== #
            pbar.close()
        # ============ tqdm end ============== #


    
    def get_loss(self, inputs, targets, hidden, dags):
        """Computes the loss for the same batch for M models.

@@ -169,7 +166,7 @@ class ENASTrainer(fastNLP.Trainer):
        """
        if not isinstance(dags, list):
            dags = [dags]

        
        loss = 0
        for dag in dags:
            self.shared.setDAG(dag)
@@ -177,14 +174,14 @@ class ENASTrainer(fastNLP.Trainer):
            inputs['hidden'] = hidden
            result = self.shared(**inputs)
            output, hidden, extra_out = result['pred'], result['hidden'], result['extra_out']

            
            self.callback_manager.on_loss_begin(targets, result)
            sample_loss = self._compute_loss(result, targets)
            loss += sample_loss

        
        assert len(dags) == 1, 'there are multiple `hidden` for multple `dags`'
        return loss, hidden, extra_out

    
    def train_shared(self, pbar=None, max_step=None, dag=None):
        """Train the language model for 400 steps of minibatches of 64
        examples.
@@ -202,9 +199,9 @@ class ENASTrainer(fastNLP.Trainer):
        model = self.shared
        model.train()
        self.controller.eval()

        
        hidden = self.shared.init_hidden(self.batch_size)

        
        abs_max_grad = 0
        abs_max_hidden_norm = 0
        step = 0
@@ -213,15 +210,15 @@ class ENASTrainer(fastNLP.Trainer):
        train_idx = 0
        avg_loss = 0
        data_iterator = Batch(self.train_data, batch_size=self.batch_size, sampler=self.sampler, as_numpy=False,
                prefetch=self.prefetch)

                              prefetch=self.prefetch)
        
        for batch_x, batch_y in data_iterator:
            _move_dict_value_to_device(batch_x, batch_y, device=self._model_device)
            indices = data_iterator.get_batch_indices()
            # negative sampling; replace unknown; re-weight batch_y
            self.callback_manager.on_batch_begin(batch_x, batch_y, indices)
            # prediction = self._data_forward(self.model, batch_x)

            
            dags = self.controller.sample(1)
            inputs, targets = batch_x, batch_y
            # self.callback_manager.on_loss_begin(batch_y, prediction)
@@ -230,18 +227,18 @@ class ENASTrainer(fastNLP.Trainer):
                                                    hidden,
                                                    dags)
            hidden.detach_()
           
            
            avg_loss += loss.item()

            
            # Is loss NaN or inf? requires_grad = False
            self.callback_manager.on_backward_begin(loss, self.model)
            self.callback_manager.on_backward_begin(loss)
            self._grad_backward(loss)
            self.callback_manager.on_backward_end(self.model)

            self.callback_manager.on_backward_end()
            
            self._update()
            self.callback_manager.on_step_end(self.optimizer)

            if (self.step+1) % self.print_every == 0:
            self.callback_manager.on_step_end()
            
            if (self.step + 1) % self.print_every == 0:
                if self.use_tqdm:
                    print_output = "loss:{0:<6.5f}".format(avg_loss / self.print_every)
                    pbar.update(self.print_every)
@@ -257,30 +254,29 @@ class ENASTrainer(fastNLP.Trainer):
            self.shared_step += 1
            self.callback_manager.on_batch_end()
        # ================= mini-batch end ==================== #


    
    def get_reward(self, dag, entropies, hidden, valid_idx=0):
        """Computes the perplexity of a single sampled model on a minibatch of
        validation data.
        """
        if not isinstance(entropies, np.ndarray):
            entropies = entropies.data.cpu().numpy()

        
        data_iterator = Batch(self.dev_data, batch_size=self.batch_size, sampler=self.sampler, as_numpy=False,
                prefetch=self.prefetch)

                              prefetch=self.prefetch)
        
        for inputs, targets in data_iterator:
            valid_loss, hidden, _ = self.get_loss(inputs, targets, hidden, dag)
            valid_loss = utils.to_item(valid_loss.data)

            
            valid_ppl = math.exp(valid_loss)

            
            R = 80 / valid_ppl

            
            rewards = R + 1e-4 * entropies

            
            return rewards, hidden

    
    def train_controller(self):
        """Fixes the shared parameters and updates the controller parameters.

@@ -298,13 +294,13 @@ class ENASTrainer(fastNLP.Trainer):
        # Why can't we call shared.eval() here? Leads to loss
        # being uniformly zero for the controller.
        # self.shared.eval()

        
        avg_reward_base = None
        baseline = None
        adv_history = []
        entropy_history = []
        reward_history = []

        
        hidden = self.shared.init_hidden(self.batch_size)
        total_loss = 0
        valid_idx = 0
@@ -312,7 +308,7 @@ class ENASTrainer(fastNLP.Trainer):
            # sample models
            dags, log_probs, entropies = self.controller.sample(
                with_details=True)

            
            # calculate reward
            np_entropies = entropies.data.cpu().numpy()
            # No gradients should be backpropagated to the
@@ -322,40 +318,39 @@ class ENASTrainer(fastNLP.Trainer):
                                                  np_entropies,
                                                  hidden,
                                                  valid_idx)


            
            reward_history.extend(rewards)
            entropy_history.extend(np_entropies)

            
            # moving average baseline
            if baseline is None:
                baseline = rewards
            else:
                decay = 0.95
                baseline = decay * baseline + (1 - decay) * rewards

            
            adv = rewards - baseline
            adv_history.extend(adv)

            
            # policy loss
            loss = -log_probs*utils.get_variable(adv,
                                                 self.use_cuda,
                                                 requires_grad=False)

            loss = -log_probs * utils.get_variable(adv,
                                                   'cuda' in self.device,
                                                   requires_grad=False)
            
            loss = loss.sum()  # or loss.mean()

            
            # update
            self.controller_optim.zero_grad()
            loss.backward()

            
            self.controller_optim.step()

            
            total_loss += utils.to_item(loss.data)

            
            if ((step % 50) == 0) and (step > 0):
                reward_history, adv_history, entropy_history = [], [], []
                total_loss = 0

            
            self.controller_step += 1
            # prev_valid_idx = valid_idx
            # valid_idx = ((valid_idx + self.max_length) %
@@ -364,16 +359,16 @@ class ENASTrainer(fastNLP.Trainer):
            # # validation data, we reset the hidden states.
            # if prev_valid_idx > valid_idx:
            #     hidden = self.shared.init_hidden(self.batch_size)

    
    def derive(self, sample_num=10, valid_idx=0):
        """We are always deriving based on the very first batch
        of validation data? This seems wrong...
        """
        hidden = self.shared.init_hidden(self.batch_size)

        
        dags, _, entropies = self.controller.sample(sample_num,
                                                    with_details=True)

        
        max_R = 0
        best_dag = None
        for dag in dags:
@@ -381,5 +376,5 @@ class ENASTrainer(fastNLP.Trainer):
            if R.max() > max_R:
                max_R = R.max()
                best_dag = dag

        
        self.model.setDAG(best_dag)
--- a/fastNLP/models/enas_utils.py
+++ b/fastNLP/models/enas_utils.py
@@ -1,24 +1,20 @@
 # Code Modified from https://github.com/carpedm20/ENAS-pytorch

 from __future__ import print_function

 from collections import defaultdict
 import collections
 from datetime import datetime
 import os
 import json

 import numpy as np

 import torch
 from torch.autograd import Variable


 def detach(h):
    if type(h) == Variable:
        return Variable(h.data)
    else:
        return tuple(detach(v) for v in h)


 def get_variable(inputs, cuda=False, **kwargs):
    if type(inputs) in [list, np.ndarray]:
        inputs = torch.Tensor(inputs)
@@ -28,10 +24,12 @@ def get_variable(inputs, cuda=False, **kwargs):
        out = Variable(inputs, **kwargs)
    return out


 def update_lr(optimizer, lr):
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr


 Node = collections.namedtuple('Node', ['id', 'name'])


@@ -48,9 +46,9 @@ def to_item(x):
    """Converts x, possibly scalar and possibly tensor, to a Python scalar."""
    if isinstance(x, (float, int)):
        return x

    
    if float(torch.__version__[0:3]) < 0.4:
        assert (x.dim() == 1) and (len(x) == 1)
        return x[0]

    
    return x.item()
--- a/fastNLP/models/sequence_labeling.py
+++ b/fastNLP/models/sequence_labeling.py
@@ -0,0 +1,233 @@
 """
    本模块实现了两种序列标注模型
 """
 __all__ = [
    "SeqLabeling",
    "AdvSeqLabel"
 ]

 import torch
 import torch.nn as nn

 from .base_model import BaseModel
 from ..modules import decoder, encoder
 from ..modules.decoder.crf import allowed_transitions
 from ..core.utils import seq_len_to_mask
 from ..core.const import Const as C


 class SeqLabeling(BaseModel):
    """
    别名：:class:`fastNLP.models.SeqLabeling`  :class:`fastNLP.models.sequence_labeling.SeqLabeling`

    一个基础的Sequence labeling的模型。
    用于做sequence labeling的基础类。结构包含一层Embedding，一层LSTM(单向，一层)，一层FC，以及一层CRF。
    
    :param tuple(int,int),torch.FloatTensor,nn.Embedding,numpy.ndarray init_embed: Embedding的大小(传入tuple(int, int),
        第一个int为vocab_zie, 第二个int为embed_dim); 如果为Tensor, Embedding, ndarray等则直接使用该值初始化Embedding
    :param int hidden_size: LSTM隐藏层的大小
    :param int num_classes: 一共有多少类
    """
    
    def __init__(self, init_embed, hidden_size, num_classes):
        super(SeqLabeling, self).__init__()
        
        self.Embedding = encoder.embedding.Embedding(init_embed)
        self.Rnn = encoder.lstm.LSTM(self.Embedding.embedding_dim, hidden_size)
        self.Linear = nn.Linear(hidden_size, num_classes)
        self.Crf = decoder.crf.ConditionalRandomField(num_classes)
        self.mask = None
    
    def forward(self, words, seq_len, target):
        """
        :param torch.LongTensor words: [batch_size, max_len]，序列的index
        :param torch.LongTensor seq_len: [batch_size,], 这个序列的长度
        :param torch.LongTensor target: [batch_size, max_len], 序列的目标值
        :return y: If truth is None, return list of [decode path(list)]. Used in testing and predicting.
                    If truth is not None, return loss, a scalar. Used in training.
        """
        assert words.shape[0] == seq_len.shape[0]
        assert target.shape == words.shape
        self.mask = self._make_mask(words, seq_len)
        
        x = self.Embedding(words)
        # [batch_size, max_len, word_emb_dim]
        x, _ = self.Rnn(x, seq_len)
        # [batch_size, max_len, hidden_size * direction]
        x = self.Linear(x)
        # [batch_size, max_len, num_classes]
        return {C.LOSS: self._internal_loss(x, target)}
    
    def predict(self, words, seq_len):
        """
        用于在预测时使用

        :param torch.LongTensor words: [batch_size, max_len]
        :param torch.LongTensor seq_len: [batch_size,]
        :return: {'pred': xx}, [batch_size, max_len]
        """
        self.mask = self._make_mask(words, seq_len)
        
        x = self.Embedding(words)
        # [batch_size, max_len, word_emb_dim]
        x, _ = self.Rnn(x, seq_len)
        # [batch_size, max_len, hidden_size * direction]
        x = self.Linear(x)
        # [batch_size, max_len, num_classes]
        pred = self._decode(x)
        return {C.OUTPUT: pred}
    
    def _internal_loss(self, x, y):
        """
        Negative log likelihood loss.
        :param x: Tensor, [batch_size, max_len, tag_size]
        :param y: Tensor, [batch_size, max_len]
        :return loss: a scalar Tensor

        """
        x = x.float()
        y = y.long()
        assert x.shape[:2] == y.shape
        assert y.shape == self.mask.shape
        total_loss = self.Crf(x, y, self.mask)
        return torch.mean(total_loss)
    
    def _make_mask(self, x, seq_len):
        batch_size, max_len = x.size(0), x.size(1)
        mask = seq_len_to_mask(seq_len)
        mask = mask.view(batch_size, max_len)
        mask = mask.to(x).float()
        return mask
    
    def _decode(self, x):
        """
        :param torch.FloatTensor x: [batch_size, max_len, tag_size]
        :return prediction: [batch_size, max_len]
        """
        tag_seq, _ = self.Crf.viterbi_decode(x, self.mask)
        return tag_seq


 class AdvSeqLabel(nn.Module):
    """
    别名：:class:`fastNLP.models.AdvSeqLabel`  :class:`fastNLP.models.sequence_labeling.AdvSeqLabel`

    更复杂的Sequence Labelling模型。结构为Embedding, LayerNorm, 双向LSTM(两层)，FC，LayerNorm，DropOut，FC，CRF。
    
    :param tuple(int,int),torch.FloatTensor,nn.Embedding,numpy.ndarray init_embed: Embedding的大小(传入tuple(int, int),
        第一个int为vocab_zie, 第二个int为embed_dim); 如果为Tensor, Embedding, ndarray等则直接使用该值初始化Embedding
    :param int hidden_size: LSTM的隐层大小
    :param int num_classes: 有多少个类
    :param float dropout: LSTM中以及DropOut层的drop概率
    :param dict id2words: tag id转为其tag word的表。用于在CRF解码时防止解出非法的顺序，比如'BMES'这个标签规范中，'S'
        不能出现在'B'之后。这里也支持类似与'B-NN'，即'-'前为标签类型的指示，后面为具体的tag的情况。这里不但会保证
        'B-NN'后面不为'S-NN'还会保证'B-NN'后面不会出现'M-xx'(任何非'M-NN'和'E-NN'的情况。)
    :param str encoding_type: 支持"BIO", "BMES", "BEMSO", 只有在id2words不为None的情况有用。
    """
    
    def __init__(self, init_embed, hidden_size, num_classes, dropout=0.3, id2words=None, encoding_type='bmes'):
        
        super().__init__()
        
        self.Embedding = encoder.embedding.Embedding(init_embed)
        self.norm1 = torch.nn.LayerNorm(self.Embedding.embedding_dim)
        self.Rnn = encoder.LSTM(input_size=self.Embedding.embedding_dim, hidden_size=hidden_size, num_layers=2,
                                dropout=dropout,
                                bidirectional=True, batch_first=True)
        self.Linear1 = nn.Linear(hidden_size * 2, hidden_size * 2 // 3)
        self.norm2 = torch.nn.LayerNorm(hidden_size * 2 // 3)
        self.relu = torch.nn.LeakyReLU()
        self.drop = torch.nn.Dropout(dropout)
        self.Linear2 = nn.Linear(hidden_size * 2 // 3, num_classes)
        
        if id2words is None:
            self.Crf = decoder.crf.ConditionalRandomField(num_classes, include_start_end_trans=False)
        else:
            self.Crf = decoder.crf.ConditionalRandomField(num_classes, include_start_end_trans=False,
                                                          allowed_transitions=allowed_transitions(id2words,
                                                                                                  encoding_type=encoding_type))
    
    def _decode(self, x):
        """
        :param torch.FloatTensor x: [batch_size, max_len, tag_size]
        :return torch.LongTensor, [batch_size, max_len]
        """
        tag_seq, _ = self.Crf.viterbi_decode(x, self.mask)
        return tag_seq
    
    def _internal_loss(self, x, y):
        """
        Negative log likelihood loss.
        :param x: Tensor, [batch_size, max_len, tag_size]
        :param y: Tensor, [batch_size, max_len]
        :return loss: a scalar Tensor

        """
        x = x.float()
        y = y.long()
        assert x.shape[:2] == y.shape
        assert y.shape == self.mask.shape
        total_loss = self.Crf(x, y, self.mask)
        return torch.mean(total_loss)
    
    def _make_mask(self, x, seq_len):
        batch_size, max_len = x.size(0), x.size(1)
        mask = seq_len_to_mask(seq_len)
        mask = mask.view(batch_size, max_len)
        mask = mask.to(x).float()
        return mask
    
    def _forward(self, words, seq_len, target=None):
        """
        :param torch.LongTensor words: [batch_size, mex_len]
        :param torch.LongTensor seq_len:[batch_size, ]
        :param torch.LongTensor target: [batch_size, max_len]
        :return y: If truth is None, return list of [decode path(list)]. Used in testing and predicting.
                   If truth is not None, return loss, a scalar. Used in training.
        """
        
        words = words.long()
        seq_len = seq_len.long()
        self.mask = self._make_mask(words, seq_len)
        
        # seq_len = seq_len.long()
        target = target.long() if target is not None else None
        
        if next(self.parameters()).is_cuda:
            words = words.cuda()
            self.mask = self.mask.cuda()
        
        x = self.Embedding(words)
        x = self.norm1(x)
        # [batch_size, max_len, word_emb_dim]
        
        x, _ = self.Rnn(x, seq_len=seq_len)
        
        x = self.Linear1(x)
        x = self.norm2(x)
        x = self.relu(x)
        x = self.drop(x)
        x = self.Linear2(x)
        if target is not None:
            return {"loss": self._internal_loss(x, target)}
        else:
            return {"pred": self._decode(x)}
    
    def forward(self, words, seq_len, target):
        """
        
        :param torch.LongTensor words: [batch_size, mex_len]
        :param torch.LongTensor seq_len: [batch_size, ]
        :param torch.LongTensor target: [batch_size, max_len], 目标
        :return torch.Tensor: a scalar loss
        """
        return self._forward(words, seq_len, target)
    
    def predict(self, words, seq_len):
        """
        
        :param torch.LongTensor words: [batch_size, mex_len]
        :param torch.LongTensor seq_len: [batch_size, ]
        :return torch.LongTensor: [batch_size, max_len]
        """
        return self._forward(words, seq_len)
--- a/fastNLP/models/sequence_modeling.py
+++ b/fastNLP/models/sequence_modeling.py
@@ -1,225 +0,0 @@
 import torch

 from fastNLP.models.base_model import BaseModel
 from fastNLP.modules import decoder, encoder
 from fastNLP.modules.decoder.CRF import allowed_transitions
 from fastNLP.modules.utils import seq_mask


 class SeqLabeling(BaseModel):
    """
    PyTorch Network for sequence labeling
    """

    def __init__(self, args):
        super(SeqLabeling, self).__init__()
        vocab_size = args["vocab_size"]
        word_emb_dim = args["word_emb_dim"]
        hidden_dim = args["rnn_hidden_units"]
        num_classes = args["num_classes"]

        self.Embedding = encoder.embedding.Embedding(vocab_size, word_emb_dim)
        self.Rnn = encoder.lstm.LSTM(word_emb_dim, hidden_dim)
        self.Linear = encoder.linear.Linear(hidden_dim, num_classes)
        self.Crf = decoder.CRF.ConditionalRandomField(num_classes)
        self.mask = None

    def forward(self, word_seq, word_seq_origin_len, truth=None):
        """
        :param word_seq: LongTensor, [batch_size, mex_len]
        :param word_seq_origin_len: LongTensor, [batch_size,], the origin lengths of the sequences.
        :param truth: LongTensor, [batch_size, max_len]
        :return y: If truth is None, return list of [decode path(list)]. Used in testing and predicting.
                    If truth is not None, return loss, a scalar. Used in training.
        """
        assert word_seq.shape[0] == word_seq_origin_len.shape[0]
        if truth is not None:
            assert truth.shape == word_seq.shape
        self.mask = self.make_mask(word_seq, word_seq_origin_len)

        x = self.Embedding(word_seq)
        # [batch_size, max_len, word_emb_dim]
        x = self.Rnn(x)
        # [batch_size, max_len, hidden_size * direction]
        x = self.Linear(x)
        # [batch_size, max_len, num_classes]
        return {"loss": self._internal_loss(x, truth) if truth is not None else None,
                "predict": self.decode(x)}

    def loss(self, x, y):
        """ Since the loss has been computed in forward(), this function simply returns x."""
        return x

    def _internal_loss(self, x, y):
        """
        Negative log likelihood loss.
        :param x: Tensor, [batch_size, max_len, tag_size]
        :param y: Tensor, [batch_size, max_len]
        :return loss: a scalar Tensor

        """
        x = x.float()
        y = y.long()
        assert x.shape[:2] == y.shape
        assert y.shape == self.mask.shape
        total_loss = self.Crf(x, y, self.mask)
        return torch.mean(total_loss)

    def make_mask(self, x, seq_len):
        batch_size, max_len = x.size(0), x.size(1)
        mask = seq_mask(seq_len, max_len)
        mask = mask.view(batch_size, max_len)
        mask = mask.to(x).float()
        return mask

    def decode(self, x, pad=True):
        """
        :param x: FloatTensor, [batch_size, max_len, tag_size]
        :param pad: pad the output sequence to equal lengths
        :return prediction: list of [decode path(list)]
        """
        max_len = x.shape[1]
        tag_seq = self.Crf.viterbi_decode(x, self.mask)
        # pad prediction to equal length
        if pad is True:
            for pred in tag_seq:
                if len(pred) < max_len:
                    pred += [0] * (max_len - len(pred))
        return tag_seq


 class AdvSeqLabel(SeqLabeling):
    """
    Advanced Sequence Labeling Model
    """

    def __init__(self, args, emb=None, id2words=None):
        super(AdvSeqLabel, self).__init__(args)

        vocab_size = args["vocab_size"]
        word_emb_dim = args["word_emb_dim"]
        hidden_dim = args["rnn_hidden_units"]
        num_classes = args["num_classes"]
        dropout = args['dropout']

        self.Embedding = encoder.embedding.Embedding(vocab_size, word_emb_dim, init_emb=emb)
        self.norm1 = torch.nn.LayerNorm(word_emb_dim)
        # self.Rnn = encoder.lstm.LSTM(word_emb_dim, hidden_dim, num_layers=2, dropout=dropout, bidirectional=True)
        self.Rnn = torch.nn.LSTM(input_size=word_emb_dim, hidden_size=hidden_dim, num_layers=2, dropout=dropout,
                                 bidirectional=True, batch_first=True)
        self.Linear1 = encoder.Linear(hidden_dim * 2, hidden_dim * 2 // 3)
        self.norm2 = torch.nn.LayerNorm(hidden_dim * 2 // 3)
        # self.batch_norm = torch.nn.BatchNorm1d(hidden_dim * 2 // 3)
        self.relu = torch.nn.LeakyReLU()
        self.drop = torch.nn.Dropout(dropout)
        self.Linear2 = encoder.Linear(hidden_dim * 2 // 3, num_classes)

        if id2words is None:
            self.Crf = decoder.CRF.ConditionalRandomField(num_classes, include_start_end_trans=False)
        else:
            self.Crf = decoder.CRF.ConditionalRandomField(num_classes, include_start_end_trans=False,
                                                          allowed_transitions=allowed_transitions(id2words,
                                                                                                  encoding_type="bmes"))

    def forward(self, word_seq, word_seq_origin_len, truth=None):
        """
        :param word_seq: LongTensor, [batch_size, mex_len]
        :param word_seq_origin_len: LongTensor, [batch_size, ]
        :param truth: LongTensor, [batch_size, max_len]
        :return y: If truth is None, return list of [decode path(list)]. Used in testing and predicting.
                   If truth is not None, return loss, a scalar. Used in training.
        """

        word_seq = word_seq.long()
        word_seq_origin_len = word_seq_origin_len.long()
        self.mask = self.make_mask(word_seq, word_seq_origin_len)
        sent_len, idx_sort = torch.sort(word_seq_origin_len, descending=True)
        _, idx_unsort = torch.sort(idx_sort, descending=False)

        # word_seq_origin_len = word_seq_origin_len.long()
        truth = truth.long() if truth is not None else None

        batch_size = word_seq.size(0)
        max_len = word_seq.size(1)
        if next(self.parameters()).is_cuda:
            word_seq = word_seq.cuda()
            idx_sort = idx_sort.cuda()
            idx_unsort = idx_unsort.cuda()
            self.mask = self.mask.cuda()

        x = self.Embedding(word_seq)
        x = self.norm1(x)
        # [batch_size, max_len, word_emb_dim]

        sent_variable = x[idx_sort]
        sent_packed = torch.nn.utils.rnn.pack_padded_sequence(sent_variable, sent_len, batch_first=True)

        x, _ = self.Rnn(sent_packed)
        # print(x)
        # [batch_size, max_len, hidden_size * direction]

        sent_output = torch.nn.utils.rnn.pad_packed_sequence(x, batch_first=True)[0]
        x = sent_output[idx_unsort]

        x = x.contiguous()
        # x = x.view(batch_size * max_len, -1)
        x = self.Linear1(x)
        # x = self.batch_norm(x)
        x = self.norm2(x)
        x = self.relu(x)
        x = self.drop(x)
        x = self.Linear2(x)
        # x = x.view(batch_size, max_len, -1)
        # [batch_size, max_len, num_classes]
        # TODO seq_lens的key这样做不合理
        return {"loss": self._internal_loss(x, truth) if truth is not None else None,
                "predict": self.decode(x),
                'word_seq_origin_len': word_seq_origin_len}

    def predict(self, **x):
        out = self.forward(**x)
        return {"predict": out["predict"]}

    def loss(self, **kwargs):
        assert 'loss' in kwargs
        return kwargs['loss']


 if __name__ == '__main__':
    args = {
        'vocab_size': 20,
        'word_emb_dim': 100,
        'rnn_hidden_units': 100,
        'num_classes': 10,
    }
    model = AdvSeqLabel(args)
    data = []
    for i in range(20):
        word_seq = torch.randint(20, (15,)).long()
        word_seq_len = torch.LongTensor([15])
        truth = torch.randint(10, (15,)).long()
        data.append((word_seq, word_seq_len, truth))
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
    print(model)
    curidx = 0
    for i in range(1000):
        endidx = min(len(data), curidx + 5)
        b_word, b_len, b_truth = [], [], []
        for word_seq, word_seq_len, truth in data[curidx: endidx]:
            b_word.append(word_seq)
            b_len.append(word_seq_len)
            b_truth.append(truth)
        word_seq = torch.stack(b_word, dim=0)
        word_seq_len = torch.cat(b_len, dim=0)
        truth = torch.stack(b_truth, dim=0)
        res = model(word_seq, word_seq_len, truth)
        loss = res['loss']
        pred = res['predict']
        print('loss: {} acc {}'.format(loss.item(),
                                       ((pred.data == truth).long().sum().float() / word_seq_len.sum().float())))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        curidx = endidx
        if curidx == len(data):
            curidx = 0
--- a/fastNLP/models/snli.py
+++ b/fastNLP/models/snli.py
@@ -1,114 +1,152 @@
 __all__ = [
    "ESIM"
 ]

 import torch
 import torch.nn as nn
 import torch.nn.functional as F

 from fastNLP.models.base_model import BaseModel
 from fastNLP.modules import decoder as Decoder
 from fastNLP.modules import encoder as Encoder
 from fastNLP.modules import aggregator as Aggregator

 from .base_model import BaseModel
 from ..core.const import Const
 from ..modules import decoder as Decoder
 from ..modules import encoder as Encoder
 from ..modules import aggregator as Aggregator
 from ..core.utils import seq_len_to_mask

 my_inf = 10e12


 class ESIM(BaseModel):
    """
    PyTorch Network for SNLI task using ESIM model.
    """
    别名：:class:`fastNLP.models.ESIM`  :class:`fastNLP.models.snli.ESIM`

    def __init__(self, **kwargs):
        super(ESIM, self).__init__()
        self.vocab_size = kwargs["vocab_size"]
        self.embed_dim = kwargs["embed_dim"]
        self.hidden_size = kwargs["hidden_size"]
        self.batch_first = kwargs["batch_first"]
        self.dropout = kwargs["dropout"]
        self.n_labels = kwargs["num_classes"]
        self.gpu = kwargs["gpu"] and torch.cuda.is_available()
    ESIM模型的一个PyTorch实现。
    ESIM模型的论文: Enhanced LSTM for Natural Language Inference (arXiv: 1609.06038)

    :param int vocab_size: 词表大小
    :param int embed_dim: 词嵌入维度
    :param int hidden_size: LSTM隐层大小
    :param float dropout: dropout大小，默认为0
    :param int num_classes: 标签数目，默认为3
    :param numpy.array init_embedding: 初始词嵌入矩阵，形状为(vocab_size, embed_dim)，默认为None，即随机初始化词嵌入矩阵
    """
    
    def __init__(self, vocab_size, embed_dim, hidden_size, dropout=0.0, num_classes=3, init_embedding=None):
        
        super(ESIM, self).__init__()
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.hidden_size = hidden_size
        self.dropout = dropout
        self.n_labels = num_classes
        
        self.drop = nn.Dropout(self.dropout)

        
        self.embedding = Encoder.Embedding(
            self.vocab_size, self.embed_dim, dropout=self.dropout,
            init_emb=kwargs["init_embedding"] if "inin_embedding" in kwargs.keys() else None,
            (self.vocab_size, self.embed_dim), dropout=self.dropout,
        )

        self.embedding_layer = Encoder.Linear(self.embed_dim, self.hidden_size)

        
        self.embedding_layer = nn.Linear(self.embed_dim, self.hidden_size)
        
        self.encoder = Encoder.LSTM(
            input_size=self.embed_dim, hidden_size=self.hidden_size, num_layers=1, bias=True,
            batch_first=self.batch_first, bidirectional=True
            batch_first=True, bidirectional=True
        )

        self.bi_attention = Aggregator.Bi_Attention()
        self.mean_pooling = Aggregator.MeanPoolWithMask()
        
        self.bi_attention = Aggregator.BiAttention()
        self.mean_pooling = Aggregator.AvgPoolWithMask()
        self.max_pooling = Aggregator.MaxPoolWithMask()

        self.inference_layer = Encoder.Linear(self.hidden_size * 4, self.hidden_size)

        
        self.inference_layer = nn.Linear(self.hidden_size * 4, self.hidden_size)
        
        self.decoder = Encoder.LSTM(
            input_size=self.hidden_size, hidden_size=self.hidden_size, num_layers=1, bias=True,
            batch_first=self.batch_first, bidirectional=True
            batch_first=True, bidirectional=True
        )

        
        self.output = Decoder.MLP([4 * self.hidden_size, self.hidden_size, self.n_labels], 'tanh', dropout=self.dropout)

    def forward(self, premise, hypothesis, premise_len, hypothesis_len):
    
    def forward(self, words1, words2, seq_len1=None, seq_len2=None, target=None):
        """ Forward function

        :param premise: A Tensor represents premise: [batch size(B), premise seq len(PL)].
        :param hypothesis: A Tensor represents hypothesis: [B, hypothesis seq len(HL)].
        :param premise_len: A Tensor record which is a real word and which is a padding word in premise: [B, PL].
        :param hypothesis_len: A Tensor record which is a real word and which is a padding word in hypothesis: [B, HL].
        :return: prediction: A Dict with Tensor of classification result: [B, n_labels(N)].
        
        :param torch.Tensor words1: [batch size(B), premise seq len(PL)] premise的token表示
        :param torch.Tensor words2: [B, hypothesis seq len(HL)] hypothesis的token表示
        :param torch.LongTensor seq_len1: [B] premise的长度
        :param torch.LongTensor seq_len2: [B] hypothesis的长度
        :param torch.LongTensor target: [B] 真实目标值
        :return: dict prediction: [B, n_labels(N)] 预测结果
        """

        premise0 = self.embedding_layer(self.embedding(premise))
        hypothesis0 = self.embedding_layer(self.embedding(hypothesis))

        
        premise0 = self.embedding_layer(self.embedding(words1))
        hypothesis0 = self.embedding_layer(self.embedding(words2))
        
        if seq_len1 is not None:
            seq_len1 = seq_len_to_mask(seq_len1)
        else:
            seq_len1 = torch.ones(premise0.size(0), premise0.size(1))
            seq_len1 = (seq_len1.long()).to(device=premise0.device)
        if seq_len2 is not None:
            seq_len2 = seq_len_to_mask(seq_len2)
        else:
            seq_len2 = torch.ones(hypothesis0.size(0), hypothesis0.size(1))
            seq_len2 = (seq_len2.long()).to(device=hypothesis0.device)
        
        _BP, _PSL, _HP = premise0.size()
        _BH, _HSL, _HH = hypothesis0.size()
        _BPL, _PLL = premise_len.size()
        _HPL, _HLL = hypothesis_len.size()

        _BPL, _PLL = seq_len1.size()
        _HPL, _HLL = seq_len2.size()
        
        assert _BP == _BH and _BPL == _HPL and _BP == _BPL
        assert _HP == _HH
        assert _PSL == _PLL and _HSL == _HLL

        
        B, PL, H = premise0.size()
        B, HL, H = hypothesis0.size()

        
        a0 = self.encoder(self.drop(premise0))  # a0: [B, PL, H * 2]
        b0 = self.encoder(self.drop(hypothesis0))  # b0: [B, HL, H * 2]

        
        a = torch.mean(a0.view(B, PL, -1, H), dim=2)  # a: [B, PL, H]
        b = torch.mean(b0.view(B, HL, -1, H), dim=2)  # b: [B, HL, H]

        ai, bi = self.bi_attention(a, b, premise_len, hypothesis_len)

        
        ai, bi = self.bi_attention(a, b, seq_len1, seq_len2)
        
        ma = torch.cat((a, ai, a - ai, a * ai), dim=2)  # ma: [B, PL, 4 * H]
        mb = torch.cat((b, bi, b - bi, b * bi), dim=2)  # mb: [B, HL, 4 * H]

        
        f_ma = self.inference_layer(ma)
        f_mb = self.inference_layer(mb)

        
        vat = self.decoder(self.drop(f_ma))
        vbt = self.decoder(self.drop(f_mb))

        
        va = torch.mean(vat.view(B, PL, -1, H), dim=2)  # va: [B, PL, H]
        vb = torch.mean(vbt.view(B, HL, -1, H), dim=2)  # vb: [B, HL, H]

        va_ave = self.mean_pooling(va, premise_len, dim=1)  # va_ave: [B, H]
        va_max, va_arg_max = self.max_pooling(va, premise_len, dim=1)  # va_max: [B, H]
        vb_ave = self.mean_pooling(vb, hypothesis_len, dim=1)  # vb_ave: [B, H]
        vb_max, vb_arg_max = self.max_pooling(vb, hypothesis_len, dim=1)  # vb_max: [B, H]

        
        va_ave = self.mean_pooling(va, seq_len1, dim=1)  # va_ave: [B, H]
        va_max, va_arg_max = self.max_pooling(va, seq_len1, dim=1)  # va_max: [B, H]
        vb_ave = self.mean_pooling(vb, seq_len2, dim=1)  # vb_ave: [B, H]
        vb_max, vb_arg_max = self.max_pooling(vb, seq_len2, dim=1)  # vb_max: [B, H]
        
        v = torch.cat((va_ave, va_max, vb_ave, vb_max), dim=1)  # v: [B, 4 * H]

        prediction = F.tanh(self.output(v))  # prediction: [B, N]

        return {'pred': prediction}

    def predict(self, premise, hypothesis, premise_len, hypothesis_len):
        return self.forward(premise, hypothesis, premise_len, hypothesis_len)

        
        prediction = torch.tanh(self.output(v))  # prediction: [B, N]
        
        if target is not None:
            func = nn.CrossEntropyLoss()
            loss = func(prediction, target)
            return {Const.OUTPUT: prediction, Const.LOSS: loss}
        
        return {Const.OUTPUT: prediction}
    
    def predict(self, words1, words2, seq_len1=None, seq_len2=None, target=None):
        """ Predict function

        :param torch.Tensor words1: [batch size(B), premise seq len(PL)] premise的token表示
        :param torch.Tensor words2: [B, hypothesis seq len(HL)] hypothesis的token表示
        :param torch.LongTensor seq_len1: [B] premise的长度
        :param torch.LongTensor seq_len2: [B] hypothesis的长度
        :param torch.LongTensor target: [B] 真实目标值
        :return: dict prediction: [B, n_labels(N)] 预测结果
        """
        prediction = self.forward(words1, words2, seq_len1, seq_len2)[Const.OUTPUT]
        return {Const.OUTPUT: torch.argmax(prediction, dim=-1)}
--- a/fastNLP/models/star_transformer.py
+++ b/fastNLP/models/star_transformer.py
@@ -0,0 +1,307 @@
 """
 Star-Transformer 的 Pytorch 实现。
 """
 __all__ = [
    "StarTransEnc",
    "STNLICls",
    "STSeqCls",
    "STSeqLabel",
 ]

 import torch
 from torch import nn

 from ..modules.encoder.star_transformer import StarTransformer
 from ..core.utils import seq_len_to_mask
 from ..modules.utils import get_embeddings
 from ..core.const import Const


 class StarTransEnc(nn.Module):
    """
    别名：:class:`fastNLP.models.StarTransEnc`  :class:`fastNLP.models.star_transformer.StarTransEnc`

    带word embedding的Star-Transformer Encoder

    :param init_embed: 单词词典, 可以是 tuple, 包括(num_embedings, embedding_dim), 即
        embedding的大小和每个词的维度. 也可以传入 nn.Embedding 对象,
        此时就以传入的对象作为embedding
    :param hidden_size: 模型中特征维度.
    :param num_layers: 模型层数.
    :param num_head: 模型中multi-head的head个数.
    :param head_dim: 模型中multi-head中每个head特征维度.
    :param max_len: 模型能接受的最大输入长度.
    :param emb_dropout: 词嵌入的dropout概率.
    :param dropout: 模型除词嵌入外的dropout概率.
    """
    
    def __init__(self, init_embed,
                 hidden_size,
                 num_layers,
                 num_head,
                 head_dim,
                 max_len,
                 emb_dropout,
                 dropout):
        super(StarTransEnc, self).__init__()
        self.embedding = get_embeddings(init_embed)
        emb_dim = self.embedding.embedding_dim
        self.emb_fc = nn.Linear(emb_dim, hidden_size)
        self.emb_drop = nn.Dropout(emb_dropout)
        self.encoder = StarTransformer(hidden_size=hidden_size,
                                       num_layers=num_layers,
                                       num_head=num_head,
                                       head_dim=head_dim,
                                       dropout=dropout,
                                       max_len=max_len)
    
    def forward(self, x, mask):
        """
        :param FloatTensor x: [batch, length, hidden] 输入的序列
        :param ByteTensor mask: [batch, length] 输入序列的padding mask, 在没有内容(padding 部分) 为 0,
            否则为 1
        :return: [batch, length, hidden] 编码后的输出序列

                [batch, hidden] 全局 relay 节点, 详见论文
        """
        x = self.embedding(x)
        x = self.emb_fc(self.emb_drop(x))
        nodes, relay = self.encoder(x, mask)
        return nodes, relay


 class _Cls(nn.Module):
    def __init__(self, in_dim, num_cls, hid_dim, dropout=0.1):
        super(_Cls, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(in_dim, hid_dim),
            nn.LeakyReLU(),
            nn.Dropout(dropout),
            nn.Linear(hid_dim, num_cls),
        )
    
    def forward(self, x):
        h = self.fc(x)
        return h


 class _NLICls(nn.Module):
    def __init__(self, in_dim, num_cls, hid_dim, dropout=0.1):
        super(_NLICls, self).__init__()
        self.fc = nn.Sequential(
            nn.Dropout(dropout),
            nn.Linear(in_dim * 4, hid_dim),  # 4
            nn.LeakyReLU(),
            nn.Dropout(dropout),
            nn.Linear(hid_dim, num_cls),
        )
    
    def forward(self, x1, x2):
        x = torch.cat([x1, x2, torch.abs(x1 - x2), x1 * x2], 1)
        h = self.fc(x)
        return h


 class STSeqLabel(nn.Module):
    """
    别名：:class:`fastNLP.models.STSeqLabel`  :class:`fastNLP.models.star_transformer.STSeqLabel`

    用于序列标注的Star-Transformer模型

    :param init_embed: 单词词典, 可以是 tuple, 包括(num_embedings, embedding_dim), 即
        embedding的大小和每个词的维度. 也可以传入 nn.Embedding 对象,
        此时就以传入的对象作为embedding
    :param num_cls: 输出类别个数
    :param hidden_size: 模型中特征维度. Default: 300
    :param num_layers: 模型层数. Default: 4
    :param num_head: 模型中multi-head的head个数. Default: 8
    :param head_dim: 模型中multi-head中每个head特征维度. Default: 32
    :param max_len: 模型能接受的最大输入长度. Default: 512
    :param cls_hidden_size: 分类器隐层维度. Default: 600
    :param emb_dropout: 词嵌入的dropout概率. Default: 0.1
    :param dropout: 模型除词嵌入外的dropout概率. Default: 0.1
    """
    
    def __init__(self, init_embed, num_cls,
                 hidden_size=300,
                 num_layers=4,
                 num_head=8,
                 head_dim=32,
                 max_len=512,
                 cls_hidden_size=600,
                 emb_dropout=0.1,
                 dropout=0.1, ):
        super(STSeqLabel, self).__init__()
        self.enc = StarTransEnc(init_embed=init_embed,
                                hidden_size=hidden_size,
                                num_layers=num_layers,
                                num_head=num_head,
                                head_dim=head_dim,
                                max_len=max_len,
                                emb_dropout=emb_dropout,
                                dropout=dropout)
        self.cls = _Cls(hidden_size, num_cls, cls_hidden_size)
    
    def forward(self, words, seq_len):
        """

        :param words: [batch, seq_len] 输入序列
        :param seq_len: [batch,] 输入序列的长度
        :return output: [batch, num_cls, seq_len] 输出序列中每个元素的分类的概率
        """
        mask = seq_len_to_mask(seq_len)
        nodes, _ = self.enc(words, mask)
        output = self.cls(nodes)
        output = output.transpose(1, 2)  # make hidden to be dim 1
        return {Const.OUTPUT: output}  # [bsz, n_cls, seq_len]
    
    def predict(self, words, seq_len):
        """

        :param words: [batch, seq_len] 输入序列
        :param seq_len: [batch,] 输入序列的长度
        :return output: [batch, seq_len] 输出序列中每个元素的分类
        """
        y = self.forward(words, seq_len)
        _, pred = y[Const.OUTPUT].max(1)
        return {Const.OUTPUT: pred}


 class STSeqCls(nn.Module):
    """
    别名：:class:`fastNLP.models.STSeqCls`  :class:`fastNLP.models.star_transformer.STSeqCls`

    用于分类任务的Star-Transformer

    :param init_embed: 单词词典, 可以是 tuple, 包括(num_embedings, embedding_dim), 即
        embedding的大小和每个词的维度. 也可以传入 nn.Embedding 对象,
        此时就以传入的对象作为embedding
    :param num_cls: 输出类别个数
    :param hidden_size: 模型中特征维度. Default: 300
    :param num_layers: 模型层数. Default: 4
    :param num_head: 模型中multi-head的head个数. Default: 8
    :param head_dim: 模型中multi-head中每个head特征维度. Default: 32
    :param max_len: 模型能接受的最大输入长度. Default: 512
    :param cls_hidden_size: 分类器隐层维度. Default: 600
    :param emb_dropout: 词嵌入的dropout概率. Default: 0.1
    :param dropout: 模型除词嵌入外的dropout概率. Default: 0.1
    """
    
    def __init__(self, init_embed, num_cls,
                 hidden_size=300,
                 num_layers=4,
                 num_head=8,
                 head_dim=32,
                 max_len=512,
                 cls_hidden_size=600,
                 emb_dropout=0.1,
                 dropout=0.1, ):
        super(STSeqCls, self).__init__()
        self.enc = StarTransEnc(init_embed=init_embed,
                                hidden_size=hidden_size,
                                num_layers=num_layers,
                                num_head=num_head,
                                head_dim=head_dim,
                                max_len=max_len,
                                emb_dropout=emb_dropout,
                                dropout=dropout)
        self.cls = _Cls(hidden_size, num_cls, cls_hidden_size)
    
    def forward(self, words, seq_len):
        """

        :param words: [batch, seq_len] 输入序列
        :param seq_len: [batch,] 输入序列的长度
        :return output: [batch, num_cls] 输出序列的分类的概率
        """
        mask = seq_len_to_mask(seq_len)
        nodes, relay = self.enc(words, mask)
        y = 0.5 * (relay + nodes.max(1)[0])
        output = self.cls(y)  # [bsz, n_cls]
        return {Const.OUTPUT: output}
    
    def predict(self, words, seq_len):
        """

        :param words: [batch, seq_len] 输入序列
        :param seq_len: [batch,] 输入序列的长度
        :return output: [batch, num_cls] 输出序列的分类
        """
        y = self.forward(words, seq_len)
        _, pred = y[Const.OUTPUT].max(1)
        return {Const.OUTPUT: pred}


 class STNLICls(nn.Module):
    """
    别名：:class:`fastNLP.models.STNLICls`  :class:`fastNLP.models.star_transformer.STNLICls`
    
    用于自然语言推断(NLI)的Star-Transformer

    :param init_embed: 单词词典, 可以是 tuple, 包括(num_embedings, embedding_dim), 即
        embedding的大小和每个词的维度. 也可以传入 nn.Embedding 对象,
        此时就以传入的对象作为embedding
    :param num_cls: 输出类别个数
    :param hidden_size: 模型中特征维度. Default: 300
    :param num_layers: 模型层数. Default: 4
    :param num_head: 模型中multi-head的head个数. Default: 8
    :param head_dim: 模型中multi-head中每个head特征维度. Default: 32
    :param max_len: 模型能接受的最大输入长度. Default: 512
    :param cls_hidden_size: 分类器隐层维度. Default: 600
    :param emb_dropout: 词嵌入的dropout概率. Default: 0.1
    :param dropout: 模型除词嵌入外的dropout概率. Default: 0.1
    """
    
    def __init__(self, init_embed, num_cls,
                 hidden_size=300,
                 num_layers=4,
                 num_head=8,
                 head_dim=32,
                 max_len=512,
                 cls_hidden_size=600,
                 emb_dropout=0.1,
                 dropout=0.1, ):
        super(STNLICls, self).__init__()
        self.enc = StarTransEnc(init_embed=init_embed,
                                hidden_size=hidden_size,
                                num_layers=num_layers,
                                num_head=num_head,
                                head_dim=head_dim,
                                max_len=max_len,
                                emb_dropout=emb_dropout,
                                dropout=dropout)
        self.cls = _NLICls(hidden_size, num_cls, cls_hidden_size)
    
    def forward(self, words1, words2, seq_len1, seq_len2):
        """

        :param words1: [batch, seq_len] 输入序列1
        :param words2: [batch, seq_len] 输入序列2
        :param seq_len1: [batch,] 输入序列1的长度
        :param seq_len2: [batch,] 输入序列2的长度
        :return output: [batch, num_cls] 输出分类的概率
        """
        mask1 = seq_len_to_mask(seq_len1)
        mask2 = seq_len_to_mask(seq_len2)
        
        def enc(seq, mask):
            nodes, relay = self.enc(seq, mask)
            return 0.5 * (relay + nodes.max(1)[0])
        
        y1 = enc(words1, mask1)
        y2 = enc(words2, mask2)
        output = self.cls(y1, y2)  # [bsz, n_cls]
        return {Const.OUTPUT: output}
    
    def predict(self, words1, words2, seq_len1, seq_len2):
        """

        :param words1: [batch, seq_len] 输入序列1
        :param words2: [batch, seq_len] 输入序列2
        :param seq_len1: [batch,] 输入序列1的长度
        :param seq_len2: [batch,] 输入序列2的长度
        :return output: [batch, num_cls] 输出分类的概率
        """
        y = self.forward(words1, words2, seq_len1, seq_len2)
        _, pred = y[Const.OUTPUT].max(1)
        return {Const.OUTPUT: pred}
--- a/fastNLP/modules/init.py
+++ b/fastNLP/modules/init.py
@@ -1,3 +1,51 @@
 """
 大部分用于的 NLP 任务神经网络都可以看做由编码 :mod:`~fastNLP.modules.encoder` 、
 聚合 :mod:`~fastNLP.modules.aggregator` 、解码 :mod:`~fastNLP.modules.decoder` 三种模块组成。

 .. image:: figures/text_classification.png

 :mod:`~fastNLP.modules` 中实现了 fastNLP 提供的诸多模块组件，可以帮助用户快速搭建自己所需的网络。
 三种模块的功能和常见组件如下:

 +-----------------------+-----------------------+-----------------------+
 | module type           | functionality         | example               |
 +=======================+=======================+=======================+
 | encoder               | 将输入编码为具有具    | embedding, RNN, CNN,  |
 |                       | 有表示能力的向量      | transformer           |
 +-----------------------+-----------------------+-----------------------+
 | aggregator            | 从多个向量中聚合信息  | self-attention,       |
 |                       |                       | max-pooling           |
 +-----------------------+-----------------------+-----------------------+
 | decoder               | 将具有某种表示意义的  | MLP, CRF              |
 |                       | 向量解码为需要的输出  |                       |
 |                       | 形式                  |                       |
 +-----------------------+-----------------------+-----------------------+

 """
 __all__ = [
    # "BertModel",
    "ConvolutionCharEncoder",
    "LSTMCharEncoder",
    "ConvMaxpool",
    "Embedding",
    "LSTM",
    "StarTransformer",
    "TransformerEncoder",
    "VarRNN",
    "VarLSTM",
    "VarGRU",
    
    "MaxPool",
    "MaxPoolWithMask",
    "AvgPool",
    "MultiHeadAttention",
    
    "MLP",
    "ConditionalRandomField",
    "viterbi_decode",
    "allowed_transitions",
 ]

 from . import aggregator
 from . import decoder
 from . import encoder
@@ -5,9 +53,4 @@ from .aggregator import *
 from .decoder import *
 from .dropout import TimestepDropout
 from .encoder import *

 __version__ = '0.0.0'

 __all__ = ['encoder',
           'decoder',
           'aggregator']
 from .utils import get_embeddings