Browse Source

merge from dev0.8.0

tags/v1.0.0alpha
yhcc 2 years ago
parent
commit
842fb7ae30
100 changed files with 16056 additions and 78 deletions
  1. +4
    -3
      docs/source/conf.py
  2. +1
    -0
      docs/source/fastNLP.core.callbacks.rst
  3. +7
    -0
      docs/source/fastNLP.core.callbacks.timer_callback.rst
  4. +7
    -0
      docs/source/fastNLP.core.collators.padders.oneflow_padder.rst
  5. +1
    -0
      docs/source/fastNLP.core.collators.padders.rst
  6. +0
    -7
      docs/source/fastNLP.core.dataloaders.mix_dataloader.rst
  7. +7
    -0
      docs/source/fastNLP.core.dataloaders.oneflow_dataloader.fdl.rst
  8. +15
    -0
      docs/source/fastNLP.core.dataloaders.oneflow_dataloader.rst
  9. +1
    -1
      docs/source/fastNLP.core.dataloaders.rst
  10. +7
    -0
      docs/source/fastNLP.core.dataloaders.torch_dataloader.mix_dataloader.rst
  11. +1
    -0
      docs/source/fastNLP.core.dataloaders.torch_dataloader.rst
  12. +7
    -0
      docs/source/fastNLP.core.drivers.oneflow_driver.ddp.rst
  13. +7
    -0
      docs/source/fastNLP.core.drivers.oneflow_driver.dist_utils.rst
  14. +7
    -0
      docs/source/fastNLP.core.drivers.oneflow_driver.initialize_oneflow_driver.rst
  15. +7
    -0
      docs/source/fastNLP.core.drivers.oneflow_driver.oneflow_driver.rst
  16. +20
    -0
      docs/source/fastNLP.core.drivers.oneflow_driver.rst
  17. +7
    -0
      docs/source/fastNLP.core.drivers.oneflow_driver.single_device.rst
  18. +7
    -0
      docs/source/fastNLP.core.drivers.oneflow_driver.utils.rst
  19. +1
    -0
      docs/source/fastNLP.core.drivers.rst
  20. +7
    -0
      docs/source/fastNLP.core.drivers.torch_driver.deepspeed.rst
  21. +7
    -0
      docs/source/fastNLP.core.drivers.torch_driver.fairscale.rst
  22. +0
    -7
      docs/source/fastNLP.core.drivers.torch_driver.fairscale_sharded.rst
  23. +3
    -1
      docs/source/fastNLP.core.drivers.torch_driver.rst
  24. +7
    -0
      docs/source/fastNLP.core.drivers.torch_driver.torch_fsdp.rst
  25. +7
    -0
      docs/source/fastNLP.core.metrics.backend.oneflow_backend.backend.rst
  26. +15
    -0
      docs/source/fastNLP.core.metrics.backend.oneflow_backend.rst
  27. +1
    -0
      docs/source/fastNLP.core.metrics.backend.rst
  28. +7
    -0
      docs/source/fastNLP.core.utils.oneflow_utils.rst
  29. +3
    -0
      docs/source/fastNLP.core.utils.rst
  30. +7
    -0
      docs/source/fastNLP.core.utils.seq_len_to_mask.rst
  31. +7
    -0
      docs/source/fastNLP.core.utils.tqdm_progress.rst
  32. +15
    -0
      docs/source/fastNLP.embeddings.rst
  33. +7
    -0
      docs/source/fastNLP.embeddings.torch.char_embedding.rst
  34. +7
    -0
      docs/source/fastNLP.embeddings.torch.embedding.rst
  35. +19
    -0
      docs/source/fastNLP.embeddings.torch.rst
  36. +7
    -0
      docs/source/fastNLP.embeddings.torch.stack_embedding.rst
  37. +7
    -0
      docs/source/fastNLP.embeddings.torch.static_embedding.rst
  38. +7
    -0
      docs/source/fastNLP.embeddings.torch.utils.rst
  39. +0
    -1
      docs/source/fastNLP.io.loader.rst
  40. +0
    -7
      docs/source/fastNLP.io.model_io.rst
  41. +0
    -1
      docs/source/fastNLP.io.pipe.rst
  42. +0
    -1
      docs/source/fastNLP.io.rst
  43. +15
    -0
      docs/source/fastNLP.models.rst
  44. +7
    -0
      docs/source/fastNLP.models.torch.biaffine_parser.rst
  45. +7
    -0
      docs/source/fastNLP.models.torch.cnn_text_classification.rst
  46. +19
    -0
      docs/source/fastNLP.models.torch.rst
  47. +7
    -0
      docs/source/fastNLP.models.torch.seq2seq_generator.rst
  48. +7
    -0
      docs/source/fastNLP.models.torch.seq2seq_model.rst
  49. +7
    -0
      docs/source/fastNLP.models.torch.sequence_labeling.rst
  50. +1
    -0
      docs/source/fastNLP.modules.rst
  51. +7
    -0
      docs/source/fastNLP.modules.torch.attention.rst
  52. +7
    -0
      docs/source/fastNLP.modules.torch.decoder.crf.rst
  53. +7
    -0
      docs/source/fastNLP.modules.torch.decoder.mlp.rst
  54. +18
    -0
      docs/source/fastNLP.modules.torch.decoder.rst
  55. +7
    -0
      docs/source/fastNLP.modules.torch.decoder.seq2seq_decoder.rst
  56. +7
    -0
      docs/source/fastNLP.modules.torch.decoder.seq2seq_state.rst
  57. +2
    -2
      docs/source/fastNLP.modules.torch.dropout.rst
  58. +7
    -0
      docs/source/fastNLP.modules.torch.encoder.conv_maxpool.rst
  59. +7
    -0
      docs/source/fastNLP.modules.torch.encoder.lstm.rst
  60. +20
    -0
      docs/source/fastNLP.modules.torch.encoder.rst
  61. +7
    -0
      docs/source/fastNLP.modules.torch.encoder.seq2seq_encoder.rst
  62. +7
    -0
      docs/source/fastNLP.modules.torch.encoder.star_transformer.rst
  63. +7
    -0
      docs/source/fastNLP.modules.torch.encoder.transformer.rst
  64. +7
    -0
      docs/source/fastNLP.modules.torch.encoder.variational_rnn.rst
  65. +15
    -0
      docs/source/fastNLP.modules.torch.generator.rst
  66. +7
    -0
      docs/source/fastNLP.modules.torch.generator.seq2seq_generator.rst
  67. +26
    -0
      docs/source/fastNLP.modules.torch.rst
  68. +3
    -0
      docs/source/fastNLP.rst
  69. +14
    -0
      docs/source/fastNLP.transformers.rst
  70. +2
    -2
      docs/source/fastNLP.transformers.torch.rst
  71. +4
    -4
      docs/source/index.rst
  72. +8
    -0
      docs/source/tutorials.rst
  73. +869
    -0
      docs/source/tutorials/fastnlp_torch_tutorial.ipynb
  74. +1352
    -0
      docs/source/tutorials/fastnlp_tutorial_0.ipynb
  75. +1333
    -0
      docs/source/tutorials/fastnlp_tutorial_1.ipynb
  76. +884
    -0
      docs/source/tutorials/fastnlp_tutorial_2.ipynb
  77. +621
    -0
      docs/source/tutorials/fastnlp_tutorial_3.ipynb
  78. +2614
    -0
      docs/source/tutorials/fastnlp_tutorial_4.ipynb
  79. +1242
    -0
      docs/source/tutorials/fastnlp_tutorial_5.ipynb
  80. +1646
    -0
      docs/source/tutorials/fastnlp_tutorial_6.ipynb
  81. +1280
    -0
      docs/source/tutorials/fastnlp_tutorial_e1.ipynb
  82. +1082
    -0
      docs/source/tutorials/fastnlp_tutorial_e2.ipynb
  83. +1086
    -0
      docs/source/tutorials/fastnlp_tutorial_paddle_e1.ipynb
  84. +1510
    -0
      docs/source/tutorials/fastnlp_tutorial_paddle_e2.ipynb
  85. BIN
      docs/source/tutorials/figures/E1-fig-glue-benchmark.png
  86. BIN
      docs/source/tutorials/figures/E2-fig-p-tuning-v2-model.png
  87. BIN
      docs/source/tutorials/figures/E2-fig-pet-model.png
  88. BIN
      docs/source/tutorials/figures/T0-fig-parameter-matching.png
  89. BIN
      docs/source/tutorials/figures/T0-fig-trainer-and-evaluator.png
  90. BIN
      docs/source/tutorials/figures/T0-fig-training-structure.png
  91. BIN
      docs/source/tutorials/figures/T1-fig-dataset-and-vocabulary.png
  92. BIN
      docs/source/tutorials/figures/paddle-ernie-1.0-masking-levels.png
  93. BIN
      docs/source/tutorials/figures/paddle-ernie-1.0-masking.png
  94. BIN
      docs/source/tutorials/figures/paddle-ernie-2.0-continual-pretrain.png
  95. BIN
      docs/source/tutorials/figures/paddle-ernie-3.0-framework.png
  96. +1
    -1
      fastNLP/__init__.py
  97. +3
    -27
      fastNLP/core/callbacks/callback_event.py
  98. +1
    -0
      fastNLP/core/collators/padders/oneflow_padder.py
  99. +7
    -7
      fastNLP/core/controllers/trainer.py
  100. +5
    -6
      fastNLP/core/dataset/dataset.py

+ 4
- 3
docs/source/conf.py View File

@@ -24,9 +24,9 @@ copyright = '2022, fastNLP'
author = 'fastNLP'

# The short X.Y version
version = '0.8'
version = '1.0'
# The full version, including alpha/beta/rc tags
release = '0.8.0'
release = '1.0.0-alpha'

# -- General configuration ---------------------------------------------------

@@ -45,6 +45,7 @@ extensions = [
'sphinx.ext.todo',
'sphinx_autodoc_typehints',
'sphinx_multiversion',
'nbsphinx',
]

autodoc_default_options = {
@@ -169,7 +170,7 @@ man_pages = [
# dir menu entry, description, category)
texinfo_documents = [
(master_doc, 'fastNLP', 'fastNLP Documentation',
author, 'fastNLP', 'One line description of project.',
author, 'fastNLP', 'A fast NLP tool for programming.',
'Miscellaneous'),
]



+ 1
- 0
docs/source/fastNLP.core.callbacks.rst View File

@@ -31,5 +31,6 @@ Submodules
fastNLP.core.callbacks.lr_scheduler_callback
fastNLP.core.callbacks.more_evaluate_callback
fastNLP.core.callbacks.progress_callback
fastNLP.core.callbacks.timer_callback
fastNLP.core.callbacks.topk_saver
fastNLP.core.callbacks.utils

+ 7
- 0
docs/source/fastNLP.core.callbacks.timer_callback.rst View File

@@ -0,0 +1,7 @@
fastNLP.core.callbacks.timer\_callback module
=============================================

.. automodule:: fastNLP.core.callbacks.timer_callback
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.core.collators.padders.oneflow_padder.rst View File

@@ -0,0 +1,7 @@
fastNLP.core.collators.padders.oneflow\_padder module
=====================================================

.. automodule:: fastNLP.core.collators.padders.oneflow_padder
:members:
:undoc-members:
:show-inheritance:

+ 1
- 0
docs/source/fastNLP.core.collators.padders.rst View File

@@ -16,6 +16,7 @@ Submodules
fastNLP.core.collators.padders.get_padder
fastNLP.core.collators.padders.jittor_padder
fastNLP.core.collators.padders.numpy_padder
fastNLP.core.collators.padders.oneflow_padder
fastNLP.core.collators.padders.padder
fastNLP.core.collators.padders.paddle_padder
fastNLP.core.collators.padders.raw_padder


+ 0
- 7
docs/source/fastNLP.core.dataloaders.mix_dataloader.rst View File

@@ -1,7 +0,0 @@
fastNLP.core.dataloaders.mix\_dataloader module
===============================================

.. automodule:: fastNLP.core.dataloaders.mix_dataloader
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.core.dataloaders.oneflow_dataloader.fdl.rst View File

@@ -0,0 +1,7 @@
fastNLP.core.dataloaders.oneflow\_dataloader.fdl module
=======================================================

.. automodule:: fastNLP.core.dataloaders.oneflow_dataloader.fdl
:members:
:undoc-members:
:show-inheritance:

+ 15
- 0
docs/source/fastNLP.core.dataloaders.oneflow_dataloader.rst View File

@@ -0,0 +1,15 @@
fastNLP.core.dataloaders.oneflow\_dataloader package
====================================================

.. automodule:: fastNLP.core.dataloaders.oneflow_dataloader
:members:
:undoc-members:
:show-inheritance:

Submodules
----------

.. toctree::
:maxdepth: 4

fastNLP.core.dataloaders.oneflow_dataloader.fdl

+ 1
- 1
docs/source/fastNLP.core.dataloaders.rst View File

@@ -13,6 +13,7 @@ Subpackages
:maxdepth: 4

fastNLP.core.dataloaders.jittor_dataloader
fastNLP.core.dataloaders.oneflow_dataloader
fastNLP.core.dataloaders.paddle_dataloader
fastNLP.core.dataloaders.torch_dataloader

@@ -22,6 +23,5 @@ Submodules
.. toctree::
:maxdepth: 4

fastNLP.core.dataloaders.mix_dataloader
fastNLP.core.dataloaders.prepare_dataloader
fastNLP.core.dataloaders.utils

+ 7
- 0
docs/source/fastNLP.core.dataloaders.torch_dataloader.mix_dataloader.rst View File

@@ -0,0 +1,7 @@
fastNLP.core.dataloaders.torch\_dataloader.mix\_dataloader module
=================================================================

.. automodule:: fastNLP.core.dataloaders.torch_dataloader.mix_dataloader
:members:
:undoc-members:
:show-inheritance:

+ 1
- 0
docs/source/fastNLP.core.dataloaders.torch_dataloader.rst View File

@@ -13,3 +13,4 @@ Submodules
:maxdepth: 4

fastNLP.core.dataloaders.torch_dataloader.fdl
fastNLP.core.dataloaders.torch_dataloader.mix_dataloader

+ 7
- 0
docs/source/fastNLP.core.drivers.oneflow_driver.ddp.rst View File

@@ -0,0 +1,7 @@
fastNLP.core.drivers.oneflow\_driver.ddp module
===============================================

.. automodule:: fastNLP.core.drivers.oneflow_driver.ddp
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.core.drivers.oneflow_driver.dist_utils.rst View File

@@ -0,0 +1,7 @@
fastNLP.core.drivers.oneflow\_driver.dist\_utils module
=======================================================

.. automodule:: fastNLP.core.drivers.oneflow_driver.dist_utils
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.core.drivers.oneflow_driver.initialize_oneflow_driver.rst View File

@@ -0,0 +1,7 @@
fastNLP.core.drivers.oneflow\_driver.initialize\_oneflow\_driver module
=======================================================================

.. automodule:: fastNLP.core.drivers.oneflow_driver.initialize_oneflow_driver
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.core.drivers.oneflow_driver.oneflow_driver.rst View File

@@ -0,0 +1,7 @@
fastNLP.core.drivers.oneflow\_driver.oneflow\_driver module
===========================================================

.. automodule:: fastNLP.core.drivers.oneflow_driver.oneflow_driver
:members:
:undoc-members:
:show-inheritance:

+ 20
- 0
docs/source/fastNLP.core.drivers.oneflow_driver.rst View File

@@ -0,0 +1,20 @@
fastNLP.core.drivers.oneflow\_driver package
============================================

.. automodule:: fastNLP.core.drivers.oneflow_driver
:members:
:undoc-members:
:show-inheritance:

Submodules
----------

.. toctree::
:maxdepth: 4

fastNLP.core.drivers.oneflow_driver.ddp
fastNLP.core.drivers.oneflow_driver.dist_utils
fastNLP.core.drivers.oneflow_driver.initialize_oneflow_driver
fastNLP.core.drivers.oneflow_driver.oneflow_driver
fastNLP.core.drivers.oneflow_driver.single_device
fastNLP.core.drivers.oneflow_driver.utils

+ 7
- 0
docs/source/fastNLP.core.drivers.oneflow_driver.single_device.rst View File

@@ -0,0 +1,7 @@
fastNLP.core.drivers.oneflow\_driver.single\_device module
==========================================================

.. automodule:: fastNLP.core.drivers.oneflow_driver.single_device
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.core.drivers.oneflow_driver.utils.rst View File

@@ -0,0 +1,7 @@
fastNLP.core.drivers.oneflow\_driver.utils module
=================================================

.. automodule:: fastNLP.core.drivers.oneflow_driver.utils
:members:
:undoc-members:
:show-inheritance:

+ 1
- 0
docs/source/fastNLP.core.drivers.rst View File

@@ -13,6 +13,7 @@ Subpackages
:maxdepth: 4

fastNLP.core.drivers.jittor_driver
fastNLP.core.drivers.oneflow_driver
fastNLP.core.drivers.paddle_driver
fastNLP.core.drivers.torch_driver



+ 7
- 0
docs/source/fastNLP.core.drivers.torch_driver.deepspeed.rst View File

@@ -0,0 +1,7 @@
fastNLP.core.drivers.torch\_driver.deepspeed module
===================================================

.. automodule:: fastNLP.core.drivers.torch_driver.deepspeed
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.core.drivers.torch_driver.fairscale.rst View File

@@ -0,0 +1,7 @@
fastNLP.core.drivers.torch\_driver.fairscale module
===================================================

.. automodule:: fastNLP.core.drivers.torch_driver.fairscale
:members:
:undoc-members:
:show-inheritance:

+ 0
- 7
docs/source/fastNLP.core.drivers.torch_driver.fairscale_sharded.rst View File

@@ -1,7 +0,0 @@
fastNLP.core.drivers.torch\_driver.fairscale\_sharded module
============================================================

.. automodule:: fastNLP.core.drivers.torch_driver.fairscale_sharded
:members:
:undoc-members:
:show-inheritance:

+ 3
- 1
docs/source/fastNLP.core.drivers.torch_driver.rst View File

@@ -13,9 +13,11 @@ Submodules
:maxdepth: 4

fastNLP.core.drivers.torch_driver.ddp
fastNLP.core.drivers.torch_driver.deepspeed
fastNLP.core.drivers.torch_driver.dist_utils
fastNLP.core.drivers.torch_driver.fairscale_sharded
fastNLP.core.drivers.torch_driver.fairscale
fastNLP.core.drivers.torch_driver.initialize_torch_driver
fastNLP.core.drivers.torch_driver.single_device
fastNLP.core.drivers.torch_driver.torch_driver
fastNLP.core.drivers.torch_driver.torch_fsdp
fastNLP.core.drivers.torch_driver.utils

+ 7
- 0
docs/source/fastNLP.core.drivers.torch_driver.torch_fsdp.rst View File

@@ -0,0 +1,7 @@
fastNLP.core.drivers.torch\_driver.torch\_fsdp module
=====================================================

.. automodule:: fastNLP.core.drivers.torch_driver.torch_fsdp
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.core.metrics.backend.oneflow_backend.backend.rst View File

@@ -0,0 +1,7 @@
fastNLP.core.metrics.backend.oneflow\_backend.backend module
============================================================

.. automodule:: fastNLP.core.metrics.backend.oneflow_backend.backend
:members:
:undoc-members:
:show-inheritance:

+ 15
- 0
docs/source/fastNLP.core.metrics.backend.oneflow_backend.rst View File

@@ -0,0 +1,15 @@
fastNLP.core.metrics.backend.oneflow\_backend package
=====================================================

.. automodule:: fastNLP.core.metrics.backend.oneflow_backend
:members:
:undoc-members:
:show-inheritance:

Submodules
----------

.. toctree::
:maxdepth: 4

fastNLP.core.metrics.backend.oneflow_backend.backend

+ 1
- 0
docs/source/fastNLP.core.metrics.backend.rst View File

@@ -13,6 +13,7 @@ Subpackages
:maxdepth: 4

fastNLP.core.metrics.backend.jittor_backend
fastNLP.core.metrics.backend.oneflow_backend
fastNLP.core.metrics.backend.paddle_backend
fastNLP.core.metrics.backend.torch_backend



+ 7
- 0
docs/source/fastNLP.core.utils.oneflow_utils.rst View File

@@ -0,0 +1,7 @@
fastNLP.core.utils.oneflow\_utils module
========================================

.. automodule:: fastNLP.core.utils.oneflow_utils
:members:
:undoc-members:
:show-inheritance:

+ 3
- 0
docs/source/fastNLP.core.utils.rst View File

@@ -16,7 +16,10 @@ Submodules
fastNLP.core.utils.dummy_class
fastNLP.core.utils.exceptions
fastNLP.core.utils.jittor_utils
fastNLP.core.utils.oneflow_utils
fastNLP.core.utils.paddle_utils
fastNLP.core.utils.rich_progress
fastNLP.core.utils.seq_len_to_mask
fastNLP.core.utils.torch_utils
fastNLP.core.utils.tqdm_progress
fastNLP.core.utils.utils

+ 7
- 0
docs/source/fastNLP.core.utils.seq_len_to_mask.rst View File

@@ -0,0 +1,7 @@
fastNLP.core.utils.seq\_len\_to\_mask module
============================================

.. automodule:: fastNLP.core.utils.seq_len_to_mask
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.core.utils.tqdm_progress.rst View File

@@ -0,0 +1,7 @@
fastNLP.core.utils.tqdm\_progress module
========================================

.. automodule:: fastNLP.core.utils.tqdm_progress
:members:
:undoc-members:
:show-inheritance:

+ 15
- 0
docs/source/fastNLP.embeddings.rst View File

@@ -0,0 +1,15 @@
fastNLP.embeddings package
==========================

.. automodule:: fastNLP.embeddings
:members:
:undoc-members:
:show-inheritance:

Subpackages
-----------

.. toctree::
:maxdepth: 4

fastNLP.embeddings.torch

+ 7
- 0
docs/source/fastNLP.embeddings.torch.char_embedding.rst View File

@@ -0,0 +1,7 @@
fastNLP.embeddings.torch.char\_embedding module
===============================================

.. automodule:: fastNLP.embeddings.torch.char_embedding
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.embeddings.torch.embedding.rst View File

@@ -0,0 +1,7 @@
fastNLP.embeddings.torch.embedding module
=========================================

.. automodule:: fastNLP.embeddings.torch.embedding
:members:
:undoc-members:
:show-inheritance:

+ 19
- 0
docs/source/fastNLP.embeddings.torch.rst View File

@@ -0,0 +1,19 @@
fastNLP.embeddings.torch package
================================

.. automodule:: fastNLP.embeddings.torch
:members:
:undoc-members:
:show-inheritance:

Submodules
----------

.. toctree::
:maxdepth: 4

fastNLP.embeddings.torch.char_embedding
fastNLP.embeddings.torch.embedding
fastNLP.embeddings.torch.stack_embedding
fastNLP.embeddings.torch.static_embedding
fastNLP.embeddings.torch.utils

+ 7
- 0
docs/source/fastNLP.embeddings.torch.stack_embedding.rst View File

@@ -0,0 +1,7 @@
fastNLP.embeddings.torch.stack\_embedding module
================================================

.. automodule:: fastNLP.embeddings.torch.stack_embedding
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.embeddings.torch.static_embedding.rst View File

@@ -0,0 +1,7 @@
fastNLP.embeddings.torch.static\_embedding module
=================================================

.. automodule:: fastNLP.embeddings.torch.static_embedding
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.embeddings.torch.utils.rst View File

@@ -0,0 +1,7 @@
fastNLP.embeddings.torch.utils module
=====================================

.. automodule:: fastNLP.embeddings.torch.utils
:members:
:undoc-members:
:show-inheritance:

+ 0
- 1
docs/source/fastNLP.io.loader.rst View File

@@ -14,7 +14,6 @@ Submodules

fastNLP.io.loader.classification
fastNLP.io.loader.conll
fastNLP.io.loader.coreference
fastNLP.io.loader.csv
fastNLP.io.loader.cws
fastNLP.io.loader.json


+ 0
- 7
docs/source/fastNLP.io.model_io.rst View File

@@ -1,7 +0,0 @@
fastNLP.io.model\_io module
===========================

.. automodule:: fastNLP.io.model_io
:members:
:undoc-members:
:show-inheritance:

+ 0
- 1
docs/source/fastNLP.io.pipe.rst View File

@@ -15,7 +15,6 @@ Submodules
fastNLP.io.pipe.classification
fastNLP.io.pipe.conll
fastNLP.io.pipe.construct_graph
fastNLP.io.pipe.coreference
fastNLP.io.pipe.cws
fastNLP.io.pipe.matching
fastNLP.io.pipe.pipe


+ 0
- 1
docs/source/fastNLP.io.rst View File

@@ -25,5 +25,4 @@ Submodules
fastNLP.io.embed_loader
fastNLP.io.file_reader
fastNLP.io.file_utils
fastNLP.io.model_io
fastNLP.io.utils

+ 15
- 0
docs/source/fastNLP.models.rst View File

@@ -0,0 +1,15 @@
fastNLP.models package
======================

.. automodule:: fastNLP.models
:members:
:undoc-members:
:show-inheritance:

Subpackages
-----------

.. toctree::
:maxdepth: 4

fastNLP.models.torch

+ 7
- 0
docs/source/fastNLP.models.torch.biaffine_parser.rst View File

@@ -0,0 +1,7 @@
fastNLP.models.torch.biaffine\_parser module
============================================

.. automodule:: fastNLP.models.torch.biaffine_parser
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.models.torch.cnn_text_classification.rst View File

@@ -0,0 +1,7 @@
fastNLP.models.torch.cnn\_text\_classification module
=====================================================

.. automodule:: fastNLP.models.torch.cnn_text_classification
:members:
:undoc-members:
:show-inheritance:

+ 19
- 0
docs/source/fastNLP.models.torch.rst View File

@@ -0,0 +1,19 @@
fastNLP.models.torch package
============================

.. automodule:: fastNLP.models.torch
:members:
:undoc-members:
:show-inheritance:

Submodules
----------

.. toctree::
:maxdepth: 4

fastNLP.models.torch.biaffine_parser
fastNLP.models.torch.cnn_text_classification
fastNLP.models.torch.seq2seq_generator
fastNLP.models.torch.seq2seq_model
fastNLP.models.torch.sequence_labeling

+ 7
- 0
docs/source/fastNLP.models.torch.seq2seq_generator.rst View File

@@ -0,0 +1,7 @@
fastNLP.models.torch.seq2seq\_generator module
==============================================

.. automodule:: fastNLP.models.torch.seq2seq_generator
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.models.torch.seq2seq_model.rst View File

@@ -0,0 +1,7 @@
fastNLP.models.torch.seq2seq\_model module
==========================================

.. automodule:: fastNLP.models.torch.seq2seq_model
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.models.torch.sequence_labeling.rst View File

@@ -0,0 +1,7 @@
fastNLP.models.torch.sequence\_labeling module
==============================================

.. automodule:: fastNLP.models.torch.sequence_labeling
:members:
:undoc-members:
:show-inheritance:

+ 1
- 0
docs/source/fastNLP.modules.rst View File

@@ -13,3 +13,4 @@ Subpackages
:maxdepth: 4

fastNLP.modules.mix_modules
fastNLP.modules.torch

+ 7
- 0
docs/source/fastNLP.modules.torch.attention.rst View File

@@ -0,0 +1,7 @@
fastNLP.modules.torch.attention module
======================================

.. automodule:: fastNLP.modules.torch.attention
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.modules.torch.decoder.crf.rst View File

@@ -0,0 +1,7 @@
fastNLP.modules.torch.decoder.crf module
========================================

.. automodule:: fastNLP.modules.torch.decoder.crf
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.modules.torch.decoder.mlp.rst View File

@@ -0,0 +1,7 @@
fastNLP.modules.torch.decoder.mlp module
========================================

.. automodule:: fastNLP.modules.torch.decoder.mlp
:members:
:undoc-members:
:show-inheritance:

+ 18
- 0
docs/source/fastNLP.modules.torch.decoder.rst View File

@@ -0,0 +1,18 @@
fastNLP.modules.torch.decoder package
=====================================

.. automodule:: fastNLP.modules.torch.decoder
:members:
:undoc-members:
:show-inheritance:

Submodules
----------

.. toctree::
:maxdepth: 4

fastNLP.modules.torch.decoder.crf
fastNLP.modules.torch.decoder.mlp
fastNLP.modules.torch.decoder.seq2seq_decoder
fastNLP.modules.torch.decoder.seq2seq_state

+ 7
- 0
docs/source/fastNLP.modules.torch.decoder.seq2seq_decoder.rst View File

@@ -0,0 +1,7 @@
fastNLP.modules.torch.decoder.seq2seq\_decoder module
=====================================================

.. automodule:: fastNLP.modules.torch.decoder.seq2seq_decoder
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.modules.torch.decoder.seq2seq_state.rst View File

@@ -0,0 +1,7 @@
fastNLP.modules.torch.decoder.seq2seq\_state module
===================================================

.. automodule:: fastNLP.modules.torch.decoder.seq2seq_state
:members:
:undoc-members:
:show-inheritance:

docs/source/fastNLP.io.loader.coreference.rst → docs/source/fastNLP.modules.torch.dropout.rst View File

@@ -1,7 +1,7 @@
fastNLP.io.loader.coreference module
fastNLP.modules.torch.dropout module
====================================

.. automodule:: fastNLP.io.loader.coreference
.. automodule:: fastNLP.modules.torch.dropout
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.modules.torch.encoder.conv_maxpool.rst View File

@@ -0,0 +1,7 @@
fastNLP.modules.torch.encoder.conv\_maxpool module
==================================================

.. automodule:: fastNLP.modules.torch.encoder.conv_maxpool
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.modules.torch.encoder.lstm.rst View File

@@ -0,0 +1,7 @@
fastNLP.modules.torch.encoder.lstm module
=========================================

.. automodule:: fastNLP.modules.torch.encoder.lstm
:members:
:undoc-members:
:show-inheritance:

+ 20
- 0
docs/source/fastNLP.modules.torch.encoder.rst View File

@@ -0,0 +1,20 @@
fastNLP.modules.torch.encoder package
=====================================

.. automodule:: fastNLP.modules.torch.encoder
:members:
:undoc-members:
:show-inheritance:

Submodules
----------

.. toctree::
:maxdepth: 4

fastNLP.modules.torch.encoder.conv_maxpool
fastNLP.modules.torch.encoder.lstm
fastNLP.modules.torch.encoder.seq2seq_encoder
fastNLP.modules.torch.encoder.star_transformer
fastNLP.modules.torch.encoder.transformer
fastNLP.modules.torch.encoder.variational_rnn

+ 7
- 0
docs/source/fastNLP.modules.torch.encoder.seq2seq_encoder.rst View File

@@ -0,0 +1,7 @@
fastNLP.modules.torch.encoder.seq2seq\_encoder module
=====================================================

.. automodule:: fastNLP.modules.torch.encoder.seq2seq_encoder
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.modules.torch.encoder.star_transformer.rst View File

@@ -0,0 +1,7 @@
fastNLP.modules.torch.encoder.star\_transformer module
======================================================

.. automodule:: fastNLP.modules.torch.encoder.star_transformer
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.modules.torch.encoder.transformer.rst View File

@@ -0,0 +1,7 @@
fastNLP.modules.torch.encoder.transformer module
================================================

.. automodule:: fastNLP.modules.torch.encoder.transformer
:members:
:undoc-members:
:show-inheritance:

+ 7
- 0
docs/source/fastNLP.modules.torch.encoder.variational_rnn.rst View File

@@ -0,0 +1,7 @@
fastNLP.modules.torch.encoder.variational\_rnn module
=====================================================

.. automodule:: fastNLP.modules.torch.encoder.variational_rnn
:members:
:undoc-members:
:show-inheritance:

+ 15
- 0
docs/source/fastNLP.modules.torch.generator.rst View File

@@ -0,0 +1,15 @@
fastNLP.modules.torch.generator package
=======================================

.. automodule:: fastNLP.modules.torch.generator
:members:
:undoc-members:
:show-inheritance:

Submodules
----------

.. toctree::
:maxdepth: 4

fastNLP.modules.torch.generator.seq2seq_generator

+ 7
- 0
docs/source/fastNLP.modules.torch.generator.seq2seq_generator.rst View File

@@ -0,0 +1,7 @@
fastNLP.modules.torch.generator.seq2seq\_generator module
=========================================================

.. automodule:: fastNLP.modules.torch.generator.seq2seq_generator
:members:
:undoc-members:
:show-inheritance:

+ 26
- 0
docs/source/fastNLP.modules.torch.rst View File

@@ -0,0 +1,26 @@
fastNLP.modules.torch package
=============================

.. automodule:: fastNLP.modules.torch
:members:
:undoc-members:
:show-inheritance:

Subpackages
-----------

.. toctree::
:maxdepth: 4

fastNLP.modules.torch.decoder
fastNLP.modules.torch.encoder
fastNLP.modules.torch.generator

Submodules
----------

.. toctree::
:maxdepth: 4

fastNLP.modules.torch.attention
fastNLP.modules.torch.dropout

+ 3
- 0
docs/source/fastNLP.rst View File

@@ -13,6 +13,9 @@ Subpackages
:maxdepth: 4

fastNLP.core
fastNLP.embeddings
fastNLP.envs
fastNLP.io
fastNLP.models
fastNLP.modules
fastNLP.transformers

+ 14
- 0
docs/source/fastNLP.transformers.rst View File

@@ -0,0 +1,14 @@
fastNLP.transformers package
============================
.. automodule:: fastNLP.transformers
:members:
:undoc-members:
:show-inheritance:

Subpackages
-----------

.. toctree::
:maxdepth: 4

fastNLP.transformers.torch

docs/source/fastNLP.io.pipe.coreference.rst → docs/source/fastNLP.transformers.torch.rst View File

@@ -1,7 +1,7 @@
fastNLP.io.pipe.coreference module
fastNLP.transformers.torch package
==================================

.. automodule:: fastNLP.io.pipe.coreference
.. automodule:: fastNLP.transformers.torch
:members:
:undoc-members:
:show-inheritance:

+ 4
- 4
docs/source/index.rst View File

@@ -2,18 +2,18 @@ fastNLP 中文文档
=====================


用户手册
快速上手
----------------

.. toctree::
:maxdepth: 1
:maxdepth: 2

语法样例 </user/example>
tutorials

API 文档
-------------

除了用户手册之外,你还可以通过查阅 API 文档来找到你所需要的工具。
可以通过查阅 API 文档来找到你所需要的工具。

.. toctree::
:titlesonly:


+ 8
- 0
docs/source/tutorials.rst View File

@@ -0,0 +1,8 @@
fastNLP 教程系列
================

.. toctree::
:maxdepth: 1
:glob:

tutorials/*

+ 869
- 0
docs/source/tutorials/fastnlp_torch_tutorial.ipynb View File

@@ -0,0 +1,869 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "6011adf8",
"metadata": {},
"source": [
"# 10 分钟快速上手 fastNLP torch\n",
"\n",
"在这个例子中,我们将使用BERT来解决conll2003数据集中的命名实体识别任务。"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "e166c051",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2022-07-07 10:12:29-- https://data.deepai.org/conll2003.zip\n",
"Resolving data.deepai.org (data.deepai.org)... 138.201.36.183\n",
"Connecting to data.deepai.org (data.deepai.org)|138.201.36.183|:443... connected.\n",
"WARNING: cannot verify data.deepai.org's certificate, issued by ‘CN=R3,O=Let's Encrypt,C=US’:\n",
" Issued certificate has expired.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 982975 (960K) [application/x-zip-compressed]\n",
"Saving to: ‘conll2003.zip’\n",
"\n",
"conll2003.zip 100%[===================>] 959.94K 653KB/s in 1.5s \n",
"\n",
"2022-07-07 10:12:32 (653 KB/s) - ‘conll2003.zip’ saved [982975/982975]\n",
"\n",
"Archive: conll2003.zip\n",
" inflating: conll2003/metadata \n",
" inflating: conll2003/test.txt \n",
" inflating: conll2003/train.txt \n",
" inflating: conll2003/valid.txt \n"
]
}
],
"source": [
"# Linux/Mac 下载数据,并解压\n",
"import platform\n",
"if platform.system() != \"Windows\":\n",
" !wget https://data.deepai.org/conll2003.zip --no-check-certificate -O conll2003.zip\n",
" !unzip conll2003.zip -d conll2003\n",
"# Windows用户请通过复制该url到浏览器下载该数据并解压"
]
},
{
"cell_type": "markdown",
"id": "f7acbf1f",
"metadata": {},
"source": [
"## 目录\n",
"接下来我们将按照以下的内容介绍在如何通过fastNLP减少工程性代码的撰写 \n",
"- 1. 数据加载\n",
"- 2. 数据预处理、数据缓存\n",
"- 3. DataLoader\n",
"- 4. 模型准备\n",
"- 5. Trainer的使用\n",
"- 6. Evaluator的使用\n",
"- 7. 其它【待补充】\n",
" - 7.1 使用多卡进行训练、评测\n",
" - 7.2 使用ZeRO优化\n",
" - 7.3 通过overfit测试快速验证模型\n",
" - 7.4 复杂Monitor的使用\n",
" - 7.5 训练过程中,使用不同的测试函数\n",
" - 7.6 更有效率的Sampler\n",
" - 7.7 保存模型\n",
" - 7.8 断点重训\n",
" - 7.9 使用huggingface datasets\n",
" - 7.10 使用torchmetrics来作为metric\n",
" - 7.11 将预测结果写出到文件\n",
" - 7.12 混合 dataset 训练\n",
" - 7.13 logger的使用\n",
" - 7.14 自定义分布式 Metric 。\n",
" - 7.15 通过batch_step_fn实现R-Drop"
]
},
{
"cell_type": "markdown",
"id": "0657dfba",
"metadata": {},
"source": [
"#### 1. 数据加载\n",
"目前在``conll2003``目录下有``train.txt``, ``test.txt``与``valid.txt``三个文件,文件的格式为[conll格式](https://universaldependencies.org/format.html),其编码格式为 [BIO](https://blog.csdn.net/HappyRocking/article/details/79716212) 类型。可以通过继承 fastNLP.io.Loader 来简化加载过程,继承了 Loader 函数后,只需要在实现读取单个文件 _load() 函数即可。"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "c557f0ba",
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"sys.path.append('../..')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "6f59e438",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
"</pre>\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"In total 3 datasets:\n",
"\ttrain has 14987 instances.\n",
"\ttest has 3684 instances.\n",
"\tdev has 3466 instances.\n",
"\n"
]
}
],
"source": [
"from fastNLP import DataSet, Instance\n",
"from fastNLP.io import Loader\n",
"\n",
"\n",
"# 继承Loader之后,我们只需要实现其中_load()方法,_load()方法传入一个文件路径,返回一个fastNLP DataSet对象,其目的是读取一个文件。\n",
"class ConllLoader(Loader):\n",
" def _load(self, path):\n",
" ds = DataSet()\n",
" with open(path, 'r') as f:\n",
" segments = []\n",
" for line in f:\n",
" line = line.strip()\n",
" if line == '': # 如果为空行,说明需要切换到下一句了。\n",
" if segments:\n",
" raw_words = [s[0] for s in segments]\n",
" raw_target = [s[1] for s in segments]\n",
" # 将一个 sample 插入到 DataSet中\n",
" ds.append(Instance(raw_words=raw_words, raw_target=raw_target)) \n",
" segments = []\n",
" else:\n",
" parts = line.split()\n",
" assert len(parts)==4\n",
" segments.append([parts[0], parts[-1]])\n",
" return ds\n",
" \n",
"\n",
"# 直接使用 load() 方法加载数据集, 返回的 data_bundle 是一个 fastNLP.io.DataBundle 对象,该对象相当于将多个 dataset 放置在一起,\n",
"# 可以方便之后的预处理,DataBundle 支持的接口可以在 !!! 查看。\n",
"data_bundle = ConllLoader().load({\n",
" 'train': 'conll2003/train.txt',\n",
" 'test': 'conll2003/test.txt',\n",
" 'dev': 'conll2003/valid.txt'\n",
"})\n",
"\"\"\"\n",
"也可以通过 ConllLoader().load('conll2003/') 来读取,其原理是load()函数将尝试从'conll2003/'文件夹下寻找文件名称中包含了\n",
"'train'、'test'和'dev'的文件,并分别读取将其命名为'train'、'test'和'dev'(如文件夹中同一个关键字出现在了多个文件名中将导致报错,\n",
"此时请通过dict的方式传入路径信息)。但在我们这里的数据里,没有文件包含dev,所以无法直接使用文件夹读取,转而通过dict的方式传入读取的路径,\n",
"该dict的key也将作为读取的数据集的名称,value即对应的文件路径。\n",
"\"\"\"\n",
"\n",
"print(data_bundle) # 打印 data_bundle 可以查看包含的 DataSet \n",
"# data_bundle.get_dataset('train') # 可以获取单个 dataset"
]
},
{
"cell_type": "markdown",
"id": "57ae314d",
"metadata": {},
"source": [
"#### 2. 数据预处理\n",
"接下来,我们将演示如何通过fastNLP提供的apply函数方便快捷地进行预处理。我们需要进行的预处理操作有: \n",
"(1)使用BertTokenizer将文本转换为index;同时记录每个word被bpe之后第一个bpe的index,用于得到word的hidden state; \n",
"(2)使用[Vocabulary](../fastNLP)来将raw_target转换为序号。 "
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "96389988",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output()"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
],
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "c3bd41a323c94a41b409d29a5d4079b6",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output()"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"IOPub message rate exceeded.\n",
"The notebook server will temporarily stop sending output\n",
"to the client in order to avoid crashing it.\n",
"To change this limit, set the config variable\n",
"`--NotebookApp.iopub_msg_rate_limit`.\n",
"\n",
"Current values:\n",
"NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
"NotebookApp.rate_limit_window=3.0 (secs)\n",
"\n"
]
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
],
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[10:48:13] </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> Save cache to <span style=\"color: #800080; text-decoration-color: #800080\">/remote-home/hyan01/exps/fastNLP/fastN</span> <a href=\"file://../../fastNLP/core/utils/cache_results.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">cache_results.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file://../../fastNLP/core/utils/cache_results.py#332\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">332</span></a>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> <span style=\"color: #800080; text-decoration-color: #800080\">LP/demo/torch_tutorial/caches/</span><span style=\"color: #ff00ff; text-decoration-color: #ff00ff\">c7f74559_cache.pkl.</span> <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[2;36m[10:48:13]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Save cache to \u001b[35m/remote-home/hyan01/exps/fastNLP/fastN\u001b[0m \u001b]8;id=831330;file://../../fastNLP/core/utils/cache_results.py\u001b\\\u001b[2mcache_results.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=609545;file://../../fastNLP/core/utils/cache_results.py#332\u001b\\\u001b[2m332\u001b[0m\u001b]8;;\u001b\\\n",
"\u001b[2;36m \u001b[0m \u001b[35mLP/demo/torch_tutorial/caches/\u001b[0m\u001b[95mc7f74559_cache.pkl.\u001b[0m \u001b[2m \u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# fastNLP 中提供了BERT, RoBERTa, GPT, BART 模型,更多的预训练模型请直接使用transformers\n",
"from fastNLP.transformers.torch import BertTokenizer\n",
"from fastNLP import cache_results, Vocabulary\n",
"\n",
"# 使用cache_results来装饰函数,会将函数的返回结果缓存到'caches/{param_hash_id}_cache.pkl'路径中(其中{param_hash_id}是根据\n",
"# 传递给 process_data 函数参数决定的,因此当函数的参数变化时,会再生成新的缓存文件。如果需要重新生成新的缓存,(a) 可以在调用process_data\n",
"# 函数时,额外传入一个_refresh=True的参数; 或者(b)删除相应的缓存文件。此外,保存结果时,cache_results默认还会\n",
"# 记录 process_data 函数源码的hash值,当其源码发生了变动,直接读取缓存会发出警告,以防止在修改预处理代码之后,忘记刷新缓存。)\n",
"@cache_results('caches/cache.pkl')\n",
"def process_data(data_bundle, model_name):\n",
" tokenizer = BertTokenizer.from_pretrained(model_name)\n",
" def bpe(raw_words):\n",
" bpes = [tokenizer.cls_token_id]\n",
" first = [0]\n",
" first_index = 1 # 记录第一个bpe的位置\n",
" for word in raw_words:\n",
" bpe = tokenizer.encode(word, add_special_tokens=False)\n",
" bpes.extend(bpe)\n",
" first.append(first_index)\n",
" first_index += len(bpe)\n",
" bpes.append(tokenizer.sep_token_id)\n",
" first.append(first_index)\n",
" return {'input_ids': bpes, 'input_len': len(bpes), 'first': first, 'first_len': len(raw_words)}\n",
" # 对data_bundle中每个dataset的每一条数据中的raw_words使用bpe函数,并且将返回的结果加入到每条数据中。\n",
" data_bundle.apply_field_more(bpe, field_name='raw_words', num_proc=4)\n",
" # 对应我们还有 apply_field() 函数,该函数和 apply_field_more() 的区别在于传入到 apply_field() 中的函数应该返回一个 field 的\n",
" # 内容(即不需要用dict包裹了)。此外,我们还提供了 data_bundle.apply() ,传入 apply() 的函数需要支持传入一个Instance对象,\n",
" # 更多信息可以参考对应的文档。\n",
" \n",
" # tag的词表,由于这是词表,所以不需要有padding和unk\n",
" tag_vocab = Vocabulary(padding=None, unknown=None)\n",
" # 从 train 数据的 raw_target 中获取建立词表\n",
" tag_vocab.from_dataset(data_bundle.get_dataset('train'), field_name='raw_target')\n",
" # 使用词表将每个 dataset 中的raw_target转为数字,并且将写入到target这个field中\n",
" tag_vocab.index_dataset(data_bundle.datasets.values(), field_name='raw_target', new_field_name='target')\n",
" \n",
" # 可以将 vocabulary 绑定到 data_bundle 上,方便之后使用。\n",
" data_bundle.set_vocab(tag_vocab, field_name='target')\n",
" \n",
" return data_bundle, tokenizer\n",
"\n",
"data_bundle, tokenizer = process_data(data_bundle, 'bert-base-cased', _refresh=True) # 第一次调用耗时较长,第二次调用则会直接读取缓存的文件\n",
"# data_bundle = process_data(data_bundle, 'bert-base-uncased') # 由于参数变化,fastNLP 会再次生成新的缓存文件。 "
]
},
{
"cell_type": "markdown",
"id": "80036fcd",
"metadata": {},
"source": [
"### 3. DataLoader \n",
"由于现在的深度学习算法大都基于 mini-batch 进行优化,因此需要将多个 sample 组合成一个 batch 再输入到模型之中。在自然语言处理中,不同的 sample 往往长度不一致,需要进行 padding 操作。在fastNLP中,我们使用 fastNLP.TorchDataLoader 帮助用户快速进行 padding ,我们使用了 !!!fastNLP.Collator!!! 对象来进行 pad ,Collator 会在迭代过程中根据第一个 batch 的数据自动判定每个 field 是否可以进行 pad ,可以通过 Collator.set_pad() 函数修改某个 field 的 pad 行为。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "09494695",
"metadata": {},
"outputs": [],
"source": [
"from fastNLP import prepare_dataloader\n",
"\n",
"# 将 data_bundle 中每个 dataset 取出并构造出相应的 DataLoader 对象。返回的 dls 是一个 dict ,包含了 'train', 'test', 'dev' 三个\n",
"# fastNLP.TorchDataLoader 对象。\n",
"dls = prepare_dataloader(data_bundle, batch_size=24) \n",
"\n",
"\n",
"# fastNLP 将默认尝试对所有 field 都进行 pad ,如果当前 field 是不可 pad 的类型,则不进行pad;如果是可以 pad 的类型\n",
"# 默认使用 0 进行 pad 。\n",
"for dl in dls.values():\n",
" # 可以通过 set_pad 修改 padding 的行为。\n",
" dl.set_pad('input_ids', pad_val=tokenizer.pad_token_id)\n",
" # 如果希望忽略某个 field ,可以通过 set_ignore 方法。\n",
" dl.set_ignore('raw_target')\n",
" dl.set_pad('target', pad_val=-100)\n",
"# 另一种设置的方法是,可以在 dls = prepare_dataloader(data_bundle, batch_size=32) 之前直接调用 \n",
"# data_bundle.set_pad('input_ids', pad_val=tokenizer.pad_token_id); data_bundle.set_ignore('raw_target')来进行设置。\n",
"# DataSet 也支持这两个方法。\n",
"# 若此时调用 batch = next(dls['train']),则 batch 是一个 dict ,其中包含了\n",
"# 'input_ids': torch.LongTensor([batch_size, max_len])\n",
"# 'input_len': torch.LongTensor([batch_size])\n",
"# 'first': torch.LongTensor([batch_size, max_len'])\n",
"# 'first_len': torch.LongTensor([batch_size])\n",
"# 'target': torch.LongTensor([batch_size, max_len'-2])\n",
"# 'raw_words': List[List[str]] # 因为无法判断,所以 Collator 不会做任何处理"
]
},
{
"cell_type": "markdown",
"id": "3583df6d",
"metadata": {},
"source": [
"### 4. 模型准备\n",
"传入给fastNLP的模型,需要有两个特殊的方法``train_step``、``evaluate_step``,前者默认在 fastNLP.Trainer 中进行调用,后者默认在 fastNLP.Evaluator 中调用。如果模型中没有``train_step``方法,则Trainer会直接使用模型的``forward``函数;如果模型没有``evaluate_step``方法,则Evaluator会直接使用模型的``forward``函数。``train_step``方法(或当其不存在时,``forward``方法)的返回值必须为 dict 类型,并且必须包含``loss``这个 key 。\n",
"\n",
"此外fastNLP会使用形参名匹配的方式进行参数传递,例如以下模型\n",
"```python\n",
"class Model(nn.Module):\n",
" def train_step(self, x, y):\n",
" return {'loss': (x-y).abs().mean()}\n",
"```\n",
"fastNLP将尝试从 DataLoader 返回的 batch(假设包含的 key 为 input_ids, target) 中寻找 'x' 和 'y' 这两个 key ,如果没有找到则会报错。有以下的方法可以解决报错\n",
"- 修改 train_step 的参数为(input_ids, target),以保证和 DataLoader 返回的 batch 中的 key 匹配\n",
"- 修改 DataLoader 中返回 batch 的 key 的名字为 (x, y)\n",
"- 在 Trainer 中传入参数 train_input_mapping={'input_ids': 'x', 'target': 'y'} 将输入进行映射,train_input_mapping 也可以是一个函数,更多 train_input_mapping 的介绍可以参考文档。\n",
"\n",
"``evaluate_step``也是使用同样的匹配方式,前两条解决方法是一致的,第三种解决方案中,需要在 Evaluator 中传入 evaluate_input_mapping={'input_ids': 'x', 'target': 'y'}。"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "f131c1a3",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[10:48:21] </span><span style=\"color: #800000; text-decoration-color: #800000\">WARNING </span> Some weights of the model checkpoint at <a href=\"file://../../fastNLP/transformers/torch/modeling_utils.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">modeling_utils.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file://../../fastNLP/transformers/torch/modeling_utils.py#1490\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1490</span></a>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> bert-base-uncased were not used when initializing <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> BertModel: <span style=\"font-weight: bold\">[</span><span style=\"color: #008000; text-decoration-color: #008000\">'cls.predictions.bias'</span>, <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> <span style=\"color: #008000; text-decoration-color: #008000\">'cls.predictions.transform.LayerNorm.weight'</span>, <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> <span style=\"color: #008000; text-decoration-color: #008000\">'cls.seq_relationship.weight'</span>, <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> <span style=\"color: #008000; text-decoration-color: #008000\">'cls.predictions.decoder.weight'</span>, <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> <span style=\"color: #008000; text-decoration-color: #008000\">'cls.predictions.transform.dense.weight'</span>, <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> <span style=\"color: #008000; text-decoration-color: #008000\">'cls.predictions.transform.LayerNorm.bias'</span>, <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> <span style=\"color: #008000; text-decoration-color: #008000\">'cls.predictions.transform.dense.bias'</span>, <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> <span style=\"color: #008000; text-decoration-color: #008000\">'cls.seq_relationship.bias'</span><span style=\"font-weight: bold\">]</span> <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> - This IS expected if you are initializing <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> BertModel from the checkpoint of a model trained <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> on another task or with another architecture <span style=\"font-weight: bold\">(</span>e.g. <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> initializing a BertForSequenceClassification model <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> from a BertForPreTraining model<span style=\"font-weight: bold\">)</span>. <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> - This IS NOT expected if you are initializing <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> BertModel from the checkpoint of a model that you <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> expect to be exactly identical <span style=\"font-weight: bold\">(</span>initializing a <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> BertForSequenceClassification model from a <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> BertForSequenceClassification model<span style=\"font-weight: bold\">)</span>. <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[2;36m[10:48:21]\u001b[0m\u001b[2;36m \u001b[0m\u001b[31mWARNING \u001b[0m Some weights of the model checkpoint at \u001b]8;id=387614;file://../../fastNLP/transformers/torch/modeling_utils.py\u001b\\\u001b[2mmodeling_utils.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=648168;file://../../fastNLP/transformers/torch/modeling_utils.py#1490\u001b\\\u001b[2m1490\u001b[0m\u001b]8;;\u001b\\\n",
"\u001b[2;36m \u001b[0m bert-base-uncased were not used when initializing \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m BertModel: \u001b[1m[\u001b[0m\u001b[32m'cls.predictions.bias'\u001b[0m, \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m \u001b[32m'cls.predictions.transform.LayerNorm.weight'\u001b[0m, \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m \u001b[32m'cls.seq_relationship.weight'\u001b[0m, \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m \u001b[32m'cls.predictions.decoder.weight'\u001b[0m, \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m \u001b[32m'cls.predictions.transform.dense.weight'\u001b[0m, \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m \u001b[32m'cls.predictions.transform.LayerNorm.bias'\u001b[0m, \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m \u001b[32m'cls.predictions.transform.dense.bias'\u001b[0m, \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m \u001b[32m'cls.seq_relationship.bias'\u001b[0m\u001b[1m]\u001b[0m \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m - This IS expected if you are initializing \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m BertModel from the checkpoint of a model trained \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m on another task or with another architecture \u001b[1m(\u001b[0me.g. \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m initializing a BertForSequenceClassification model \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m from a BertForPreTraining model\u001b[1m)\u001b[0m. \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m - This IS NOT expected if you are initializing \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m BertModel from the checkpoint of a model that you \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m expect to be exactly identical \u001b[1m(\u001b[0minitializing a \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m BertForSequenceClassification model from a \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m BertForSequenceClassification model\u001b[1m)\u001b[0m. \u001b[2m \u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> All the weights of BertModel were initialized from <a href=\"file://../../fastNLP/transformers/torch/modeling_utils.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">modeling_utils.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file://../../fastNLP/transformers/torch/modeling_utils.py#1507\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1507</span></a>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> the model checkpoint at bert-base-uncased. <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> If your task is similar to the task the model of <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> the checkpoint was trained on, you can already use <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> BertModel for predictions without further <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> training. <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m All the weights of BertModel were initialized from \u001b]8;id=544687;file://../../fastNLP/transformers/torch/modeling_utils.py\u001b\\\u001b[2mmodeling_utils.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=934505;file://../../fastNLP/transformers/torch/modeling_utils.py#1507\u001b\\\u001b[2m1507\u001b[0m\u001b]8;;\u001b\\\n",
"\u001b[2;36m \u001b[0m the model checkpoint at bert-base-uncased. \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m If your task is similar to the task the model of \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m the checkpoint was trained on, you can already use \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m BertModel for predictions without further \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m training. \u001b[2m \u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import torch\n",
"from torch import nn\n",
"from torch.nn.utils.rnn import pad_sequence\n",
"from fastNLP.transformers.torch import BertModel\n",
"from fastNLP import seq_len_to_mask\n",
"import torch.nn.functional as F\n",
"\n",
"\n",
"class BertNER(nn.Module):\n",
" def __init__(self, model_name, num_class, tag_vocab=None):\n",
" super().__init__()\n",
" self.bert = BertModel.from_pretrained(model_name)\n",
" self.mlp = nn.Sequential(nn.Linear(self.bert.config.hidden_size, self.bert.config.hidden_size),\n",
" nn.Dropout(0.3),\n",
" nn.Linear(self.bert.config.hidden_size, num_class))\n",
" self.tag_vocab = tag_vocab # 这里传入 tag_vocab 的目的是为了演示 constrined_decode \n",
" if tag_vocab is not None:\n",
" self._init_constrained_transition()\n",
" \n",
" def forward(self, input_ids, input_len, first):\n",
" attention_mask = seq_len_to_mask(input_len)\n",
" outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)\n",
" last_hidden_state = outputs.last_hidden_state\n",
" first = first.unsqueeze(-1).repeat(1, 1, last_hidden_state.size(-1))\n",
" first_bpe_state = last_hidden_state.gather(dim=1, index=first)\n",
" first_bpe_state = first_bpe_state[:, 1:-1] # 删除 cls 和 sep\n",
" \n",
" pred = self.mlp(first_bpe_state)\n",
" return {'pred': pred}\n",
" \n",
" def train_step(self, input_ids, input_len, first, target):\n",
" pred = self(input_ids, input_len, first)['pred']\n",
" loss = F.cross_entropy(pred.transpose(1, 2), target)\n",
" return {'loss': loss}\n",
" \n",
" def evaluate_step(self, input_ids, input_len, first):\n",
" pred = self(input_ids, input_len, first)['pred'].argmax(dim=-1)\n",
" return {'pred': pred}\n",
" \n",
" def constrained_decode(self, input_ids, input_len, first, first_len):\n",
" # 这个函数在推理时,将保证解码出来的 tag 一定不与前一个 tag 矛盾【例如一定不会出现 B-person 后面接着 I-Location 的情况】\n",
" # 本身这个需求可以在 Metric 中实现,这里在模型中实现的目的是为了方便演示:如何在fastNLP中使用不同的评测函数\n",
" pred = self(input_ids, input_len, first)['pred']\n",
" cons_pred = []\n",
" for _pred, _len in zip(pred, first_len):\n",
" _pred = _pred[:_len]\n",
" tags = [_pred[0].argmax(dim=-1).item()] # 这里就不考虑第一个位置非法的情况了\n",
" for i in range(1, _len):\n",
" tags.append((_pred[i] + self.transition[tags[-1]]).argmax().item())\n",
" cons_pred.append(torch.LongTensor(tags))\n",
" cons_pred = pad_sequence(cons_pred, batch_first=True)\n",
" return {'pred': cons_pred}\n",
" \n",
" def _init_constrained_transition(self):\n",
" from fastNLP.modules.torch import allowed_transitions\n",
" allowed_trans = allowed_transitions(self.tag_vocab)\n",
" transition = torch.ones((len(self.tag_vocab), len(self.tag_vocab)))*-100000.0\n",
" for s, e in allowed_trans:\n",
" transition[s, e] = 0\n",
" self.register_buffer('transition', transition)\n",
"\n",
"model = BertNER('bert-base-uncased', len(data_bundle.get_vocab('target')), data_bundle.get_vocab('target'))"
]
},
{
"cell_type": "markdown",
"id": "5aeee1e9",
"metadata": {},
"source": [
"### Trainer 的使用\n",
"fastNLP 的 Trainer 是用于对模型进行训练的部件。"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "f4250f0b",
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[10:49:22] </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> Running evaluator sanity check for <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2</span> batches. <a href=\"file://../../fastNLP/core/controllers/trainer.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">trainer.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file://../../fastNLP/core/controllers/trainer.py#661\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">661</span></a>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[2;36m[10:49:22]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Running evaluator sanity check for \u001b[1;36m2\u001b[0m batches. \u001b]8;id=246773;file://../../fastNLP/core/controllers/trainer.py\u001b\\\u001b[2mtrainer.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=639347;file://../../fastNLP/core/controllers/trainer.py#661\u001b\\\u001b[2m661\u001b[0m\u001b]8;;\u001b\\\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output()"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
],
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output()"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
"</pre>\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #00d75f; text-decoration-color: #00d75f\">+++++++++++++++++++++++++++++ </span><span style=\"font-weight: bold\">Eval. results on Epoch:</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span><span style=\"font-weight: bold\">, Batch:</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span><span style=\"color: #00d75f; text-decoration-color: #00d75f\"> +++++++++++++++++++++++++++++</span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[38;5;41m+++++++++++++++++++++++++++++ \u001b[0m\u001b[1mEval. results on Epoch:\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m, Batch:\u001b[0m\u001b[1;36m0\u001b[0m\u001b[38;5;41m +++++++++++++++++++++++++++++\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">{</span>\n",
" <span style=\"color: #000080; text-decoration-color: #000080; font-weight: bold\">\"f#f\"</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.402447</span>,\n",
" <span style=\"color: #000080; text-decoration-color: #000080; font-weight: bold\">\"pre#f\"</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.447906</span>,\n",
" <span style=\"color: #000080; text-decoration-color: #000080; font-weight: bold\">\"rec#f\"</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.365365</span>\n",
"<span style=\"font-weight: bold\">}</span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1m{\u001b[0m\n",
" \u001b[1;34m\"f#f\"\u001b[0m: \u001b[1;36m0.402447\u001b[0m,\n",
" \u001b[1;34m\"pre#f\"\u001b[0m: \u001b[1;36m0.447906\u001b[0m,\n",
" \u001b[1;34m\"rec#f\"\u001b[0m: \u001b[1;36m0.365365\u001b[0m\n",
"\u001b[1m}\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[10:51:15] </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> The best performance for monitor f#<span style=\"color: #00ff00; text-decoration-color: #00ff00; font-weight: bold\">f:0</span>.<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">402447</span> was <a href=\"file://../../fastNLP/core/callbacks/progress_callback.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">progress_callback.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file://../../fastNLP/core/callbacks/progress_callback.py#37\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">37</span></a>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> achieved in Epoch:<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>, Global Batch:<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">625</span>. The <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> evaluation result: <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> <span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'f#f'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.402447</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'pre#f'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.447906</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'rec#f'</span>: <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.365365</span><span style=\"font-weight: bold\">}</span> <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[2;36m[10:51:15]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m The best performance for monitor f#\u001b[1;92mf:0\u001b[0m.\u001b[1;36m402447\u001b[0m was \u001b]8;id=192029;file://../../fastNLP/core/callbacks/progress_callback.py\u001b\\\u001b[2mprogress_callback.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=994998;file://../../fastNLP/core/callbacks/progress_callback.py#37\u001b\\\u001b[2m37\u001b[0m\u001b]8;;\u001b\\\n",
"\u001b[2;36m \u001b[0m achieved in Epoch:\u001b[1;36m1\u001b[0m, Global Batch:\u001b[1;36m625\u001b[0m. The \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m evaluation result: \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m \u001b[1m{\u001b[0m\u001b[32m'f#f'\u001b[0m: \u001b[1;36m0.402447\u001b[0m, \u001b[32m'pre#f'\u001b[0m: \u001b[1;36m0.447906\u001b[0m, \u001b[32m'rec#f'\u001b[0m: \u001b[2m \u001b[0m\n",
"\u001b[2;36m \u001b[0m \u001b[1;36m0.365365\u001b[0m\u001b[1m}\u001b[0m \u001b[2m \u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
],
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
"</pre>\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> Loading best model from buffer with f#f: <a href=\"file://../../fastNLP/core/callbacks/load_best_model_callback.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">load_best_model_callback.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file://../../fastNLP/core/callbacks/load_best_model_callback.py#115\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">115</span></a>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.402447</span><span style=\"color: #808000; text-decoration-color: #808000\">...</span> <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Loading best model from buffer with f#f: \u001b]8;id=654516;file://../../fastNLP/core/callbacks/load_best_model_callback.py\u001b\\\u001b[2mload_best_model_callback.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=96586;file://../../fastNLP/core/callbacks/load_best_model_callback.py#115\u001b\\\u001b[2m115\u001b[0m\u001b]8;;\u001b\\\n",
"\u001b[2;36m \u001b[0m \u001b[1;36m0.402447\u001b[0m\u001b[33m...\u001b[0m \u001b[2m \u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from torch import optim\n",
"from fastNLP import Trainer, LoadBestModelCallback, TorchWarmupCallback\n",
"from fastNLP import SpanFPreRecMetric\n",
"\n",
"optimizer = optim.AdamW(model.parameters(), lr=2e-5)\n",
"callbacks = [\n",
" LoadBestModelCallback(), # 用于在训练结束之后加载性能最好的model的权重\n",
" TorchWarmupCallback()\n",
"] \n",
"\n",
"trainer = Trainer(model=model, train_dataloader=dls['train'], optimizers=optimizer, \n",
" evaluate_dataloaders=dls['dev'], \n",
" metrics={'f': SpanFPreRecMetric(tag_vocab=data_bundle.get_vocab('target'))}, \n",
" n_epochs=1, callbacks=callbacks, \n",
" # 在评测时将 dataloader 中的 first_len 映射 seq_len, 因为 Accuracy.update 接口需要输入一个名为 seq_len 的参数\n",
" evaluate_input_mapping={'first_len': 'seq_len'}, overfit_batches=0,\n",
" device=0, monitor='f#f', fp16=False) # fp16 为 True 的话,将使用 float16 进行训练。\n",
"trainer.run()"
]
},
{
"cell_type": "markdown",
"id": "c600a450",
"metadata": {},
"source": [
"### Evaluator的使用\n",
"fastNLP中用于评测数据的对象。"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "1b19f0ba",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output()"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
],
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'f#f'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.390326</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'pre#f'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.414741</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'rec#f'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.368626</span><span style=\"font-weight: bold\">}</span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1m{\u001b[0m\u001b[32m'f#f'\u001b[0m: \u001b[1;36m0.390326\u001b[0m, \u001b[32m'pre#f'\u001b[0m: \u001b[1;36m0.414741\u001b[0m, \u001b[32m'rec#f'\u001b[0m: \u001b[1;36m0.368626\u001b[0m\u001b[1m}\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"{'f#f': 0.390326, 'pre#f': 0.414741, 'rec#f': 0.368626}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from fastNLP import Evaluator\n",
"from fastNLP import SpanFPreRecMetric\n",
"\n",
"evaluator = Evaluator(model=model, dataloaders=dls['test'], \n",
" metrics={'f': SpanFPreRecMetric(tag_vocab=data_bundle.get_vocab('target'))}, \n",
" evaluate_input_mapping={'first_len': 'seq_len'}, \n",
" device=0)\n",
"evaluator.run()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "52f87770",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "f723fe399df34917875ad74c2542508c",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output()"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 如果想评测一下使用 constrained decoding的性能,则可以通过传入 evaluate_fn 指定使用的函数\n",
"def input_mapping(x):\n",
" x['seq_len'] = x['first_len']\n",
" return x\n",
"evaluator = Evaluator(model=model, dataloaders=dls['test'], device=0,\n",
" metrics={'f': SpanFPreRecMetric(tag_vocab=data_bundle.get_vocab('target'))},\n",
" evaluate_fn='constrained_decode',\n",
" # 如果将 first_len 重新命名为了 seq_len, 将导致 constrained_decode 的输入缺少 first_len 参数,因此\n",
" # 额外重复一下 'first_len': 'first_len',使得这个参数不会消失。\n",
" evaluate_input_mapping=input_mapping)\n",
"evaluator.run()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "419e718b",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

+ 1352
- 0
docs/source/tutorials/fastnlp_tutorial_0.ipynb
File diff suppressed because it is too large
View File


+ 1333
- 0
docs/source/tutorials/fastnlp_tutorial_1.ipynb
File diff suppressed because it is too large
View File


+ 884
- 0
docs/source/tutorials/fastnlp_tutorial_2.ipynb View File

@@ -0,0 +1,884 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# T2. databundle 和 tokenizer 的基本使用\n",
"\n",
"&emsp; 1 &ensp; fastNLP 中 dataset 的延伸\n",
"\n",
"&emsp; &emsp; 1.1 &ensp; databundle 的概念与使用\n",
"\n",
"&emsp; 2 &ensp; fastNLP 中的 tokenizer\n",
" \n",
"&emsp; &emsp; 2.1 &ensp; PreTrainedTokenizer 的概念\n",
"\n",
"&emsp; &emsp; 2.2 &ensp; BertTokenizer 的基本使用\n",
"<!-- \n",
"&emsp; &emsp; 2.3 &ensp; 补充:GloVe 词嵌入的使用 -->"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. fastNLP 中 dataset 的延伸\n",
"\n",
"### 1.1 databundle 的概念与使用\n",
"\n",
"在`fastNLP 1.0`中,在常用的数据加载模块`DataLoader`和数据集`DataSet`模块之间,还存在\n",
"\n",
"&emsp; 一个中间模块,即 **数据包 DataBundle 模块**,可以从`fastNLP.io`路径中导入该模块\n",
"\n",
"在`fastNLP 1.0`中,**一个 databundle 数据包包含若干 dataset 数据集和 vocabulary 词汇表**\n",
"\n",
"&emsp; 分别存储在`datasets`和`vocabs`两个变量中,所以了解`databundle`数据包之前\n",
"\n",
"需要首先**复习 dataset 数据集和 vocabulary 词汇表**,**下面的一串代码**,**你知道其大概含义吗?**\n",
"\n",
"<!-- 必要提示:`NG20`,全称[`News Group 20`](http://qwone.com/~jason/20Newsgroups/),是一个新闻文本分类数据集,包含20个类别\n",
"\n",
"&emsp; 数据集包含训练集`'ng20_train.csv'`和测试集`'ng20_test.csv'`两部分,每条数据\n",
"\n",
"&emsp; 包括`'label'`标签和`'text'`文本两个条目,通过`sample(frac=1)[:6]`随机采样并读取前6条 -->"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
"</pre>\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/6 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------------------------------------+----------+\n",
"| text | label |\n",
"+------------------------------------------+----------+\n",
"| ['a', 'series', 'of', 'escapades', 'd... | negative |\n",
"| ['this', 'quiet', ',', 'introspective... | positive |\n",
"| ['even', 'fans', 'of', 'ismail', 'mer... | negative |\n",
"| ['the', 'importance', 'of', 'being', ... | neutral |\n",
"+------------------------------------------+----------+\n",
"+------------------------------------------+----------+\n",
"| text | label |\n",
"+------------------------------------------+----------+\n",
"| ['a', 'comedy-drama', 'of', 'nearly',... | positive |\n",
"| ['a', 'positively', 'thrilling', 'com... | neutral |\n",
"+------------------------------------------+----------+\n",
"{'<pad>': 0, '<unk>': 1, 'negative': 2, 'positive': 3, 'neutral': 4}\n"
]
}
],
"source": [
"import pandas as pd\n",
"\n",
"from fastNLP import DataSet\n",
"from fastNLP import Vocabulary\n",
"from fastNLP.io import DataBundle\n",
"\n",
"datasets = DataSet.from_pandas(pd.read_csv('./data/test4dataset.tsv', sep='\\t'))\n",
"datasets.rename_field('Sentence', 'text')\n",
"datasets.rename_field('Sentiment', 'label')\n",
"datasets.apply_more(lambda ins:{'label': ins['label'].lower(), \n",
" 'text': ins['text'].lower().split()},\n",
" progress_bar='tqdm')\n",
"datasets.delete_field('SentenceId')\n",
"train_ds, test_ds = datasets.split(ratio=0.7)\n",
"datasets = {'train': train_ds, 'test': test_ds}\n",
"print(datasets['train'])\n",
"print(datasets['test'])\n",
"\n",
"vocabs = {}\n",
"vocabs['label'] = Vocabulary().from_dataset(datasets['train'].concat(datasets['test'], inplace=False), field_name='label')\n",
"vocabs['text'] = Vocabulary().from_dataset(datasets['train'].concat(datasets['test'], inplace=False), field_name='text')\n",
"print(vocabs['label'].word2idx)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!-- 上述代码的含义是:随机读取`NG20`数据集中的各十条训练数据和测试数据,将标签都设为小写,对文本进行分词\n",
" -->\n",
"上述代码的含义是:从`test4dataset`的 6 条数据中,划分 4 条训练集(`int(6*0.7) = 4`),2 条测试集\n",
"\n",
"&emsp; &emsp; 修改相关字段名称,删除序号字段,同时将标签都设为小写,对文本进行分词\n",
"\n",
"&emsp; 接着通过`concat`方法拼接测试集训练集,注意设置`inplace=False`,生成临时的新数据集\n",
"\n",
"&emsp; 使用`from_dataset`方法从拼接的数据集中抽取词汇表,为将数据集中的单词替换为序号做准备\n",
"\n",
"由此就可以得到**数据集字典 datasets**(**对应训练集、测试集**)和**词汇表字典 vocabs**(**对应数据集各字段**)\n",
"\n",
"&emsp; 然后就可以初始化`databundle`了,通过`print`可以观察其大致结构,效果如下"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"In total 2 datasets:\n",
"\ttrain has 4 instances.\n",
"\ttest has 2 instances.\n",
"In total 2 vocabs:\n",
"\tlabel has 5 entries.\n",
"\ttext has 96 entries.\n",
"\n",
"['train', 'test']\n",
"['label', 'text']\n"
]
}
],
"source": [
"data_bundle = DataBundle(datasets=datasets, vocabs=vocabs)\n",
"print(data_bundle)\n",
"print(data_bundle.get_dataset_names())\n",
"print(data_bundle.get_vocab_names())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"此外,也可以通过`data_bundle`的`num_dataset`和`num_vocab`返回数据表和词汇表个数\n",
"\n",
"&emsp; 通过`data_bundle`的`iter_datasets`和`iter_vocabs`遍历数据表和词汇表"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"In total 2 datasets:\n",
"\ttrain has 4 instances.\n",
"\ttest has 2 instances.\n",
"In total 2 datasets:\n",
"\tlabel has 5 entries.\n",
"\ttext has 96 entries.\n"
]
}
],
"source": [
"print(\"In total %d datasets:\" % data_bundle.num_dataset)\n",
"for name, dataset in data_bundle.iter_datasets():\n",
" print(\"\\t%s has %d instances.\" % (name, len(dataset)))\n",
"print(\"In total %d datasets:\" % data_bundle.num_dataset)\n",
"for name, vocab in data_bundle.iter_vocabs():\n",
" print(\"\\t%s has %d entries.\" % (name, len(vocab)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在数据包`databundle`中,也有和数据集`dataset`类似的四个`apply`函数,即\n",
"\n",
"&emsp; `apply`函数、`apply_field`函数、`apply_field_more`函数和`apply_more`函数\n",
"\n",
"&emsp; 负责对数据集进行预处理,如下所示是`apply_more`函数的示例,其他函数类似\n",
"\n",
"此外,通过`get_dataset`函数,可以通过数据表名`name`称找到对应数据表\n",
"\n",
"&emsp; 通过`get_vocab`函数,可以通过词汇表名`field_name`称找到对应词汇表"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/4 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/2 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------------------------+----------+-----+\n",
"| text | label | len |\n",
"+------------------------------+----------+-----+\n",
"| ['a', 'series', 'of', 'es... | negative | 37 |\n",
"| ['this', 'quiet', ',', 'i... | positive | 11 |\n",
"| ['even', 'fans', 'of', 'i... | negative | 21 |\n",
"| ['the', 'importance', 'of... | neutral | 20 |\n",
"+------------------------------+----------+-----+\n"
]
}
],
"source": [
"data_bundle.apply_more(lambda ins:{'len': len(ins['text'])}, progress_bar='tqdm')\n",
"print(data_bundle.get_dataset('train'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. fastNLP 中的 tokenizer\n",
"\n",
"### 2.1 PreTrainTokenizer 的提出\n",
"<!-- \n",
"*词嵌入是什么,为什么不用了*\n",
"\n",
"*什么是字节对编码,BPE的提出*\n",
"\n",
"*以BERT模型为例,WordPiece的提出*\n",
" -->\n",
"在`fastNLP 1.0`中,**使用 PreTrainedTokenizer 模块来为数据集中的词语进行词向量的标注**\n",
"\n",
"&emsp; 需要注意的是,`PreTrainedTokenizer`模块的下载和导入**需要确保环境安装了 transformers 模块**\n",
"\n",
"&emsp; 这是因为 `fastNLP 1.0`中`PreTrainedTokenizer`模块的实现基于`Huggingface Transformers`库\n",
"\n",
"**Huggingface Transformers 是一个开源的**,**基于 transformer 模型结构提供的预训练语言库**\n",
"\n",
"&emsp; 包含了多种经典的基于`transformer`的预训练模型,如`BERT`、`BART`、`RoBERTa`、`GPT2`、`CPT`\n",
"\n",
"&emsp; 更多相关内容可以参考`Huggingface Transformers`的[相关论文](https://arxiv.org/pdf/1910.03771.pdf)、[官方文档](https://huggingface.co/transformers/)以及[的代码仓库](https://github.com/huggingface/transformers)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 BertTokenizer 的基本使用\n",
"\n",
"在`fastNLP 1.0`中,以`PreTrainedTokenizer`为基类,泛化出多个子类,实现基于`BERT`等模型的标注\n",
"\n",
"&emsp; 本节以`BertTokenizer`模块为例,展示`PreTrainedTokenizer`模块的使用方法与应用实例\n",
"\n",
"**BertTokenizer 的初始化包括 导入模块和导入数据 两步**,先通过从`fastNLP.transformers.torch`中\n",
"\n",
"&emsp; 导入`BertTokenizer`模块,再**通过 from_pretrained 方法指定 tokenizer 参数类型下载**\n",
"\n",
"&emsp; 其中,**'bert-base-uncased' 指定 tokenizer 使用的预训练 BERT 类型**:单词不区分大小写\n",
"\n",
"&emsp; &emsp; **模块层数 L=12**,**隐藏层维度 H=768**,**自注意力头数 A=12**,**总参数量 110M**\n",
"\n",
"&emsp; 另外,模型参数自动下载至 home 目录下的`~\\.cache\\huggingface\\transformers`文件夹中"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"from fastNLP.transformers.torch import BertTokenizer\n",
"\n",
"tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"通过变量`vocab_size`和`vocab_files_names`可以查看`BertTokenizer`的词汇表的大小和对应文件\n",
"\n",
"&emsp; 通过变量`vocab`可以访问`BertTokenizer`预训练的词汇表(由于内容过大就不演示了"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"30522 {'vocab_file': 'vocab.txt'}\n"
]
}
],
"source": [
"print(tokenizer.vocab_size, tokenizer.vocab_files_names)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"通过变量`all_special_tokens`或通过变量`special_tokens_map`可以**查看 BertTokenizer 内置的特殊词素**\n",
"\n",
"&emsp; 包括**未知符 '[UNK]'**, **断句符 '[SEP]'**, **补零符 '[PAD]'**, **分类符 '[CLS]'**, **掩码 '[MASK]'**\n",
"\n",
"通过变量`all_special_ids`可以**查看 BertTokenizer 内置的特殊词素对应的词汇表编号**,相同功能\n",
"\n",
"&emsp; 也可以直接通过查看`pad_token`,值为`'[UNK]'`,和`pad_token_id`,值为`0`,等变量来实现"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"pad_token [PAD] 0\n",
"unk_token [UNK] 100\n",
"cls_token [CLS] 101\n",
"sep_token [SEP] 102\n",
"msk_token [MASK] 103\n",
"all_tokens ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]'] [100, 102, 0, 101, 103]\n",
"{'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}\n"
]
}
],
"source": [
"print('pad_token', tokenizer.pad_token, tokenizer.pad_token_id) \n",
"print('unk_token', tokenizer.unk_token, tokenizer.unk_token_id) \n",
"print('cls_token', tokenizer.cls_token, tokenizer.cls_token_id) \n",
"print('sep_token', tokenizer.sep_token, tokenizer.sep_token_id)\n",
"print('msk_token', tokenizer.mask_token, tokenizer.mask_token_id)\n",
"print('all_tokens', tokenizer.all_special_tokens, tokenizer.all_special_ids)\n",
"print(tokenizer.special_tokens_map)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"此外,还可以添加其他特殊字符,例如起始符`[BOS]`、终止符`[EOS]`,添加后词汇表编号也会相应改变\n",
"\n",
"&emsp; *但是如何添加这两个之外的字符,并且如何将这两个的编号设置为 [UNK] 之外的编号???*"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"bos_token [BOS] 100\n",
"eos_token [EOS] 100\n",
"all_tokens ['[BOS]', '[EOS]', '[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]'] [100, 100, 100, 102, 0, 101, 103]\n",
"{'bos_token': '[BOS]', 'eos_token': '[EOS]', 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}\n"
]
}
],
"source": [
"tokenizer.bos_token = '[BOS]'\n",
"tokenizer.eos_token = '[EOS]'\n",
"# tokenizer.bos_token_id = 104\n",
"# tokenizer.eos_token_id = 105\n",
"print('bos_token', tokenizer.bos_token, tokenizer.bos_token_id)\n",
"print('eos_token', tokenizer.eos_token, tokenizer.eos_token_id)\n",
"print('all_tokens', tokenizer.all_special_tokens, tokenizer.all_special_ids)\n",
"print(tokenizer.special_tokens_map)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在`BertTokenizer`中,**使用 tokenize 函数和 convert_tokens_to_string 函数可以实现文本和词素列表的互转**\n",
"\n",
"&emsp; 此外,**使用 convert_tokens_to_ids 函数和 convert_ids_to_tokens 函数则可以实现词素和词素编号的互转**\n",
"\n",
"&emsp; 上述四个函数的使用效果如下所示,此处可以明显看出,`tokenizer`分词和传统分词的不同效果,例如`'##cap'`"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1037, 2186, 1997, 9686, 17695, 18673, 14313, 1996, 15262, 3351, 2008, 2054, 2003, 2204, 2005, 1996, 13020, 2003, 2036, 2204, 2005, 1996, 25957, 4063, 1010, 2070, 1997, 2029, 5681, 2572, 25581, 2021, 3904, 1997, 2029, 8310, 2000, 2172, 1997, 1037, 2466, 1012]\n",
"['a', 'series', 'of', 'es', '##cap', '##ades', 'demonstrating', 'the', 'ada', '##ge', 'that', 'what', 'is', 'good', 'for', 'the', 'goose', 'is', 'also', 'good', 'for', 'the', 'gan', '##der', ',', 'some', 'of', 'which', 'occasionally', 'am', '##uses', 'but', 'none', 'of', 'which', 'amounts', 'to', 'much', 'of', 'a', 'story', '.']\n",
"a series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .\n"
]
}
],
"source": [
"text = \"a series of escapades demonstrating the adage that what is \" \\\n",
" \"good for the goose is also good for the gander , some of which \" \\\n",
" \"occasionally amuses but none of which amounts to much of a story .\" \n",
"tks = ['a', 'series', 'of', 'es', '##cap', '##ades', 'demonstrating', 'the', \n",
" 'ada', '##ge', 'that', 'what', 'is', 'good', 'for', 'the', 'goose', \n",
" 'is', 'also', 'good', 'for', 'the', 'gan', '##der', ',', 'some', 'of', \n",
" 'which', 'occasionally', 'am', '##uses', 'but', 'none', 'of', 'which', \n",
" 'amounts', 'to', 'much', 'of', 'a', 'story', '.']\n",
"ids = [ 1037, 2186, 1997, 9686, 17695, 18673, 14313, 1996, 15262, 3351, \n",
" 2008, 2054, 2003, 2204, 2005, 1996, 13020, 2003, 2036, 2204,\n",
" 2005, 1996, 25957, 4063, 1010, 2070, 1997, 2029, 5681, 2572,\n",
" 25581, 2021, 3904, 1997, 2029, 8310, 2000, 2172, 1997, 1037,\n",
" 2466, 1012]\n",
"\n",
"tokens = tokenizer.tokenize(text)\n",
"print(tokenizer.convert_tokens_to_ids(tokens))\n",
"\n",
"ids = tokenizer.convert_tokens_to_ids(tokens)\n",
"print(tokenizer.convert_ids_to_tokens(ids))\n",
"\n",
"print(tokenizer.convert_tokens_to_string(tokens))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在`BertTokenizer`中,还有另外两个函数可以实现分词标注,分别是 **encode 和 decode 函数**,**可以直接实现**\n",
"\n",
"&emsp; **文本字符串和词素编号列表的互转**,但是编码过程中会按照`BERT`的规则,**在句子首末加入 [CLS] 和 [SEP]**"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[101, 1037, 2186, 1997, 9686, 17695, 18673, 14313, 1996, 15262, 3351, 2008, 2054, 2003, 2204, 2005, 1996, 13020, 2003, 2036, 2204, 2005, 1996, 25957, 4063, 1010, 2070, 1997, 2029, 5681, 2572, 25581, 2021, 3904, 1997, 2029, 8310, 2000, 2172, 1997, 1037, 2466, 1012, 102]\n",
"[CLS] a series of escapades demonstrating the adage that what is good for the goose is also good for the gander, some of which occasionally amuses but none of which amounts to much of a story. [SEP]\n"
]
}
],
"source": [
"enc = tokenizer.encode(text)\n",
"print(tokenizer.encode(text))\n",
"dec = tokenizer.decode(enc)\n",
"print(tokenizer.decode(enc))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在`encode`函数之上,还有`encode_plus`函数,这也是在数据预处理中,`BertTokenizer`模块最常用到的函数\n",
"\n",
"&emsp; **encode 函数的参数**,**encode_plus 函数都有**;**encode 函数词素编号列表**,**encode_plus 函数返回字典**\n",
"\n",
"在`encode_plus`函数的返回值中,字段`input_ids`表示词素编号,其余两个字段后文有详细解释\n",
"\n",
"&emsp; **字段 token_type_ids 详见 text_pairs 的示例**,**字段 attention_mask 详见 batch_text 的示例**\n",
"\n",
"在`encode_plus`函数的参数中,参数`add_special_tokens`表示是否按照`BERT`的规则,加入相关特殊字符\n",
"\n",
"&emsp; 参数`max_length`表示句子截取最大长度(算特殊字符),在参数`truncation=True`时会自动截取\n",
"\n",
"&emsp; 参数`return_attention_mask`约定返回的字典中是否包括`attention_mask`字段,以上案例如下"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'input_ids': [101, 1037, 2186, 1997, 9686, 17695, 18673, 14313, 1996, 15262, 3351, 2008, 2054, 2003, 2204, 2005, 1996, 13020, 2003, 2036, 2204, 2005, 1996, 25957, 4063, 1010, 2070, 1997, 2029, 5681, 2572, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}\n"
]
}
],
"source": [
"text = \"a series of escapades demonstrating the adage that what is good for the goose is also good for \"\\\n",
" \"the gander , some of which occasionally amuses but none of which amounts to much of a story .\" \n",
"\n",
"encoded = tokenizer.encode_plus(text=text, add_special_tokens=True, max_length=32, \n",
" truncation=True, return_attention_mask=True)\n",
"print(encoded)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在`encode_plus`函数之上,还有`batch_encode_plus`函数(类似地,在`decode`之上,还有`batch_decode`\n",
"\n",
"&emsp; 两者参数类似,**batch_encode_plus 函数针对批量文本 batch_text**,**或者批量句对 text_pairs**\n",
"\n",
"在针对批量文本`batch_text`的示例中,注意`batch_encode_plus`函数返回字典中的`attention_mask`字段\n",
"\n",
"&emsp; 可以发现,**attention_mask 字段通过 01 标注出词素序列中该位置是否为补零**,可以用做自注意力的掩模"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'input_ids': [[101, 1037, 2186, 1997, 9686, 17695, 18673, 14313, 1996, 15262, 3351, 2008, 102, 0, 0], [101, 2054, 2003, 2204, 2005, 1996, 13020, 2003, 2036, 2204, 2005, 1996, 25957, 4063, 102], [101, 2070, 1997, 2029, 5681, 2572, 25581, 102, 0, 0, 0, 0, 0, 0, 0], [101, 2021, 3904, 1997, 2029, 8310, 2000, 2172, 1997, 1037, 2466, 102, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]}\n"
]
}
],
"source": [
"batch_text = [\"a series of escapades demonstrating the adage that\",\n",
" \"what is good for the goose is also good for the gander\",\n",
" \"some of which occasionally amuses\",\n",
" \"but none of which amounts to much of a story\" ]\n",
"\n",
"encoded = tokenizer.batch_encode_plus(batch_text_or_text_pairs=batch_text, padding=True,\n",
" add_special_tokens=True, max_length=16, truncation=True, \n",
" return_attention_mask=True)\n",
"print(encoded)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"而在针对批量句对`text_pairs`的示例中,注意`batch_encode_plus`函数返回字典中的`attention_mask`字段\n",
"\n",
"&emsp; 可以发现,**token_type_ids 字段通过 01 标注出词素序列中该位置为句对中的第几句**,句对用 [SEP] 分割"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'input_ids': [[101, 1037, 2186, 1997, 9686, 17695, 18673, 14313, 1996, 15262, 3351, 2008, 102, 2054, 2003, 2204, 2005, 1996, 13020, 2003, 2036, 2204, 2005, 1996, 25957, 4063, 102], [101, 2070, 1997, 2029, 5681, 2572, 25581, 102, 2021, 3904, 1997, 2029, 8310, 2000, 2172, 1997, 1037, 2466, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}\n"
]
}
],
"source": [
"text_pairs = [(\"a series of escapades demonstrating the adage that\",\n",
" \"what is good for the goose is also good for the gander\"),\n",
" (\"some of which occasionally amuses\",\n",
" \"but none of which amounts to much of a story\")]\n",
"\n",
"encoded = tokenizer.batch_encode_plus(batch_text_or_text_pairs=text_pairs, padding=True,\n",
" add_special_tokens=True, max_length=32, truncation=True, \n",
" return_attention_mask=True)\n",
"print(encoded)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"回到`encode_plus`上,在接下来的示例中,**使用内置的 functools.partial 模块构造 encode 函数**\n",
"\n",
"&emsp; 接着**使用该函数对 databundle 进行数据预处理**,由于`tokenizer.encode_plus`返回的是一个字典\n",
"\n",
"&emsp; 读入的是一个字段,所以此处使用`apply_field_more`方法,得到结果自动并入`databundle`中如下"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"functools.partial(<bound method PreTrainedTokenizerBase.encode_plus of PreTrainedTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=False, padding_side='right', special_tokens={'bos_token': '[BOS]', 'eos_token': '[EOS]', 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})>, max_length=32, truncation=True, return_attention_mask=True)\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/4 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/2 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------------+----------+-----+------------------+--------------------+--------------------+\n",
"| text | label | len | input_ids | token_type_ids | attention_mask |\n",
"+------------------+----------+-----+------------------+--------------------+--------------------+\n",
"| ['a', 'series... | negative | 37 | [101, 1037, 2... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... |\n",
"| ['this', 'qui... | positive | 11 | [101, 2023, 4... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... |\n",
"| ['even', 'fan... | negative | 21 | [101, 2130, 4... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... |\n",
"| ['the', 'impo... | neutral | 20 | [101, 1996, 5... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... |\n",
"+------------------+----------+-----+------------------+--------------------+--------------------+\n"
]
}
],
"source": [
"from functools import partial\n",
"\n",
"encode = partial(tokenizer.encode_plus, max_length=32, truncation=True,\n",
" return_attention_mask=True)\n",
"print(encode)\n",
"\n",
"data_bundle.apply_field_more(encode, field_name='text', progress_bar='tqdm')\n",
"print(data_bundle.datasets['train'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"经过`tokenizer`的处理,原始数据集中的文本被替换为词素编号列表,此时,调用`databundle`模块的\n",
"\n",
"&emsp; **set_pad 函数**,**将 databundle 的补零符编号 pad_val 和 tokenizer 补零符编号 pad_token_id 统一**\n",
"\n",
"&emsp; 该函数同时将`databundle`的`'input_ids'`字段添加到对应数据集的`collator`中(见`tutorial 3.`"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{}\n",
"{}\n",
"{'input_ids': {'pad_val': 0, 'dtype': None, 'backend': 'auto', 'pad_fn': None}}\n",
"{'input_ids': {'pad_val': 0, 'dtype': None, 'backend': 'auto', 'pad_fn': None}}\n"
]
}
],
"source": [
"print(data_bundle.get_dataset('train').collator.input_fields)\n",
"print(data_bundle.get_dataset('test').collator.input_fields)\n",
"data_bundle.set_pad('input_ids', pad_val=tokenizer.pad_token_id)\n",
"print(data_bundle.get_dataset('train').collator.input_fields)\n",
"print(data_bundle.get_dataset('test').collator.input_fields)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"最后,使用`from_dataset`、`index_dataset`和`iter_datasets`方法,为处理数据集的`'label'`字段编码\n",
"\n",
"&emsp; 接着**通过 set_ignore 函数**,**指定 databundle 的部分字段**,如`'text'`等,**在划分 batch 时不再出现**"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+----------------+----------+-----+----------------+--------------------+--------------------+--------+\n",
"| text | label | len | input_ids | token_type_ids | attention_mask | target |\n",
"+----------------+----------+-----+----------------+--------------------+--------------------+--------+\n",
"| ['a', 'seri... | negative | 37 | [101, 1037,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 0 |\n",
"| ['this', 'q... | positive | 11 | [101, 2023,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 1 |\n",
"| ['even', 'f... | negative | 21 | [101, 2130,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 0 |\n",
"| ['the', 'im... | neutral | 20 | [101, 1996,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 2 |\n",
"+----------------+----------+-----+----------------+--------------------+--------------------+--------+\n"
]
}
],
"source": [
"target_vocab = Vocabulary(padding=None, unknown=None)\n",
"\n",
"target_vocab.from_dataset(*[ds for _, ds in data_bundle.iter_datasets()], field_name='label')\n",
"target_vocab.index_dataset(*[ds for _, ds in data_bundle.iter_datasets()], field_name='label',\n",
" new_field_name='target')\n",
"\n",
"data_bundle.set_ignore('text', 'len', 'label') \n",
"print(data_bundle.datasets['train'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"以上就是使用`dataset`、`vocabulary`、`databundle`和`tokenizer`实现输入文本数据的读取\n",
"\n",
"&emsp; 分词标注、序列化的全部预处理过程,通过下方的代码梳理,相信你会有更详细的了解\n",
"\n",
"```python\n",
"# 首先,导入预训练的 BertTokenizer,这里使用 'bert-base-uncased' 版本\n",
"tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n",
"\n",
"# 接着,导入数据,先生成为 dataset 形式,再变成 dataset-dict,并转为 databundle 形式\n",
"datasets = DataSet.from_pandas(pd.read_csv('./data/test4dataset.tsv', sep='\\t'))\n",
"train_ds, test_ds = datasets.split(ratio=0.7)\n",
"data_bundle = DataBundle(datasets={'train': train_ds, 'test': test_ds})\n",
"\n",
"# 然后,通过 tokenizer.encode_plus 函数,进行文本分词标注、修改并补充数据包内容\n",
"encode = partial(tokenizer.encode_plus, max_length=100, truncation=True,\n",
" return_attention_mask=True)\n",
"data_bundle.apply_field_more(encode, field_name='Sentence', progress_bar='tqdm')\n",
"\n",
"# 在修改好 'text' 字段的文本信息后,接着处理 'label' 字段的预测信息\n",
"target_vocab = Vocabulary(padding=None, unknown=None)\n",
"target_vocab.from_dataset(*[ds for _, ds in data_bundle.iter_datasets()], field_name='Sentiment')\n",
"target_vocab.index_dataset(*[ds for _, ds in data_bundle.iter_datasets()], field_name='Sentiment',\n",
" new_field_name='target')\n",
"\n",
"# 最后,通过 data_bundle 的其他一些函数,完成善后内容\n",
"data_bundle.set_pad('input_ids', pad_val=tokenizer.pad_token_id)\n",
"data_bundle.set_ignore('SentenceId', 'Sentiment', 'Sentence') \n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"<!-- ### 2.3 补充:GloVe 词嵌入的使用\n",
"\n",
"如何使用传统的GloVe词嵌入\n",
"\n",
"from utils import get_from_cache\n",
"\n",
"filepath = get_from_cache(\"http://download.fastnlp.top/embedding/glove.6B.50d.zip\") -->\n",
"\n",
"在接下来的`tutorial 3.`中,将会介绍`fastNLP v1.0`中的`dataloader`模块,会涉及本章中\n",
"\n",
"&emsp; 提到的`collator`模块,`fastNLP`的多框架适应以及完整的数据加载过程,敬请期待"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"metadata": {
"collapsed": false
},
"source": []
}
}
},
"nbformat": 4,
"nbformat_minor": 1
}

+ 621
- 0
docs/source/tutorials/fastnlp_tutorial_3.ipynb View File

@@ -0,0 +1,621 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "213d538c",
"metadata": {},
"source": [
"# T3. dataloader 的内部结构和基本使用\n",
"\n",
"&emsp; 1 &ensp; fastNLP 中的 dataloader\n",
" \n",
"&emsp; &emsp; 1.1 &ensp; dataloader 的基本介绍\n",
"\n",
"&emsp; &emsp; 1.2 &ensp; dataloader 的函数创建\n",
"\n",
"&emsp; 2 &ensp; fastNLP 中 dataloader 的延伸\n",
"\n",
"&emsp; &emsp; 2.1 &ensp; collator 的概念与使用\n",
"\n",
"&emsp; &emsp; 2.2 &ensp; 结合 datasets 框架"
]
},
{
"cell_type": "markdown",
"id": "85857115",
"metadata": {},
"source": [
"## 1. fastNLP 中的 dataloader\n",
"\n",
"### 1.1 dataloader 的基本介绍\n",
"\n",
"在`fastNLP 1.0`的开发中,最关键的开发目标就是**实现 fastNLP 对当前主流机器学习框架**,例如\n",
"\n",
"&emsp; **当下流行的 pytorch**,以及**国产的 paddle 、jittor 和 oneflow 的兼容**,扩大受众的同时,也是助力国产\n",
"\n",
"本着分而治之的思想,我们可以将`fastNLP 1.0`对`pytorch`、`paddle`、`jittor`、`oneflow`框架的兼容,划分为\n",
"\n",
"&emsp; &emsp; **对数据预处理**、**批量 batch 的划分与补齐**、**模型训练**、**模型评测**,**四个部分的兼容**\n",
"\n",
"&emsp; 针对数据预处理,我们已经在`tutorial-1`中介绍了`dataset`和`vocabulary`的使用\n",
"\n",
"&emsp; &emsp; 而结合`tutorial-0`,我们可以发现**数据预处理环节本质上是框架无关的**\n",
"\n",
"&emsp; &emsp; 因为在不同框架下,读取的原始数据格式都差异不大,彼此也很容易转换\n",
"\n",
"只有涉及到张量、模型,不同框架才展现出其各自的特色:**pytorch 和 oneflow 中的 tensor 和 nn.Module**\n",
"\n",
"&emsp; &emsp; **在 paddle 中称为 tensor 和 nn.Layer**,**在 jittor 中则称为 Var 和 Module**\n",
"\n",
"&emsp; &emsp; 因此,**模型训练、模型评测**,**是兼容的重难点**,我们将会在`tutorial-5`中详细介绍\n",
"\n",
"&emsp; 针对批量`batch`的处理,作为`fastNLP 1.0`中框架无关部分想框架相关部分的过渡\n",
"\n",
"&emsp; &emsp; 就是`dataloader`模块的职责,这也是本篇教程`tutorial-3`讲解的重点\n",
"\n",
"**dataloader 模块的职责**,详细划分可以包含以下三部分,**采样划分、补零对齐、框架匹配**\n",
"\n",
"&emsp; &emsp; 第一,确定`batch`大小,确定采样方式,划分后通过迭代器即可得到`batch`序列\n",
"\n",
"&emsp; &emsp; 第二,对于序列处理,这也是`fastNLP`主要针对的,将同个`batch`内的数据对齐\n",
"\n",
"&emsp; &emsp; 第三,**batch 内数据格式要匹配框架**,**但 batch 结构需保持一致**,**参数匹配机制**\n",
"\n",
"&emsp; 对此,`fastNLP 1.0`给出了 **TorchDataLoader 、 PaddleDataLoader 、 JittorDataLoader 和 OneflowDataLoader**\n",
"\n",
"&emsp; &emsp; 分别针对并匹配不同框架,但彼此之间参数名、属性、方法仍然类似,前两者大致如下表所示\n",
"\n",
"名称|参数|属性|功能|内容\n",
"----|----|----|----|----|\n",
" `dataset` | √ | √ | 指定`dataloader`的数据内容 | |\n",
" `batch_size` | √ | √ | 指定`dataloader`的`batch`大小 | 默认`16` |\n",
" `shuffle` | √ | √ | 指定`dataloader`的数据是否打乱 | 默认`False` |\n",
" `collate_fn` | √ | √ | 指定`dataloader`的`batch`打包方法 | 视框架而定 |\n",
" `sampler` | √ | √ | 指定`dataloader`的`__len__`和`__iter__`函数的实现 | 默认`None` |\n",
" `batch_sampler` | √ | √ | 指定`dataloader`的`__len__`和`__iter__`函数的实现 | 默认`None` |\n",
" `drop_last` | √ | √ | 指定`dataloader`划分`batch`时是否丢弃剩余的 | 默认`False` |\n",
" `cur_batch_indices` | | √ | 记录`dataloader`当前遍历批量序号 | |\n",
" `num_workers` | √ | √ | 指定`dataloader`开启子进程数量 | 默认`0` |\n",
" `worker_init_fn` | √ | √ | 指定`dataloader`子进程初始方法 | 默认`None` |\n",
" `generator` | √ | √ | 指定`dataloader`子进程随机种子 | 默认`None` |\n",
" `prefetch_factor` | | √ | 指定为每个`worker`装载的`sampler`数量 | 默认`2` |"
]
},
{
"cell_type": "markdown",
"id": "60a8a224",
"metadata": {},
"source": [
"&emsp; 论及`dataloader`的函数,其中,`get_batch_indices`用来获取当前遍历到的`batch`序号,其他函数\n",
"\n",
"&emsp; &emsp; 包括`set_ignore`、`set_pad`和`databundle`类似,请参考`tutorial-2`,此处不做更多介绍\n",
"\n",
"&emsp; &emsp; 以下是`tutorial-2`中已经介绍过的数据预处理流程,接下来是对相关数据进行`dataloader`处理"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "aca72b49",
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"\u001b[38;5;2m[i 0604 15:44:29.773860 92 log.cc:351] Load log_sync: 1\u001b[m\n"
]
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
"</pre>\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/4 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/2 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/2 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------+----------------+-----------+----------------+--------------------+--------------------+--------+\n",
"| SentenceId | Sentence | Sentiment | input_ids | token_type_ids | attention_mask | target |\n",
"+------------+----------------+-----------+----------------+--------------------+--------------------+--------+\n",
"| 1 | A series of... | negative | [101, 1037,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 1 |\n",
"| 4 | A positivel... | neutral | [101, 1037,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 2 |\n",
"| 3 | Even fans o... | negative | [101, 2130,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 1 |\n",
"| 5 | A comedy-dr... | positive | [101, 1037,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 0 |\n",
"+------------+----------------+-----------+----------------+--------------------+--------------------+--------+\n"
]
}
],
"source": [
"import sys\n",
"sys.path.append('..')\n",
"\n",
"import pandas as pd\n",
"from functools import partial\n",
"from fastNLP.transformers.torch import BertTokenizer\n",
"\n",
"from fastNLP import DataSet\n",
"from fastNLP import Vocabulary\n",
"from fastNLP.io import DataBundle\n",
"\n",
"\n",
"class PipeDemo:\n",
" def __init__(self, tokenizer='bert-base-uncased'):\n",
" self.tokenizer = BertTokenizer.from_pretrained(tokenizer)\n",
"\n",
" def process_from_file(self, path='./data/test4dataset.tsv'):\n",
" datasets = DataSet.from_pandas(pd.read_csv(path, sep='\\t'))\n",
" train_ds, test_ds = datasets.split(ratio=0.7)\n",
" train_ds, dev_ds = datasets.split(ratio=0.8)\n",
" data_bundle = DataBundle(datasets={'train': train_ds, 'dev': dev_ds, 'test': test_ds})\n",
"\n",
" encode = partial(self.tokenizer.encode_plus, max_length=100, truncation=True,\n",
" return_attention_mask=True)\n",
" data_bundle.apply_field_more(encode, field_name='Sentence', progress_bar='tqdm')\n",
" \n",
" target_vocab = Vocabulary(padding=None, unknown=None)\n",
"\n",
" target_vocab.from_dataset(*[ds for _, ds in data_bundle.iter_datasets()], field_name='Sentiment')\n",
" target_vocab.index_dataset(*[ds for _, ds in data_bundle.iter_datasets()], field_name='Sentiment',\n",
" new_field_name='target')\n",
"\n",
" data_bundle.set_pad('input_ids', pad_val=self.tokenizer.pad_token_id)\n",
" data_bundle.set_ignore('SentenceId', 'Sentence', 'Sentiment') \n",
" return data_bundle\n",
"\n",
" \n",
"pipe = PipeDemo(tokenizer='bert-base-uncased')\n",
"\n",
"data_bundle = pipe.process_from_file('./data/test4dataset.tsv')\n",
"\n",
"print(data_bundle.get_dataset('train'))"
]
},
{
"cell_type": "markdown",
"id": "76e6b8ab",
"metadata": {},
"source": [
"### 1.2 dataloader 的函数创建\n",
"\n",
"在`fastNLP 1.0`中,**更方便、可能更常用的 dataloader 创建方法是通过 prepare_xx_dataloader 函数**\n",
"\n",
"&emsp; 例如下方的`prepare_torch_dataloader`函数,指定必要参数,读取数据集,生成对应`dataloader`\n",
"\n",
"&emsp; 类型为`TorchDataLoader`,只能适用于`pytorch`框架,因此对应`trainer`初始化时`driver='torch'`\n",
"\n",
"同时我们看还可以发现,在`fastNLP 1.0`中,**batch 表示为字典 dict 类型**,**key 值就是原先数据集中各个字段**\n",
"\n",
"&emsp; **除去经过 DataBundle.set_ignore 函数隐去的部分**,而`value`值为`pytorch`框架对应的`torch.Tensor`类型"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "5fd60e42",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'fastNLP.core.dataloaders.torch_dataloader.fdl.TorchDataLoader'>\n",
"<class 'dict'> <class 'torch.Tensor'> ['input_ids', 'token_type_ids', 'attention_mask', 'target']\n",
"{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
" [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
" 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
" [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
" [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
" 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),\n",
" 'input_ids': tensor([[ 101, 1037, 4038, 1011, 3689, 1997, 3053, 8680, 19173, 15685,\n",
" 1999, 1037, 18006, 2836, 2011, 1996, 2516, 2839, 14996, 3054,\n",
" 15509, 5325, 1012, 102, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0],\n",
" [ 101, 1037, 2186, 1997, 9686, 17695, 18673, 14313, 1996, 15262,\n",
" 3351, 2008, 2054, 2003, 2204, 2005, 1996, 13020, 2003, 2036,\n",
" 2204, 2005, 1996, 25957, 4063, 1010, 2070, 1997, 2029, 5681,\n",
" 2572, 25581, 2021, 3904, 1997, 2029, 8310, 2000, 2172, 1997,\n",
" 1037, 2466, 1012, 102],\n",
" [ 101, 2130, 4599, 1997, 19214, 6432, 1005, 1055, 2147, 1010,\n",
" 1045, 8343, 1010, 2052, 2031, 1037, 2524, 2051, 3564, 2083,\n",
" 2023, 2028, 1012, 102, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0],\n",
" [ 101, 1037, 13567, 26162, 5257, 1997, 3802, 7295, 9888, 1998,\n",
" 2035, 1996, 20014, 27611, 1010, 14583, 1010, 11703, 20175, 1998,\n",
" 4028, 1997, 1037, 8101, 2319, 10576, 2030, 1037, 28900, 7815,\n",
" 3850, 1012, 102, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0]]),\n",
" 'target': tensor([0, 1, 1, 2]),\n",
" 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
" [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
" [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
" [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}\n"
]
}
],
"source": [
"from fastNLP import prepare_torch_dataloader\n",
"\n",
"train_dataset = data_bundle.get_dataset('train')\n",
"evaluate_dataset = data_bundle.get_dataset('dev')\n",
"\n",
"train_dataloader = prepare_torch_dataloader(train_dataset, batch_size=16, shuffle=True)\n",
"evaluate_dataloader = prepare_torch_dataloader(evaluate_dataset, batch_size=16)\n",
"\n",
"print(type(train_dataloader))\n",
"\n",
"import pprint\n",
"\n",
"for batch in train_dataloader:\n",
" print(type(batch), type(batch['input_ids']), list(batch))\n",
" pprint.pprint(batch, width=1)"
]
},
{
"cell_type": "markdown",
"id": "9f457a6e",
"metadata": {},
"source": [
"之所以说`prepare_xx_dataloader`函数更方便,是因为其**导入对象不仅可也是 DataSet 类型**,**还可以**\n",
"\n",
"&emsp; **是 DataBundle 类型**,不过数据集名称需要是`'train'`、`'dev'`、`'test'`供`fastNLP`识别\n",
"\n",
"例如下方就是**直接通过 prepare_paddle_dataloader 函数生成基于 PaddleDataLoader 的字典**\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "7827557d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'fastNLP.core.dataloaders.paddle_dataloader.fdl.PaddleDataLoader'>\n"
]
}
],
"source": [
"from fastNLP import prepare_paddle_dataloader\n",
"\n",
"dl_bundle = prepare_paddle_dataloader(data_bundle, batch_size=16, shuffle=True)\n",
"\n",
"print(type(dl_bundle['train']))"
]
},
{
"cell_type": "markdown",
"id": "d898cf40",
"metadata": {},
"source": [
"&emsp; 而在接下来`trainer`的初始化过程中,按如下方式使用即可,除了初始化时`driver='paddle'`外\n",
"\n",
"&emsp; 这里也可以看出`trainer`模块中,**evaluate_dataloaders 的设计允许评测可以针对多个数据集**\n",
"\n",
"```python\n",
"trainer = Trainer(\n",
" model=model,\n",
" train_dataloader=dl_bundle['train'],\n",
" optimizers=optimizer,\n",
"\t...\n",
"\tdriver='paddle',\n",
"\tdevice='gpu',\n",
"\t...\n",
" evaluate_dataloaders={'dev': dl_bundle['dev'], 'test': dl_bundle['test']}, \n",
" metrics={'acc': Accuracy()},\n",
"\t...\n",
")\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "d74d0523",
"metadata": {},
"source": [
"## 2. fastNLP 中 dataloader 的延伸\n",
"\n",
"### 2.1 collator 的概念与使用\n",
"\n",
"在`fastNLP 1.0`中,在数据加载模块`dataloader`内部,如之前表格所列举的,还存在其他的一些模块\n",
"\n",
"&emsp; 例如,**实现序列的补零对齐的核对器 collator 模块**;注:`collate vt. 整理(文件或书等);核对,校勘`\n",
"\n",
"在`fastNLP 1.0`中,虽然`dataloader`随框架不同,但`collator`模块却是统一的,主要属性、方法如下表所示\n",
"\n",
"名称|属性|方法|功能|内容\n",
" ----|----|----|----|----|\n",
" `backend` | √ | | 记录`collator`对应框架 | 字符串型,如`'torch'` |\n",
" `padders` | √ | | 记录各字段对应的`padder`,每个负责具体补零对齐&emsp; | 字典类型 |\n",
" `ignore_fields` | √ | | 记录`dataloader`采样`batch`时不予考虑的字段 | 集合类型 |\n",
" `input_fields` | √ | | 记录`collator`每个字段的补零值、数据类型等 | 字典类型 |\n",
" `set_backend` | | √ | 设置`collator`对应框架 | 字符串型,如`'torch'` |\n",
" `set_ignore` | | √ | 设置`dataloader`采样`batch`时不予考虑的字段 | 字符串型,表示`field_name`&emsp; |\n",
" `set_pad` | | √ | 设置`collator`每个字段的补零值、数据类型等 | |"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "d0795b3e",
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'function'>\n"
]
}
],
"source": [
"train_dataloader.collate_fn\n",
"\n",
"print(type(train_dataloader.collate_fn))"
]
},
{
"cell_type": "markdown",
"id": "5f816ef5",
"metadata": {},
"source": [
"此外,还可以 **手动定义 dataloader 中的 collate_fn**,而不是使用`fastNLP 1.0`中自带的`collator`模块\n",
"\n",
"&emsp; 该函数的定义可以大致如下,需要注意的是,**定义 collate_fn 之前需要了解 batch 作为字典的格式**\n",
"\n",
"&emsp; 该函数通过`collate_fn`参数传入`dataloader`,**在 batch 分发**(**而不是 batch 划分**)**时调用**"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "ff8e405e",
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"\n",
"def collate_fn(batch):\n",
" input_ids, atten_mask, labels = [], [], []\n",
" max_length = [0] * 3\n",
" for each_item in batch:\n",
" input_ids.append(each_item['input_ids'])\n",
" max_length[0] = max(len(each_item['input_ids']), max_length[0])\n",
" atten_mask.append(each_item['token_type_ids'])\n",
" max_length[1] = max(len(each_item['token_type_ids']), max_length[1])\n",
" labels.append(each_item['attention_mask'])\n",
" max_length[2] = max(len(each_item['attention_mask']), max_length[2])\n",
"\n",
" for i in range(3):\n",
" each = (input_ids, atten_mask, labels)[i]\n",
" for item in each:\n",
" item.extend([0] * (max_length[i] - len(item)))\n",
" return {'input_ids': torch.cat([torch.tensor([item]) for item in input_ids], dim=0),\n",
" 'token_type_ids': torch.cat([torch.tensor([item]) for item in atten_mask], dim=0),\n",
" 'attention_mask': torch.cat([torch.tensor(item) for item in labels], dim=0)}"
]
},
{
"cell_type": "markdown",
"id": "487b75fb",
"metadata": {},
"source": [
"注意:使用自定义的`collate_fn`函数,`trainer`的`collate_fn`变量也会自动调整为`function`类型"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "e916d1ac",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'fastNLP.core.dataloaders.torch_dataloader.fdl.TorchDataLoader'>\n",
"<class 'function'>\n",
"{'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,\n",
" 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
" 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
" 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
" 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0]),\n",
" 'input_ids': tensor([[ 101, 1037, 4038, 1011, 3689, 1997, 3053, 8680, 19173, 15685,\n",
" 1999, 1037, 18006, 2836, 2011, 1996, 2516, 2839, 14996, 3054,\n",
" 15509, 5325, 1012, 102, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0],\n",
" [ 101, 1037, 2186, 1997, 9686, 17695, 18673, 14313, 1996, 15262,\n",
" 3351, 2008, 2054, 2003, 2204, 2005, 1996, 13020, 2003, 2036,\n",
" 2204, 2005, 1996, 25957, 4063, 1010, 2070, 1997, 2029, 5681,\n",
" 2572, 25581, 2021, 3904, 1997, 2029, 8310, 2000, 2172, 1997,\n",
" 1037, 2466, 1012, 102],\n",
" [ 101, 2130, 4599, 1997, 19214, 6432, 1005, 1055, 2147, 1010,\n",
" 1045, 8343, 1010, 2052, 2031, 1037, 2524, 2051, 3564, 2083,\n",
" 2023, 2028, 1012, 102, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0],\n",
" [ 101, 1037, 13567, 26162, 5257, 1997, 3802, 7295, 9888, 1998,\n",
" 2035, 1996, 20014, 27611, 1010, 14583, 1010, 11703, 20175, 1998,\n",
" 4028, 1997, 1037, 8101, 2319, 10576, 2030, 1037, 28900, 7815,\n",
" 3850, 1012, 102, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0]]),\n",
" 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
" [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
" [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
" [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}\n"
]
}
],
"source": [
"train_dataloader = prepare_torch_dataloader(train_dataset, collate_fn=collate_fn, shuffle=True)\n",
"evaluate_dataloader = prepare_torch_dataloader(evaluate_dataset, collate_fn=collate_fn, shuffle=True)\n",
"\n",
"print(type(train_dataloader))\n",
"print(type(train_dataloader.collate_fn))\n",
"\n",
"for batch in train_dataloader:\n",
" pprint.pprint(batch, width=1)"
]
},
{
"cell_type": "markdown",
"id": "0bd98365",
"metadata": {},
"source": [
"### 2.2 fastNLP 与 datasets 的结合\n",
"\n",
"从`tutorial-1`至`tutorial-3`,我们已经完成了对`fastNLP v1.0`数据读取、预处理、加载,整个流程的介绍\n",
"\n",
"&emsp; 不过在实际使用中,我们往往也会采取更为简便的方法读取数据,例如使用`huggingface`的`datasets`模块\n",
"\n",
"**使用 datasets 模块中的 load_dataset 函数**,通过指定数据集两级的名称,示例中即是**GLUE 标准中的 SST-2 数据集**\n",
"\n",
"&emsp; 即可以快速从网上下载好`SST-2`数据集读入,之后以`pandas.DataFrame`作为中介,再转化成`fastNLP.DataSet`\n",
"\n",
"&emsp; 之后的步骤就和其他关于`dataset`、`databundle`、`vocabulary`、`dataloader`中介绍的相关使用相同了"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "91879c30",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Reusing dataset glue (/remote-home/xrliu/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "639a0ad3c63944c6abef4e8ee1f7bf7c",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/3 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from datasets import load_dataset\n",
"\n",
"sst2data = load_dataset('glue', 'sst2')\n",
"\n",
"dataset = DataSet.from_pandas(sst2data['train'].to_pandas())"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"metadata": {
"collapsed": false
},
"source": []
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}

+ 2614
- 0
docs/source/tutorials/fastnlp_tutorial_4.ipynb
File diff suppressed because it is too large
View File


+ 1242
- 0
docs/source/tutorials/fastnlp_tutorial_5.ipynb
File diff suppressed because it is too large
View File


+ 1646
- 0
docs/source/tutorials/fastnlp_tutorial_6.ipynb
File diff suppressed because it is too large
View File


+ 1280
- 0
docs/source/tutorials/fastnlp_tutorial_e1.ipynb
File diff suppressed because it is too large
View File


+ 1082
- 0
docs/source/tutorials/fastnlp_tutorial_e2.ipynb
File diff suppressed because it is too large
View File


+ 1086
- 0
docs/source/tutorials/fastnlp_tutorial_paddle_e1.ipynb
File diff suppressed because it is too large
View File


+ 1510
- 0
docs/source/tutorials/fastnlp_tutorial_paddle_e2.ipynb
File diff suppressed because it is too large
View File


BIN
docs/source/tutorials/figures/E1-fig-glue-benchmark.png View File

Before After
Width: 1040  |  Height: 545  |  Size: 159 kB

BIN
docs/source/tutorials/figures/E2-fig-p-tuning-v2-model.png View File

Before After
Width: 938  |  Height: 359  |  Size: 50 kB

BIN
docs/source/tutorials/figures/E2-fig-pet-model.png View File

Before After
Width: 582  |  Height: 521  |  Size: 57 kB

BIN
docs/source/tutorials/figures/T0-fig-parameter-matching.png View File

Before After
Width: 1265  |  Height: 736  |  Size: 96 kB

BIN
docs/source/tutorials/figures/T0-fig-trainer-and-evaluator.png View File

Before After
Width: 1290  |  Height: 821  |  Size: 71 kB

BIN
docs/source/tutorials/figures/T0-fig-training-structure.png View File

Before After
Width: 1160  |  Height: 732  |  Size: 80 kB

BIN
docs/source/tutorials/figures/T1-fig-dataset-and-vocabulary.png View File

Before After
Width: 1326  |  Height: 701  |  Size: 139 kB

BIN
docs/source/tutorials/figures/paddle-ernie-1.0-masking-levels.png View File

Before After
Width: 917  |  Height: 173  |  Size: 59 kB

BIN
docs/source/tutorials/figures/paddle-ernie-1.0-masking.png View File

Before After
Width: 779  |  Height: 452  |  Size: 47 kB

BIN
docs/source/tutorials/figures/paddle-ernie-2.0-continual-pretrain.png View File

Before After
Width: 1033  |  Height: 464  |  Size: 129 kB

BIN
docs/source/tutorials/figures/paddle-ernie-3.0-framework.png View File

Before After
Width: 1161  |  Height: 720  |  Size: 202 kB

+ 1
- 1
fastNLP/__init__.py View File

@@ -2,4 +2,4 @@
from fastNLP.envs import *
from fastNLP.core import *

__version__ = '0.8.0beta'
__version__ = '1.0.0alpha'

+ 3
- 27
fastNLP/core/callbacks/callback_event.py View File

@@ -35,14 +35,14 @@ class Event:

:param value: Trainer 的 callback 时机;
:param every: 每触发多少次才真正运行一次;
:param once: 在第一次运行后时候再次执行
:param once: 是否仅运行一次
:param filter_fn: 输入参数的应该为 ``(filter, trainer)``,其中 ``filter`` 对象中包含了 `filter.num_called` 和
`filter.num_executed` 两个变量分别获取当前被调用了多少次,真正执行了多少次;``trainer`` 对象即为当前正在运行的 Trainer;
"""
every: Optional[int]
once: Optional[int]
once: Optional[bool]

def __init__(self, value: str, every: Optional[int] = None, once: Optional[int] = None,
def __init__(self, value: str, every: Optional[int] = None, once: Optional[bool] = None,
filter_fn: Optional[Callable] = None):
self.every = every
self.once = once
@@ -68,7 +68,6 @@ class Event:
return Event(value='on_after_trainer_initialized', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_sanity_check_begin(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_sanity_check_begin` 时触发;
@@ -85,7 +84,6 @@ class Event:
return Event(value='on_sanity_check_begin', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_sanity_check_end(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_sanity_check_end` 时触发;
@@ -101,7 +99,6 @@ class Event:
return Event(value='on_sanity_check_end', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_train_begin(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_train_begin` 时触发;
@@ -117,7 +114,6 @@ class Event:
return Event(value='on_train_begin', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_train_end(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_train_end` 时触发;
@@ -133,7 +129,6 @@ class Event:
return Event(value='on_train_end', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_train_epoch_begin(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_train_epoch_begin` 时触发;
@@ -149,7 +144,6 @@ class Event:
return Event(value='on_train_epoch_begin', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_train_epoch_end(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_train_epoch_end` 时触发;
@@ -165,7 +159,6 @@ class Event:
return Event(value='on_train_epoch_end', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_fetch_data_begin(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_fetch_data_begin` 时触发;
@@ -181,7 +174,6 @@ class Event:
return Event(value='on_fetch_data_begin', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_fetch_data_end(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_fetch_data_end` 时触发;
@@ -197,7 +189,6 @@ class Event:
return Event(value='on_fetch_data_end', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_train_batch_begin(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_train_batch_begin` 时触发;
@@ -213,7 +204,6 @@ class Event:
return Event(value='on_train_batch_begin', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_train_batch_end(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_train_batch_end` 时触发;
@@ -229,7 +219,6 @@ class Event:
return Event(value='on_train_batch_end', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_exception(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_exception` 时触发;
@@ -245,7 +234,6 @@ class Event:
return Event(value='on_exception', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_save_model(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_save_model` 时触发;
@@ -261,7 +249,6 @@ class Event:
return Event(value='on_save_model', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_load_model(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_load_model` 时触发;
@@ -277,7 +264,6 @@ class Event:
return Event(value='on_load_model', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_save_checkpoint(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_save_checkpoint` 时触发;
@@ -293,7 +279,6 @@ class Event:
return Event(value='on_save_checkpoint', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_load_checkpoint(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_load_checkpoint` 时触发;
@@ -309,7 +294,6 @@ class Event:
return Event(value='on_load_checkpoint', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_load_checkpoint(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_load_checkpoint` 时触发;
@@ -325,7 +309,6 @@ class Event:
return Event(value='on_load_checkpoint', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_before_backward(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_before_backward` 时触发;
@@ -341,7 +324,6 @@ class Event:
return Event(value='on_before_backward', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_after_backward(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_after_backward` 时触发;
@@ -357,7 +339,6 @@ class Event:
return Event(value='on_after_backward', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_before_optimizers_step(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_before_optimizers_step` 时触发;
@@ -373,7 +354,6 @@ class Event:
return Event(value='on_before_optimizers_step', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_after_optimizers_step(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_after_optimizers_step` 时触发;
@@ -389,7 +369,6 @@ class Event:
return Event(value='on_after_optimizers_step', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_before_zero_grad(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_before_zero_grad` 时触发;
@@ -405,7 +384,6 @@ class Event:
return Event(value='on_before_zero_grad', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_after_zero_grad(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_after_zero_grad` 时触发;
@@ -421,7 +399,6 @@ class Event:
return Event(value='on_after_zero_grad', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_evaluate_begin(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_evaluate_begin` 时触发;
@@ -437,7 +414,6 @@ class Event:
return Event(value='on_evaluate_begin', every=every, once=once, filter_fn=filter_fn)

@staticmethod
def on_evaluate_end(every=None, once=None, filter_fn=None):
"""
当 Trainer 运行到 :func:`on_evaluate_end` 时触发;


+ 1
- 0
fastNLP/core/collators/padders/oneflow_padder.py View File

@@ -7,6 +7,7 @@ from inspect import isclass
import numpy as np

from fastNLP.envs.imports import _NEED_IMPORT_ONEFLOW
from fastNLP.envs.utils import _module_available

if _NEED_IMPORT_ONEFLOW:
import oneflow


+ 7
- 7
fastNLP/core/controllers/trainer.py View File

@@ -84,13 +84,13 @@ class Trainer(TrainerEventTrigger):
.. warning::

当使用分布式训练时, **fastNLP** 会默认将 ``dataloader`` 中的 ``Sampler`` 进行处理,以使得在一个 epoch 中,不同卡
用以训练的数据是不重叠的。如果对 sampler 有特殊处理,那么请将 ``use_dist_sampler`` 参数设置为 ``False`` ,此刻需要由
自身保证每张卡上所使用的数据是不同的。
用以训练的数据是不重叠的。如果对 sampler 有特殊处理,那么请将 ``use_dist_sampler`` 参数设置为 ``False`` ,此刻需要由
自身保证每张卡上所使用的数据是不同的。

:param optimizers: 训练所需要的优化器;可以是单独的一个优化器实例,也可以是多个优化器组成的 List;
:param device: 该参数用来指定具体训练时使用的机器;注意当该参数仅当您通过 ``torch.distributed.launch/run`` 启动时可以为 ``None``,
此时 fastNLP 不会对模型和数据进行设备之间的移动处理,但是可以通过参数 ``input_mapping`` 和 ``output_mapping`` 来实现设备之间
数据迁移的工作(通过这两个参数传入两个处理数据的函数);同时也可以通过在 kwargs 添加参数 ``data_device`` 来让我们帮助您将数据
此时 fastNLP 不会对模型和数据进行设备之间的移动处理,但是可以通过参数 ``input_mapping`` 和 ``output_mapping`` 来实现设备之间
数据迁移的工作(通过这两个参数传入两个处理数据的函数);同时也可以通过在 kwargs 添加参数 ``data_device`` 来让我们帮助您将数据
迁移到指定的机器上(注意这种情况理应只出现在用户在 Trainer 实例化前自己构造 DDP 的场景);

device 的可选输入如下所示:
@@ -196,7 +196,7 @@ class Trainer(TrainerEventTrigger):
3. 如果此时 batch 此时是其它类型,那么我们将会直接报错;
2. 如果 ``input_mapping`` 是一个函数,那么对于取出的 batch,我们将不会做任何处理,而是直接将其传入该函数里;

注意该参数会被传进 ``Evaluator`` 中;因此可以通过该参数来实现将训练数据 batch 移到对应机器上的工作(例如当参数 ``device`` 为 ``None`` 时);
注意该参数会被传进 ``Evaluator`` 中;因此可以通过该参数来实现将训练数据 batch 移到对应机器上的工作(例如当参数 ``device`` 为 ``None`` 时);
如果 ``Trainer`` 和 ``Evaluator`` 需要使用不同的 ``input_mapping``, 请使用 ``train_input_mapping`` 与 ``evaluate_input_mapping`` 分别进行设置。

:param output_mapping: 应当为一个字典或者函数。作用和 ``input_mapping`` 类似,区别在于其用于转换输出:
@@ -367,7 +367,7 @@ class Trainer(TrainerEventTrigger):

.. note::
``Trainer`` 是通过在内部直接初始化一个 ``Evaluator`` 来进行验证;
``Trainer`` 内部的 ``Evaluator`` 默认是 None,如果您需要在训练过程中进行验证,需要保证这几个参数得到正确的传入:
``Trainer`` 内部的 ``Evaluator`` 默认是 None,如果您需要在训练过程中进行验证,需要保证这几个参数得到正确的传入:

必须的参数:``metrics`` 与 ``evaluate_dataloaders``;

@@ -898,7 +898,7 @@ class Trainer(TrainerEventTrigger):

这段代码意味着 ``fn1`` 和 ``fn2`` 会被加入到 ``trainer1``,``fn3`` 会被加入到 ``trainer2``;

注意如果你使用该函数修饰器来为你的训练添加 callback,请务必保证你加入 callback 函数的代码在实例化 `Trainer` 之前;
注意如果您使用该函数修饰器来为您的训练添加 callback,请务必保证您加入 callback 函数的代码在实例化 `Trainer` 之前;

补充性的解释见 :meth:`~fastNLP.core.controllers.Trainer.add_callback_fn`;



+ 5
- 6
fastNLP/core/dataset/dataset.py View File

@@ -584,7 +584,7 @@ class DataSet:
将 :class:`DataSet` 每个 ``instance`` 中为 ``field_name`` 的 field 传给函数 ``func``,并写入到 ``new_field_name``
中。

:param func: 对指定 fiel` 进行处理的函数,注意其输入应为 ``instance`` 中名为 ``field_name`` 的 field 的内容;
:param func: 对指定 field 进行处理的函数,注意其输入应为 ``instance`` 中名为 ``field_name`` 的 field 的内容;
:param field_name: 传入 ``func`` 的 field 名称;
:param new_field_name: 函数执行结果写入的 ``field`` 名称。该函数会将 ``func`` 返回的内容放入到 ``new_field_name`` 对
应的 ``field`` 中,注意如果名称与已有的 field 相同则会进行覆盖。如果为 ``None`` 则不会覆盖和创建 field ;
@@ -624,10 +624,9 @@ class DataSet:
``apply_field_more`` 与 ``apply_field`` 的区别参考 :meth:`~fastNLP.core.dataset.DataSet.apply_more` 中关于 ``apply_more`` 与
``apply`` 区别的介绍。

:param func: 对指定 fiel` 进行处理的函数,注意其输入应为 ``instance`` 中名为 ``field_name`` 的 field 的内容;
:param field_name: 传入 ``func`` 的 fiel` 名称;
:param new_field_name: 函数执行结果写入的 ``field`` 名称。该函数会将 ``func`` 返回的内容放入到 ``new_field_name`` 对
应的 ``field`` 中,注意如果名称与已有的 field 相同则会进行覆盖。如果为 ``None`` 则不会覆盖和创建 field ;
:param func: 对指定 field 进行处理的函数,注意其输入应为 ``instance`` 中名为 ``field_name`` 的 field 的内容;
:param field_name: 传入 ``func`` 的 field 名称;
:param modify_fields: 是否用结果修改 ``DataSet`` 中的 ``Field`` , 默认为 ``True``
:param num_proc: 使用进程的数量。
.. note::
@@ -751,8 +750,8 @@ class DataSet:

3. ``apply_more`` 默认修改 ``DataSet`` 中的 field ,``apply`` 默认不修改。

:param modify_fields: 是否用结果修改 ``DataSet`` 中的 ``Field`` , 默认为 True
:param func: 参数是 ``DataSet`` 中的 ``Instance`` ,返回值是一个字典,key 是field 的名字,value 是对应的结果
:param modify_fields: 是否用结果修改 ``DataSet`` 中的 ``Field`` , 默认为 ``True``
:param num_proc: 使用进程的数量。

.. note::


Some files were not shown because too many files changed in this diff

Loading…
Cancel
Save