Browse Source

[to #42322933] add/refactor nlp models source code and finetune

1. add sbert,veco,palm,space source code
2. support sbert sequence classification, token classification finetune
3. support veco sequence classification finetune
4. support palm nlg finetune
evaluation result: https://sheet.alibaba-inc.com/#/sheet/f7fdcc7f22bd5105 sheet:Maas
5. add ut for finetunes
6. add veco's taskdataset processor
7. add a common trainer for nlp, and a specific trainer for veco
8. merge some duplicate codes of models, preprocessors, pipelines
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9574105

    * add basic class of hook&metrics

* pre-commit passed

* change some comments

* pre commit passed

* 1. remove accuracy's groups 2. remove useless hooks 3. simplify priorities

* pre-commit passed

* fix a comment

* Merge branch 'master' into finetune_hooks_metrics

# Conflicts:
#	modelscope/metainfo.py

* pre-commit passed

* add basic class of hook&metrics

* pre-commit passed

* change some comments

* pre commit passed

* 1. remove accuracy's groups 2. remove useless hooks 3. simplify priorities

* pre-commit passed

* fix a comment

* Merge branch 'feat/finetune' of gitlab.alibaba-inc.com:Ali-MaaS/MaaS-lib into feat/finetune

* mv hooks related to modelscope/trainers/hooks

* mv priority back

* add torch mdoel base and test

* update hooks, trainer, import_util

* add torch epoch based trainer and dis utils

* add hooks

* fix warmup

* format code stype and fix warmup and add warmup unittest

* fix impls

* pre-commit check passed

* update hook and add EpochBasedTrainer

* add trainer unittest

* Merge branch 'feat/add_hooks' into feat/add_task

# Conflicts:
#	modelscope/models/base_torch.py
#	modelscope/trainers/hooks/hook.py
#	modelscope/trainers/trainer.py

* update unittest name

* rewrite taskdataset to trainer

* fix trainer and add unittest

* add unittest

* code: run to forward

* run through... but ugly code

* arrange some cls

* fix some errs

* revert some mistakes

* init check in

* Merge branch 'feat/add_hooks' into feat/add_task

# Conflicts:
#	modelscope/trainers/trainer.py

* test with bigger epoch and size

* add the default metrics class

* move build metrics code to a method

* merge add_task

* merge origin add_task

* add device initialization

* remove preprocessor arg for bool

* add task models

* move metric collect logic to metrics class

* pre-commit passed

* fix cr comments

* precommit passed

* add task models

* Merge remote-tracking branch 'origin/feat/add_task' into feat/backbone_head

* add comment

* change comment formats.

* fix comments

* fix ut bug

* fix comments

* add wrapper check

* fix comments

* pre commit passed

* fix cr comments

* solve a loop import problem

* fix ut bug

* fix ut errors

* change dummydataset to msdataset

* precommit passed

* merge add task

* backbone-head is build, model is not correctly loaded

* model load states matched

* result matched

* lint

* add veco/palm_v2 code

* merge master

* merge master success running

* add repr model name level

* Merge branch 'feat/veco_palm' into feat/finetune_sbert_veco

* model test for training

* add token-classification metric add formal ut

* fix running bug

* finetune and pipeline are working with backbone-head

* add nli

* add missing code

* finetune and pipeline are working with backbone-head

* Merge branch 'feat/backbone_head' of http://gitlab.alibaba-inc.com/Ali-MaaS/MaaS-lib into feat/backbone_head

* add a test repo for pr

* remove merge conflicted file

* remove merge conflicted file 1

* lint check

* import error

* none type bug fix

* forward input unpacking or dict bug

* move head into models, add build_backbone with registry, no base method

* merge master

* feat: 1. add interleave dataset method 2. support multiple dataset in trainer.build_dataset 3. support 3 sub tasks in sequence_classification task

* unfinished

* update the task model structure in NLP field

* merge master

* update by comments

* keep the default model id as current on production

* unfinished

* unfinished

* veco can run

* Merge remote-tracking branch 'origin/master' into feat/backbone_head

* add taskmodel for module management

* remove forward_input_is_dict

* unfinished

* token classification started

* update base model structure

* move space to backbone

* remove 'type' in build_from_cfg method

* test update

* bug fix

* on tesing, mess code

* Merge branch 'feat/backbone_head' into feat/refactor_nlp_730

# Conflicts:
#	modelscope/metrics/builder.py
#	modelscope/models/__init__.py
#	modelscope/models/nlp/__init__.py
#	modelscope/preprocessors/nlp.py
#	modelscope/trainers/trainer.py
#	requirements/multi-modal.txt

* add missing merge

* add sofa source code

* refactor

* add veco task dataset

* add veco task dataset

* pre-commit passed

* fix bug of log

* add some features

* merge master

* bug fix

* refine nlp models

* fix the training error

* unfinished

* refactor pipeline

* Merge branch 'feat/backbone_head' into feat/refactor_nlp_730

# Conflicts:
#	modelscope/metrics/builder.py
#	modelscope/models/nlp/__init__.py
#	modelscope/models/nlp/backbones/structbert/modeling_sbert.py
#	modelscope/models/nlp/palm_v2/palm_for_text_generation.py
#	modelscope/preprocessors/base.py
#	modelscope/preprocessors/nlp.py
#	modelscope/trainers/trainer.py

* Merge commit 'ab04ceafc5453ce7daa9aa09e37a55f703072a10' into feat/refactor_nlp_730

# Conflicts:
#	modelscope/metainfo.py
#	modelscope/metrics/builder.py
#	modelscope/models/__init__.py
#	modelscope/models/base/base_torch_model.py
#	modelscope/models/nlp/__init__.py
#	modelscope/models/nlp/backbones/space/model/intent_unified_transformer.py
#	modelscope/models/nlp/backbones/space/model/model_base.py
#	modelscope/models/nlp/palm_v2/palm_for_text_generation.py
#	modelscope/models/nlp/sbert_for_sequence_classification.py
#	modelscope/models/nlp/sequence_classification.py
#	modelscope/models/nlp/space/__init__.py
#	modelscope/models/nlp/space_for_dialog_intent_prediction.py
#	modelscope/models/nlp/space_for_dialog_modeling.py
#	modelscope/models/nlp/space_for_dialog_state_tracking.py
#	modelscope/models/nlp/task_model.py
#	modelscope/pipelines/nlp/sentiment_classification_pipeline.py
#	modelscope/preprocessors/base.py
#	modelscope/preprocessors/nlp.py
#	modelscope/trainers/trainer.py

* revert changes

* unify sentnece classification postprocess

* revert some changes, move some model files

* pipeline first case run through

* ws pipeline passed

* Merge branch 'feat/refactor_nlp_730' into feat/finetune_sbert_veco

* finetune

* revert code

* revert some code

* ws finetune started, only the accuracy is weird

* Merge branch 'feat/veco_taskdataset' into feat/finetune_sbert_veco

# Conflicts:
#	modelscope/task_datasets/veco_dataset.py
#	tests/taskdataset/test_veco_dataset.py

* veco+nli finetune started

* Merge branch 'master' into feat/finetune_sbert_veco

# Conflicts:
#	modelscope/models/nlp/sbert_for_sequence_classification.py
#	modelscope/models/nlp/sbert_for_token_classification.py
#	modelscope/models/nlp/sbert_for_zero_shot_classification.py
#	modelscope/models/nlp/space/space_for_dialog_intent_prediction.py
#	modelscope/models/nlp/space/space_for_dialog_modeling.py
#	modelscope/trainers/trainer.py

* add trainer for nlp

* trainer: dataset params passed into preprocessor

* test passed by nlptrainer

* fix some bugs

* fix some bugs

* add backbone/head subclass

* fix regression bugs

* fix bug in token-cls finetune

* support cfg modification

* fix bug

* fix bug

* update requirements

* add some comments and fix some t

* add some comments and revert a argument

* split to two test files

* revert code

* fixbug in precessor

(cherry picked from commit 7a648d096e)

* fix ut bug

* support sbert models

* unfinished

* Merge branch 'feat/finetune_sbert_veco' into sly_tmp_veco_finetune

# Conflicts:
#	tests/trainers/test_finetune_sequence_classification.py

* fixbug in veco

* fix bug

* fixbug

* correct running params

* remove useless files

* add palm finetuning with cnn_dailymail dataset

* copy space model from sofa

* Merge branch 'feat/finetune_sbert_veco' of gitlab.alibaba-inc.com:Ali-MaaS/MaaS-lib into feat/finetune_sbert_veco

* Merge branch 'master' into feat/finetune_sbert_veco

# Conflicts:
#	modelscope/metrics/__init__.py
#	modelscope/models/__init__.py
#	modelscope/models/nlp/__init__.py
#	modelscope/models/nlp/backbones/__init__.py
#	modelscope/models/nlp/backbones/structbert/modeling_sbert.py
#	modelscope/models/nlp/heads/__init__.py
#	modelscope/models/nlp/masked_language.py
#	modelscope/models/nlp/palm_v2/palm_for_text_generation.py
#	modelscope/models/nlp/sbert_for_nli.py
#	modelscope/models/nlp/sbert_for_sentence_similarity.py
#	modelscope/models/nlp/sbert_for_sentiment_classification.py
#	modelscope/models/nlp/sbert_for_sequence_classification.py
#	modelscope/models/nlp/sbert_for_token_classification.py
#	modelscope/models/nlp/sbert_for_zero_shot_classification.py
#	modelscope/models/nlp/sequence_classification.py
#	modelscope/models/nlp/space/space_for_dialog_intent_prediction.py
#	modelscope/models/nlp/space/space_for_dialog_modeling.py
#	modelscope/models/nlp/space/space_for_dialog_state_tracking.py
#	modelscope/models/nlp/structbert/adv_utils.py
#	modelscope/models/nlp/structbert/configuration_sbert.py
#	modelscope/models/nlp/task_models/task_model.py
#	modelscope/pipelines/__init__.py
#	modelscope/pipelines/nlp/__init__.py
#	modelscope/pipelines/nlp/fill_mask_pipeline.py
#	modelscope/pipelines/nlp/named_entity_recognition_pipeline.py
#	modelscope/pipelines/nlp/nli_pipeline.py
#	modelscope/pipelines/nlp/sentence_similarity_pipeline.py
#	modelscope/pipelines/nlp/sentiment_classification_pipeline.py
#	modelscope/pipelines/nlp/text_generation_pipeline.py
#	modelscope/pipelines/nlp/word_segmentation_pipeline.py
#	modelscope/pipelines/nlp/zero_shot_classification_pipeline.py
#	modelscope/preprocessors/nlp.py
#	modelscope/task_datasets/__init__.py
#	modelscope/trainers/trainer.py
#	modelscope/trainers/utils/inference.py
#	modelscope/utils/file_utils.py
#	requirements/nlp.txt
#	tests/pipelines/test_nli.py
#	tests/pipelines/test_sentence_similarity.py
#	tests/pipelines/test_sentiment_classification.py

* fix imports

* mark backbone in their own modeling

* pre-commit check passed

* pre-commit passed, remove roberta model

* fix a bug in ast import

* skip all finetune uts

* fix bugs

* pre-commit passed

* bug fixed

* bug fixed

* bug fixed

* bug fixed

* fix ut bug

* fix bug

* fix ut bug

* fix bug

* fix bug

* fixbugs

* fixbug

* revert veco

* revert veco because of core dump

* fix palm bug

* revert veco

* revert mistaken code

* add a test print

* pre-commit check

* test exception

* add test code

* for test

* fix bug and test

* remove test code

* remove useless file

* 1. fix some bugs 2. add backbone ut

* Merge branch 'master' into feat/finetune_refactor_730

# Conflicts:
#	modelscope/metainfo.py
#	modelscope/metrics/sequence_classification_metric.py
#	modelscope/models/nlp/__init__.py
#	modelscope/models/nlp/task_models/task_model.py
#	modelscope/preprocessors/__init__.py
#	modelscope/preprocessors/nlp.py
#	modelscope/trainers/trainer.py
#	modelscope/trainers/utils/inference.py
#	modelscope/utils/file_utils.py
#	tests/trainers/test_trainer_with_nlp.py

* pre-commit passed

* revert files

* increase test level

* unregister models

* fix bugs

* fix cr comments

* fix bug in backbone-head

* add sbert backbone

* fix bug

* add test for token-cls-metric

* pre-commit passed

* fix ut comments

* revert normal tokenizer to fast tokenizer

* Merge branch 'master' into feat/finetune_refactor_730

# Conflicts:
#	modelscope/models/nlp/__init__.py
#	modelscope/models/nlp/backbones/__init__.py
#	modelscope/models/nlp/backbones/structbert/__init__.py
#	modelscope/models/nlp/masked_language.py
#	modelscope/models/nlp/palm_v2/palm_for_text_generation.py
#	modelscope/models/nlp/sbert_for_sequence_classification.py
#	modelscope/models/nlp/sbert_for_token_classification.py
#	modelscope/models/nlp/sbert_for_zero_shot_classification.py
#	modelscope/pipelines/nlp/text_generation_pipeline.py
#	modelscope/preprocessors/nlp.py
#	modelscope/trainers/trainer.py
#	modelscope/trainers/utils/inference.py

* fix merge bugs

* pre commit passed

* fix bug

* fix bug

* fix bug

* fix bug from master

* add print

* fix ut bug

* fix bug

* Merge branch 'master' into feat/finetune_refactor_730

* skip task model test
master
yuze.zyz 3 years ago
parent
commit
21fa71baf0
100 changed files with 7836 additions and 1846 deletions
  1. +1
    -1
      configs/nlp/sbert_sentence_similarity.json
  2. +1
    -1
      modelscope/hub/utils/utils.py
  3. +8
    -2
      modelscope/metainfo.py
  4. +2
    -0
      modelscope/metrics/__init__.py
  5. +3
    -0
      modelscope/metrics/base.py
  6. +2
    -0
      modelscope/metrics/builder.py
  7. +4
    -4
      modelscope/metrics/sequence_classification_metric.py
  8. +123
    -0
      modelscope/metrics/token_classification_metric.py
  9. +10
    -7
      modelscope/models/base/base_model.py
  10. +8
    -3
      modelscope/models/base/base_torch_model.py
  11. +19
    -29
      modelscope/models/nlp/__init__.py
  12. +0
    -4
      modelscope/models/nlp/backbones/__init__.py
  13. +0
    -2
      modelscope/models/nlp/backbones/space/__init__.py
  14. +0
    -3
      modelscope/models/nlp/backbones/space/model/__init__.py
  15. +54
    -0
      modelscope/models/nlp/backbones/structbert.py
  16. +0
    -19
      modelscope/models/nlp/backbones/structbert/__init__.py
  17. +0
    -815
      modelscope/models/nlp/backbones/structbert/modeling_sbert.py
  18. +3
    -1
      modelscope/models/nlp/gpt3/__init__.py
  19. +0
    -0
      modelscope/models/nlp/gpt3/configuration_gpt3.py
  20. +1
    -1
      modelscope/models/nlp/gpt3/gpt3_for_text_generation.py
  21. +0
    -0
      modelscope/models/nlp/gpt3/modeling_gpt3.py
  22. +3
    -1
      modelscope/models/nlp/heads/__init__.py
  23. +1
    -2
      modelscope/models/nlp/heads/sequence_classification_head.py
  24. +26
    -0
      modelscope/models/nlp/heads/torch_pretrain_head.py
  25. +100
    -57
      modelscope/models/nlp/masked_language.py
  26. +43
    -0
      modelscope/models/nlp/palm_v2/__init__.py
  27. +116
    -0
      modelscope/models/nlp/palm_v2/configuration_palm.py
  28. +872
    -0
      modelscope/models/nlp/palm_v2/dureader_eval.py
  29. +1332
    -0
      modelscope/models/nlp/palm_v2/modeling_palm.py
  30. +2
    -2
      modelscope/models/nlp/palm_v2/palm_for_text_generation.py
  31. +0
    -23
      modelscope/models/nlp/sbert_for_nli.py
  32. +0
    -25
      modelscope/models/nlp/sbert_for_sentence_similarity.py
  33. +0
    -22
      modelscope/models/nlp/sbert_for_sentiment_classification.py
  34. +0
    -82
      modelscope/models/nlp/sbert_for_sequence_classification.py
  35. +0
    -64
      modelscope/models/nlp/sbert_for_token_classification.py
  36. +0
    -50
      modelscope/models/nlp/sbert_for_zero_shot_classification.py
  37. +155
    -66
      modelscope/models/nlp/sequence_classification.py
  38. +28
    -0
      modelscope/models/nlp/space/__init__.py
  39. +10
    -0
      modelscope/models/nlp/space/model/__init__.py
  40. +32
    -0
      modelscope/models/nlp/space/model/configuration_space.py
  41. +0
    -0
      modelscope/models/nlp/space/model/gen_unified_transformer.py
  42. +0
    -0
      modelscope/models/nlp/space/model/generator.py
  43. +0
    -0
      modelscope/models/nlp/space/model/intent_unified_transformer.py
  44. +0
    -0
      modelscope/models/nlp/space/model/model_base.py
  45. +268
    -0
      modelscope/models/nlp/space/model/modeling_space.py
  46. +29
    -0
      modelscope/models/nlp/space/model/tokenization_space.py
  47. +3
    -4
      modelscope/models/nlp/space/model/unified_transformer.py
  48. +0
    -0
      modelscope/models/nlp/space/modules/__init__.py
  49. +0
    -0
      modelscope/models/nlp/space/modules/embedder.py
  50. +0
    -0
      modelscope/models/nlp/space/modules/feedforward.py
  51. +0
    -0
      modelscope/models/nlp/space/modules/functions.py
  52. +0
    -0
      modelscope/models/nlp/space/modules/multihead_attention.py
  53. +0
    -0
      modelscope/models/nlp/space/modules/transformer_block.py
  54. +1
    -1
      modelscope/models/nlp/space/space_for_dialog_intent_prediction.py
  55. +1
    -1
      modelscope/models/nlp/space/space_for_dialog_modeling.py
  56. +1
    -1
      modelscope/models/nlp/space/space_for_dialog_state_tracking.py
  57. +45
    -0
      modelscope/models/nlp/structbert/__init__.py
  58. +4
    -2
      modelscope/models/nlp/structbert/adv_utils.py
  59. +7
    -4
      modelscope/models/nlp/structbert/configuration_sbert.py
  60. +1964
    -0
      modelscope/models/nlp/structbert/modeling_sbert.py
  61. +516
    -0
      modelscope/models/nlp/structbert/tokenization_sbert.py
  62. +200
    -0
      modelscope/models/nlp/structbert/tokenization_sbert_fast.py
  63. +0
    -0
      modelscope/models/nlp/task_models/__init__.py
  64. +86
    -0
      modelscope/models/nlp/task_models/sequence_classification.py
  65. +7
    -4
      modelscope/models/nlp/task_models/task_model.py
  66. +147
    -0
      modelscope/models/nlp/token_classification.py
  67. +43
    -0
      modelscope/models/nlp/veco/__init__.py
  68. +33
    -0
      modelscope/models/nlp/veco/configuration_veco.py
  69. +143
    -0
      modelscope/models/nlp/veco/modeling_veco.py
  70. +321
    -0
      modelscope/models/nlp/veco/tokenization_veco.py
  71. +213
    -0
      modelscope/models/nlp/veco/tokenization_veco_fast.py
  72. +7
    -0
      modelscope/msdatasets/ms_dataset.py
  73. +1
    -0
      modelscope/outputs.py
  74. +6
    -7
      modelscope/pipelines/nlp/__init__.py
  75. +10
    -11
      modelscope/pipelines/nlp/fill_mask_pipeline.py
  76. +5
    -7
      modelscope/pipelines/nlp/named_entity_recognition_pipeline.py
  77. +0
    -73
      modelscope/pipelines/nlp/nli_pipeline.py
  78. +37
    -0
      modelscope/pipelines/nlp/pair_sentence_classification_pipeline.py
  79. +0
    -73
      modelscope/pipelines/nlp/sentence_similarity_pipeline.py
  80. +0
    -74
      modelscope/pipelines/nlp/sentiment_classification_pipeline.py
  81. +60
    -0
      modelscope/pipelines/nlp/sequence_classification_pipeline_base.py
  82. +35
    -0
      modelscope/pipelines/nlp/single_sentence_classification_pipeline.py
  83. +4
    -4
      modelscope/pipelines/nlp/text_generation_pipeline.py
  84. +1
    -3
      modelscope/pipelines/nlp/translation_pipeline.py
  85. +19
    -17
      modelscope/pipelines/nlp/word_segmentation_pipeline.py
  86. +13
    -14
      modelscope/pipelines/nlp/zero_shot_classification_pipeline.py
  87. +7
    -7
      modelscope/preprocessors/__init__.py
  88. +3
    -1
      modelscope/preprocessors/base.py
  89. +302
    -204
      modelscope/preprocessors/nlp.py
  90. +1
    -1
      modelscope/preprocessors/space/dialog_state_tracking_preprocessor.py
  91. +2
    -0
      modelscope/task_datasets/__init__.py
  92. +3
    -3
      modelscope/task_datasets/base.py
  93. +3
    -3
      modelscope/task_datasets/torch_base_dataset.py
  94. +76
    -0
      modelscope/task_datasets/veco_dataset.py
  95. +1
    -0
      modelscope/trainers/__init__.py
  96. +1
    -0
      modelscope/trainers/hooks/evaluation_hook.py
  97. +5
    -3
      modelscope/trainers/hooks/lr_scheduler_hook.py
  98. +192
    -0
      modelscope/trainers/nlp_trainer.py
  99. +34
    -24
      modelscope/trainers/trainer.py
  100. +17
    -14
      modelscope/trainers/utils/inference.py

+ 1
- 1
configs/nlp/sbert_sentence_similarity.json View File

@@ -2,7 +2,7 @@
"framework": "pytorch",
"task": "sentence-similarity",
"preprocessor": {
"type": "bert-seq-cls-tokenizer-finetune",
"type": "sen-sim-tokenizer",
"first_sequence": "sentence1",
"second_sequence": "sentence2"
},


+ 1
- 1
modelscope/hub/utils/utils.py View File

@@ -4,7 +4,7 @@ from modelscope.hub.constants import (DEFAULT_MODELSCOPE_DOMAIN,
DEFAULT_MODELSCOPE_GROUP,
MODEL_ID_SEPARATOR,
MODELSCOPE_URL_SCHEME)
from modelscope.utils.utils import get_default_cache_dir
from modelscope.utils.file_utils import get_default_cache_dir


def model_id_to_group_owner_name(model_id):


+ 8
- 2
modelscope/metainfo.py View File

@@ -53,6 +53,10 @@ class TaskModels(object):
class Heads(object):
# nlp heads
text_classification = 'text-classification'
# mlm
bert_mlm = 'bert-mlm'
# roberta mlm
roberta_mlm = 'roberta-mlm'


class Pipelines(object):
@@ -137,7 +141,7 @@ class Trainers(object):
Holds the standard trainer name to use for identifying different trainer.
This should be used to register trainers.

For a general Trainer, you can use easynlp-trainer/ofa-trainer/sofa-trainer.
For a general Trainer, you can use easynlp-trainer/ofa-trainer.
For a model specific Trainer, you can use ${ModelName}-${Task}-trainer.
"""

@@ -179,6 +183,8 @@ class Preprocessors(object):
sbert_token_cls_tokenizer = 'sbert-token-cls-tokenizer'
zero_shot_cls_tokenizer = 'zero-shot-cls-tokenizer'
text_error_correction = 'text-error-correction'
word_segment_text_to_label_preprocessor = 'word-segment-text-to-label-preprocessor'
fill_mask = 'fill-mask'

# audio preprocessor
linear_aec_fbank = 'linear-aec-fbank'
@@ -204,7 +210,7 @@ class Metrics(object):
# metric for image instance segmentation task
image_ins_seg_coco_metric = 'image-ins-seg-coco-metric'
# metrics for sequence classification task
seq_cls_metric = 'seq_cls_metric'
seq_cls_metric = 'seq-cls-metric'
# metrics for token-classification task
token_cls_metric = 'token-cls-metric'
# metrics for text-generation task


+ 2
- 0
modelscope/metrics/__init__.py View File

@@ -13,6 +13,7 @@ if TYPE_CHECKING:
from .image_portrait_enhancement_metric import ImagePortraitEnhancementMetric
from .sequence_classification_metric import SequenceClassificationMetric
from .text_generation_metric import TextGenerationMetric
from .token_classification_metric import TokenClassificationMetric

else:
_import_structure = {
@@ -26,6 +27,7 @@ else:
['ImagePortraitEnhancementMetric'],
'sequence_classification_metric': ['SequenceClassificationMetric'],
'text_generation_metric': ['TextGenerationMetric'],
'token_classification_metric': ['TokenClassificationMetric'],
}

import sys


+ 3
- 0
modelscope/metrics/base.py View File

@@ -10,6 +10,9 @@ class Metric(ABC):
complex metrics for a specific task with or without other Metric subclasses.
"""

def __init__(self, trainer=None, *args, **kwargs):
self.trainer = trainer

@abstractmethod
def add(self, outputs: Dict, inputs: Dict):
""" Append logits and labels within an eval loop.


+ 2
- 0
modelscope/metrics/builder.py View File

@@ -20,7 +20,9 @@ class MetricKeys(object):
task_default_metrics = {
Tasks.image_segmentation: [Metrics.image_ins_seg_coco_metric],
Tasks.sentence_similarity: [Metrics.seq_cls_metric],
Tasks.nli: [Metrics.seq_cls_metric],
Tasks.sentiment_classification: [Metrics.seq_cls_metric],
Tasks.token_classification: [Metrics.token_cls_metric],
Tasks.text_generation: [Metrics.text_gen_metric],
Tasks.image_denoising: [Metrics.image_denoise_metric],
Tasks.image_color_enhancement: [Metrics.image_color_enhance_metric],


+ 4
- 4
modelscope/metrics/sequence_classification_metric.py View File

@@ -17,14 +17,14 @@ class SequenceClassificationMetric(Metric):
"""The metric computation class for sequence classification classes.
"""

label_name = 'labels'

def __init__(self):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.preds = []
self.labels = []

def add(self, outputs: Dict, inputs: Dict):
ground_truths = inputs[self.label_name]
label_name = OutputKeys.LABEL if OutputKeys.LABEL in inputs else OutputKeys.LABELS
ground_truths = inputs[label_name]
eval_results = outputs[OutputKeys.LOGITS]
self.preds.append(
torch_nested_numpify(torch_nested_detach(eval_results)))


+ 123
- 0
modelscope/metrics/token_classification_metric.py View File

@@ -0,0 +1,123 @@
import importlib
from typing import Dict, List, Optional, Union

import numpy as np

from modelscope.outputs import OutputKeys
from ..metainfo import Metrics
from ..utils.registry import default_group
from ..utils.tensor_utils import torch_nested_detach, torch_nested_numpify
from .base import Metric
from .builder import METRICS, MetricKeys


@METRICS.register_module(
group_key=default_group, module_name=Metrics.token_cls_metric)
class TokenClassificationMetric(Metric):
"""
The metric computation class for token-classification task.
Args:
return_entity_level_metrics (bool, *optional*):
Whether to return every label's detail metrics, default False.
"""

def add(self, outputs: Dict, inputs: Dict):
label_name = OutputKeys.LABEL if OutputKeys.LABEL in inputs else OutputKeys.LABELS
ground_truths = inputs[label_name]
eval_results = outputs[OutputKeys.LOGITS]
self.preds.append(
torch_nested_numpify(torch_nested_detach(eval_results)))
self.labels.append(
torch_nested_numpify(torch_nested_detach(ground_truths)))

def __init__(self, return_entity_level_metrics=False, *args, **kwargs):
super().__init__(*args, **kwargs)
self.return_entity_level_metrics = return_entity_level_metrics
self.preds = []
self.labels = []

def evaluate(self):
self.id2label = {
id: label
for label, id in self.trainer.label2id.items()
}
self.preds = np.concatenate(self.preds, axis=0)
self.labels = np.concatenate(self.labels, axis=0)
predictions = np.argmax(self.preds, axis=-1)

true_predictions = [[
self.id2label[p] for (p, lb) in zip(prediction, label)
if lb != -100
] for prediction, label in zip(predictions, self.labels)]
true_labels = [[
self.id2label[lb] for (p, lb) in zip(prediction, label)
if lb != -100
] for prediction, label in zip(predictions, self.labels)]

results = self._compute(
predictions=true_predictions, references=true_labels)
if self.return_entity_level_metrics:
final_results = {}
for key, value in results.items():
if isinstance(value, dict):
for n, v in value.items():
final_results[f'{key}_{n}'] = v
else:
final_results[key] = value
return final_results
else:
return {
MetricKeys.PRECISION: results[MetricKeys.PRECISION],
MetricKeys.RECALL: results[MetricKeys.RECALL],
MetricKeys.F1: results[MetricKeys.F1],
MetricKeys.ACCURACY: results[MetricKeys.ACCURACY],
}

@staticmethod
def _compute(
predictions,
references,
suffix: bool = False,
scheme: Optional[str] = None,
mode: Optional[str] = None,
sample_weight: Optional[List[int]] = None,
zero_division: Union[str, int] = 'warn',
):
from seqeval.metrics import accuracy_score, classification_report
if scheme is not None:
try:
scheme_module = importlib.import_module('seqeval.scheme')
scheme = getattr(scheme_module, scheme)
except AttributeError:
raise ValueError(
f'Scheme should be one of [IOB1, IOB2, IOE1, IOE2, IOBES, BILOU], got {scheme}'
)
report = classification_report(
y_true=references,
y_pred=predictions,
suffix=suffix,
output_dict=True,
scheme=scheme,
mode=mode,
sample_weight=sample_weight,
zero_division=zero_division,
)
report.pop('macro avg')
report.pop('weighted avg')
overall_score = report.pop('micro avg')

scores = {
type_name: {
MetricKeys.PRECISION: score['precision'],
MetricKeys.RECALL: score['recall'],
MetricKeys.F1: score['f1-score'],
'number': score['support'],
}
for type_name, score in report.items()
}
scores[MetricKeys.PRECISION] = overall_score['precision']
scores[MetricKeys.RECALL] = overall_score['recall']
scores[MetricKeys.F1] = overall_score['f1-score']
scores[MetricKeys.ACCURACY] = accuracy_score(
y_true=references, y_pred=predictions)
return scores

+ 10
- 7
modelscope/models/base/base_model.py View File

@@ -10,6 +10,8 @@ from modelscope.hub.snapshot_download import snapshot_download
from modelscope.models.builder import build_model
from modelscope.utils.config import Config
from modelscope.utils.constant import DEFAULT_MODEL_REVISION, ModelFile
from modelscope.utils.file_utils import func_receive_dict_inputs
from modelscope.utils.hub import parse_label_mapping
from modelscope.utils.logger import get_logger

logger = get_logger()
@@ -69,6 +71,7 @@ class Model(ABC):
def from_pretrained(cls,
model_name_or_path: str,
revision: Optional[str] = DEFAULT_MODEL_REVISION,
cfg_dict: Config = None,
*model_args,
**kwargs):
""" Instantiate a model from local directory or remote model repo. Note
@@ -87,25 +90,25 @@ class Model(ABC):
)
local_model_dir = snapshot_download(model_name_or_path, revision)
logger.info(f'initialize model from {local_model_dir}')
cfg = Config.from_file(
osp.join(local_model_dir, ModelFile.CONFIGURATION))
if cfg_dict is not None:
cfg = cfg_dict
else:
cfg = Config.from_file(
osp.join(local_model_dir, ModelFile.CONFIGURATION))
task_name = cfg.task
model_cfg = cfg.model
assert hasattr(
cfg, 'pipeline'), 'pipeline config is missing from config file.'
pipeline_cfg = cfg.pipeline
# TODO @wenmeng.zwm may should manually initialize model after model building

if hasattr(model_cfg, 'model_type') and not hasattr(model_cfg, 'type'):
model_cfg.type = model_cfg.model_type

model_cfg.model_dir = local_model_dir

for k, v in kwargs.items():
model_cfg[k] = v
model = build_model(
model_cfg, task_name=task_name, default_args=kwargs)

# dynamically add pipeline info to model for pipeline inference
model.pipeline = pipeline_cfg
if hasattr(cfg, 'pipeline'):
model.pipeline = cfg.pipeline
return model

+ 8
- 3
modelscope/models/base/base_torch_model.py View File

@@ -5,6 +5,7 @@ from typing import Any, Dict, Optional, Union
import torch
from torch import nn

from modelscope.utils.file_utils import func_receive_dict_inputs
from modelscope.utils.logger import get_logger
from .base_model import Model

@@ -20,6 +21,13 @@ class TorchModel(Model, torch.nn.Module):
super().__init__(model_dir, *args, **kwargs)
torch.nn.Module.__init__(self)

def __call__(self, input: Dict[str,
torch.Tensor]) -> Dict[str, torch.Tensor]:
if func_receive_dict_inputs(self.forward):
return self.postprocess(self.forward(input))
else:
return self.postprocess(self.forward(**input))

def forward(self, inputs: Dict[str,
torch.Tensor]) -> Dict[str, torch.Tensor]:
raise NotImplementedError
@@ -50,6 +58,3 @@ class TorchModel(Model, torch.nn.Module):
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)

def compute_loss(self, outputs: Dict[str, Any], labels):
raise NotImplementedError()

+ 19
- 29
modelscope/models/nlp/__init__.py View File

@@ -4,32 +4,26 @@ from typing import TYPE_CHECKING
from modelscope.utils.import_utils import LazyImportModule

if TYPE_CHECKING:
from .backbones import (SbertModel, SpaceGenerator, SpaceModelBase,
GPT3Model)
from .backbones import SbertModel
from .heads import SequenceClassificationHead
from .bert_for_sequence_classification import BertForSequenceClassification
from .csanmt_for_translation import CsanmtForTranslation
from .masked_language import (StructBertForMaskedLM, VecoForMaskedLM,
BertForMaskedLM)
from .nncrf_for_named_entity_recognition import TransformerCRFForNamedEntityRecognition
from .palm_for_text_generation import PalmForTextGeneration
from .sbert_for_nli import SbertForNLI
from .sbert_for_sentence_similarity import SbertForSentenceSimilarity
from .sbert_for_sentiment_classification import SbertForSentimentClassification
from .sbert_for_token_classification import SbertForTokenClassification
from .sbert_for_zero_shot_classification import SbertForZeroShotClassification
from .sequence_classification import SequenceClassificationModel
from .space_for_dialog_intent_prediction import SpaceForDialogIntent
from .space_for_dialog_modeling import SpaceForDialogModeling
from .space_for_dialog_state_tracking import SpaceForDialogStateTracking
from .task_model import SingleBackboneTaskModelBase
from .palm_v2 import PalmForTextGeneration
from .token_classification import SbertForTokenClassification
from .sequence_classification import VecoForSequenceClassification, SbertForSequenceClassification
from .space import SpaceForDialogIntent
from .space import SpaceForDialogModeling
from .space import SpaceForDialogStateTracking
from .task_models.task_model import SingleBackboneTaskModelBase
from .bart_for_text_error_correction import BartForTextErrorCorrection
from .gpt3_for_text_generation import GPT3ForTextGeneration
from .gpt3 import GPT3ForTextGeneration

else:
_import_structure = {
'backbones':
['SbertModel', 'SpaceGenerator', 'SpaceModelBase', 'GPT3Model'],
'backbones': ['SbertModel'],
'heads': ['SequenceClassificationHead'],
'csanmt_for_translation': ['CsanmtForTranslation'],
'bert_for_sequence_classification': ['BertForSequenceClassification'],
@@ -37,21 +31,17 @@ else:
['StructBertForMaskedLM', 'VecoForMaskedLM', 'BertForMaskedLM'],
'nncrf_for_named_entity_recognition':
['TransformerCRFForNamedEntityRecognition'],
'palm_for_text_generation': ['PalmForTextGeneration'],
'sbert_for_nli': ['SbertForNLI'],
'sbert_for_sentence_similarity': ['SbertForSentenceSimilarity'],
'sbert_for_sentiment_classification':
['SbertForSentimentClassification'],
'sbert_for_token_classification': ['SbertForTokenClassification'],
'sbert_for_zero_shot_classification':
['SbertForZeroShotClassification'],
'sequence_classification': ['SequenceClassificationModel'],
'space_for_dialog_intent_prediction': ['SpaceForDialogIntent'],
'space_for_dialog_modeling': ['SpaceForDialogModeling'],
'space_for_dialog_state_tracking': ['SpaceForDialogStateTracking'],
'palm_v2': ['PalmForTextGeneration'],
'token_classification': ['SbertForTokenClassification'],
'sequence_classification':
['VecoForSequenceClassification', 'SbertForSequenceClassification'],
'space': [
'SpaceForDialogIntent', 'SpaceForDialogModeling',
'SpaceForDialogStateTracking'
],
'task_model': ['SingleBackboneTaskModelBase'],
'bart_for_text_error_correction': ['BartForTextErrorCorrection'],
'gpt3_for_text_generation': ['GPT3ForTextGeneration'],
'gpt3': ['GPT3ForTextGeneration'],
}

import sys


+ 0
- 4
modelscope/models/nlp/backbones/__init__.py View File

@@ -4,14 +4,10 @@ from typing import TYPE_CHECKING
from modelscope.utils.import_utils import LazyImportModule

if TYPE_CHECKING:
from .space import SpaceGenerator, SpaceModelBase
from .structbert import SbertModel
from .gpt3 import GPT3Model
else:
_import_structure = {
'space': ['SpaceGenerator', 'SpaceModelBase'],
'structbert': ['SbertModel'],
'gpt3': ['GPT3Model']
}

import sys


+ 0
- 2
modelscope/models/nlp/backbones/space/__init__.py View File

@@ -1,2 +0,0 @@
from .model.generator import Generator as SpaceGenerator
from .model.model_base import SpaceModelBase

+ 0
- 3
modelscope/models/nlp/backbones/space/model/__init__.py View File

@@ -1,3 +0,0 @@
from .gen_unified_transformer import GenUnifiedTransformer
from .intent_unified_transformer import IntentUnifiedTransformer
from .unified_transformer import UnifiedTransformer

+ 54
- 0
modelscope/models/nlp/backbones/structbert.py View File

@@ -0,0 +1,54 @@
from transformers import PreTrainedModel

from modelscope.metainfo import Models
from modelscope.models.base import TorchModel
from modelscope.models.builder import BACKBONES
from modelscope.models.nlp.structbert import SbertConfig
from modelscope.models.nlp.structbert import SbertModel as SbertModelTransform
from modelscope.utils.constant import Fields
from modelscope.utils.logger import get_logger

logger = get_logger(__name__)


@BACKBONES.register_module(Fields.nlp, module_name=Models.structbert)
class SbertModel(TorchModel, SbertModelTransform):

def __init__(self, model_dir=None, add_pooling_layer=True, **config):
"""
Args:
model_dir (str, optional): The model checkpoint directory. Defaults to None.
add_pooling_layer (bool, optional): to decide if pool the output from hidden layer. Defaults to True.
"""
config = SbertConfig(**config)
super().__init__(model_dir)
self.config = config
SbertModelTransform.__init__(self, config, add_pooling_layer)

def extract_sequence_outputs(self, outputs):
return outputs['last_hidden_state']

def extract_pooled_outputs(self, outputs):
return outputs['pooler_output']

def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
past_key_values=None,
use_cache=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
return SbertModelTransform.forward(
self, input_ids, attention_mask, token_type_ids, position_ids,
head_mask, inputs_embeds, encoder_hidden_states,
encoder_attention_mask, past_key_values, use_cache,
output_attentions, output_hidden_states, return_dict)

+ 0
- 19
modelscope/models/nlp/backbones/structbert/__init__.py View File

@@ -1,19 +0,0 @@
# Copyright (c) Alibaba, Inc. and its affiliates.
from typing import TYPE_CHECKING

from modelscope.utils.import_utils import LazyImportModule

if TYPE_CHECKING:
from .modeling_sbert import SbertModel
else:
_import_structure = {'modeling_sbert': ['SbertModel']}

import sys

sys.modules[__name__] = LazyImportModule(
__name__,
globals()['__file__'],
_import_structure,
module_spec=__spec__,
extra_objects={},
)

+ 0
- 815
modelscope/models/nlp/backbones/structbert/modeling_sbert.py View File

@@ -1,815 +0,0 @@
import math
from dataclasses import dataclass
from typing import Optional, Tuple, Union

import torch
import torch.utils.checkpoint
from packaging import version
from torch import nn
from transformers import PreTrainedModel
from transformers.activations import ACT2FN
from transformers.modeling_outputs import (
BaseModelOutputWithPastAndCrossAttentions,
BaseModelOutputWithPoolingAndCrossAttentions, ModelOutput)
from transformers.modeling_utils import (apply_chunking_to_forward,
find_pruneable_heads_and_indices,
prune_linear_layer)

from modelscope.metainfo import Models
from modelscope.models.base import TorchModel
from modelscope.models.builder import BACKBONES
from modelscope.utils.constant import Fields
from modelscope.utils.logger import get_logger
from .configuration_sbert import SbertConfig

logger = get_logger(__name__)


@BACKBONES.register_module(Fields.nlp, module_name=Models.structbert)
class SbertModel(TorchModel, PreTrainedModel):
"""

The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
cross-attention is added between the self-attention layers, following the architecture described in `Attention is
all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.

To behave as an decoder the model needs to be initialized with the :obj:`is_decoder` argument of the configuration
set to :obj:`True`. To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder`
argument and :obj:`add_cross_attention` set to :obj:`True`; an :obj:`encoder_hidden_states` is then expected as an
input to the forward pass.
"""

def __init__(self, model_dir=None, add_pooling_layer=True, **config):
"""
Args:
model_dir (str, optional): The model checkpoint directory. Defaults to None.
add_pooling_layer (bool, optional): to decide if pool the output from hidden layer. Defaults to True.
"""
config = SbertConfig(**config)
super().__init__(model_dir)
self.config = config

self.embeddings = SbertEmbeddings(config)
self.encoder = SbertEncoder(config)

self.pooler = SbertPooler(config) if add_pooling_layer else None
self.init_weights()

def get_input_embeddings(self):
return self.embeddings.word_embeddings

def set_input_embeddings(self, value):
self.embeddings.word_embeddings = value

def _prune_heads(self, heads_to_prune):
"""
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
class PreTrainedModel
"""
for layer, heads in heads_to_prune.items():
self.encoder.layer[layer].attention.prune_heads(heads)

def forward(self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
past_key_values=None,
use_cache=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
**kwargs):
r"""
encoder_hidden_states (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`
, `optional`):
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
the model is configured as a decoder.
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:

- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers`
with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads,
sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
(those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
use_cache (:obj:`bool`, `optional`):
If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
decoding (see :obj:`past_key_values`).
"""

output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else
self.config.output_hidden_states)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict

if self.config.is_decoder:
use_cache = use_cache if use_cache is not None else self.config.use_cache
else:
use_cache = False

if input_ids is not None and inputs_embeds is not None:
raise ValueError(
'You cannot specify both input_ids and inputs_embeds at the same time'
)
elif input_ids is not None:
input_shape = input_ids.size()
elif inputs_embeds is not None:
input_shape = inputs_embeds.size()[:-1]
else:
raise ValueError(
'You have to specify either input_ids or inputs_embeds')

batch_size, seq_length = input_shape
device = input_ids.device if input_ids is not None else inputs_embeds.device

# past_key_values_length
past_key_values_length = past_key_values[0][0].shape[
2] if past_key_values is not None else 0

if attention_mask is None:
attention_mask = torch.ones(
((batch_size, seq_length + past_key_values_length)),
device=device)

if token_type_ids is None:
if hasattr(self.embeddings, 'token_type_ids'):
buffered_token_type_ids = self.embeddings.token_type_ids[:, :
seq_length]
buffered_token_type_ids_expanded = buffered_token_type_ids.expand(
batch_size, seq_length)
token_type_ids = buffered_token_type_ids_expanded
else:
token_type_ids = torch.zeros(
input_shape, dtype=torch.long, device=device)

# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
# ourselves in which case we just need to make it broadcastable to all heads.
extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(
attention_mask, input_shape, device)

# If a 2D or 3D attention mask is provided for the cross-attention
# we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
if self.config.is_decoder and encoder_hidden_states is not None:
encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size(
)
encoder_hidden_shape = (encoder_batch_size,
encoder_sequence_length)
if encoder_attention_mask is None:
encoder_attention_mask = torch.ones(
encoder_hidden_shape, device=device)
encoder_extended_attention_mask = self.invert_attention_mask(
encoder_attention_mask)
else:
encoder_extended_attention_mask = None

# Prepare head mask if needed
# 1.0 in head_mask indicate we keep the head
# attention_probs has shape bsz x n_heads x N x N
# input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
# and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
head_mask = self.get_head_mask(head_mask,
self.config.num_hidden_layers)

embedding_output, orignal_embeds = self.embeddings(
input_ids=input_ids,
position_ids=position_ids,
token_type_ids=token_type_ids,
inputs_embeds=inputs_embeds,
past_key_values_length=past_key_values_length,
return_inputs_embeds=True,
)
encoder_outputs = self.encoder(
embedding_output,
attention_mask=extended_attention_mask,
head_mask=head_mask,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_extended_attention_mask,
past_key_values=past_key_values,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = encoder_outputs[0]
pooled_output = self.pooler(
sequence_output) if self.pooler is not None else None

if not return_dict:
return (sequence_output,
pooled_output) + encoder_outputs[1:] + (orignal_embeds, )

return BaseModelOutputWithPoolingAndCrossAttentionsWithEmbedding(
last_hidden_state=sequence_output,
pooler_output=pooled_output,
past_key_values=encoder_outputs.past_key_values,
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions,
cross_attentions=encoder_outputs.cross_attentions,
embedding_output=orignal_embeds)

def extract_sequence_outputs(self, outputs):
return outputs['last_hidden_state']

def extract_pooled_outputs(self, outputs):
return outputs['pooler_output']


class SbertEmbeddings(nn.Module):
"""Construct the embeddings from word, position and token_type embeddings."""

def __init__(self, config):
super().__init__()
self.word_embeddings = nn.Embedding(
config.vocab_size,
config.hidden_size,
padding_idx=config.pad_token_id)
self.position_embeddings = nn.Embedding(config.max_position_embeddings,
config.hidden_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size,
config.hidden_size)

# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
# any TensorFlow checkpoint file
self.LayerNorm = nn.LayerNorm(
config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
# position_ids (1, len position emb) is contiguous in memory and exported when serialized
self.position_embedding_type = getattr(config,
'position_embedding_type',
'absolute')
self.register_buffer(
'position_ids',
torch.arange(config.max_position_embeddings).expand((1, -1)))
if version.parse(torch.__version__) > version.parse('1.6.0'):
self.register_buffer(
'token_type_ids',
torch.zeros(
self.position_ids.size(),
dtype=torch.long,
device=self.position_ids.device),
persistent=False,
)

def forward(self,
input_ids=None,
token_type_ids=None,
position_ids=None,
inputs_embeds=None,
past_key_values_length=0,
return_inputs_embeds=False):
if input_ids is not None:
input_shape = input_ids.size()
else:
input_shape = inputs_embeds.size()[:-1]

seq_length = input_shape[1]

if position_ids is None:
position_ids = self.position_ids[:,
past_key_values_length:seq_length
+ past_key_values_length]

# Setting the token_type_ids to the registered buffer in constructor where it is all zeros, which usually occurs
# when its auto-generated, registered buffer helps users when tracing the model without passing token_type_ids
# issue #5664
if token_type_ids is None:
if hasattr(self, 'token_type_ids'):
buffered_token_type_ids = self.token_type_ids[:, :seq_length]
buffered_token_type_ids_expanded = buffered_token_type_ids.expand(
input_shape[0], seq_length)
token_type_ids = buffered_token_type_ids_expanded
else:
token_type_ids = torch.zeros(
input_shape,
dtype=torch.long,
device=self.position_ids.device)

if inputs_embeds is None:
inputs_embeds = self.word_embeddings(input_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)

embeddings = inputs_embeds + token_type_embeddings
if self.position_embedding_type == 'absolute':
position_embeddings = self.position_embeddings(position_ids)
embeddings += position_embeddings
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
if not return_inputs_embeds:
return embeddings
else:
return embeddings, inputs_embeds


class SbertSelfAttention(nn.Module):

def __init__(self, config):
super().__init__()
if config.hidden_size % config.num_attention_heads != 0 and not hasattr(
config, 'embedding_size'):
raise ValueError(
f'The hidden size ({config.hidden_size}) is not a multiple of the number of attention '
f'heads ({config.num_attention_heads})')

self.num_attention_heads = config.num_attention_heads
self.attention_head_size = int(config.hidden_size
/ config.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size

self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)

self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
self.position_embedding_type = getattr(config,
'position_embedding_type',
'absolute')
if self.position_embedding_type == 'relative_key' or self.position_embedding_type == 'relative_key_query':
self.max_position_embeddings = config.max_position_embeddings
self.distance_embedding = nn.Embedding(
2 * config.max_position_embeddings - 1,
self.attention_head_size)

self.is_decoder = config.is_decoder

def transpose_for_scores(self, x):
new_x_shape = x.size()[:-1] + (self.num_attention_heads,
self.attention_head_size)
x = x.view(*new_x_shape)
return x.permute(0, 2, 1, 3)

def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
past_key_value=None,
output_attentions=False,
):
mixed_query_layer = self.query(hidden_states)

# If this is instantiated as a cross-attention module, the keys
# and values come from an encoder; the attention mask needs to be
# such that the encoder's padding tokens are not attended to.
is_cross_attention = encoder_hidden_states is not None

if is_cross_attention and past_key_value is not None:
# reuse k,v, cross_attentions
key_layer = past_key_value[0]
value_layer = past_key_value[1]
attention_mask = encoder_attention_mask
elif is_cross_attention:
key_layer = self.transpose_for_scores(
self.key(encoder_hidden_states))
value_layer = self.transpose_for_scores(
self.value(encoder_hidden_states))
attention_mask = encoder_attention_mask
elif past_key_value is not None:
key_layer = self.transpose_for_scores(self.key(hidden_states))
value_layer = self.transpose_for_scores(self.value(hidden_states))
key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
value_layer = torch.cat([past_key_value[1], value_layer], dim=2)
else:
key_layer = self.transpose_for_scores(self.key(hidden_states))
value_layer = self.transpose_for_scores(self.value(hidden_states))

query_layer = self.transpose_for_scores(mixed_query_layer)

if self.is_decoder:
# if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.
# Further calls to cross_attention layer can then reuse all cross-attention
# key/value_states (first "if" case)
# if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of
# all previous decoder key/value_states. Further calls to uni-directional self-attention
# can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
# if encoder bi-directional self-attention `past_key_value` is always `None`
past_key_value = (key_layer, value_layer)

# Take the dot product between "query" and "key" to get the raw attention scores.
attention_scores = torch.matmul(query_layer,
key_layer.transpose(-1, -2))

if self.position_embedding_type == 'relative_key' or self.position_embedding_type == 'relative_key_query':
seq_length = hidden_states.size()[1]
position_ids_l = torch.arange(
seq_length, dtype=torch.long,
device=hidden_states.device).view(-1, 1)
position_ids_r = torch.arange(
seq_length, dtype=torch.long,
device=hidden_states.device).view(1, -1)
distance = position_ids_l - position_ids_r
positional_embedding = self.distance_embedding(
distance + self.max_position_embeddings - 1)
positional_embedding = positional_embedding.to(
dtype=query_layer.dtype) # fp16 compatibility

if self.position_embedding_type == 'relative_key':
relative_position_scores = torch.einsum(
'bhld,lrd->bhlr', query_layer, positional_embedding)
attention_scores = attention_scores + relative_position_scores
elif self.position_embedding_type == 'relative_key_query':
relative_position_scores_query = torch.einsum(
'bhld,lrd->bhlr', query_layer, positional_embedding)
relative_position_scores_key = torch.einsum(
'bhrd,lrd->bhlr', key_layer, positional_embedding)
attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key

attention_scores = attention_scores / math.sqrt(
self.attention_head_size)
if attention_mask is not None:
# Apply the attention mask is (precomputed for all layers in SbertModel forward() function)
attention_scores = attention_scores + attention_mask

# Normalize the attention scores to probabilities.
attention_probs = nn.Softmax(dim=-1)(attention_scores)

# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = self.dropout(attention_probs)

# Mask heads if we want to
if head_mask is not None:
attention_probs = attention_probs * head_mask

context_layer = torch.matmul(attention_probs, value_layer)

context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (
self.all_head_size, )
context_layer = context_layer.view(*new_context_layer_shape)

outputs = (context_layer,
attention_probs) if output_attentions else (context_layer, )

if self.is_decoder:
outputs = outputs + (past_key_value, )
return outputs


class SbertSelfOutput(nn.Module):

def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.LayerNorm = nn.LayerNorm(
config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)

def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states


class SbertAttention(nn.Module):

def __init__(self, config):
super().__init__()
self.self = SbertSelfAttention(config)
self.output = SbertSelfOutput(config)
self.pruned_heads = set()

def prune_heads(self, heads):
if len(heads) == 0:
return
heads, index = find_pruneable_heads_and_indices(
heads, self.self.num_attention_heads,
self.self.attention_head_size, self.pruned_heads)

# Prune linear layers
self.self.query = prune_linear_layer(self.self.query, index)
self.self.key = prune_linear_layer(self.self.key, index)
self.self.value = prune_linear_layer(self.self.value, index)
self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)

# Update hyper params and store pruned heads
self.self.num_attention_heads = self.self.num_attention_heads - len(
heads)
self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
self.pruned_heads = self.pruned_heads.union(heads)

def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
past_key_value=None,
output_attentions=False,
):
self_outputs = self.self(
hidden_states,
attention_mask,
head_mask,
encoder_hidden_states,
encoder_attention_mask,
past_key_value,
output_attentions,
)
attention_output = self.output(self_outputs[0], hidden_states)
outputs = (attention_output,
) + self_outputs[1:] # add attentions if we output them
return outputs


class SbertIntermediate(nn.Module):

def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
if isinstance(config.hidden_act, str):
self.intermediate_act_fn = ACT2FN[config.hidden_act]
else:
self.intermediate_act_fn = config.hidden_act

def forward(self, hidden_states):
hidden_states = self.dense(hidden_states)
hidden_states = self.intermediate_act_fn(hidden_states)
return hidden_states


class SbertOutput(nn.Module):

def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
self.LayerNorm = nn.LayerNorm(
config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)

def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states


class SbertLayer(nn.Module):

def __init__(self, config):
super().__init__()
self.chunk_size_feed_forward = config.chunk_size_feed_forward
self.seq_len_dim = 1
self.attention = SbertAttention(config)
self.is_decoder = config.is_decoder
self.add_cross_attention = config.add_cross_attention
if self.add_cross_attention:
if not self.is_decoder:
raise ValueError(
f'{self} should be used as a decoder model if cross attention is added'
)
self.crossattention = SbertAttention(config)
self.intermediate = SbertIntermediate(config)
self.output = SbertOutput(config)

def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
past_key_value=None,
output_attentions=False,
):
# decoder uni-directional self-attention cached key/values tuple is at positions 1,2
self_attn_past_key_value = past_key_value[:
2] if past_key_value is not None else None
self_attention_outputs = self.attention(
hidden_states,
attention_mask,
head_mask,
output_attentions=output_attentions,
past_key_value=self_attn_past_key_value,
)
attention_output = self_attention_outputs[0]

# if decoder, the last output is tuple of self-attn cache
if self.is_decoder:
outputs = self_attention_outputs[1:-1]
present_key_value = self_attention_outputs[-1]
else:
outputs = self_attention_outputs[
1:] # add self attentions if we output attention weights

cross_attn_present_key_value = None
if self.is_decoder and encoder_hidden_states is not None:
if not hasattr(self, 'crossattention'):
raise ValueError(
f'If `encoder_hidden_states` are passed, {self} has to be instantiated'
f'with cross-attention layers by setting `config.add_cross_attention=True`'
)

# cross_attn cached key/values tuple is at positions 3,4 of past_key_value tuple
cross_attn_past_key_value = past_key_value[
-2:] if past_key_value is not None else None
cross_attention_outputs = self.crossattention(
attention_output,
attention_mask,
head_mask,
encoder_hidden_states,
encoder_attention_mask,
cross_attn_past_key_value,
output_attentions,
)
attention_output = cross_attention_outputs[0]
outputs = outputs + cross_attention_outputs[
1:-1] # add cross attentions if we output attention weights

# add cross-attn cache to positions 3,4 of present_key_value tuple
cross_attn_present_key_value = cross_attention_outputs[-1]
present_key_value = present_key_value + cross_attn_present_key_value

layer_output = apply_chunking_to_forward(self.feed_forward_chunk,
self.chunk_size_feed_forward,
self.seq_len_dim,
attention_output)
outputs = (layer_output, ) + outputs

# if decoder, return the attn key/values as the last output
if self.is_decoder:
outputs = outputs + (present_key_value, )

return outputs

def feed_forward_chunk(self, attention_output):
intermediate_output = self.intermediate(attention_output)
layer_output = self.output(intermediate_output, attention_output)
return layer_output


class SbertEncoder(nn.Module):

def __init__(self, config):
super().__init__()
self.config = config
self.layer = nn.ModuleList(
[SbertLayer(config) for _ in range(config.num_hidden_layers)])
self.gradient_checkpointing = False

def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
past_key_values=None,
use_cache=None,
output_attentions=False,
output_hidden_states=False,
return_dict=True,
):
all_hidden_states = () if output_hidden_states else None
all_self_attentions = () if output_attentions else None
all_cross_attentions = (
) if output_attentions and self.config.add_cross_attention else None

next_decoder_cache = () if use_cache else None
for i, layer_module in enumerate(self.layer):
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states, )

layer_head_mask = head_mask[i] if head_mask is not None else None
past_key_value = past_key_values[
i] if past_key_values is not None else None

if self.gradient_checkpointing and self.training:

if use_cache:
logger.warning(
'`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...'
)
use_cache = False

def create_custom_forward(module):

def custom_forward(*inputs):
return module(*inputs, past_key_value,
output_attentions)

return custom_forward

layer_outputs = torch.utils.checkpoint.checkpoint(
create_custom_forward(layer_module),
hidden_states,
attention_mask,
layer_head_mask,
encoder_hidden_states,
encoder_attention_mask,
)
else:
layer_outputs = layer_module(
hidden_states,
attention_mask,
layer_head_mask,
encoder_hidden_states,
encoder_attention_mask,
past_key_value,
output_attentions,
)

hidden_states = layer_outputs[0]
if use_cache:
next_decoder_cache += (layer_outputs[-1], )
if output_attentions:
all_self_attentions = all_self_attentions + (
layer_outputs[1], )
if self.config.add_cross_attention:
all_cross_attentions = all_cross_attentions + (
layer_outputs[2], )

if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states, )

if not return_dict:
return tuple(v for v in [
hidden_states,
next_decoder_cache,
all_hidden_states,
all_self_attentions,
all_cross_attentions,
] if v is not None)
return BaseModelOutputWithPastAndCrossAttentions(
last_hidden_state=hidden_states,
past_key_values=next_decoder_cache,
hidden_states=all_hidden_states,
attentions=all_self_attentions,
cross_attentions=all_cross_attentions,
)


class SbertPooler(nn.Module):

def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.activation = nn.Tanh()

def forward(self, hidden_states):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
first_token_tensor = hidden_states[:, 0]
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output


@dataclass
class SbertForPreTrainingOutput(ModelOutput):
"""
Output type of :class:`~structbert.utils.BertForPreTraining`.

Args:
loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`):
Total loss as the sum of the masked language modeling loss and the next sequence prediction
(classification) loss.
prediction_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
seq_relationship_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
before SoftMax).
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when
``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.

Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when
``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
"""

loss: Optional[torch.FloatTensor] = None
prediction_logits: torch.FloatTensor = None
seq_relationship_logits: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
attentions: Optional[Tuple[torch.FloatTensor]] = None


@dataclass
class BaseModelOutputWithPoolingAndCrossAttentionsWithEmbedding(
BaseModelOutputWithPoolingAndCrossAttentions):
embedding_output: torch.FloatTensor = None
logits: Optional[Union[tuple, torch.FloatTensor]] = None
kwargs: dict = None

modelscope/models/nlp/backbones/gpt3/__init__.py → modelscope/models/nlp/gpt3/__init__.py View File

@@ -6,10 +6,12 @@ from modelscope.utils.import_utils import LazyImportModule
if TYPE_CHECKING:
from .configuration_gpt3 import GPT3Config
from .modeling_gpt3 import GPT3Model
from .gpt3_for_text_generation import GPT3ForTextGeneration
else:
_import_structure = {
'configuration_gpt3': ['GPT3Config'],
'modeling_gpt3': ['GPT3Model']
'modeling_gpt3': ['GPT3Model'],
'gpt3_for_text_generation': ['GPT3ForTextGeneration'],
}

import sys

modelscope/models/nlp/backbones/gpt3/configuration_gpt3.py → modelscope/models/nlp/gpt3/configuration_gpt3.py View File


modelscope/models/nlp/gpt3_for_text_generation.py → modelscope/models/nlp/gpt3/gpt3_for_text_generation.py View File

@@ -20,7 +20,7 @@ class GPT3ForTextGeneration(TorchModel):
"""
super().__init__(model_dir, *args, **kwargs)

from modelscope.models.nlp import GPT3Model
from modelscope.models.nlp.gpt3 import GPT3Model
from transformers import BertTokenizer

self.model = GPT3Model.from_pretrained(model_dir)

modelscope/models/nlp/backbones/gpt3/modeling_gpt3.py → modelscope/models/nlp/gpt3/modeling_gpt3.py View File


+ 3
- 1
modelscope/models/nlp/heads/__init__.py View File

@@ -5,9 +5,11 @@ from modelscope.utils.import_utils import LazyImportModule

if TYPE_CHECKING:
from .sequence_classification_head import SequenceClassificationHead
from .torch_pretrain_head import BertMLMHead, RobertaMLMHead
else:
_import_structure = {
'sequence_classification_head': ['SequenceClassificationHead']
'sequence_classification_head': ['SequenceClassificationHead'],
'torch_pretrain_head': ['BertMLMHead', 'RobertaMLMHead'],
}

import sys


+ 1
- 2
modelscope/models/nlp/heads/sequence_classification_head.py View File

@@ -1,5 +1,4 @@
import importlib
from typing import Dict, List, Optional, Union
from typing import Dict

import torch
import torch.nn.functional as F


+ 26
- 0
modelscope/models/nlp/heads/torch_pretrain_head.py View File

@@ -0,0 +1,26 @@
from typing import Dict

import torch
from transformers.models.bert.modeling_bert import BertOnlyMLMHead
from transformers.models.roberta.modeling_roberta import RobertaLMHead

from modelscope.metainfo import Heads
from modelscope.models.base import TorchHead
from modelscope.models.builder import HEADS
from modelscope.utils.constant import Tasks


@HEADS.register_module(Tasks.fill_mask, module_name=Heads.bert_mlm)
class BertMLMHead(BertOnlyMLMHead, TorchHead):

def compute_loss(self, outputs: Dict[str, torch.Tensor],
labels) -> Dict[str, torch.Tensor]:
raise NotImplementedError()


@HEADS.register_module(Tasks.fill_mask, module_name=Heads.roberta_mlm)
class RobertaMLMHead(RobertaLMHead, TorchHead):

def compute_loss(self, outputs: Dict[str, torch.Tensor],
labels) -> Dict[str, torch.Tensor]:
raise NotImplementedError()

+ 100
- 57
modelscope/models/nlp/masked_language.py View File

@@ -1,72 +1,115 @@
from typing import Dict
from typing import Any, Dict, Optional, Union

import numpy as np
from transformers import BertForMaskedLM as BertForMaskedLMTransformer

from modelscope.metainfo import Models
from modelscope.models import TorchModel
from modelscope.models.base import Tensor
from modelscope.models.base import TorchModel
from modelscope.models.builder import MODELS
from modelscope.models.nlp.structbert import SbertForMaskedLM
from modelscope.models.nlp.veco import \
VecoForMaskedLM as VecoForMaskedLMTransformer
from modelscope.outputs import OutputKeys
from modelscope.utils.constant import Tasks

__all__ = ['BertForMaskedLM', 'StructBertForMaskedLM', 'VecoForMaskedLM']


class MaskedLanguageModelBase(TorchModel):

def __init__(self, model_dir: str, *args, **kwargs):
super().__init__(model_dir, *args, **kwargs)
self.model = self.build_model()

def build_model(self):
raise NotImplementedError()

def train(self):
return self.model.train()

def eval(self):
return self.model.eval()

@property
def config(self):
if hasattr(self.model, 'config'):
return self.model.config
return None

def forward(self, input: Dict[str, Tensor]) -> Dict[str, np.ndarray]:
"""return the result by the model

Args:
input (Dict[str, Any]): the preprocessed data

Returns:
Dict[str, np.ndarray]: results
"""
rst = self.model(
input_ids=input['input_ids'],
attention_mask=input['attention_mask'],
token_type_ids=input['token_type_ids'])
return {'logits': rst['logits'], 'input_ids': input['input_ids']}


@MODELS.register_module(Tasks.fill_mask, module_name=Models.structbert)
class StructBertForMaskedLM(MaskedLanguageModelBase):

def build_model(self):
from sofa import SbertForMaskedLM
return SbertForMaskedLM.from_pretrained(self.model_dir)


@MODELS.register_module(Tasks.fill_mask, module_name=Models.veco)
class VecoForMaskedLM(MaskedLanguageModelBase):

def build_model(self):
from sofa import VecoForMaskedLM
return VecoForMaskedLM.from_pretrained(self.model_dir)
class StructBertForMaskedLM(TorchModel, SbertForMaskedLM):

def __init__(self, config, model_dir):
super(TorchModel, self).__init__(model_dir)
SbertForMaskedLM.__init__(self, config)

def forward(self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
labels=None):
output = SbertForMaskedLM.forward(
self,
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
labels=labels)
output[OutputKeys.INPUT_IDS] = input_ids
return output

@classmethod
def _instantiate(cls, **kwargs):
model_dir = kwargs.get('model_dir')
return super(SbertForMaskedLM, StructBertForMaskedLM).from_pretrained(
pretrained_model_name_or_path=model_dir, model_dir=model_dir)


@MODELS.register_module(Tasks.fill_mask, module_name=Models.bert)
class BertForMaskedLM(MaskedLanguageModelBase):
class BertForMaskedLM(TorchModel, BertForMaskedLMTransformer):

def __init__(self, config, model_dir):
super(TorchModel, self).__init__(model_dir)
BertForMaskedLMTransformer.__init__(self, config)

def forward(self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
labels=None):
output = BertForMaskedLMTransformer.forward(
self,
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
labels=labels)
output[OutputKeys.INPUT_IDS] = input_ids
return output

@classmethod
def _instantiate(cls, **kwargs):
model_dir = kwargs.get('model_dir')
return super(BertForMaskedLMTransformer,
BertForMaskedLM).from_pretrained(
pretrained_model_name_or_path=model_dir,
model_dir=model_dir)

def build_model(self):
from transformers import BertForMaskedLM
return BertForMaskedLM.from_pretrained(self.model_dir)

@MODELS.register_module(Tasks.fill_mask, module_name=Models.veco)
class VecoForMaskedLM(TorchModel, VecoForMaskedLMTransformer):

def __init__(self, config, model_dir):
super(TorchModel, self).__init__(model_dir)
VecoForMaskedLMTransformer.__init__(self, config)

def forward(self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
labels=None):
output = VecoForMaskedLMTransformer.forward(
self,
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
labels=labels)
output[OutputKeys.INPUT_IDS] = input_ids
return output

@classmethod
def _instantiate(cls, **kwargs):
model_dir = kwargs.get('model_dir')
return super(VecoForMaskedLMTransformer,
VecoForMaskedLM).from_pretrained(
pretrained_model_name_or_path=model_dir,
model_dir=model_dir)

+ 43
- 0
modelscope/models/nlp/palm_v2/__init__.py View File

@@ -0,0 +1,43 @@
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
# All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from modelscope.utils.import_utils import LazyImportModule

if TYPE_CHECKING:
from .configuration_palm import PalmConfig
from .modeling_palm import (
AbsSummarizer,
PalmForConditionalGeneration,
Translator,
)
from .palm_for_text_generation import PalmForTextGeneration
else:
_import_structure = {
'configuration_palm': ['PalmConfig'],
'modeling_palm':
['AbsSummarizer', 'PalmForConditionalGeneration', 'Translator'],
'palm_for_text_generation': ['PalmForTextGeneration'],
}

import sys

sys.modules[__name__] = LazyImportModule(
__name__,
globals()['__file__'],
_import_structure,
module_spec=__spec__,
extra_objects={},
)

+ 116
- 0
modelscope/models/nlp/palm_v2/configuration_palm.py View File

@@ -0,0 +1,116 @@
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" PALM model configuration """

from transformers.configuration_utils import PretrainedConfig

from modelscope.utils import logger as logging

logger = logging.get_logger(__name__)


class PalmConfig(PretrainedConfig):
r"""
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.


Args:
vocab_size (:obj:`int`, `optional`, defaults to 30522):
Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.BertModel` or
:class:`~transformers.TFBertModel`.
hidden_size (:obj:`int`, `optional`, defaults to 768):
Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the Transformer encoder.
num_attention_heads (:obj:`int`, `optional`, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, `optional`, defaults to 3072):
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string,
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, `optional`, defaults to 2):
The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.BertModel` or
:class:`~transformers.TFBertModel`.
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layernorm_epsilon (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers.
dec_hidden_layers (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the Transformer decoder.
attn_separate (:obj:`bool`, `optional`, defaults to false):
Whether or not to separate the q, k, v of attention.

Examples::

>>> from modelscope.models.nlp.palm_v2 import PalmForConditionalGeneration, PalmConfig
>>> configuration = PalmConfig()

>>> # Initializing a model from the configuration
>>> model = PalmForConditionalGeneration(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
"""
model_type = 'palm'

def __init__(self,
encoder='roberta',
encoder_pth='roberta-base',
max_pos=512,
share_emb=False,
dec_layers=12,
dec_hidden_size=768,
dec_heads=8,
dec_ff_size=3072,
dec_dropout=0.2,
use_bert_emb=True,
label_smoothing=0.1,
alpha=0.95,
beam_size=5,
min_length=40,
max_length=130,
sample_topk=False,
block_trigram=False,
**kwargs):
super().__init__(**kwargs)
self.encoder = encoder
self.encoder_pth = encoder_pth
self.max_pos = max_pos
self.share_emb = share_emb
self.dec_layers = dec_layers
self.dec_hidden_size = dec_hidden_size
self.dec_heads = dec_heads
self.dec_ff_size = dec_ff_size
self.dec_dropout = dec_dropout
self.use_bert_emb = use_bert_emb
self.label_smoothing = label_smoothing
# Translator
self.alpha = alpha
self.beam_size = beam_size
self.min_length = min_length
self.max_length = max_length
self.sample_topk = sample_topk
self.block_trigram = block_trigram

+ 872
- 0
modelscope/models/nlp/palm_v2/dureader_eval.py View File

@@ -0,0 +1,872 @@
# ==============================================================================
# Copyright 2017 Baidu.com, Inc. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""
This module computes evaluation metrics for DuReader dataset.
"""

import argparse
import copy
import math
import re
import sys
import zipfile
from collections import Counter, defaultdict

import json
import numpy as np
from rouge import Rouge

EMPTY = ''
YESNO_LABELS = set(['Yes', 'No', 'Depends'])


def my_lcs(string, sub):
"""
Calculates longest common subsequence for a pair of tokenized strings
:param string : list of str : tokens from a string split using whitespace
:param sub : list of str : shorter string, also split using whitespace
:returns: length (list of int): length of the longest common subsequence between the two strings

Note: my_lcs only gives length of the longest common subsequence, not the actual LCS
"""
if (len(string) < len(sub)):
sub, string = string, sub

lengths = [[0 for i in range(0,
len(sub) + 1)]
for j in range(0,
len(string) + 1)]

for j in range(1, len(sub) + 1):
for i in range(1, len(string) + 1):
if (string[i - 1] == sub[j - 1]):
lengths[i][j] = lengths[i - 1][j - 1] + 1
else:
lengths[i][j] = max(lengths[i - 1][j], lengths[i][j - 1])

return lengths[len(string)][len(sub)]


class Bleu:

def __init__(self, n=4):
# default compute Blue score up to 4
self._n = n
self._hypo_for_image = {}
self.ref_for_image = {}

def compute_score(self, gts, res):
assert (list(gts.keys()) == list(res.keys()))
imgIds = list(gts.keys())

bleu_scorer = BleuScorer(n=self._n)
for id in imgIds:
hypo = res[id]
ref = gts[id]

# Sanity check.
assert (type(hypo) is list)
assert (len(hypo) == 1)
assert (type(ref) is list)
assert (len(ref) >= 1)

bleu_scorer += (hypo[0], ref)

score, scores = bleu_scorer.compute_score(option='closest', verbose=1)
return score, scores

def method(self):
return 'Bleu'


def precook(s, n=4, out=False):
"""Takes a string as input and returns an object that can be given to
either cook_refs or cook_test. This is optional: cook_refs and cook_test
can take string arguments as well."""
words = s.split()
counts = defaultdict(int)
for k in range(1, n + 1):
for i in range(len(words) - k + 1):
ngram = tuple(words[i:i + k])
counts[ngram] += 1
return (len(words), counts)


def cook_refs(refs, eff=None, n=4): # lhuang: oracle will call with "average"
'''Takes a list of reference sentences for a single segment
and returns an object that encapsulates everything that BLEU
needs to know about them.'''

reflen = []
maxcounts = {}
for ref in refs:
rl, counts = precook(ref, n)
reflen.append(rl)
for (ngram, count) in counts.items():
maxcounts[ngram] = max(maxcounts.get(ngram, 0), count)

# Calculate effective reference sentence length.
if eff == 'shortest':
reflen = min(reflen)
elif eff == 'average':
reflen = float(sum(reflen)) / len(reflen)

# lhuang: N.B.: leave reflen computaiton to the very end!!

# lhuang: N.B.: in case of "closest", keep a list of reflens!! (bad design)

return reflen, maxcounts


def cook_test(test, xxx_todo_changeme, eff=None, n=4):
'''Takes a test sentence and returns an object that
encapsulates everything that BLEU needs to know about it.'''
(reflen, refmaxcounts) = xxx_todo_changeme
testlen, counts = precook(test, n, True)

result = {}

# Calculate effective reference sentence length.

if eff == 'closest':
result['reflen'] = min((abs(ref - testlen), ref) for ref in reflen)[1]
else: # i.e., "average" or "shortest" or None
result['reflen'] = reflen

result['testlen'] = testlen

result['guess'] = [max(0, testlen - k + 1) for k in range(1, n + 1)]

result['correct'] = [0] * n
for (ngram, count) in counts.items():
result['correct'][len(ngram) - 1] += min(
refmaxcounts.get(ngram, 0), count)

return result


class BleuScorer(object):
"""Bleu scorer.
"""

__slots__ = 'n', 'crefs', 'ctest', '_score', '_ratio', '_testlen', '_reflen', 'special_reflen'

# special_reflen is used in oracle (proportional effective ref len for a node).

def copy(self):
''' copy the refs.'''
new = BleuScorer(n=self.n)
new.ctest = copy.copy(self.ctest)
new.crefs = copy.copy(self.crefs)
new._score = None
return new

def __init__(self, test=None, refs=None, n=4, special_reflen=None):
''' singular instance '''

self.n = n
self.crefs = []
self.ctest = []
self.cook_append(test, refs)
self.special_reflen = special_reflen

def cook_append(self, test, refs):
'''called by constructor and __iadd__ to avoid creating new instances.'''

if refs is not None:
self.crefs.append(cook_refs(refs))
if test is not None:
cooked_test = cook_test(test, self.crefs[-1])
self.ctest.append(cooked_test) # N.B.: -1
else:
self.ctest.append(
None) # lens of crefs and ctest have to match

self._score = None # need to recompute

def ratio(self, option=None):
self.compute_score(option=option)
return self._ratio

def score_ratio(self, option=None):
'''return (bleu, len_ratio) pair'''
return (self.fscore(option=option), self.ratio(option=option))

def score_ratio_str(self, option=None):
return '%.4f (%.2f)' % self.score_ratio(option)

def reflen(self, option=None):
self.compute_score(option=option)
return self._reflen

def testlen(self, option=None):
self.compute_score(option=option)
return self._testlen

def retest(self, new_test):
if type(new_test) is str:
new_test = [new_test]
assert len(new_test) == len(self.crefs), new_test
self.ctest = []
for t, rs in zip(new_test, self.crefs):
self.ctest.append(cook_test(t, rs))
self._score = None

return self

def rescore(self, new_test):
''' replace test(s) with new test(s), and returns the new score.'''

return self.retest(new_test).compute_score()

def size(self):
assert len(self.crefs) == len(
self.ctest), 'refs/test mismatch! %d<>%d' % (len(
self.crefs), len(self.ctest))
return len(self.crefs)

def __iadd__(self, other):
'''add an instance (e.g., from another sentence).'''

if type(other) is tuple:
# avoid creating new BleuScorer instances
self.cook_append(other[0], other[1])
else:
assert self.compatible(other), 'incompatible BLEUs.'
self.ctest.extend(other.ctest)
self.crefs.extend(other.crefs)
self._score = None # need to recompute

return self

def compatible(self, other):
return isinstance(other, BleuScorer) and self.n == other.n

def single_reflen(self, option='average'):
return self._single_reflen(self.crefs[0][0], option)

def _single_reflen(self, reflens, option=None, testlen=None):

if option == 'shortest':
reflen = min(reflens)
elif option == 'average':
reflen = float(sum(reflens)) / len(reflens)
elif option == 'closest':
reflen = min((abs(ref - testlen), ref) for ref in reflens)[1]
else:
assert False, 'unsupported reflen option %s' % option

return reflen

def recompute_score(self, option=None, verbose=0):
self._score = None
return self.compute_score(option, verbose)

def compute_score(self, option=None, verbose=0):
n = self.n
small = 1e-9
tiny = 1e-15 # so that if guess is 0 still return 0
bleu_list = [[] for _ in range(n)]

if self._score is not None:
return self._score

if option is None:
option = 'average' if len(self.crefs) == 1 else 'closest'

self._testlen = 0
self._reflen = 0
totalcomps = {
'testlen': 0,
'reflen': 0,
'guess': [0] * n,
'correct': [0] * n
}

# for each sentence
for comps in self.ctest:
testlen = comps['testlen']
self._testlen += testlen

if self.special_reflen is None: # need computation
reflen = self._single_reflen(comps['reflen'], option, testlen)
else:
reflen = self.special_reflen

self._reflen += reflen

for key in ['guess', 'correct']:
for k in range(n):
totalcomps[key][k] += comps[key][k]

# append per image bleu score
bleu = 1.
for k in range(n):
bleu *= (float(comps['correct'][k]) + tiny) / (
float(comps['guess'][k]) + small)
bleu_list[k].append(bleu**(1. / (k + 1)))
ratio = (testlen + tiny) / (reflen + small
) # N.B.: avoid zero division
if ratio < 1:
for k in range(n):
bleu_list[k][-1] *= math.exp(1 - 1 / ratio)

if verbose > 1:
print(comps, reflen)

totalcomps['reflen'] = self._reflen
totalcomps['testlen'] = self._testlen

bleus = []
bleu = 1.
for k in range(n):
bleu *= float(totalcomps['correct'][k] + tiny) / (
totalcomps['guess'][k] + small)
bleus.append(bleu**(1. / (k + 1)))
ratio = (self._testlen + tiny) / (self._reflen + small
) # N.B.: avoid zero division
if ratio < 1:
for k in range(n):
bleus[k] *= math.exp(1 - 1 / ratio)

if verbose > 0:
print(totalcomps)
print('ratio:', ratio)

self._score = bleus
return self._score, bleu_list


def normalize(s):
"""
Normalize strings to space joined chars.

Args:
s: a list of strings.

Returns:
A list of normalized strings.
"""
if not s:
return s
normalized = []
for ss in s:
tokens = [c for c in list(ss) if len(c.strip()) != 0]
normalized.append(' '.join(tokens))
return normalized


def data_check(obj, task):
"""
Check data.

Raises:
Raises AssertionError when data is not legal.
"""
assert 'question_id' in obj, "Missing 'question_id' field."
assert 'question_type' in obj, \
"Missing 'question_type' field. question_id: {}".format(obj['question_type'])

assert 'yesno_answers' in obj, \
"Missing 'yesno_answers' field. question_id: {}".format(obj['question_id'])
assert isinstance(obj['yesno_answers'], list), \
r"""'yesno_answers' field must be a list, if the 'question_type' is not
'YES_NO', then this field should be an empty list.
question_id: {}""".format(obj['question_id'])

assert 'entity_answers' in obj, \
"Missing 'entity_answers' field. question_id: {}".format(obj['question_id'])
assert isinstance(
obj['entity_answers'],
list) and len(obj['entity_answers']) > 0, r"""'entity_answers' field
must be a list, and has at least one element, which can be a empty list.
question_id: {}""".format(obj['question_id'])


def read_file(file_name, task, is_ref=False):
"""
Read predict answers or reference answers from file.

Args:
file_name: the name of the file containing predict result or reference
result.

Returns:
A dictionary mapping question_id to the result information. The result
information itself is also a dictionary with has four keys:
- question_type: type of the query.
- yesno_answers: A list of yesno answers corresponding to 'answers'.
- answers: A list of predicted answers.
- entity_answers: A list, each element is also a list containing the entities
tagged out from the corresponding answer string.
"""

def _open(file_name, mode, zip_obj=None):
if zip_obj is not None:
return zip_obj.open(file_name, mode)
return open(file_name, mode)

results = {}
keys = ['answers', 'yesno_answers', 'entity_answers', 'question_type']
if is_ref:
keys += ['source']

zf = zipfile.ZipFile(file_name,
'r') if file_name.endswith('.zip') else None
file_list = [file_name] if zf is None else zf.namelist()

for fn in file_list:
for line in _open(fn, 'r', zip_obj=zf):
try:
obj = json.loads(line.strip())
except ValueError:
raise ValueError('Every line of data should be legal json')
data_check(obj, task)
qid = obj['question_id']
assert qid not in results, 'Duplicate question_id: {}'.format(qid)
results[qid] = {}
for k in keys:
results[qid][k] = obj[k]
return results


def compute_bleu_rouge(pred_dict, ref_dict, bleu_order=4):
"""
Compute bleu and rouge scores.
"""
assert set(pred_dict.keys()) == set(ref_dict.keys()), \
'missing keys: {}'.format(set(ref_dict.keys()) - set(pred_dict.keys()))
scores = {}
bleu_scores, _ = Bleu(bleu_order).compute_score(ref_dict, pred_dict)
for i, bleu_score in enumerate(bleu_scores):
scores['Bleu-%d' % (i + 1)] = bleu_score
# rouge_score, _ = Rouge().compute_score(ref_dict, pred_dict)
rouge_score = Rouge().get_scores(
list(map(lambda x: x[0], pred_dict.values())),
list(map(lambda x: x[0], ref_dict.values())))
rouge_score = sum([d['rouge-l']['f']
for d in rouge_score]) / len(rouge_score)
scores['Rouge-L'] = rouge_score
return scores


def local_prf(pred_list, ref_list):
"""
Compute local precision recall and f1-score,
given only one prediction list and one reference list
"""
common = Counter(pred_list) & Counter(ref_list)
num_same = sum(common.values())
if num_same == 0:
return 0, 0, 0
p = 1.0 * num_same / len(pred_list)
r = 1.0 * num_same / len(ref_list)
f1 = (2 * p * r) / (p + r)
return p, r, f1


def compute_prf(pred_dict, ref_dict):
"""
Compute precision recall and f1-score.
"""
# pred_question_ids = set(pred_dict.keys())
ref_question_ids = set(ref_dict.keys())
correct_preds, total_correct, total_preds = 0, 0, 0
for question_id in ref_question_ids:
pred_entity_list = pred_dict.get(question_id, [[]])
assert len(pred_entity_list) == 1, \
'the number of entity list for question_id {} is not 1.'.format(question_id)
pred_entity_list = pred_entity_list[0]
all_ref_entity_lists = ref_dict[question_id]
best_local_f1 = 0
best_ref_entity_list = None
for ref_entity_list in all_ref_entity_lists:
local_f1 = local_prf(pred_entity_list, ref_entity_list)[2]
if local_f1 > best_local_f1:
best_ref_entity_list = ref_entity_list
best_local_f1 = local_f1
if best_ref_entity_list is None:
if len(all_ref_entity_lists) > 0:
best_ref_entity_list = sorted(
all_ref_entity_lists, key=lambda x: len(x))[0]
else:
best_ref_entity_list = []
gold_entities = set(best_ref_entity_list)
pred_entities = set(pred_entity_list)
correct_preds += len(gold_entities & pred_entities)
total_preds += len(pred_entities)
total_correct += len(gold_entities)
p = float(correct_preds) / total_preds if correct_preds > 0 else 0
r = float(correct_preds) / total_correct if correct_preds > 0 else 0
f1 = 2 * p * r / (p + r) if correct_preds > 0 else 0
return {'Precision': p, 'Recall': r, 'F1': f1}


def prepare_prf(pred_dict, ref_dict):
"""
Prepares data for calculation of prf scores.
"""
preds = {k: v['entity_answers'] for k, v in pred_dict.items()}
refs = {k: v['entity_answers'] for k, v in ref_dict.items()}
return preds, refs


def filter_dict(result_dict, key_tag):
"""
Filter a subset of the result_dict, where keys ends with 'key_tag'.
"""
filtered = {}
for k, v in result_dict.items():
if k.endswith(key_tag):
filtered[k] = v
return filtered


def get_metrics(pred_result, ref_result, task, source):
"""
Computes metrics.
"""
metrics = {}

ref_result_filtered = {}
pred_result_filtered = {}
if source == 'both':
ref_result_filtered = ref_result
pred_result_filtered = pred_result
else:
for question_id, info in ref_result.items():
if info['source'] == source:
ref_result_filtered[question_id] = info
if question_id in pred_result:
pred_result_filtered[question_id] = pred_result[
question_id]

if task == 'main' or task == 'all' \
or task == 'description':
pred_dict, ref_dict = prepare_bleu(pred_result_filtered,
ref_result_filtered, task)
metrics = compute_bleu_rouge(pred_dict, ref_dict)
elif task == 'yesno':
pred_dict, ref_dict = prepare_bleu(pred_result_filtered,
ref_result_filtered, task)
keys = ['Yes', 'No', 'Depends']
preds = [filter_dict(pred_dict, k) for k in keys]
refs = [filter_dict(ref_dict, k) for k in keys]

metrics = compute_bleu_rouge(pred_dict, ref_dict)

for k, pred, ref in zip(keys, preds, refs):
m = compute_bleu_rouge(pred, ref)
k_metric = [(k + '|' + key, v) for key, v in m.items()]
metrics.update(k_metric)

elif task == 'entity':
pred_dict, ref_dict = prepare_prf(pred_result_filtered,
ref_result_filtered)
pred_dict_bleu, ref_dict_bleu = prepare_bleu(pred_result_filtered,
ref_result_filtered, task)
metrics = compute_prf(pred_dict, ref_dict)
metrics.update(compute_bleu_rouge(pred_dict_bleu, ref_dict_bleu))
else:
raise ValueError('Illegal task name: {}'.format(task))

return metrics


def prepare_bleu(pred_result, ref_result, task):
"""
Prepares data for calculation of bleu and rouge scores.
"""
pred_list, ref_list = [], []
qids = ref_result.keys()
for qid in qids:
if task == 'main':
pred, ref = get_main_result(qid, pred_result, ref_result)
elif task == 'yesno':
pred, ref = get_yesno_result(qid, pred_result, ref_result)
elif task == 'all':
pred, ref = get_all_result(qid, pred_result, ref_result)
elif task == 'entity':
pred, ref = get_entity_result(qid, pred_result, ref_result)
elif task == 'description':
pred, ref = get_desc_result(qid, pred_result, ref_result)
else:
raise ValueError('Illegal task name: {}'.format(task))
if pred and ref:
pred_list += pred
ref_list += ref
pred_dict = dict(pred_list)
ref_dict = dict(ref_list)
for qid, ans in ref_dict.items():
ref_dict[qid] = normalize(ref_dict[qid])
pred_dict[qid] = normalize(pred_dict.get(qid, [EMPTY]))
if not ans or ans == [EMPTY]:
del ref_dict[qid]
del pred_dict[qid]

for k, v in pred_dict.items():
assert len(v) == 1, \
'There should be only one predict answer. question_id: {}'.format(k)
return pred_dict, ref_dict


def get_main_result(qid, pred_result, ref_result):
"""
Prepare answers for task 'main'.

Args:
qid: question_id.
pred_result: A dict include all question_id's result information read
from args.pred_file.
ref_result: A dict incluce all question_id's result information read
from args.ref_file.
Returns:
Two lists, the first one contains predict result, the second
one contains reference result of the same question_id. Each list has
elements of tuple (question_id, answers), 'answers' is a list of strings.
"""
ref_ans = ref_result[qid]['answers']
if not ref_ans:
ref_ans = [EMPTY]
pred_ans = pred_result.get(qid, {}).get('answers', [])[:1]
if not pred_ans:
pred_ans = [EMPTY]

return [(qid, pred_ans)], [(qid, ref_ans)]


def get_entity_result(qid, pred_result, ref_result):
"""
Prepare answers for task 'entity'.

Args:
qid: question_id.
pred_result: A dict include all question_id's result information read
from args.pred_file.
ref_result: A dict incluce all question_id's result information read
from args.ref_file.
Returns:
Two lists, the first one contains predict result, the second
one contains reference result of the same question_id. Each list has
elements of tuple (question_id, answers), 'answers' is a list of strings.
"""
if ref_result[qid]['question_type'] != 'ENTITY':
return None, None
return get_main_result(qid, pred_result, ref_result)


def get_desc_result(qid, pred_result, ref_result):
"""
Prepare answers for task 'description'.

Args:
qid: question_id.
pred_result: A dict include all question_id's result information read
from args.pred_file.
ref_result: A dict incluce all question_id's result information read
from args.ref_file.
Returns:
Two lists, the first one contains predict result, the second
one contains reference result of the same question_id. Each list has
elements of tuple (question_id, answers), 'answers' is a list of strings.
"""
if ref_result[qid]['question_type'] != 'DESCRIPTION':
return None, None
return get_main_result(qid, pred_result, ref_result)


def get_yesno_result(qid, pred_result, ref_result):
"""
Prepare answers for task 'yesno'.

Args:
qid: question_id.
pred_result: A dict include all question_id's result information read
from args.pred_file.
ref_result: A dict incluce all question_id's result information read
from args.ref_file.
Returns:
Two lists, the first one contains predict result, the second
one contains reference result of the same question_id. Each list has
elements of tuple (question_id, answers), 'answers' is a list of strings.
"""

def _uniq(li, is_ref):
uniq_li = []
left = []
keys = set()
for k, v in li:
if k not in keys:
uniq_li.append((k, v))
keys.add(k)
else:
left.append((k, v))

if is_ref:
dict_li = dict(uniq_li)
for k, v in left:
dict_li[k] += v
uniq_li = [(k, v) for k, v in dict_li.items()]
return uniq_li

def _expand_result(uniq_li):
expanded = uniq_li[:]
keys = set([x[0] for x in uniq_li])
for k in YESNO_LABELS - keys:
expanded.append((k, [EMPTY]))
return expanded

def _get_yesno_ans(qid, result_dict, is_ref=False):
if qid not in result_dict:
return [(str(qid) + '_' + k, v) for k, v in _expand_result([])]
yesno_answers = result_dict[qid]['yesno_answers']
answers = result_dict[qid]['answers']
lbl_ans = _uniq([(k, [v]) for k, v in zip(yesno_answers, answers)],
is_ref)
ret = [(str(qid) + '_' + k, v) for k, v in _expand_result(lbl_ans)]
return ret

if ref_result[qid]['question_type'] != 'YES_NO':
return None, None

ref_ans = _get_yesno_ans(qid, ref_result, is_ref=True)
pred_ans = _get_yesno_ans(qid, pred_result)
return pred_ans, ref_ans


def get_all_result(qid, pred_result, ref_result):
"""
Prepare answers for task 'all'.

Args:
qid: question_id.
pred_result: A dict include all question_id's result information read
from args.pred_file.
ref_result: A dict incluce all question_id's result information read
from args.ref_file.
Returns:
Two lists, the first one contains predict result, the second
one contains reference result of the same question_id. Each list has
elements of tuple (question_id, answers), 'answers' is a list of strings.
"""
if ref_result[qid]['question_type'] == 'YES_NO':
return get_yesno_result(qid, pred_result, ref_result)
return get_main_result(qid, pred_result, ref_result)


def format_metrics(metrics, task, err_msg):
"""
Format metrics. 'err' field returns any error occured during evaluation.

Args:
metrics: A dict object contains metrics for different tasks.
task: Task name.
err_msg: Exception raised during evaluation.
Returns:
Formatted result.
"""
result = {}
sources = ['both', 'search', 'zhidao']
if err_msg is not None:
return {'errorMsg': str(err_msg), 'errorCode': 1, 'data': []}
data = []
if task != 'all' and task != 'main':
sources = ['both']

if task == 'entity':
metric_names = ['Bleu-4', 'Rouge-L']
metric_names_prf = ['F1', 'Precision', 'Recall']
for name in metric_names + metric_names_prf:
for src in sources:
obj = {
'name': name,
'value': round(metrics[src].get(name, 0) * 100, 2),
'type': src,
}
data.append(obj)
elif task == 'yesno':
metric_names = ['Bleu-4', 'Rouge-L']
details = ['Yes', 'No', 'Depends']
src = sources[0]
for name in metric_names:
obj = {
'name': name,
'value': round(metrics[src].get(name, 0) * 100, 2),
'type': 'All',
}
data.append(obj)
for d in details:
obj = {
'name': name,
'value': round(metrics[src].get(d + '|' + name, 0) * 100,
2),
'type': d
}
data.append(obj)
else:
metric_names = ['Bleu-4', 'Rouge-L']
for name in metric_names:
for src in sources:
obj = {
'name': name,
'value': round(metrics[src].get(name, 0) * 100, 2),
'type': src
}
data.append(obj)

result['data'] = data
result['errorCode'] = 0
result['errorMsg'] = 'success'

return result


def main(args):
"""
Do evaluation.
"""
err = None
metrics = {}
try:
pred_result = read_file(args.pred_file, args.task)
ref_result = read_file(args.ref_file, args.task, is_ref=True)
sources = ['both', 'search', 'zhidao']
if args.task not in set(['main', 'all']):
sources = sources[:1]
for source in sources:
metrics[source] = get_metrics(pred_result, ref_result, args.task,
source)
except ValueError as ve:
err = ve
except AssertionError as ae:
err = ae

print(
json.dumps(
format_metrics(metrics, args.task, err),
ensure_ascii=False).encode('utf8'))


if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('pred_file', help='predict file')
parser.add_argument('ref_file', help='reference file')
parser.add_argument(
'task', help='task name: Main|Yes_No|All|Entity|Description')

args = parser.parse_args()
args.task = args.task.lower().replace('_', '')
main(args)

+ 1332
- 0
modelscope/models/nlp/palm_v2/modeling_palm.py
File diff suppressed because it is too large
View File


modelscope/models/nlp/palm_for_text_generation.py → modelscope/models/nlp/palm_v2/palm_for_text_generation.py View File

@@ -22,8 +22,8 @@ class PalmForTextGeneration(TorchModel):
"""
super().__init__(model_dir, *args, **kwargs)

from sofa.models.palm_v2 import (PalmForConditionalGeneration,
Translator)
from modelscope.models.nlp.palm_v2 import (
PalmForConditionalGeneration, Translator)
self.model = PalmForConditionalGeneration.from_pretrained(model_dir)
self.tokenizer = self.model.tokenizer
self.generator = Translator(self.model)

+ 0
- 23
modelscope/models/nlp/sbert_for_nli.py View File

@@ -1,23 +0,0 @@
from modelscope.metainfo import Models
from modelscope.models.builder import MODELS
from modelscope.utils.constant import Tasks
from .sbert_for_sequence_classification import \
SbertForSequenceClassificationBase

__all__ = ['SbertForNLI']


@MODELS.register_module(Tasks.nli, module_name=Models.structbert)
class SbertForNLI(SbertForSequenceClassificationBase):

def __init__(self, model_dir: str, *args, **kwargs):
"""initialize the text generation model from the `model_dir` path.

Args:
model_dir (str): the model path.
model_cls (Optional[Any], optional): model loader, if None, use the
default loader to load model weights, by default None.
"""
super().__init__(
model_dir, *args, model_args={'num_labels': 3}, **kwargs)
assert self.model.config.num_labels == 3

+ 0
- 25
modelscope/models/nlp/sbert_for_sentence_similarity.py View File

@@ -1,25 +0,0 @@
from modelscope.metainfo import Models
from modelscope.models.builder import MODELS
from modelscope.utils.constant import Tasks
from .sbert_for_sequence_classification import \
SbertForSequenceClassificationBase

__all__ = ['SbertForSentenceSimilarity']


@MODELS.register_module(
Tasks.sentence_similarity, module_name=Models.structbert)
class SbertForSentenceSimilarity(SbertForSequenceClassificationBase):

def __init__(self, model_dir: str, *args, **kwargs):
"""initialize the sentence similarity model from the `model_dir` path.

Args:
model_dir (str): the model path.
model_cls (Optional[Any], optional): model loader, if None, use the
default loader to load model weights, by default None.
"""
super().__init__(
model_dir, *args, model_args={'num_labels': 2}, **kwargs)
self.model_dir = model_dir
assert self.model.config.num_labels == 2

+ 0
- 22
modelscope/models/nlp/sbert_for_sentiment_classification.py View File

@@ -1,22 +0,0 @@
from modelscope.metainfo import Models
from modelscope.models.builder import MODELS
from modelscope.utils.constant import Tasks
from .sbert_for_sequence_classification import \
SbertForSequenceClassificationBase

__all__ = ['SbertForSentimentClassification']


@MODELS.register_module(
Tasks.sentiment_classification, module_name=Models.structbert)
class SbertForSentimentClassification(SbertForSequenceClassificationBase):

def __init__(self, model_dir: str, *args, **kwargs):
"""initialize the text generation model from the `model_dir` path.

Args:
model_dir (str): the model path.
"""
super().__init__(
model_dir, *args, model_args={'num_labels': 2}, **kwargs)
assert self.model.config.num_labels == 2

+ 0
- 82
modelscope/models/nlp/sbert_for_sequence_classification.py View File

@@ -1,82 +0,0 @@
import os
from typing import Any, Dict

import json
import numpy as np
import torch
from sofa.models.sbert.modeling_sbert import SbertModel, SbertPreTrainedModel
from torch import nn

from modelscope.models import TorchModel


class SbertTextClassfier(SbertPreTrainedModel):

def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.config = config
self.encoder = SbertModel(config, add_pooling_layer=True)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)

def forward(self,
input_ids=None,
token_type_ids=None,
labels=None,
**kwargs):
outputs = self.encoder(
input_ids,
token_type_ids=token_type_ids,
return_dict=None,
)
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
if labels is not None:
loss_fct = nn.CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
return {'logits': logits, 'loss': loss}
return {'logits': logits}

def build(**kwags):
return SbertTextClassfier.from_pretrained(model_dir, **model_args)


class SbertForSequenceClassificationBase(TorchModel):

def __init__(self, model_dir: str, model_args=None, *args, **kwargs):
super().__init__(model_dir, *args, **kwargs)
if model_args is None:
model_args = {}
self.model = SbertTextClassfier.from_pretrained(
model_dir, **model_args)
self.id2label = {}
self.label_path = os.path.join(self.model_dir, 'label_mapping.json')
if os.path.exists(self.label_path):
with open(self.label_path) as f:
self.label_mapping = json.load(f)
self.id2label = {
idx: name
for name, idx in self.label_mapping.items()
}

def train(self):
return self.model.train()

def eval(self):
return self.model.eval()

def forward(self, input: Dict[str, Any]) -> Dict[str, np.ndarray]:
input_ids = torch.tensor(input['input_ids'], dtype=torch.long)
token_type_ids = torch.tensor(
input['token_type_ids'], dtype=torch.long)
return self.model.forward(input_ids, token_type_ids)

def postprocess(self, input, **kwargs):
logits = input['logits']
probs = logits.softmax(-1).cpu().numpy()
pred = logits.argmax(-1).cpu().numpy()
logits = logits.cpu().numpy()
res = {'predictions': pred, 'probabilities': probs, 'logits': logits}
return res

+ 0
- 64
modelscope/models/nlp/sbert_for_token_classification.py View File

@@ -1,64 +0,0 @@
from typing import Any, Dict, Union

import numpy as np
import torch

from modelscope.metainfo import Models
from modelscope.models import TorchModel
from modelscope.models.base import Tensor
from modelscope.models.builder import MODELS
from modelscope.utils.constant import Tasks

__all__ = ['SbertForTokenClassification']


@MODELS.register_module(Tasks.word_segmentation, module_name=Models.structbert)
class SbertForTokenClassification(TorchModel):

def __init__(self, model_dir: str, *args, **kwargs):
"""initialize the word segmentation model from the `model_dir` path.

Args:
model_dir (str): the model path.
model_cls (Optional[Any], optional): model loader, if None, use the
default loader to load model weights, by default None.
"""
super().__init__(model_dir, *args, **kwargs)
self.model_dir = model_dir
import sofa
self.model = sofa.SbertForTokenClassification.from_pretrained(
self.model_dir)
self.config = sofa.SbertConfig.from_pretrained(self.model_dir)

def train(self):
return self.model.train()

def eval(self):
return self.model.eval()

def forward(self, input: Dict[str,
Any]) -> Dict[str, Union[str, np.ndarray]]:
"""return the result by the model

Args:
input (Dict[str, Any]): the preprocessed data

Returns:
Dict[str, Union[str,np.ndarray]]: results
Example:
{
'predictions': array([1,4]), # lable 0-negative 1-positive
'logits': array([[-0.53860897, 1.5029076 ]], dtype=float32) # true value
'text': str(今天),
}
"""
input_ids = torch.tensor(input['input_ids']).unsqueeze(0)
return {**self.model(input_ids), 'text': input['text']}

def postprocess(self, input: Dict[str, Tensor],
**kwargs) -> Dict[str, Tensor]:
logits = input['logits']
pred = torch.argmax(logits[0], dim=-1)
pred = pred.cpu().numpy()
rst = {'predictions': pred, 'logits': logits, 'text': input['text']}
return rst

+ 0
- 50
modelscope/models/nlp/sbert_for_zero_shot_classification.py View File

@@ -1,50 +0,0 @@
from typing import Any, Dict

import numpy as np

from modelscope.metainfo import Models
from modelscope.models import TorchModel
from modelscope.models.builder import MODELS
from modelscope.utils.constant import Tasks

__all__ = ['SbertForZeroShotClassification']


@MODELS.register_module(
Tasks.zero_shot_classification, module_name=Models.structbert)
class SbertForZeroShotClassification(TorchModel):

def __init__(self, model_dir: str, *args, **kwargs):
"""initialize the zero shot classification model from the `model_dir` path.

Args:
model_dir (str): the model path.
"""

super().__init__(model_dir, *args, **kwargs)
from sofa import SbertForSequenceClassification
self.model = SbertForSequenceClassification.from_pretrained(model_dir)

def train(self):
return self.model.train()

def eval(self):
return self.model.eval()

def forward(self, input: Dict[str, Any]) -> Dict[str, np.ndarray]:
"""return the result by the model

Args:
input (Dict[str, Any]): the preprocessed data

Returns:
Dict[str, np.ndarray]: results
Example:
{
'logits': array([[-0.53860897, 1.5029076 ]], dtype=float32) # true value
}
"""
outputs = self.model(**input)
logits = outputs['logits'].cpu().numpy()
res = {'logits': logits}
return res

+ 155
- 66
modelscope/models/nlp/sequence_classification.py View File

@@ -1,85 +1,174 @@
import os
from typing import Any, Dict
from abc import abstractmethod

import json
import numpy as np
from torch import nn

from modelscope.metainfo import TaskModels
from modelscope.metainfo import Models
from modelscope.models.base import TorchModel
from modelscope.models.builder import MODELS
from modelscope.models.nlp.structbert import SbertPreTrainedModel
from modelscope.models.nlp.veco import \
VecoForSequenceClassification as VecoForSequenceClassificationTransform
from modelscope.outputs import OutputKeys
from modelscope.utils.constant import Tasks
from .task_model import SingleBackboneTaskModelBase
from modelscope.utils.hub import parse_label_mapping
from modelscope.utils.tensor_utils import (torch_nested_detach,
torch_nested_numpify)

__all__ = ['SequenceClassificationModel']
__all__ = ['SbertForSequenceClassification', 'VecoForSequenceClassification']


@MODELS.register_module(
Tasks.sentiment_classification, module_name=TaskModels.text_classification)
@MODELS.register_module(
Tasks.text_classification, module_name=TaskModels.text_classification)
class SequenceClassificationModel(SingleBackboneTaskModelBase):
class SequenceClassificationBase(TorchModel):
base_model_prefix: str = 'bert'

def __init__(self, config, model_dir):
super().__init__(model_dir)
self.num_labels = config.num_labels
self.config = config
setattr(self, self.base_model_prefix, self.build_base_model())
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)

def __init__(self, model_dir: str, *args, **kwargs):
"""initialize the sequence classification model from the `model_dir` path.
@abstractmethod
def build_base_model(self):
"""Build the backbone model.

Args:
model_dir (str): the model path.
Returns: the backbone instance.
"""
super().__init__(model_dir, *args, **kwargs)
if 'base_model_prefix' in kwargs:
self._base_model_prefix = kwargs['base_model_prefix']

backbone_cfg = self.cfg.backbone
head_cfg = self.cfg.head

# get the num_labels from label_mapping.json
self.id2label = {}
self.label_path = os.path.join(model_dir, 'label_mapping.json')
if os.path.exists(self.label_path):
with open(self.label_path) as f:
self.label_mapping = json.load(f)
self.id2label = {
idx: name
for name, idx in self.label_mapping.items()
}
head_cfg['num_labels'] = len(self.label_mapping)

self.build_backbone(backbone_cfg)
self.build_head(head_cfg)

def forward(self, input: Dict[str, Any]) -> Dict[str, np.ndarray]:
outputs = super().forward(input)
sequence_output, pooled_output = self.extract_backbone_outputs(outputs)
outputs = self.head.forward(pooled_output)
if 'labels' in input:
loss = self.compute_loss(outputs, input['labels'])
outputs.update(loss)
return outputs

def extract_logits(self, outputs):
return outputs[OutputKeys.LOGITS].cpu().detach()

def extract_backbone_outputs(self, outputs):
sequence_output = None
pooled_output = None
if hasattr(self.backbone, 'extract_sequence_outputs'):
sequence_output = self.backbone.extract_sequence_outputs(outputs)
if hasattr(self.backbone, 'extract_pooled_outputs'):
pooled_output = self.backbone.extract_pooled_outputs(outputs)
return sequence_output, pooled_output

def compute_loss(self, outputs, labels):
loss = self.head.compute_loss(outputs, labels)
return loss
pass

@property
def base_model(self):
return getattr(self, self.base_model_prefix)

def forward(self, **kwargs):
labels = None
if OutputKeys.LABEL in kwargs:
labels = kwargs.pop(OutputKeys.LABEL)
elif OutputKeys.LABELS in kwargs:
labels = kwargs.pop(OutputKeys.LABELS)

outputs = self.base_model.forward(**kwargs)

# backbone model should return pooled_output as its second output
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
if labels is not None:
loss_fct = nn.CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
return {OutputKeys.LOGITS: logits, OutputKeys.LOSS: loss}
return {OutputKeys.LOGITS: logits}

def postprocess(self, input, **kwargs):
logits = self.extract_logits(input)
probs = logits.softmax(-1).numpy()
pred = logits.argmax(-1).numpy()
logits = logits.numpy()
logits = input[OutputKeys.LOGITS]
probs = torch_nested_numpify(torch_nested_detach(logits.softmax(-1)))
pred = torch_nested_numpify(torch_nested_detach(logits.argmax(-1)))
logits = torch_nested_numpify(torch_nested_detach(logits))
res = {
OutputKeys.PREDICTIONS: pred,
OutputKeys.PROBABILITIES: probs,
OutputKeys.LOGITS: logits
}
return res


@MODELS.register_module(
Tasks.sentence_similarity, module_name=Models.structbert)
@MODELS.register_module(
Tasks.sentiment_classification, module_name=Models.structbert)
@MODELS.register_module(Tasks.nli, module_name=Models.structbert)
@MODELS.register_module(
Tasks.zero_shot_classification, module_name=Models.structbert)
class SbertForSequenceClassification(SequenceClassificationBase,
SbertPreTrainedModel):
base_model_prefix: str = 'bert'
supports_gradient_checkpointing = True
_keys_to_ignore_on_load_missing = [r'position_ids']

def __init__(self, config, model_dir):
if hasattr(config, 'base_model_prefix'):
SbertForSequenceClassification.base_model_prefix = config.base_model_prefix
super().__init__(config, model_dir)

def build_base_model(self):
from .structbert import SbertModel
return SbertModel(self.config, add_pooling_layer=True)

def forward(self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
labels=None,
**kwargs):
return super().forward(
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
labels=labels)

@classmethod
def _instantiate(cls, **kwargs):
model_dir = kwargs.get('model_dir')
num_labels = kwargs.get('num_labels')
if num_labels is None:
label2id = parse_label_mapping(model_dir)
if label2id is not None and len(label2id) > 0:
num_labels = len(label2id)

model_args = {} if num_labels is None else {'num_labels': num_labels}
return super(SbertPreTrainedModel,
SbertForSequenceClassification).from_pretrained(
pretrained_model_name_or_path=kwargs.get('model_dir'),
model_dir=kwargs.get('model_dir'),
**model_args)


@MODELS.register_module(Tasks.sentence_similarity, module_name=Models.veco)
@MODELS.register_module(
Tasks.sentiment_classification, module_name=Models.veco)
@MODELS.register_module(Tasks.nli, module_name=Models.veco)
class VecoForSequenceClassification(TorchModel,
VecoForSequenceClassificationTransform):

def __init__(self, config, model_dir):
super().__init__(model_dir)
VecoForSequenceClassificationTransform.__init__(self, config)

def forward(self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
labels=None,
output_attentions=None,
output_hidden_states=None,
**kwargs):
return VecoForSequenceClassificationTransform.forward(
self,
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
labels=labels)

@classmethod
def _instantiate(cls, **kwargs):
model_dir = kwargs.get('model_dir')
num_labels = kwargs.get('num_labels')
if num_labels is None:
label2id = parse_label_mapping(model_dir)
if label2id is not None and len(label2id) > 0:
num_labels = len(label2id)

model_args = {} if num_labels is None else {'num_labels': num_labels}
return super(VecoForSequenceClassificationTransform,
VecoForSequenceClassification).from_pretrained(
pretrained_model_name_or_path=kwargs.get('model_dir'),
model_dir=kwargs.get('model_dir'),
**model_args)

+ 28
- 0
modelscope/models/nlp/space/__init__.py View File

@@ -0,0 +1,28 @@
from typing import TYPE_CHECKING

from modelscope.utils.import_utils import LazyImportModule

if TYPE_CHECKING:
from .model import SpaceGenerator
from .model import SpaceModelBase, SpaceTokenizer, SpaceConfig
from .space_for_dialog_intent_prediction import SpaceForDialogIntent
from .space_for_dialog_modeling import SpaceForDialogModeling
from .space_for_dialog_state_tracking import SpaceForDialogStateTracking
else:
_import_structure = {
'model':
['SpaceGenerator', 'SpaceModelBase', 'SpaceTokenizer', 'SpaceConfig'],
'space_for_dialog_intent_prediction': ['SpaceForDialogIntent'],
'space_for_dialog_modeling': ['SpaceForDialogModeling'],
'space_for_dialog_state_tracking': ['SpaceForDialogStateTracking'],
}

import sys

sys.modules[__name__] = LazyImportModule(
__name__,
globals()['__file__'],
_import_structure,
module_spec=__spec__,
extra_objects={},
)

+ 10
- 0
modelscope/models/nlp/space/model/__init__.py View File

@@ -0,0 +1,10 @@
from .configuration_space import SpaceConfig
from .gen_unified_transformer import GenUnifiedTransformer
from .generator import Generator as SpaceGenerator
from .intent_unified_transformer import IntentUnifiedTransformer
from .model_base import SpaceModelBase
from .modeling_space import (SpaceForDST, SpaceForMaskedLM,
SpaceForPreTraining, SpaceModel)
from .tokenization_space import (BasicTokenizer, SpaceTokenizer,
WordpieceTokenizer)
from .unified_transformer import UnifiedTransformer

+ 32
- 0
modelscope/models/nlp/space/model/configuration_space.py View File

@@ -0,0 +1,32 @@
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
# Copyright 2018 The Google AI Language Team Authors.
# Copyright 2020 The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.
# All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Space configuration, mainly copied from :class:`~transformers.configuration_xlm_roberta` """

from modelscope.models.nlp.structbert import SbertConfig
from modelscope.utils import logger as logging

logger = logging.get_logger(__name__)


class SpaceConfig(SbertConfig):
"""
This class overrides [`SbertConfig`]. Please check the superclass for the appropriate
documentation alongside usage examples.
"""

model_type = 'space'

modelscope/models/nlp/backbones/space/model/gen_unified_transformer.py → modelscope/models/nlp/space/model/gen_unified_transformer.py View File


modelscope/models/nlp/backbones/space/model/generator.py → modelscope/models/nlp/space/model/generator.py View File


modelscope/models/nlp/backbones/space/model/intent_unified_transformer.py → modelscope/models/nlp/space/model/intent_unified_transformer.py View File


modelscope/models/nlp/backbones/space/model/model_base.py → modelscope/models/nlp/space/model/model_base.py View File


+ 268
- 0
modelscope/models/nlp/space/model/modeling_space.py View File

@@ -0,0 +1,268 @@
# Copyright 2019 Facebook AI Research and the HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
# All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PyTorch Space model. mainly copied from :module:`~transformers.modeling_xlm_roberta`"""

import torch
from torch import nn
from torch.nn import CrossEntropyLoss
from transformers.file_utils import add_start_docstrings

from modelscope.models.nlp.structbert.modeling_sbert import (
SbertForMaskedLM, SbertModel, SbertPreTrainedModel)
from .configuration_space import SpaceConfig

SPACE_START_DOCSTRING = r"""

This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
general usage and behavior.

Parameters:
config ([`SpaceConfig`]): Model configuration class with all the parameters of the
model. Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model
weights.
"""


@add_start_docstrings(
'The bare Space Model transformer outputting raw hidden-states without any specific head on top. '
'It is identical with the Bert Model from Transformers',
SPACE_START_DOCSTRING,
)
class SpaceModel(SbertModel):
"""
This class overrides [`SbertModel`]. Please check the superclass for the appropriate
documentation alongside usage examples.
"""

config_class = SpaceConfig


@add_start_docstrings(
"""
Space Model transformer with Dialog state tracking heads on top (a inform projection
layer with a dialog state layer and a set of slots including history infromation from
previous dialog) e.g. for multiwoz2.2 tasks.
""",
SPACE_START_DOCSTRING,
)
class SpaceForDST(SbertPreTrainedModel):

def __init__(self, config):
super(SpaceForDST, self).__init__(config)
self.slot_list = config.dst_slot_list
self.class_types = config.dst_class_types
self.class_labels = config.dst_class_labels
self.token_loss_for_nonpointable = config.dst_token_loss_for_nonpointable
self.refer_loss_for_nonpointable = config.dst_refer_loss_for_nonpointable
self.class_aux_feats_inform = config.dst_class_aux_feats_inform
self.class_aux_feats_ds = config.dst_class_aux_feats_ds
self.class_loss_ratio = config.dst_class_loss_ratio

# Only use refer loss if refer class is present in dataset.
if 'refer' in self.class_types:
self.refer_index = self.class_types.index('refer')
else:
self.refer_index = -1

self.bert = SpaceModel(config)
self.dropout = nn.Dropout(config.dst_dropout_rate)
self.dropout_heads = nn.Dropout(config.dst_heads_dropout_rate)

if self.class_aux_feats_inform:
self.add_module(
'inform_projection',
nn.Linear(len(self.slot_list), len(self.slot_list)))
if self.class_aux_feats_ds:
self.add_module(
'ds_projection',
nn.Linear(len(self.slot_list), len(self.slot_list)))

aux_dims = len(self.slot_list) * (
self.class_aux_feats_inform + self.class_aux_feats_ds
) # second term is 0, 1 or 2

for slot in self.slot_list:
self.add_module(
'class_' + slot,
nn.Linear(config.hidden_size + aux_dims, self.class_labels))
self.add_module('token_' + slot, nn.Linear(config.hidden_size, 2))
self.add_module(
'refer_' + slot,
nn.Linear(config.hidden_size + aux_dims,
len(self.slot_list) + 1))

self.init_weights()

def forward(self,
input_ids,
input_mask=None,
segment_ids=None,
position_ids=None,
head_mask=None,
start_pos=None,
end_pos=None,
inform_slot_id=None,
refer_id=None,
class_label_id=None,
diag_state=None):
outputs = self.bert(
input_ids,
attention_mask=input_mask,
token_type_ids=segment_ids,
position_ids=position_ids,
head_mask=head_mask)

sequence_output = outputs[0]
pooled_output = outputs[1]

sequence_output = self.dropout(sequence_output)
pooled_output = self.dropout(pooled_output)

# TODO: establish proper format in labels already?
if inform_slot_id is not None:
inform_labels = torch.stack(list(inform_slot_id.values()),
1).float()
if diag_state is not None:
diag_state_labels = torch.clamp(
torch.stack(list(diag_state.values()), 1).float(), 0.0, 1.0)

total_loss = 0
per_slot_per_example_loss = {}
per_slot_class_logits = {}
per_slot_start_logits = {}
per_slot_end_logits = {}
per_slot_refer_logits = {}
for slot in self.slot_list:
if self.class_aux_feats_inform and self.class_aux_feats_ds:
pooled_output_aux = torch.cat(
(pooled_output, self.inform_projection(inform_labels),
self.ds_projection(diag_state_labels)), 1)
elif self.class_aux_feats_inform:
pooled_output_aux = torch.cat(
(pooled_output, self.inform_projection(inform_labels)), 1)
elif self.class_aux_feats_ds:
pooled_output_aux = torch.cat(
(pooled_output, self.ds_projection(diag_state_labels)), 1)
else:
pooled_output_aux = pooled_output
class_logits = self.dropout_heads(
getattr(self, 'class_' + slot)(pooled_output_aux))

token_logits = self.dropout_heads(
getattr(self, 'token_' + slot)(sequence_output))
start_logits, end_logits = token_logits.split(1, dim=-1)
start_logits = start_logits.squeeze(-1)
end_logits = end_logits.squeeze(-1)

refer_logits = self.dropout_heads(
getattr(self, 'refer_' + slot)(pooled_output_aux))

per_slot_class_logits[slot] = class_logits
per_slot_start_logits[slot] = start_logits
per_slot_end_logits[slot] = end_logits
per_slot_refer_logits[slot] = refer_logits

# If there are no labels, don't compute loss
if class_label_id is not None and start_pos is not None and end_pos is not None and refer_id is not None:
# If we are on multi-GPU, split add a dimension
if len(start_pos[slot].size()) > 1:
start_pos[slot] = start_pos[slot].squeeze(-1)
if len(end_pos[slot].size()) > 1:
end_pos[slot] = end_pos[slot].squeeze(-1)
# sometimes the start/end positions are outside our model inputs, we ignore these terms
ignored_index = start_logits.size(1) # This is a single index
start_pos[slot].clamp_(0, ignored_index)
end_pos[slot].clamp_(0, ignored_index)

class_loss_fct = CrossEntropyLoss(reduction='none')
token_loss_fct = CrossEntropyLoss(
reduction='none', ignore_index=ignored_index)
refer_loss_fct = CrossEntropyLoss(reduction='none')

start_loss = token_loss_fct(start_logits, start_pos[slot])
end_loss = token_loss_fct(end_logits, end_pos[slot])
token_loss = (start_loss + end_loss) / 2.0

token_is_pointable = (start_pos[slot] > 0).float()
if not self.token_loss_for_nonpointable:
token_loss *= token_is_pointable

refer_loss = refer_loss_fct(refer_logits, refer_id[slot])
token_is_referrable = torch.eq(class_label_id[slot],
self.refer_index).float()
if not self.refer_loss_for_nonpointable:
refer_loss *= token_is_referrable

class_loss = class_loss_fct(class_logits, class_label_id[slot])

if self.refer_index > -1:
per_example_loss = (self.class_loss_ratio) * class_loss + (
(1 - self.class_loss_ratio) / 2) * token_loss + (
(1 - self.class_loss_ratio) / 2) * refer_loss
else:
per_example_loss = self.class_loss_ratio * class_loss + (
1 - self.class_loss_ratio) * token_loss

total_loss += per_example_loss.sum()
per_slot_per_example_loss[slot] = per_example_loss

# add hidden states and attention if they are here
outputs = (total_loss, ) + (
per_slot_per_example_loss,
per_slot_class_logits,
per_slot_start_logits,
per_slot_end_logits,
per_slot_refer_logits,
) + outputs[2:]

return outputs


@add_start_docstrings(
'The Space Model Model with a `language modeling` head on tops',
SPACE_START_DOCSTRING,
)
class SpaceForMaskedLM(SbertForMaskedLM):
"""
This class overrides [`SbertForMaskedLM`]. Please check the superclass for the
appropriate documentation alongside usage examples.
"""

config_class = SpaceConfig


@add_start_docstrings(
"""
Space Model with only one head on top as done during the pretraining: a `masked language modeling` head.
""",
SPACE_START_DOCSTRING,
)
class SpaceForPreTraining(SbertPreTrainedModel):

def __init__(self, model_name_or_path: str):
super(SpaceForPreTraining, self).__init__()
self.bert_model = SpaceForMaskedLM.from_pretrained(model_name_or_path)

def forward(self, input_ids: torch.tensor, mlm_labels: torch.tensor):
outputs = self.bert_model(input_ids, masked_lm_labels=mlm_labels)
return outputs[0]

+ 29
- 0
modelscope/models/nlp/space/model/tokenization_space.py View File

@@ -0,0 +1,29 @@
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
# All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License
"""Tokenization classes for Space. mainly copied from :module:`~transformers.tokenization_xlm_roberta`"""

from modelscope.models.nlp.structbert import (BasicTokenizer, SbertTokenizer,
WordpieceTokenizer)
from modelscope.utils import logger as logging

logger = logging.get_logger(__name__)


class SpaceTokenizer(SbertTokenizer):
"""
This class overrides [`SpaceTokenizer`]. Please check the superclass for the appropriate
documentation alongside usage examples.
"""

modelscope/models/nlp/backbones/space/model/unified_transformer.py → modelscope/models/nlp/space/model/unified_transformer.py View File

@@ -5,10 +5,9 @@ import torch
import torch.nn as nn
import torch.nn.functional as F

from modelscope.models.nlp.backbones.space.model.model_base import \
SpaceModelBase
from modelscope.models.nlp.backbones.space.modules.embedder import Embedder
from modelscope.models.nlp.backbones.space.modules.transformer_block import \
from modelscope.models.nlp.space.model.model_base import SpaceModelBase
from modelscope.models.nlp.space.modules.embedder import Embedder
from modelscope.models.nlp.space.modules.transformer_block import \
TransformerBlock



modelscope/models/nlp/backbones/space/modules/__init__.py → modelscope/models/nlp/space/modules/__init__.py View File


modelscope/models/nlp/backbones/space/modules/embedder.py → modelscope/models/nlp/space/modules/embedder.py View File


modelscope/models/nlp/backbones/space/modules/feedforward.py → modelscope/models/nlp/space/modules/feedforward.py View File


modelscope/models/nlp/backbones/space/modules/functions.py → modelscope/models/nlp/space/modules/functions.py View File


modelscope/models/nlp/backbones/space/modules/multihead_attention.py → modelscope/models/nlp/space/modules/multihead_attention.py View File


modelscope/models/nlp/backbones/space/modules/transformer_block.py → modelscope/models/nlp/space/modules/transformer_block.py View File


modelscope/models/nlp/space_for_dialog_intent_prediction.py → modelscope/models/nlp/space/space_for_dialog_intent_prediction.py View File

@@ -7,7 +7,7 @@ from modelscope.metainfo import Models
from modelscope.models import TorchModel
from modelscope.models.base import Tensor
from modelscope.models.builder import MODELS
from modelscope.models.nlp.backbones import SpaceGenerator, SpaceModelBase
from modelscope.models.nlp.space import SpaceGenerator, SpaceModelBase
from modelscope.preprocessors.space import IntentBPETextField
from modelscope.utils.config import Config
from modelscope.utils.constant import ModelFile, Tasks

modelscope/models/nlp/space_for_dialog_modeling.py → modelscope/models/nlp/space/space_for_dialog_modeling.py View File

@@ -7,7 +7,7 @@ from modelscope.metainfo import Models
from modelscope.models import TorchModel
from modelscope.models.base import Tensor
from modelscope.models.builder import MODELS
from modelscope.models.nlp.backbones import SpaceGenerator, SpaceModelBase
from modelscope.models.nlp.space import SpaceGenerator, SpaceModelBase
from modelscope.preprocessors.space import MultiWOZBPETextField
from modelscope.utils.config import Config
from modelscope.utils.constant import ModelFile, Tasks

modelscope/models/nlp/space_for_dialog_state_tracking.py → modelscope/models/nlp/space/space_for_dialog_state_tracking.py View File

@@ -21,7 +21,7 @@ class SpaceForDialogStateTracking(TorchModel):

super().__init__(model_dir, *args, **kwargs)

from sofa.models.space import SpaceConfig, SpaceForDST
from modelscope.models.nlp.space.model import SpaceForDST, SpaceConfig
self.model_dir = model_dir

self.config = SpaceConfig.from_pretrained(self.model_dir)

+ 45
- 0
modelscope/models/nlp/structbert/__init__.py View File

@@ -0,0 +1,45 @@
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
# All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import TYPE_CHECKING

from modelscope.utils.import_utils import LazyImportModule

if TYPE_CHECKING:
from .configuration_sbert import SbertConfig
from .modeling_sbert import (SbertForMaskedLM, SbertModel,
SbertPreTrainedModel)
from .tokenization_sbert import (BasicTokenizer, SbertTokenizer,
WordpieceTokenizer)
from .tokenization_sbert_fast import SbertTokenizerFast
else:
_import_structure = {
'configuration_sbert': ['SbertConfig'],
'modeling_sbert':
['SbertForMaskedLM', 'SbertModel', 'SbertPreTrainedModel'],
'tokenization_sbert':
['BasicTokenizer', 'SbertTokenizer', 'WordpieceTokenizer'],
'tokenization_sbert_fast': ['SbertTokenizerFast'],
}

import sys

sys.modules[__name__] = LazyImportModule(
__name__,
globals()['__file__'],
_import_structure,
module_spec=__spec__,
extra_objects={},
)

modelscope/models/nlp/backbones/structbert/adv_utils.py → modelscope/models/nlp/structbert/adv_utils.py View File

@@ -59,7 +59,8 @@ def compute_adv_loss(embedding,
"""
Calculate the adv loss of the model.
:param embedding: Original sentense embedding
:param model: The model or the forward function(including decoder/classifier), accept kwargs as input, output logits
:param model: The model, or the forward function(including decoder/classifier),
accept kwargs as input, output logits
:param ori_logits: The original logits outputed from the model function
:param ori_loss: The original loss
:param adv_grad_factor: This factor will be multipled by the KL loss grad and then the result will be added to
@@ -119,7 +120,8 @@ def compute_adv_loss_pair(embedding,
"""
Calculate the adv loss of the model. This function is used in the pair logits scenerio.
:param embedding: Original sentense embedding
:param model: The model or the forward function(including decoder/classifier), accept kwargs as input, output logits
:param model: The model, or the forward function(including decoder/classifier),
accept kwargs as input, output logits
:param start_logits: The original start logits outputed from the model function
:param end_logits: The original end logits outputed from the model function
:param ori_loss: The original loss

modelscope/models/nlp/backbones/structbert/configuration_sbert.py → modelscope/models/nlp/structbert/configuration_sbert.py View File

@@ -24,11 +24,12 @@ logger = logging.get_logger(__name__)

class SbertConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~sofa.models.SbertModel`.
This is the configuration class to store the configuration
of a :class:`~modelscope.models.nlp.structbert.SbertModel`.
It is used to instantiate a SBERT model according to the specified arguments.

Configuration objects inherit from :class:`~sofa.utils.PretrainedConfig` and can be used to control the model
outputs. Read the documentation from :class:`~sofa.utils.PretrainedConfig` for more information.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.


Args:
@@ -99,11 +100,13 @@ class SbertConfig(PretrainedConfig):
type_vocab_size=2,
initializer_range=0.02,
layer_norm_eps=1e-12,
pad_token_id=0,
position_embedding_type='absolute',
use_cache=True,
classifier_dropout=None,
**kwargs):
super().__init__(**kwargs)
super().__init__(pad_token_id=pad_token_id, **kwargs)

self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers

+ 1964
- 0
modelscope/models/nlp/structbert/modeling_sbert.py
File diff suppressed because it is too large
View File


+ 516
- 0
modelscope/models/nlp/structbert/tokenization_sbert.py View File

@@ -0,0 +1,516 @@
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes for Sbert. mainly copied from :module:`~transformers.tokenization_bert`"""

import collections
import os
import unicodedata
from typing import List, Optional, Tuple

from transformers.tokenization_utils import (PreTrainedTokenizer, _is_control,
_is_punctuation, _is_whitespace)

from modelscope.utils.logger import get_logger

logger = get_logger(__name__)

VOCAB_FILES_NAMES = {'vocab_file': 'vocab.txt'}

PRETRAINED_VOCAB_FILES_MAP = {'vocab_file': {}}

PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
'chinese_sbert-large-std-512': 512,
'english_sbert-large-std-512': 512,
}

PRETRAINED_INIT_CONFIGURATION = {
'english_sbert-large-std-512': {
'do_lower_case': True
},
}


def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
with open(vocab_file, 'r', encoding='utf-8') as reader:
tokens = reader.readlines()
for index, token in enumerate(tokens):
token = token.rstrip('\n')
vocab[token] = index
return vocab


def whitespace_tokenize(text):
"""Runs basic whitespace cleaning and splitting on a piece of text."""
text = text.strip()
if not text:
return []
tokens = text.split()
return tokens


class SbertTokenizer(PreTrainedTokenizer):
r"""
Construct a SBERT tokenizer. Based on WordPiece.

This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
Users should refer to this superclass for more information regarding those methods.

Args:
vocab_file (:obj:`str`):
File containing the vocabulary.
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to lowercase the input when tokenizing.
do_basic_tokenize (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to do basic tokenization before WordPiece.
never_split (:obj:`Iterable`, `optional`):
Collection of tokens which will never be split during tokenization. Only has an effect when
:obj:`do_basic_tokenize=True`
unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens.
pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`):
The token used for padding, for example when batching sequences of different lengths.
cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens.
mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict.
tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to tokenize Chinese characters.

This should likely be deactivated for Japanese (see this `issue
<https://github.com/huggingface/transformers/issues/328>`__).
strip_accents: (:obj:`bool`, `optional`):
Whether or not to strip all accents. If this option is not specified, then it will be determined by the
value for :obj:`lowercase` (as in the original BERT).
"""

vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES

def __init__(self,
vocab_file,
do_lower_case=True,
do_basic_tokenize=True,
never_split=None,
unk_token='[UNK]',
sep_token='[SEP]',
pad_token='[PAD]',
cls_token='[CLS]',
mask_token='[MASK]',
tokenize_chinese_chars=True,
strip_accents=None,
**kwargs):
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)

if not os.path.isfile(vocab_file):
raise ValueError(
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained "
'model use `tokenizer = SbertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`'
)
self.vocab = load_vocab(vocab_file)
self.ids_to_tokens = collections.OrderedDict([
(ids, tok) for tok, ids in self.vocab.items()
])
self.do_basic_tokenize = do_basic_tokenize
if do_basic_tokenize:
self.basic_tokenizer = BasicTokenizer(
do_lower_case=do_lower_case,
never_split=never_split,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
)
self.wordpiece_tokenizer = WordpieceTokenizer(
vocab=self.vocab, unk_token=self.unk_token)

@property
def do_lower_case(self):
return self.basic_tokenizer.do_lower_case

@property
def vocab_size(self):
return len(self.vocab)

def get_vocab(self):
return dict(self.vocab, **self.added_tokens_encoder)

def _tokenize(self, text):
split_tokens = []
if self.do_basic_tokenize:
for token in self.basic_tokenizer.tokenize(
text, never_split=self.all_special_tokens):

# If the token is part of the never_split set
if token in self.basic_tokenizer.never_split:
split_tokens.append(token)
else:
split_tokens += self.wordpiece_tokenizer.tokenize(token)
else:
split_tokens = self.wordpiece_tokenizer.tokenize(text)
return split_tokens

def _convert_token_to_id(self, token):
"""Converts a token (str) in an id using the vocab."""
return self.vocab.get(token, self.vocab.get(self.unk_token))

def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab."""
return self.ids_to_tokens.get(index, self.unk_token)

def convert_tokens_to_string(self, tokens):
"""Converts a sequence of tokens (string) in a single string."""
out_string = ' '.join(tokens).replace(' ##', '').strip()
return out_string

def build_inputs_with_special_tokens(
self,
token_ids_0: List[int],
token_ids_1: Optional[List[int]] = None) -> List[int]:
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. A SBERT sequence has the following format:

- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``

Args:
token_ids_0 (:obj:`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.

Returns:
:obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
"""
if token_ids_1 is None:
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
cls = [self.cls_token_id]
sep = [self.sep_token_id]
return cls + token_ids_0 + sep + token_ids_1 + sep

def get_special_tokens_mask(
self,
token_ids_0: List[int],
token_ids_1: Optional[List[int]] = None,
already_has_special_tokens: bool = False) -> List[int]:
"""
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer ``prepare_for_model`` method.

Args:
token_ids_0 (:obj:`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.
already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the token list is already formatted with special tokens for the model.

Returns:
:obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""

if already_has_special_tokens:
return super().get_special_tokens_mask(
token_ids_0=token_ids_0,
token_ids_1=token_ids_1,
already_has_special_tokens=True)

if token_ids_1 is not None:
return [1] + ([0] * len(token_ids_0)) + [1] + (
[0] * len(token_ids_1)) + [1]
return [1] + ([0] * len(token_ids_0)) + [1]

def create_token_type_ids_from_sequences(
self,
token_ids_0: List[int],
token_ids_1: Optional[List[int]] = None) -> List[int]:
"""
Create a mask from the two sequences passed to be used in a sequence-pair classification task. A SBERT sequence
pair mask has the following format:

::

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |

If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).

Args:
token_ids_0 (:obj:`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.

Returns:
:obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
sequence(s).
"""
sep = [self.sep_token_id]
cls = [self.cls_token_id]
if token_ids_1 is None:
return len(cls + token_ids_0 + sep) * [0]
return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1
+ sep) * [1]

def save_vocabulary(self,
save_directory: str,
filename_prefix: Optional[str] = None) -> Tuple[str]:
index = 0
if os.path.isdir(save_directory):
vocab_file = os.path.join(
save_directory,
(filename_prefix + '-' if filename_prefix else '')
+ VOCAB_FILES_NAMES['vocab_file'])
else:
vocab_file = (filename_prefix
+ '-' if filename_prefix else '') + save_directory
with open(vocab_file, 'w', encoding='utf-8') as writer:
for token, token_index in sorted(
self.vocab.items(), key=lambda kv: kv[1]):
if index != token_index:
logger.warning(
f'Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive.'
' Please check that the vocabulary is not corrupted!')
index = token_index
writer.write(token + '\n')
index += 1
return (vocab_file, )


class BasicTokenizer(object):
"""
Constructs a BasicTokenizer that will run basic tokenization (punctuation splitting, lower casing, etc.).

Args:
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to lowercase the input when tokenizing.
never_split (:obj:`Iterable`, `optional`):
Collection of tokens which will never be split during tokenization. Only has an effect when
:obj:`do_basic_tokenize=True`
tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to tokenize Chinese characters.

This should likely be deactivated for Japanese (see this `issue
<https://github.com/huggingface/transformers/issues/328>`__).
strip_accents: (:obj:`bool`, `optional`):
Whether or not to strip all accents. If this option is not specified, then it will be determined by the
value for :obj:`lowercase` (as in the original BERT).
"""

def __init__(self,
do_lower_case=True,
never_split=None,
tokenize_chinese_chars=True,
strip_accents=None):
if never_split is None:
never_split = []
self.do_lower_case = do_lower_case
self.never_split = set(never_split)
self.tokenize_chinese_chars = tokenize_chinese_chars
self.strip_accents = strip_accents

def tokenize(self, text, never_split=None):
"""
Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see
WordPieceTokenizer.

Args:
**never_split**: (`optional`) list of str
Kept for backward compatibility purposes. Now implemented directly at the base class level (see
:func:`PreTrainedTokenizer.tokenize`) List of token not to split.
"""
# union() returns a new set by concatenating the two sets.
never_split = self.never_split.union(
set(never_split)) if never_split else self.never_split
text = self._clean_text(text)

# This was added on November 1st, 2018 for the multilingual and Chinese
# models. This is also applied to the English models now, but it doesn't
# matter since the English models were not trained on any Chinese data
# and generally don't have any Chinese data in them (there are Chinese
# characters in the vocabulary because Wikipedia does have some Chinese
# words in the English Wikipedia.).
if self.tokenize_chinese_chars:
text = self._tokenize_chinese_chars(text)
orig_tokens = whitespace_tokenize(text)
split_tokens = []
for token in orig_tokens:
if token not in never_split:
if self.do_lower_case:
token = token.lower()
if self.strip_accents is not False:
token = self._run_strip_accents(token)
elif self.strip_accents:
token = self._run_strip_accents(token)
split_tokens.extend(self._run_split_on_punc(token, never_split))

output_tokens = whitespace_tokenize(' '.join(split_tokens))
return output_tokens

def _run_strip_accents(self, text):
"""Strips accents from a piece of text."""
text = unicodedata.normalize('NFD', text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == 'Mn':
continue
output.append(char)
return ''.join(output)

def _run_split_on_punc(self, text, never_split=None):
"""Splits punctuation on a piece of text."""
if never_split is not None and text in never_split:
return [text]
chars = list(text)
i = 0
start_new_word = True
output = []
while i < len(chars):
char = chars[i]
if _is_punctuation(char):
output.append([char])
start_new_word = True
else:
if start_new_word:
output.append([])
start_new_word = False
output[-1].append(char)
i += 1

return [''.join(x) for x in output]

def _tokenize_chinese_chars(self, text):
"""Adds whitespace around any CJK character."""
output = []
for char in text:
cp = ord(char)
if self._is_chinese_char(cp):
output.append(' ')
output.append(char)
output.append(' ')
else:
output.append(char)
return ''.join(output)

def _is_chinese_char(self, cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if ((0x4E00 <= cp <= 0x9FFF) or (0x3400 <= cp <= 0x4DBF)
or (0x20000 <= cp <= 0x2A6DF) or (0x2A700 <= cp <= 0x2B73F)
or (0x2B740 <= cp <= 0x2B81F) or (0x2B820 <= cp <= 0x2CEAF)
or (0xF900 <= cp <= 0xFAFF) or (0x2F800 <= cp <= 0x2FA1F)):
return True

return False

def _clean_text(self, text):
"""Performs invalid character removal and whitespace cleanup on text."""
output = []
for char in text:
cp = ord(char)
if cp == 0 or cp == 0xFFFD or _is_control(char):
continue
if _is_whitespace(char):
output.append(' ')
else:
output.append(char)
return ''.join(output)


class WordpieceTokenizer(object):
"""Runs WordPiece tokenization."""

def __init__(self, vocab, unk_token, max_input_chars_per_word=100):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word

def tokenize(self, text):
"""
Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform
tokenization using the given vocabulary.

For example, :obj:`input = "unaffable"` wil return as output :obj:`["un", "##aff", "##able"]`.

Args:
text: A single token or whitespace separated tokens. This should have
already been passed through `BasicTokenizer`.

Returns:
A list of wordpiece tokens.
"""

output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue

is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = ''.join(chars[start:end])
if start > 0:
substr = '##' + substr
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end

if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens

+ 200
- 0
modelscope/models/nlp/structbert/tokenization_sbert_fast.py View File

@@ -0,0 +1,200 @@
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Fast Tokenization classes for Sbert. mainly copied from :module:`~transformers.tokenization_bert_fast`"""

from typing import List, Optional, Tuple

import json
import transformers
from tokenizers import normalizers
from transformers.tokenization_utils_fast import PreTrainedTokenizerFast

from modelscope.utils.logger import get_logger
from .tokenization_sbert import SbertTokenizer

logger = get_logger(__name__)

VOCAB_FILES_NAMES = {
'vocab_file': 'vocab.txt',
'tokenizer_file': 'tokenizer.json'
}

PRETRAINED_VOCAB_FILES_MAP = {
'vocab_file': {},
'tokenizer_file': {},
}

PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
'chinese_sbert-large-std-512': 512,
'english_sbert-large-std-512': 512,
}

PRETRAINED_INIT_CONFIGURATION = {
'english_sbert-large-std-512': {
'do_lower_case': True
},
}

transformers.SLOW_TO_FAST_CONVERTERS[
'SbertTokenizer'] = transformers.SLOW_TO_FAST_CONVERTERS['BertTokenizer']


class SbertTokenizerFast(PreTrainedTokenizerFast):
r"""
Construct a "fast" SBERT tokenizer (backed by HuggingFace's `tokenizers` library). Based on WordPiece.

This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the main
methods. Users should refer to this superclass for more information regarding those methods.

Args:
vocab_file (:obj:`str`):
File containing the vocabulary.
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to lowercase the input when tokenizing.
unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens.
pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`):
The token used for padding, for example when batching sequences of different lengths.
cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens.
mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict.
clean_text (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to clean the text before tokenization by removing any control characters and replacing all
whitespaces by the classic one.
tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to tokenize Chinese characters. This should likely be deactivated for Japanese (see `this
issue <https://github.com/huggingface/transformers/issues/328>`__).
strip_accents: (:obj:`bool`, `optional`):
Whether or not to strip all accents. If this option is not specified, then it will be determined by the
value for :obj:`lowercase` (as in the original BERT).
wordpieces_prefix: (:obj:`str`, `optional`, defaults to :obj:`"##"`):
The prefix for subwords.
"""

vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
slow_tokenizer_class = SbertTokenizer

def __init__(self,
vocab_file=None,
tokenizer_file=None,
do_lower_case=True,
unk_token='[UNK]',
sep_token='[SEP]',
pad_token='[PAD]',
cls_token='[CLS]',
mask_token='[MASK]',
tokenize_chinese_chars=True,
strip_accents=None,
**kwargs):
super().__init__(
vocab_file,
tokenizer_file=tokenizer_file,
do_lower_case=do_lower_case,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)

pre_tok_state = json.loads(
self.backend_tokenizer.normalizer.__getstate__())
if (pre_tok_state.get('lowercase', do_lower_case) != do_lower_case
or pre_tok_state.get('strip_accents',
strip_accents) != strip_accents):
pre_tok_class = getattr(normalizers, pre_tok_state.pop('type'))
pre_tok_state['lowercase'] = do_lower_case
pre_tok_state['strip_accents'] = strip_accents
self.backend_tokenizer.normalizer = pre_tok_class(**pre_tok_state)

self.do_lower_case = do_lower_case

def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. A SBERT sequence has the following format:

- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``

Args:
token_ids_0 (:obj:`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.

Returns:
:obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
"""
output = [self.cls_token_id] + token_ids_0 + [self.sep_token_id]

if token_ids_1:
output += token_ids_1 + [self.sep_token_id]

return output

def create_token_type_ids_from_sequences(
self,
token_ids_0: List[int],
token_ids_1: Optional[List[int]] = None) -> List[int]:
"""
Create a mask from the two sequences passed to be used in a sequence-pair classification task. A SBERT sequence
pair mask has the following format:

::

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |

If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).

Args:
token_ids_0 (:obj:`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.

Returns:
:obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
sequence(s).
"""
sep = [self.sep_token_id]
cls = [self.cls_token_id]
if token_ids_1 is None:
return len(cls + token_ids_0 + sep) * [0]
return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1
+ sep) * [1]

def save_vocabulary(self,
save_directory: str,
filename_prefix: Optional[str] = None) -> Tuple[str]:
files = self._tokenizer.model.save(
save_directory, name=filename_prefix)
return tuple(files)

+ 0
- 0
modelscope/models/nlp/task_models/__init__.py View File


+ 86
- 0
modelscope/models/nlp/task_models/sequence_classification.py View File

@@ -0,0 +1,86 @@
import os
from typing import Any, Dict

import json
import numpy as np

from modelscope.metainfo import TaskModels
from modelscope.models.builder import MODELS
from modelscope.models.nlp.task_models.task_model import \
SingleBackboneTaskModelBase
from modelscope.outputs import OutputKeys
from modelscope.utils.constant import Tasks

__all__ = ['SequenceClassificationModel']


@MODELS.register_module(
Tasks.sentiment_classification, module_name=TaskModels.text_classification)
@MODELS.register_module(
Tasks.text_classification, module_name=TaskModels.text_classification)
class SequenceClassificationModel(SingleBackboneTaskModelBase):

def __init__(self, model_dir: str, *args, **kwargs):
"""initialize the sequence classification model from the `model_dir` path.

Args:
model_dir (str): the model path.
"""
super().__init__(model_dir, *args, **kwargs)
if 'base_model_prefix' in kwargs:
self._base_model_prefix = kwargs['base_model_prefix']

backbone_cfg = self.cfg.backbone
head_cfg = self.cfg.head

# get the num_labels from label_mapping.json
self.id2label = {}
self.label_path = os.path.join(model_dir, 'label_mapping.json')
if os.path.exists(self.label_path):
with open(self.label_path) as f:
self.label_mapping = json.load(f)
self.id2label = {
idx: name
for name, idx in self.label_mapping.items()
}
head_cfg['num_labels'] = len(self.label_mapping)

self.build_backbone(backbone_cfg)
self.build_head(head_cfg)

def forward(self, input: Dict[str, Any]) -> Dict[str, np.ndarray]:
outputs = super().forward(input)
sequence_output, pooled_output = self.extract_backbone_outputs(outputs)
outputs = self.head.forward(pooled_output)
if 'labels' in input:
loss = self.compute_loss(outputs, input['labels'])
outputs.update(loss)
return outputs

def extract_logits(self, outputs):
return outputs[OutputKeys.LOGITS].cpu().detach()

def extract_backbone_outputs(self, outputs):
sequence_output = None
pooled_output = None
if hasattr(self.backbone, 'extract_sequence_outputs'):
sequence_output = self.backbone.extract_sequence_outputs(outputs)
if hasattr(self.backbone, 'extract_pooled_outputs'):
pooled_output = self.backbone.extract_pooled_outputs(outputs)
return sequence_output, pooled_output

def compute_loss(self, outputs, labels):
loss = self.head.compute_loss(outputs, labels)
return loss

def postprocess(self, input, **kwargs):
logits = self.extract_logits(input)
probs = logits.softmax(-1).numpy()
pred = logits.argmax(-1).numpy()
logits = logits.numpy()
res = {
OutputKeys.PREDICTIONS: pred,
OutputKeys.PROBABILITIES: probs,
OutputKeys.LOGITS: logits
}
return res

modelscope/models/nlp/task_model.py → modelscope/models/nlp/task_models/task_model.py View File

@@ -11,8 +11,8 @@ from modelscope.models.base import TorchModel
from modelscope.models.builder import build_backbone, build_head
from modelscope.utils.config import ConfigDict
from modelscope.utils.constant import Fields, Tasks
from modelscope.utils.file_utils import func_receive_dict_inputs
from modelscope.utils.logger import get_logger
from modelscope.utils.utils import if_func_receive_dict_inputs

logger = get_logger(__name__)

@@ -424,12 +424,15 @@ class SingleBackboneTaskModelBase(BaseTaskModel):

def forward(self, input: Dict[str, Any]) -> Dict[str, Any]:
"""default forward method is the backbone-only forward"""
if if_func_receive_dict_inputs(self.backbone.forward):
if func_receive_dict_inputs(self.backbone.forward):
outputs = self.backbone.forward(input)
else:
outputs = self.backbone.forward(**input)
return outputs

def compute_loss(self, outputs: Dict[str, Any], labels):
raise NotImplementedError()


class EncoderDecoderTaskModelBase(BaseTaskModel):
"""
@@ -472,13 +475,13 @@ class EncoderDecoderTaskModelBase(BaseTaskModel):
return getattr(self, self._decoder_prefix)

def forward(self, input: Dict[str, Any]) -> Dict[str, Any]:
if if_func_receive_dict_inputs(self.encoder_.forward):
if func_receive_dict_inputs(self.encoder_.forward):
encoder_outputs = self.encoder_.forward(input)
else:
encoder_outputs = self.encoder_.forward(**input)
decoder_inputs = self.project_decoder_inputs_and_mediate(
input, encoder_outputs)
if if_func_receive_dict_inputs(self.decoder_.forward):
if func_receive_dict_inputs(self.decoder_.forward):
outputs = self.decoder_.forward(decoder_inputs)
else:
outputs = self.decoder_.forward(**decoder_inputs)

+ 147
- 0
modelscope/models/nlp/token_classification.py View File

@@ -0,0 +1,147 @@
from abc import abstractmethod
from typing import Dict

import numpy as np
import torch
from torch import nn

from modelscope.metainfo import Models
from modelscope.models.base import TorchModel
from modelscope.models.builder import MODELS
from modelscope.outputs import OutputKeys
from modelscope.utils.constant import Tasks
from modelscope.utils.hub import parse_label_mapping
from modelscope.utils.tensor_utils import (torch_nested_detach,
torch_nested_numpify)
from .structbert import SbertPreTrainedModel

__all__ = ['SbertForTokenClassification']


class TokenClassification(TorchModel):

base_model_prefix: str = 'bert'

def __init__(self, config, model_dir):
super().__init__(model_dir)
self.num_labels = config.num_labels
self.config = config
setattr(self, self.base_model_prefix, self.build_base_model())
classifier_dropout = (
config.classifier_dropout if config.classifier_dropout is not None
else config.hidden_dropout_prob)
self.dropout = nn.Dropout(classifier_dropout)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)

@abstractmethod
def build_base_model(self):
"""Build the backbone model.

Returns: the backbone instance.
"""
pass

@property
def base_model(self):
return getattr(self, self.base_model_prefix)

def compute_loss(self, logits, labels, **kwargs):
"""Compute loss.

For example, if backbone is pretrained model, there will be a 'attention_mask' parameter to skip
useless tokens.

Args:
logits: The logits from the classifier
labels: The labels
**kwargs: Other input params.

Returns: Loss.

"""
pass

def forward(self, **kwargs):
labels = None
if OutputKeys.LABEL in kwargs:
labels = kwargs.pop(OutputKeys.LABEL)
elif OutputKeys.LABELS in kwargs:
labels = kwargs.pop(OutputKeys.LABELS)

outputs = self.base_model(**kwargs)
# base model should return the sequence_output as its first output
sequence_output = outputs[0]
sequence_output = self.dropout(sequence_output)
logits = self.classifier(sequence_output)
if labels is not None:
loss = self.compute_loss(logits, labels, **kwargs)
return {OutputKeys.LOGITS: logits, OutputKeys.LOSS: loss}
return {OutputKeys.LOGITS: logits}

def postprocess(self, input: Dict[str, np.ndarray],
**kwargs) -> Dict[str, np.ndarray]:
logits = input[OutputKeys.LOGITS]
pred = torch.argmax(logits[0], dim=-1)
pred = torch_nested_numpify(torch_nested_detach(pred))
logits = torch_nested_numpify(torch_nested_detach(logits))
rst = {OutputKeys.PREDICTIONS: pred, OutputKeys.LOGITS: logits}
return rst


@MODELS.register_module(Tasks.word_segmentation, module_name=Models.structbert)
@MODELS.register_module(
Tasks.token_classification, module_name=Models.structbert)
class SbertForTokenClassification(TokenClassification, SbertPreTrainedModel):

supports_gradient_checkpointing = True
_keys_to_ignore_on_load_unexpected = [r'pooler']

def __init__(self, config, model_dir):
if hasattr(config, 'base_model_prefix'):
SbertForTokenClassification.base_model_prefix = config.base_model_prefix
super().__init__(config, model_dir)

def build_base_model(self):
from .structbert import SbertModel
return SbertModel(self.config, add_pooling_layer=False)

def forward(self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
labels=None,
**kwargs):
return super().forward(
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
labels=labels)

def compute_loss(self, logits, labels, attention_mask=None, **kwargs):
loss_fct = nn.CrossEntropyLoss()
# Only keep active parts of the loss
if attention_mask is not None:
active_loss = attention_mask.view(-1) == 1
active_logits = logits.view(-1, self.num_labels)
active_labels = torch.where(
active_loss, labels.view(-1),
torch.tensor(loss_fct.ignore_index).type_as(labels))
return loss_fct(active_logits, active_labels)
else:
return loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

@classmethod
def _instantiate(cls, **kwargs):
model_dir = kwargs.get('model_dir')
num_labels = kwargs.get('num_labels')
if num_labels is None:
label2id = parse_label_mapping(model_dir)
if label2id is not None and len(label2id) > 0:
num_labels = len(label2id)

model_args = {} if num_labels is None else {'num_labels': num_labels}
return super(SbertPreTrainedModel,
SbertForTokenClassification).from_pretrained(
pretrained_model_name_or_path=kwargs.get('model_dir'),
model_dir=kwargs.get('model_dir'),
**model_args)

+ 43
- 0
modelscope/models/nlp/veco/__init__.py View File

@@ -0,0 +1,43 @@
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
# All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import TYPE_CHECKING

from modelscope.utils.import_utils import LazyImportModule

if TYPE_CHECKING:
from .configuration_veco import VecoConfig
from .modeling_veco import (VecoForMaskedLM, VecoForSequenceClassification,
VecoModel)
from .tokenization_veco import VecoTokenizer
from .tokenization_veco_fast import VecoTokenizerFast
else:
_import_structure = {
'configuration_veco': ['VecoConfig'],
'modeling_veco':
['VecoForMaskedLM', 'VecoForSequenceClassification', 'VecoModel'],
'tokenization_veco': ['VecoTokenizer'],
'tokenization_veco_fast': ['VecoTokenizerFast'],
}

import sys

sys.modules[__name__] = LazyImportModule(
__name__,
globals()['__file__'],
_import_structure,
module_spec=__spec__,
extra_objects={},
)

+ 33
- 0
modelscope/models/nlp/veco/configuration_veco.py View File

@@ -0,0 +1,33 @@
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
# Copyright 2018 The Google AI Language Team Authors.
# Copyright 2020 The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.
# All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Veco configuration, mainly copied from :class:`~transformers.configuration_xlm_roberta` """

from transformers import RobertaConfig

from modelscope.utils import logger as logging

logger = logging.get_logger(__name__)


class VecoConfig(RobertaConfig):
"""
This class overrides [`RobertaConfig`]. Please check the superclass for the appropriate
documentation alongside usage examples.
"""

model_type = 'veco'

+ 143
- 0
modelscope/models/nlp/veco/modeling_veco.py View File

@@ -0,0 +1,143 @@
# Copyright 2019 Facebook AI Research and the HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
# All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PyTorch Veco model. mainly copied from :module:`~transformers.modeling_xlm_roberta`"""

from transformers import (RobertaForMaskedLM, RobertaForMultipleChoice,
RobertaForQuestionAnswering,
RobertaForSequenceClassification,
RobertaForTokenClassification, RobertaModel)
from transformers.file_utils import add_start_docstrings

from modelscope.metainfo import Models
from modelscope.models.builder import BACKBONES
from modelscope.utils import logger as logging
from modelscope.utils.constant import Fields
from .configuration_veco import VecoConfig

logger = logging.get_logger(__name__)

VECO_PRETRAINED_MODEL_ARCHIVE_LIST = []

VECO_START_DOCSTRING = r"""

This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
general usage and behavior.

Parameters:
config ([`VecoConfig`]): Model configuration class with all the parameters of the
model. Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model
weights.
"""


@add_start_docstrings(
'The bare Veco Model transformer outputting raw hidden-states without any specific head on top.',
VECO_START_DOCSTRING,
)
class VecoModel(RobertaModel):
"""
This class overrides [`RobertaModel`]. Please check the superclass for the appropriate
documentation alongside usage examples.
"""

config_class = VecoConfig


@add_start_docstrings(
"""
Veco Model transformer with a sequence classification/regression head on top (a linear layer on top of the
pooled output) e.g. for GLUE tasks.
""",
VECO_START_DOCSTRING,
)
class VecoForSequenceClassification(RobertaForSequenceClassification):
"""
This class overrides [`RobertaForSequenceClassification`]. Please check the superclass for the
appropriate documentation alongside usage examples.
"""

config_class = VecoConfig


@add_start_docstrings(
"""
Veco Model transformer with a masked language model head on top (a linear layer on top of the
pooled output).
""",
VECO_START_DOCSTRING,
)
class VecoForMaskedLM(RobertaForMaskedLM):
"""
This class overrides [`RobertaForMaskedLM`]. Please check the superclass for the
appropriate documentation alongside usage examples.
"""

config_class = VecoConfig


@add_start_docstrings(
"""
Veco Model with a multiple choice classification head on top (a linear layer on top of the pooled output and
a softmax) e.g. for RocStories/SWAG tasks.
""",
VECO_START_DOCSTRING,
)
class VecoForMultipleChoice(RobertaForMultipleChoice):
"""
This class overrides [`RobertaForMultipleChoice`]. Please check the superclass for the
appropriate documentation alongside usage examples.
"""

config_class = VecoConfig


@add_start_docstrings(
"""
Veco Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g.
for Named-Entity-Recognition (NER) tasks.
""",
VECO_START_DOCSTRING,
)
class VecoForTokenClassification(RobertaForTokenClassification):
"""
This class overrides [`RobertaForTokenClassification`]. Please check the superclass for the
appropriate documentation alongside usage examples.
"""

config_class = VecoConfig


@add_start_docstrings(
"""
Veco Model with a span classification head on top for extractive question-answering tasks like SQuAD (a
linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`).
""",
VECO_START_DOCSTRING,
)
class VecoForQuestionAnswering(RobertaForQuestionAnswering):
"""
This class overrides [`RobertaForQuestionAnswering`]. Please check the superclass for the
appropriate documentation alongside usage examples.
"""

config_class = VecoConfig

+ 321
- 0
modelscope/models/nlp/veco/tokenization_veco.py View File

@@ -0,0 +1,321 @@
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
# All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License
"""Tokenization classes for Veco. mainly copied from :module:`~transformers.tokenization_xlm_roberta`"""

import os
from shutil import copyfile
from typing import Any, Dict, List, Optional, Tuple

import sentencepiece as spm
from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer

from modelscope.utils import logger as logging

logger = logging.get_logger(__name__)

SPIECE_UNDERLINE = '▁'

VOCAB_FILES_NAMES = {'vocab_file': 'sentencepiece.bpe.model'}

PRETRAINED_VOCAB_FILES_MAP = {'vocab_file': {}}

PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {}


class VecoTokenizer(PreTrainedTokenizer):
"""
Adapted from [`RobertaTokenizer`] and [`XLNetTokenizer`]. Based on
[SentencePiece](https://github.com/google/sentencepiece).

This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods.
Users should refer to this superclass for more information regarding those methods.

Args:
vocab_file (`str`):
Path to the vocabulary file.
bos_token (`str`, *optional*, defaults to `"<s>"`):
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.

<Tip>

When building a sequence using special tokens, this is not the token that is used for the beginning of
sequence. The token used is the `cls_token`.

</Tip>

eos_token (`str`, *optional*, defaults to `"</s>"`):
The end of sequence token.

<Tip>

When building a sequence using special tokens, this is not the token that is used for the end of
sequence. The token used is the `sep_token`.

</Tip>

sep_token (`str`, *optional*, defaults to `"</s>"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens.
cls_token (`str`, *optional*, defaults to `"<s>"`):
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens.
unk_token (`str`, *optional*, defaults to `"<unk>"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
pad_token (`str`, *optional*, defaults to `"<pad>"`):
The token used for padding, for example when batching sequences of different lengths.
mask_token (`str`, *optional*, defaults to `"<mask>"`):
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict.
additional_special_tokens (`List[str]`, *optional*, defaults to `["<s>NOTUSED", "</s>NOTUSED"]`):
Additional special tokens used by the tokenizer.
sp_model_kwargs (`dict`, *optional*):
Will be passed to the `SentencePieceProcessor.__init__()` method.
The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python)
can be used, among other things, to set:

- `enable_sampling`: Enable subword regularization.
- `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.

- `nbest_size = {0,1}`: No sampling is performed.
- `nbest_size > 1`: samples from the nbest_size results.
- `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
using forward-filtering-and-backward-sampling algorithm.

- `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
BPE-dropout.

Attributes:
sp_model (`SentencePieceProcessor`):
The *SentencePiece* processor that is used for every conversion (string, tokens and IDs).
"""

vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
model_input_names = ['input_ids', 'attention_mask']

def __init__(self,
vocab_file,
bos_token='<s>',
eos_token='</s>',
sep_token='</s>',
cls_token='<s>',
unk_token='<unk>',
pad_token='<pad>',
mask_token='<mask>',
sp_model_kwargs: Optional[Dict[str, Any]] = None,
**kwargs) -> None:
# Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(
mask_token, lstrip=True, rstrip=False) if isinstance(
mask_token, str) else mask_token

self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs

super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)

self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(str(vocab_file))
self.vocab_file = vocab_file

# Original fairseq vocab and spm vocab must be "aligned":
# Vocab | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
# -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ----
# fairseq | '<s>' | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's' | '▁de' | '-'
# spm | '<unk>' | '<s>' | '</s>' | ',' | '.' | '▁' | 's' | '▁de' | '-' | '▁a'

# Mimic fairseq token-to-id alignment for the first 4 token
self.fairseq_tokens_to_ids = {
'<s>': 0,
'<pad>': 1,
'</s>': 2,
'<unk>': 3
}

# The first "real" token "," has position 4 in the original fairseq vocab and position 3 in the spm vocab
self.fairseq_offset = 1

self.fairseq_tokens_to_ids['<mask>'] = len(
self.sp_model) + self.fairseq_offset
self.fairseq_ids_to_tokens = {
v: k
for k, v in self.fairseq_tokens_to_ids.items()
}

def __getstate__(self):
state = self.__dict__.copy()
state['sp_model'] = None
state['sp_model_proto'] = self.sp_model.serialized_model_proto()
return state

def __setstate__(self, d):
self.__dict__ = d

# for backward compatibility
if not hasattr(self, 'sp_model_kwargs'):
self.sp_model_kwargs = {}

self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.LoadFromSerializedProto(self.sp_model_proto)

def build_inputs_with_special_tokens(
self,
token_ids_0: List[int],
token_ids_1: Optional[List[int]] = None) -> List[int]:
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. An Veco sequence has the following format:

- single sequence: `<s> X </s>`
- pair of sequences: `<s> A </s></s> B </s>`

Args:
token_ids_0 (`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.

Returns:
`List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
"""

if token_ids_1 is None:
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
cls = [self.cls_token_id]
sep = [self.sep_token_id]
return cls + token_ids_0 + sep + sep + token_ids_1 + sep

def get_special_tokens_mask(
self,
token_ids_0: List[int],
token_ids_1: Optional[List[int]] = None,
already_has_special_tokens: bool = False) -> List[int]:
"""
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer `prepare_for_model` method.

Args:
token_ids_0 (`List[int]`):
List of IDs.
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
already_has_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not the token list is already formatted with special tokens for the model.

Returns:
`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""

if already_has_special_tokens:
return super().get_special_tokens_mask(
token_ids_0=token_ids_0,
token_ids_1=token_ids_1,
already_has_special_tokens=True)

if token_ids_1 is None:
return [1] + ([0] * len(token_ids_0)) + [1]
return [1] + ([0] * len(token_ids_0)) + [1, 1] + (
[0] * len(token_ids_1)) + [1]

def create_token_type_ids_from_sequences(
self,
token_ids_0: List[int],
token_ids_1: Optional[List[int]] = None) -> List[int]:
"""
Create a mask from the two sequences passed to be used in a sequence-pair classification task. Veco does
not make use of token type ids, therefore a list of zeros is returned.

Args:
token_ids_0 (`List[int]`):
List of IDs.
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.

Returns:
`List[int]`: List of zeros.

"""

sep = [self.sep_token_id]
cls = [self.cls_token_id]

if token_ids_1 is None:
return len(cls + token_ids_0 + sep) * [0]
return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]

@property
def vocab_size(self):
return len(
self.sp_model) + self.fairseq_offset + 1 # Add the <mask> token

def get_vocab(self):
vocab = {
self.convert_ids_to_tokens(i): i
for i in range(self.vocab_size)
}
vocab.update(self.added_tokens_encoder)
return vocab

def _tokenize(self, text: str) -> List[str]:
return self.sp_model.encode(text, out_type=str)

def _convert_token_to_id(self, token):
"""Converts a token (str) in an id using the vocab."""
if token in self.fairseq_tokens_to_ids:
return self.fairseq_tokens_to_ids[token]
spm_id = self.sp_model.PieceToId(token)

# Need to return unknown token if the SP model returned 0
return spm_id + self.fairseq_offset if spm_id else self.unk_token_id

def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab."""
if index in self.fairseq_ids_to_tokens:
return self.fairseq_ids_to_tokens[index]
return self.sp_model.IdToPiece(index - self.fairseq_offset)

def convert_tokens_to_string(self, tokens):
"""Converts a sequence of tokens (strings for sub-words) in a single string."""
out_string = ''.join(tokens).replace(SPIECE_UNDERLINE, ' ').strip()
return out_string

def save_vocabulary(self,
save_directory: str,
filename_prefix: Optional[str] = None) -> Tuple[str]:
if not os.path.isdir(save_directory):
logger.error(
f'Vocabulary path ({save_directory}) should be a directory')
return
out_vocab_file = os.path.join(
save_directory, (filename_prefix + '-' if filename_prefix else '')
+ VOCAB_FILES_NAMES['vocab_file'])

if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
copyfile(self.vocab_file, out_vocab_file)

return (out_vocab_file, )

+ 213
- 0
modelscope/models/nlp/veco/tokenization_veco_fast.py View File

@@ -0,0 +1,213 @@
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
# All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License
"""Fast Tokenization classes for Veco. mainly copied from :module:`~transformers.tokenization_xlm_roberta_fast`"""

import os
from shutil import copyfile
from typing import List, Optional, Tuple

import transformers
from transformers.file_utils import is_sentencepiece_available
from transformers.tokenization_utils import AddedToken
from transformers.tokenization_utils_fast import PreTrainedTokenizerFast

from modelscope.utils import logger as logging

if is_sentencepiece_available():
from .tokenization_veco import VecoTokenizer
else:
VecoTokenizer = None

logger = logging.get_logger(__name__)

VOCAB_FILES_NAMES = {
'vocab_file': 'sentencepiece.bpe.model',
'tokenizer_file': 'tokenizer.json'
}

PRETRAINED_VOCAB_FILES_MAP = {
'vocab_file': {},
'tokenizer_file': {},
}

PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {}

transformers.SLOW_TO_FAST_CONVERTERS[
'VecoTokenizer'] = transformers.SLOW_TO_FAST_CONVERTERS[
'XLMRobertaTokenizer']


class VecoTokenizerFast(PreTrainedTokenizerFast):
"""
Adapted from [`RobertaTokenizer`] and [`XLNetTokenizer`].
Based on [BPE](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=BPE#models).

This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main
methods. Users should refer to this superclass for more information regarding those methods.

Args:
vocab_file (`str`):
Path to the vocabulary file.
bos_token (`str`, *optional*, defaults to `"<s>"`):
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.

<Tip>

When building a sequence using special tokens, this is not the token that is used for the beginning of
sequence. The token used is the `cls_token`.

</Tip>

eos_token (`str`, *optional*, defaults to `"</s>"`):
The end of sequence token.

<Tip>

When building a sequence using special tokens, this is not the token that is used for the end of
sequence. The token used is the `sep_token`.

</Tip>

sep_token (`str`, *optional*, defaults to `"</s>"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens.
cls_token (`str`, *optional*, defaults to `"<s>"`):
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens.
unk_token (`str`, *optional*, defaults to `"<unk>"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
pad_token (`str`, *optional*, defaults to `"<pad>"`):
The token used for padding, for example when batching sequences of different lengths.
mask_token (`str`, *optional*, defaults to `"<mask>"`):
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict.
additional_special_tokens (`List[str]`, *optional*, defaults to `["<s>NOTUSED", "</s>NOTUSED"]`):
Additional special tokens used by the tokenizer.
"""

vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
model_input_names = ['input_ids', 'attention_mask']
slow_tokenizer_class = VecoTokenizer

def __init__(self,
vocab_file=None,
tokenizer_file=None,
bos_token='<s>',
eos_token='</s>',
sep_token='</s>',
cls_token='<s>',
unk_token='<unk>',
pad_token='<pad>',
mask_token='<mask>',
**kwargs):
# Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(
mask_token, lstrip=True, rstrip=False) if isinstance(
mask_token, str) else mask_token

super().__init__(
vocab_file,
tokenizer_file=tokenizer_file,
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
cls_token=cls_token,
unk_token=unk_token,
pad_token=pad_token,
mask_token=mask_token,
**kwargs,
)

self.vocab_file = vocab_file
self.can_save_slow_tokenizer = False if not self.vocab_file else True

def build_inputs_with_special_tokens(
self,
token_ids_0: List[int],
token_ids_1: Optional[List[int]] = None) -> List[int]:
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. An Veco sequence has the following format:

- single sequence: `<s> X </s>`
- pair of sequences: `<s> A </s></s> B </s>`

Args:
token_ids_0 (`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.

Returns:
`List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
"""

if token_ids_1 is None:
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
cls = [self.cls_token_id]
sep = [self.sep_token_id]
return cls + token_ids_0 + sep + sep + token_ids_1 + sep

def create_token_type_ids_from_sequences(
self,
token_ids_0: List[int],
token_ids_1: Optional[List[int]] = None) -> List[int]:
"""
Create a mask from the two sequences passed to be used in a sequence-pair classification task. Veco does
not make use of token type ids, therefore a list of zeros is returned.

Args:
token_ids_0 (`List[int]`):
List of IDs.
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.

Returns:
`List[int]`: List of zeros.

"""

sep = [self.sep_token_id]
cls = [self.cls_token_id]

if token_ids_1 is None:
return len(cls + token_ids_0 + sep) * [0]
return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]

def save_vocabulary(self,
save_directory: str,
filename_prefix: Optional[str] = None) -> Tuple[str]:
if not self.can_save_slow_tokenizer:
raise ValueError(
'Your fast tokenizer does not have the necessary information to save the vocabulary for a slow '
'tokenizer.')

if not os.path.isdir(save_directory):
logger.error(
f'Vocabulary path ({save_directory}) should be a directory.')
return
out_vocab_file = os.path.join(
save_directory, (filename_prefix + '-' if filename_prefix else '')
+ VOCAB_FILES_NAMES['vocab_file'])

if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
copyfile(self.vocab_file, out_vocab_file)

return (out_vocab_file, )

+ 7
- 0
modelscope/msdatasets/ms_dataset.py View File

@@ -517,3 +517,10 @@ class MsDataset:
def to_hf_dataset(self) -> Dataset:
self._hf_ds.reset_format()
return self._hf_ds

@staticmethod
def interleave_datasets(datasets: List[Any],
probabilities: Optional[List[float]] = None,
seed: Optional[int] = None):
from datasets import interleave_datasets
return interleave_datasets(datasets, probabilities, seed)

+ 1
- 0
modelscope/outputs.py View File

@@ -9,6 +9,7 @@ class OutputKeys(object):
SCORES = 'scores'
LABEL = 'label'
LABELS = 'labels'
INPUT_IDS = 'input_ids'
LABEL_POS = 'label_pos'
POSES = 'poses'
CAPTION = 'caption'


+ 6
- 7
modelscope/pipelines/nlp/__init__.py View File

@@ -9,9 +9,8 @@ if TYPE_CHECKING:
from .dialog_state_tracking_pipeline import DialogStateTrackingPipeline
from .fill_mask_pipeline import FillMaskPipeline
from .named_entity_recognition_pipeline import NamedEntityRecognitionPipeline
from .nli_pipeline import NLIPipeline
from .sentence_similarity_pipeline import SentenceSimilarityPipeline
from .sentiment_classification_pipeline import SentimentClassificationPipeline
from .pair_sentence_classification_pipeline import PairSentenceClassificationPipeline
from .single_sentence_classification_pipeline import SingleSentenceClassificationPipeline
from .sequence_classification_pipeline import SequenceClassificationPipeline
from .text_generation_pipeline import TextGenerationPipeline
from .translation_pipeline import TranslationPipeline
@@ -28,10 +27,10 @@ else:
'dialog_modeling_pipeline': ['DialogModelingPipeline'],
'dialog_state_tracking_pipeline': ['DialogStateTrackingPipeline'],
'fill_mask_pipeline': ['FillMaskPipeline'],
'nli_pipeline': ['NLIPipeline'],
'sentence_similarity_pipeline': ['SentenceSimilarityPipeline'],
'sentiment_classification_pipeline':
['SentimentClassificationPipeline'],
'single_sentence_classification_pipeline':
['SingleSentenceClassificationPipeline'],
'pair_sentence_classification_pipeline':
['PairSentenceClassificationPipeline'],
'sequence_classification_pipeline': ['SequenceClassificationPipeline'],
'text_generation_pipeline': ['TextGenerationPipeline'],
'word_segmentation_pipeline': ['WordSegmentationPipeline'],


+ 10
- 11
modelscope/pipelines/nlp/fill_mask_pipeline.py View File

@@ -5,11 +5,10 @@ import torch

from modelscope.metainfo import Pipelines
from modelscope.models import Model
from modelscope.models.nlp.masked_language import MaskedLanguageModelBase
from modelscope.outputs import OutputKeys
from modelscope.pipelines.base import Pipeline, Tensor
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import FillMaskPreprocessor
from modelscope.preprocessors import FillMaskPreprocessor, Preprocessor
from modelscope.utils.config import Config
from modelscope.utils.constant import ModelFile, Tasks

@@ -21,18 +20,18 @@ _type_map = {'veco': 'roberta', 'sbert': 'bert'}
class FillMaskPipeline(Pipeline):

def __init__(self,
model: Union[MaskedLanguageModelBase, str],
preprocessor: Optional[FillMaskPreprocessor] = None,
first_sequence='sentense',
model: Union[Model, str],
preprocessor: Optional[Preprocessor] = None,
first_sequence='sentence',
**kwargs):
"""use `model` and `preprocessor` to create a nlp fill mask pipeline for prediction

Args:
model (MaskedLanguageModelBase): a model instance
preprocessor (FillMaskPreprocessor): a preprocessor instance
model (Model): a model instance
preprocessor (Preprocessor): a preprocessor instance
"""
fill_mask_model = model if isinstance(
model, MaskedLanguageModelBase) else Model.from_pretrained(model)
model, Model) else Model.from_pretrained(model)

if preprocessor is None:
preprocessor = FillMaskPreprocessor(
@@ -73,7 +72,7 @@ class FillMaskPipeline(Pipeline):
def forward(self, inputs: Dict[str, Any],
**forward_params) -> Dict[str, Any]:
with torch.no_grad():
return super().forward(inputs, **forward_params)
return self.model(inputs, **forward_params)

def postprocess(self, inputs: Dict[str, Tensor]) -> Dict[str, Tensor]:
"""process the prediction results
@@ -85,8 +84,8 @@ class FillMaskPipeline(Pipeline):
Dict[str, str]: the prediction results
"""
import numpy as np
logits = inputs['logits'].detach().cpu().numpy()
input_ids = inputs['input_ids'].detach().cpu().numpy()
logits = inputs[OutputKeys.LOGITS].detach().cpu().numpy()
input_ids = inputs[OutputKeys.INPUT_IDS].detach().cpu().numpy()
pred_ids = np.argmax(logits, axis=-1)
model_type = self.model.config.model_type
process_type = model_type if model_type in self.mask_id else _type_map[


+ 5
- 7
modelscope/pipelines/nlp/named_entity_recognition_pipeline.py View File

@@ -4,11 +4,10 @@ import torch

from modelscope.metainfo import Pipelines
from modelscope.models import Model
from modelscope.models.nlp import TransformerCRFForNamedEntityRecognition
from modelscope.outputs import OutputKeys
from modelscope.pipelines.base import Pipeline, Tensor
from modelscope.pipelines.base import Pipeline
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import NERPreprocessor
from modelscope.preprocessors import NERPreprocessor, Preprocessor
from modelscope.utils.constant import Tasks

__all__ = ['NamedEntityRecognitionPipeline']
@@ -20,13 +19,12 @@ __all__ = ['NamedEntityRecognitionPipeline']
class NamedEntityRecognitionPipeline(Pipeline):

def __init__(self,
model: Union[TransformerCRFForNamedEntityRecognition, str],
preprocessor: Optional[NERPreprocessor] = None,
model: Union[Model, str],
preprocessor: Optional[Preprocessor] = None,
**kwargs):

model = model if isinstance(model,
TransformerCRFForNamedEntityRecognition
) else Model.from_pretrained(model)
Model) else Model.from_pretrained(model)
if preprocessor is None:
preprocessor = NERPreprocessor(model.model_dir)
model.eval()


+ 0
- 73
modelscope/pipelines/nlp/nli_pipeline.py View File

@@ -1,73 +0,0 @@
import uuid
from typing import Any, Dict, Union

import numpy as np
import torch

from modelscope.metainfo import Pipelines
from modelscope.models import Model
from modelscope.models.nlp import SbertForNLI
from modelscope.outputs import OutputKeys
from modelscope.pipelines.base import Pipeline
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import NLIPreprocessor
from modelscope.utils.constant import Tasks

__all__ = ['NLIPipeline']


@PIPELINES.register_module(Tasks.nli, module_name=Pipelines.nli)
class NLIPipeline(Pipeline):

def __init__(self,
model: Union[SbertForNLI, str],
preprocessor: NLIPreprocessor = None,
first_sequence='first_sequence',
second_sequence='second_sequence',
**kwargs):
"""use `model` and `preprocessor` to create a nlp text classification pipeline for prediction

Args:
model (SbertForNLI): a model instance
preprocessor (NLIPreprocessor): a preprocessor instance
"""
assert isinstance(model, str) or isinstance(model, SbertForNLI), \
'model must be a single str or SbertForNLI'
model = model if isinstance(
model, SbertForNLI) else Model.from_pretrained(model)
if preprocessor is None:
preprocessor = NLIPreprocessor(
model.model_dir,
first_sequence=first_sequence,
second_sequence=second_sequence)
model.eval()
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
assert len(model.id2label) > 0

def forward(self, inputs: Dict[str, Any],
**forward_params) -> Dict[str, Any]:
with torch.no_grad():
return super().forward(inputs, **forward_params)

def postprocess(self,
inputs: Dict[str, Any],
topk: int = 5) -> Dict[str, str]:
"""process the prediction results

Args:
inputs (Dict[str, Any]): _description_

Returns:
Dict[str, str]: the prediction results
"""

probs = inputs['probabilities'][0]
num_classes = probs.shape[0]
topk = min(topk, num_classes)
top_indices = np.argpartition(probs, -topk)[-topk:]
cls_ids = top_indices[np.argsort(probs[top_indices])]
probs = probs[cls_ids].tolist()

cls_names = [self.model.id2label[cid] for cid in cls_ids]

return {OutputKeys.SCORES: probs, OutputKeys.LABELS: cls_names}

+ 37
- 0
modelscope/pipelines/nlp/pair_sentence_classification_pipeline.py View File

@@ -0,0 +1,37 @@
from typing import Union

from modelscope.models.base import Model
from ...metainfo import Pipelines
from ...preprocessors import (PairSentenceClassificationPreprocessor,
Preprocessor)
from ...utils.constant import Tasks
from ..builder import PIPELINES
from .sequence_classification_pipeline_base import \
SequenceClassificationPipelineBase

__all__ = ['PairSentenceClassificationPipeline']


@PIPELINES.register_module(Tasks.nli, module_name=Pipelines.nli)
@PIPELINES.register_module(
Tasks.sentence_similarity, module_name=Pipelines.sentence_similarity)
class PairSentenceClassificationPipeline(SequenceClassificationPipelineBase):

def __init__(self,
model: Union[Model, str],
preprocessor: Preprocessor = None,
first_sequence='first_sequence',
second_sequence='second_sequence',
**kwargs):
"""use `model` and `preprocessor` to create a nlp pair sentence classification pipeline for prediction

Args:
model (Model): a model instance
preprocessor (Preprocessor): a preprocessor instance
"""
if preprocessor is None:
preprocessor = PairSentenceClassificationPreprocessor(
model.model_dir if isinstance(model, Model) else model,
first_sequence=first_sequence,
second_sequence=second_sequence)
super().__init__(model=model, preprocessor=preprocessor, **kwargs)

+ 0
- 73
modelscope/pipelines/nlp/sentence_similarity_pipeline.py View File

@@ -1,73 +0,0 @@
from typing import Any, Dict, Union

import numpy as np
import torch

from modelscope.metainfo import Pipelines
from modelscope.models import Model
from modelscope.models.nlp import SbertForSentenceSimilarity
from modelscope.outputs import OutputKeys
from modelscope.pipelines.base import Input, Pipeline
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import SentenceSimilarityPreprocessor
from modelscope.utils.constant import Tasks

__all__ = ['SentenceSimilarityPipeline']


@PIPELINES.register_module(
Tasks.sentence_similarity, module_name=Pipelines.sentence_similarity)
class SentenceSimilarityPipeline(Pipeline):

def __init__(self,
model: Union[Model, str],
preprocessor: SentenceSimilarityPreprocessor = None,
first_sequence='first_sequence',
second_sequence='second_sequence',
**kwargs):
"""use `model` and `preprocessor` to create a nlp sentence similarity pipeline for prediction

Args:
model (SbertForSentenceSimilarity): a model instance
preprocessor (SentenceSimilarityPreprocessor): a preprocessor instance
"""
assert isinstance(model, str) or isinstance(model, SbertForSentenceSimilarity), \
'model must be a single str or SbertForSentenceSimilarity'
sc_model = model if isinstance(
model,
SbertForSentenceSimilarity) else Model.from_pretrained(model)
if preprocessor is None:
preprocessor = SentenceSimilarityPreprocessor(
sc_model.model_dir,
first_sequence=first_sequence,
second_sequence=second_sequence)
sc_model.eval()
super().__init__(model=sc_model, preprocessor=preprocessor, **kwargs)

assert hasattr(self.model, 'id2label'), \
'id2label map should be initalizaed in init function.'

def forward(self, inputs: Dict[str, Any],
**forward_params) -> Dict[str, Any]:
with torch.no_grad():
return super().forward(inputs, **forward_params)

def postprocess(self, inputs: Dict[str, Any],
**postprocess_params) -> Dict[str, str]:
"""process the prediction results

Args:
inputs (Dict[str, Any]): _description_

Returns:
Dict[str, str]: the prediction results
"""

probs = inputs['probabilities'][0]
num_classes = probs.shape[0]
top_indices = np.argpartition(probs, -num_classes)[-num_classes:]
cls_ids = top_indices[np.argsort(-probs[top_indices], axis=-1)]
probs = probs[cls_ids].tolist()
cls_names = [self.model.id2label[cid] for cid in cls_ids]
b = 0
return {OutputKeys.SCORES: probs[b], OutputKeys.LABELS: cls_names[b]}

+ 0
- 74
modelscope/pipelines/nlp/sentiment_classification_pipeline.py View File

@@ -1,74 +0,0 @@
from typing import Any, Dict, Union

import numpy as np
import torch

from modelscope.metainfo import Pipelines
from modelscope.models import Model
from modelscope.models.nlp import SequenceClassificationModel
from modelscope.outputs import OutputKeys
from modelscope.pipelines.base import Pipeline
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import SentimentClassificationPreprocessor
from modelscope.utils.constant import Tasks

__all__ = ['SentimentClassificationPipeline']


@PIPELINES.register_module(
Tasks.sentiment_classification,
module_name=Pipelines.sentiment_classification)
class SentimentClassificationPipeline(Pipeline):

def __init__(self,
model: Union[SequenceClassificationModel, str],
preprocessor: SentimentClassificationPreprocessor = None,
first_sequence='first_sequence',
second_sequence='second_sequence',
**kwargs):
"""use `model` and `preprocessor` to create a nlp text classification pipeline for prediction

Args:
model (SequenceClassificationModel): a model instance
preprocessor (SentimentClassificationPreprocessor): a preprocessor instance
"""
assert isinstance(model, str) or isinstance(model, SequenceClassificationModel), \
'model must be a single str or SentimentClassification'
model = model if isinstance(
model,
SequenceClassificationModel) else Model.from_pretrained(model)
if preprocessor is None:
preprocessor = SentimentClassificationPreprocessor(
model.model_dir,
first_sequence=first_sequence,
second_sequence=second_sequence)
model.eval()
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
assert len(model.id2label) > 0

def forward(self, inputs: Dict[str, Any],
**forward_params) -> Dict[str, Any]:
with torch.no_grad():
return super().forward(inputs, **forward_params)

def postprocess(self,
inputs: Dict[str, Any],
topk: int = 5) -> Dict[str, str]:
"""process the prediction results

Args:
inputs (Dict[str, Any]): _description_

Returns:
Dict[str, str]: the prediction results
"""

probs = inputs['probabilities'][0]
num_classes = probs.shape[0]
topk = min(topk, num_classes)
top_indices = np.argpartition(probs, -topk)[-topk:]
cls_ids = top_indices[np.argsort(probs[top_indices])]
probs = probs[cls_ids].tolist()

cls_names = [self.model.id2label[cid] for cid in cls_ids]
return {OutputKeys.SCORES: probs, OutputKeys.LABELS: cls_names}

+ 60
- 0
modelscope/pipelines/nlp/sequence_classification_pipeline_base.py View File

@@ -0,0 +1,60 @@
from typing import Any, Dict, Union

import numpy as np
import torch

from modelscope.models.base import Model
from modelscope.outputs import OutputKeys
from ...preprocessors import Preprocessor
from ..base import Pipeline


class SequenceClassificationPipelineBase(Pipeline):

def __init__(self, model: Union[Model, str], preprocessor: Preprocessor,
**kwargs):
"""use `model` and `preprocessor` to create a nlp text classification pipeline for prediction

Args:
model (str or Model): a model instance
preprocessor (Preprocessor): a preprocessor instance
"""
assert isinstance(model, str) or isinstance(model, Model), \
'model must be a single str or Model'
model = model if isinstance(model,
Model) else Model.from_pretrained(model)
assert preprocessor is not None
model.eval()
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
self.id2label = kwargs.get('id2label')
if self.id2label is None and hasattr(self.preprocessor, 'id2label'):
self.id2label = self.preprocessor.id2label
assert self.id2label is not None, 'Cannot convert id to the original label, please pass in the mapping ' \
'as a parameter or make sure the preprocessor has the attribute.'

def forward(self, inputs: Dict[str, Any],
**forward_params) -> Dict[str, Any]:
with torch.no_grad():
return self.model(inputs, **forward_params)

def postprocess(self,
inputs: Dict[str, Any],
topk: int = 5) -> Dict[str, str]:
"""process the prediction results

Args:
inputs (Dict[str, Any]): _description_
topk (int): The topk probs to take
Returns:
Dict[str, str]: the prediction results
"""

probs = inputs[OutputKeys.PROBABILITIES][0]
num_classes = probs.shape[0]
topk = min(topk, num_classes)
top_indices = np.argpartition(probs, -topk)[-topk:]
cls_ids = top_indices[np.argsort(probs[top_indices])]
probs = probs[cls_ids].tolist()

cls_names = [self.id2label[cid] for cid in cls_ids]
return {OutputKeys.SCORES: probs, OutputKeys.LABELS: cls_names}

+ 35
- 0
modelscope/pipelines/nlp/single_sentence_classification_pipeline.py View File

@@ -0,0 +1,35 @@
from typing import Union

from ...metainfo import Pipelines
from ...models import Model
from ...preprocessors import (Preprocessor,
SingleSentenceClassificationPreprocessor)
from ...utils.constant import Tasks
from ..builder import PIPELINES
from .sequence_classification_pipeline_base import \
SequenceClassificationPipelineBase

__all__ = ['SingleSentenceClassificationPipeline']


@PIPELINES.register_module(
Tasks.sentiment_classification,
module_name=Pipelines.sentiment_classification)
class SingleSentenceClassificationPipeline(SequenceClassificationPipelineBase):

def __init__(self,
model: Union[Model, str],
preprocessor: Preprocessor = None,
first_sequence='first_sequence',
**kwargs):
"""use `model` and `preprocessor` to create a nlp single sentence classification pipeline for prediction

Args:
model (Model): a model instance
preprocessor (Preprocessor): a preprocessor instance
"""
if preprocessor is None:
preprocessor = SingleSentenceClassificationPreprocessor(
model.model_dir if isinstance(model, Model) else model,
first_sequence=first_sequence)
super().__init__(model=model, preprocessor=preprocessor, **kwargs)

+ 4
- 4
modelscope/pipelines/nlp/text_generation_pipeline.py View File

@@ -3,7 +3,7 @@ from typing import Any, Dict, Optional, Union
import torch

from modelscope.metainfo import Pipelines
from modelscope.models.base import TorchModel
from modelscope.models.base import Model
from modelscope.pipelines.base import Pipeline, Tensor
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import TextGenerationPreprocessor
@@ -17,7 +17,7 @@ __all__ = ['TextGenerationPipeline']
class TextGenerationPipeline(Pipeline):

def __init__(self,
model: Union[TorchModel, str],
model: Union[Model, str],
preprocessor: Optional[TextGenerationPreprocessor] = None,
**kwargs):
"""use `model` and `preprocessor` to create a nlp text generation pipeline for prediction
@@ -26,8 +26,8 @@ class TextGenerationPipeline(Pipeline):
model (PalmForTextGeneration): a model instance
preprocessor (TextGenerationPreprocessor): a preprocessor instance
"""
model = model if isinstance(
model, TorchModel) else TorchModel.from_pretrained(model)
model = model if isinstance(model,
Model) else Model.from_pretrained(model)
if preprocessor is None:
preprocessor = TextGenerationPreprocessor(
model.model_dir,


+ 1
- 3
modelscope/pipelines/nlp/translation_pipeline.py View File

@@ -4,11 +4,9 @@ from typing import Any, Dict
import numpy as np
import tensorflow as tf

from modelscope.hub.snapshot_download import snapshot_download
from modelscope.metainfo import Pipelines
from modelscope.models.nlp import CsanmtForTranslation
from modelscope.outputs import OutputKeys
from modelscope.pipelines.base import Pipeline, Tensor
from modelscope.pipelines.base import Pipeline
from modelscope.pipelines.builder import PIPELINES
from modelscope.utils.constant import ModelFile, Tasks
from modelscope.utils.logger import get_logger


+ 19
- 17
modelscope/pipelines/nlp/word_segmentation_pipeline.py View File

@@ -4,11 +4,11 @@ import torch

from modelscope.metainfo import Pipelines
from modelscope.models import Model
from modelscope.models.nlp import SbertForTokenClassification
from modelscope.outputs import OutputKeys
from modelscope.pipelines.base import Pipeline, Tensor
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import TokenClassificationPreprocessor
from modelscope.preprocessors import (Preprocessor,
TokenClassificationPreprocessor)
from modelscope.utils.constant import Tasks

__all__ = ['WordSegmentationPipeline']
@@ -18,33 +18,35 @@ __all__ = ['WordSegmentationPipeline']
Tasks.word_segmentation, module_name=Pipelines.word_segmentation)
class WordSegmentationPipeline(Pipeline):

def __init__(
self,
model: Union[SbertForTokenClassification, str],
preprocessor: Optional[TokenClassificationPreprocessor] = None,
**kwargs):
def __init__(self,
model: Union[Model, str],
preprocessor: Optional[Preprocessor] = None,
**kwargs):
"""use `model` and `preprocessor` to create a nlp word segmentation pipeline for prediction

Args:
model (StructBertForTokenClassification): a model instance
preprocessor (TokenClassificationPreprocessor): a preprocessor instance
model (Model): a model instance
preprocessor (Preprocessor): a preprocessor instance
"""
model = model if isinstance(
model,
SbertForTokenClassification) else Model.from_pretrained(model)
model = model if isinstance(model,
Model) else Model.from_pretrained(model)
if preprocessor is None:
preprocessor = TokenClassificationPreprocessor(model.model_dir)
model.eval()
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
self.tokenizer = preprocessor.tokenizer
self.config = model.config
assert len(self.config.id2label) > 0
self.id2label = self.config.id2label
self.id2label = kwargs.get('id2label')
if self.id2label is None and hasattr(self.preprocessor, 'id2label'):
self.id2label = self.preprocessor.id2label
assert self.id2label is not None, 'Cannot convert id to the original label, please pass in the mapping ' \
'as a parameter or make sure the preprocessor has the attribute.'

def forward(self, inputs: Dict[str, Any],
**forward_params) -> Dict[str, Any]:
text = inputs.pop(OutputKeys.TEXT)
with torch.no_grad():
return super().forward(inputs, **forward_params)
return {
**self.model(inputs, **forward_params), OutputKeys.TEXT: text
}

def postprocess(self, inputs: Dict[str, Any],
**postprocess_params) -> Dict[str, str]:


+ 13
- 14
modelscope/pipelines/nlp/zero_shot_classification_pipeline.py View File

@@ -5,11 +5,11 @@ from scipy.special import softmax

from modelscope.metainfo import Pipelines
from modelscope.models import Model
from modelscope.models.nlp import SbertForZeroShotClassification
from modelscope.outputs import OutputKeys
from modelscope.pipelines.base import Pipeline
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import ZeroShotClassificationPreprocessor
from modelscope.preprocessors import (Preprocessor,
ZeroShotClassificationPreprocessor)
from modelscope.utils.constant import Tasks

__all__ = ['ZeroShotClassificationPipeline']
@@ -21,19 +21,18 @@ __all__ = ['ZeroShotClassificationPipeline']
class ZeroShotClassificationPipeline(Pipeline):

def __init__(self,
model: Union[SbertForZeroShotClassification, str],
preprocessor: ZeroShotClassificationPreprocessor = None,
model: Union[Model, str],
preprocessor: Preprocessor = None,
**kwargs):
"""use `model` and `preprocessor` to create a nlp text classification pipeline for prediction
"""use `model` and `preprocessor` to create a nlp zero-shot text classification pipeline for prediction
Args:
model (SbertForZeroShotClassification): a model instance
preprocessor (SentimentClassificationPreprocessor): a preprocessor instance
model (Model): a model instance
preprocessor (Preprocessor): a preprocessor instance
"""
assert isinstance(model, str) or isinstance(model, SbertForZeroShotClassification), \
'model must be a single str or SbertForZeroShotClassification'
model = model if isinstance(
model,
SbertForZeroShotClassification) else Model.from_pretrained(model)
assert isinstance(model, str) or isinstance(model, Model), \
'model must be a single str or Model'
model = model if isinstance(model,
Model) else Model.from_pretrained(model)
self.entailment_id = 0
self.contradiction_id = 2
if preprocessor is None:
@@ -58,7 +57,7 @@ class ZeroShotClassificationPipeline(Pipeline):
def forward(self, inputs: Dict[str, Any],
**forward_params) -> Dict[str, Any]:
with torch.no_grad():
return super().forward(inputs, **forward_params)
return self.model(inputs, **forward_params)

def postprocess(self,
inputs: Dict[str, Any],
@@ -70,7 +69,7 @@ class ZeroShotClassificationPipeline(Pipeline):
Returns:
Dict[str, Any]: the prediction results
"""
logits = inputs['logits']
logits = inputs[OutputKeys.LOGITS]
if multi_label or len(candidate_labels) == 1:
logits = logits[..., [self.contradiction_id, self.entailment_id]]
scores = softmax(logits, axis=-1)[..., 1]


+ 7
- 7
modelscope/preprocessors/__init__.py View File

@@ -18,11 +18,11 @@ if TYPE_CHECKING:
MPlugVisualQuestionAnsweringPreprocessor)
from .nlp import (Tokenize, SequenceClassificationPreprocessor,
TextGenerationPreprocessor,
TokenClassificationPreprocessor, NLIPreprocessor,
SentimentClassificationPreprocessor,
SentenceSimilarityPreprocessor, FillMaskPreprocessor,
ZeroShotClassificationPreprocessor, NERPreprocessor,
TextErrorCorrectionPreprocessor)
TokenClassificationPreprocessor,
SingleSentenceClassificationPreprocessor,
PairSentenceClassificationPreprocessor,
FillMaskPreprocessor, ZeroShotClassificationPreprocessor,
NERPreprocessor, TextErrorCorrectionPreprocessor)
from .space import (DialogIntentPredictionPreprocessor,
DialogModelingPreprocessor,
DialogStateTrackingPreprocessor)
@@ -46,8 +46,8 @@ else:
'nlp': [
'Tokenize', 'SequenceClassificationPreprocessor',
'TextGenerationPreprocessor', 'TokenClassificationPreprocessor',
'NLIPreprocessor', 'SentimentClassificationPreprocessor',
'SentenceSimilarityPreprocessor', 'FillMaskPreprocessor',
'SingleSentenceClassificationPreprocessor',
'PairSentenceClassificationPreprocessor', 'FillMaskPreprocessor',
'ZeroShotClassificationPreprocessor', 'NERPreprocessor',
'TextErrorCorrectionPreprocessor'
],


+ 3
- 1
modelscope/preprocessors/base.py View File

@@ -1,5 +1,5 @@
# Copyright (c) Alibaba, Inc. and its affiliates.
import os
from abc import ABC, abstractmethod
from typing import Any, Dict

@@ -10,6 +10,8 @@ class Preprocessor(ABC):

def __init__(self, *args, **kwargs):
self._mode = ModeKeys.INFERENCE
self.device = int(
os.environ['LOCAL_RANK']) if 'LOCAL_RANK' in os.environ else None
pass

@abstractmethod


+ 302
- 204
modelscope/preprocessors/nlp.py View File

@@ -2,14 +2,14 @@

import os.path as osp
import uuid
from typing import Any, Dict, Optional, Union
from typing import Any, Dict, Iterable, Optional, Tuple, Union

from transformers import AutoTokenizer

from modelscope.metainfo import Preprocessors
from modelscope.models import Model
from modelscope.metainfo import Models, Preprocessors
from modelscope.outputs import OutputKeys
from modelscope.utils.constant import Fields, InputFields, ModeKeys
from modelscope.utils.hub import parse_label_mapping
from modelscope.utils.hub import get_model_type, parse_label_mapping
from modelscope.utils.type_assert import type_assert
from .base import Preprocessor
from .builder import PREPROCESSORS
@@ -17,8 +17,8 @@ from .builder import PREPROCESSORS
__all__ = [
'Tokenize', 'SequenceClassificationPreprocessor',
'TextGenerationPreprocessor', 'TokenClassificationPreprocessor',
'NLIPreprocessor', 'SentimentClassificationPreprocessor',
'FillMaskPreprocessor', 'SentenceSimilarityPreprocessor',
'PairSentenceClassificationPreprocessor',
'SingleSentenceClassificationPreprocessor', 'FillMaskPreprocessor',
'ZeroShotClassificationPreprocessor', 'NERPreprocessor',
'TextErrorCorrectionPreprocessor'
]
@@ -38,99 +38,6 @@ class Tokenize(Preprocessor):
return data


class NLPPreprocessorBase(Preprocessor):

def __init__(self, model_dir: str, *args, **kwargs):
"""preprocess the data via the vocab.txt from the `model_dir` path

Args:
model_dir (str): model path
"""

super().__init__(*args, **kwargs)
self.model_dir: str = model_dir
self.first_sequence: str = kwargs.pop('first_sequence',
'first_sequence')
self.second_sequence = kwargs.pop('second_sequence', 'second_sequence')
self.tokenize_kwargs = kwargs
self.tokenizer = self.build_tokenizer(model_dir)
self.label2id = parse_label_mapping(self.model_dir)

def build_tokenizer(self, model_dir):
from sofa import SbertTokenizer
return SbertTokenizer.from_pretrained(model_dir)

@type_assert(object, object)
def __call__(self, data: Union[str, tuple, Dict]) -> Dict[str, Any]:
"""process the raw input data

Args:
data (tuple): [sentence1, sentence2]
sentence1 (str): a sentence
Example:
'you are so handsome.'
sentence2 (str): a sentence
Example:
'you are so beautiful.'
Returns:
Dict[str, Any]: the preprocessed data
"""

text_a, text_b = None, None
if isinstance(data, str):
text_a = data
elif isinstance(data, tuple):
assert len(data) == 2
text_a, text_b = data
elif isinstance(data, dict):
text_a = data.get(self.first_sequence)
text_b = data.get(self.second_sequence, None)

rst = self.tokenizer(text_a, text_b, **self.tokenize_kwargs)
if self._mode == ModeKeys.TRAIN:
rst = {k: v.squeeze() for k, v in rst.items()}
if self.label2id is not None and 'label' in data:
rst['label'] = self.label2id[str(data['label'])]
return rst


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.nli_tokenizer)
class NLIPreprocessor(NLPPreprocessorBase):

def __init__(self, model_dir: str, *args, **kwargs):
kwargs['truncation'] = True
kwargs['padding'] = False
kwargs['return_tensors'] = 'pt'
kwargs['max_length'] = kwargs.pop('sequence_length', 128)
super().__init__(model_dir, *args, **kwargs)


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.sen_cls_tokenizer)
class SentimentClassificationPreprocessor(NLPPreprocessorBase):

def __init__(self, model_dir: str, *args, **kwargs):
kwargs['truncation'] = True
kwargs['padding'] = 'max_length'
kwargs['return_tensors'] = 'pt'
kwargs['max_length'] = kwargs.pop('sequence_length', 128)
super().__init__(model_dir, *args, **kwargs)


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.sen_sim_tokenizer)
class SentenceSimilarityPreprocessor(NLPPreprocessorBase):

def __init__(self, model_dir: str, *args, **kwargs):
kwargs['truncation'] = True
kwargs['padding'] = False if 'padding' not in kwargs else kwargs[
'padding']
kwargs['return_tensors'] = 'pt'
kwargs['max_length'] = kwargs.pop('sequence_length', 128)
super().__init__(model_dir, *args, **kwargs)


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.bert_seq_cls_tokenizer)
class SequenceClassificationPreprocessor(Preprocessor):
@@ -197,32 +104,193 @@ class SequenceClassificationPreprocessor(Preprocessor):
return rst


class NLPTokenizerPreprocessorBase(Preprocessor):

def __init__(self, model_dir: str, pair: bool, mode: str, **kwargs):
"""preprocess the data via the vocab.txt from the `model_dir` path

Args:
model_dir (str): model path
"""

super().__init__(**kwargs)
self.model_dir: str = model_dir
self.first_sequence: str = kwargs.pop('first_sequence',
'first_sequence')
self.second_sequence = kwargs.pop('second_sequence', 'second_sequence')
self.pair = pair
self._mode = mode
self.label = kwargs.pop('label', OutputKeys.LABEL)
self.label2id = None
if 'label2id' in kwargs:
self.label2id = kwargs.pop('label2id')
if self.label2id is None:
self.label2id = parse_label_mapping(self.model_dir)

self.tokenize_kwargs = kwargs
self.tokenizer = self.build_tokenizer(model_dir)

@property
def id2label(self):
if self.label2id is not None:
return {id: label for label, id in self.label2id.items()}
return None

def build_tokenizer(self, model_dir):
model_type = get_model_type(model_dir)
if model_type in (Models.structbert, Models.gpt3, Models.palm):
from modelscope.models.nlp.structbert import SbertTokenizerFast
return SbertTokenizerFast.from_pretrained(model_dir)
elif model_type == Models.veco:
from modelscope.models.nlp.veco import VecoTokenizerFast
return VecoTokenizerFast.from_pretrained(model_dir)
else:
return AutoTokenizer.from_pretrained(model_dir)

def __call__(self, data: Union[str, Tuple, Dict]) -> Dict[str, Any]:
"""process the raw input data

Args:
data (tuple): [sentence1, sentence2]
sentence1 (str): a sentence
Example:
'you are so handsome.'
sentence2 (str): a sentence
Example:
'you are so beautiful.'
Returns:
Dict[str, Any]: the preprocessed data
"""

text_a, text_b, labels = self.parse_text_and_label(data)
output = self.tokenizer(
text_a,
text_b,
return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None,
**self.tokenize_kwargs)
self.labels_to_id(labels, output)
return output

def parse_text_and_label(self, data):
text_a, text_b, labels = None, None, None
if isinstance(data, str):
text_a = data
elif isinstance(data, tuple) or isinstance(data, list):
if len(data) == 3:
text_a, text_b, labels = data
elif len(data) == 2:
if self.pair:
text_a, text_b = data
else:
text_a, labels = data
elif isinstance(data, dict):
text_a = data.get(self.first_sequence)
text_b = data.get(self.second_sequence)
labels = data.get(self.label)

return text_a, text_b, labels

def labels_to_id(self, labels, output):

def label_can_be_mapped(label):
return isinstance(label, str) or isinstance(label, int)

if labels is not None:
if isinstance(labels, Iterable) and all([label_can_be_mapped(label) for label in labels]) \
and self.label2id is not None:
output[OutputKeys.LABEL] = [
self.label2id[str(label)] for label in labels
]
elif label_can_be_mapped(labels) and self.label2id is not None:
output[OutputKeys.LABEL] = self.label2id[str(labels)]
else:
output[OutputKeys.LABEL] = labels


@PREPROCESSORS.register_module(
Fields.nlp, module_name='bert-seq-cls-tokenizer-finetune')
class SentenceSimilarityFinetunePreprocessor(SentenceSimilarityPreprocessor):
"""Sentence similarity preprocessor in the finetune scenario
Fields.nlp, module_name=Preprocessors.nli_tokenizer)
@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.sen_sim_tokenizer)
class PairSentenceClassificationPreprocessor(NLPTokenizerPreprocessorBase):

def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs):
kwargs['truncation'] = kwargs.get('truncation', True)
kwargs['padding'] = kwargs.get(
'padding', False if mode == 'inference' else 'max_length')
kwargs['max_length'] = kwargs.pop('sequence_length', 128)
super().__init__(model_dir, pair=True, mode=mode, **kwargs)

Mainly added the label mapping procedure.
"""

def __init__(self, model_dir: str, *args, **kwargs):
kwargs['padding'] = 'max_length'
super().__init__(model_dir, *args, **kwargs)
@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.sen_cls_tokenizer)
class SingleSentenceClassificationPreprocessor(NLPTokenizerPreprocessorBase):

def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs):
kwargs['truncation'] = kwargs.get('truncation', True)
kwargs['padding'] = kwargs.get(
'padding', False if mode == 'inference' else 'max_length')
kwargs['max_length'] = kwargs.pop('sequence_length', 128)
super().__init__(model_dir, pair=False, mode=mode, **kwargs)


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.zero_shot_cls_tokenizer)
class ZeroShotClassificationPreprocessor(NLPTokenizerPreprocessorBase):

def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs):
"""preprocess the data via the vocab.txt from the `model_dir` path

Args:
model_dir (str): model path
"""
self.sequence_length = kwargs.pop('sequence_length', 512)
super().__init__(model_dir, pair=False, mode=mode, **kwargs)

def __call__(self, data: Union[str, Dict], hypothesis_template: str,
candidate_labels: list) -> Dict[str, Any]:
"""process the raw input data

Args:
data (str or dict): a sentence
Example:
'you are so handsome.'

Returns:
Dict[str, Any]: the preprocessed data
"""
if isinstance(data, dict):
data = data.get(self.first_sequence)

pairs = [[data, hypothesis_template.format(label)]
for label in candidate_labels]

features = self.tokenizer(
pairs,
padding=True,
truncation=True,
max_length=self.sequence_length,
truncation_strategy='only_first',
return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None)
return features


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.text_gen_tokenizer)
class TextGenerationPreprocessor(NLPPreprocessorBase):
class TextGenerationPreprocessor(NLPTokenizerPreprocessorBase):

def __init__(self, model_dir: str, tokenizer=None, *args, **kwargs):
def __init__(self,
model_dir: str,
tokenizer=None,
mode=ModeKeys.INFERENCE,
**kwargs):
self.tokenizer = self.build_tokenizer(
model_dir) if tokenizer is None else tokenizer
kwargs['truncation'] = True
kwargs['padding'] = True
kwargs['return_tensors'] = 'pt'
kwargs['return_token_type_ids'] = False
kwargs['truncation'] = kwargs.get('truncation', True)
kwargs['padding'] = kwargs.get('padding', True)
kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids',
False)
kwargs['max_length'] = kwargs.pop('sequence_length', 128)
super().__init__(model_dir, *args, **kwargs)
super().__init__(model_dir, pair=False, mode=mode, **kwargs)

@staticmethod
def get_roberta_tokenizer_dir(model_dir: str) -> Optional[str]:
@@ -240,19 +308,13 @@ class TextGenerationPreprocessor(NLPPreprocessorBase):
roberta_tokenizer_dir, do_lower_case=False)
return super().build_tokenizer(model_dir)


@PREPROCESSORS.register_module(
Fields.nlp, module_name='palm-text-gen-tokenizer-finetune')
class TextGenerationFinetunePreprocessor(TextGenerationPreprocessor):

@type_assert(object, dict)
def __call__(self, data: dict) -> Dict[str, Any]:
def __call__(self, data: Union[Dict, str]) -> Dict[str, Any]:
if self._mode == 'inference':
return super().__call__(data)
src_txt = data['src_txt']
tgt_txt = data['tgt_txt']
src_rst = super().__call__(src_txt)
tgt_rst = super().__call__(tgt_txt)
src_rst = {k: v.squeeze() for k, v in src_rst.items()}
tgt_rst = {k: v.squeeze() for k, v in tgt_rst.items()}

return {
'src': src_rst['input_ids'],
@@ -261,87 +323,69 @@ class TextGenerationFinetunePreprocessor(TextGenerationPreprocessor):
}


@PREPROCESSORS.register_module(Fields.nlp)
class FillMaskPreprocessor(NLPPreprocessorBase):
@PREPROCESSORS.register_module(Fields.nlp, module_name=Preprocessors.fill_mask)
class FillMaskPreprocessor(NLPTokenizerPreprocessorBase):

def __init__(self, model_dir: str, *args, **kwargs):
kwargs['truncation'] = True
kwargs['padding'] = 'max_length'
kwargs['return_tensors'] = 'pt'
def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs):
kwargs['truncation'] = kwargs.get('truncation', True)
kwargs['padding'] = kwargs.get('padding', 'max_length')
kwargs['max_length'] = kwargs.pop('sequence_length', 128)
kwargs['return_token_type_ids'] = True
super().__init__(model_dir, *args, **kwargs)

def build_tokenizer(self, model_dir):
from modelscope.utils.hub import get_model_type
model_type = get_model_type(model_dir)
if model_type in ['sbert', 'structbert', 'bert']:
from sofa import SbertTokenizer
return SbertTokenizer.from_pretrained(model_dir, use_fast=False)
elif model_type == 'veco':
from sofa import VecoTokenizer
return VecoTokenizer.from_pretrained(model_dir, use_fast=False)
else:
# TODO Only support veco & sbert
raise RuntimeError(f'Unsupported model type: {model_type}')
kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids',
True)
super().__init__(model_dir, pair=False, mode=mode, **kwargs)


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.token_cls_tokenizer)
class TokenClassificationPreprocessor(NLPPreprocessorBase):

def __init__(self, model_dir: str, *args, **kwargs):
super().__init__(model_dir, *args, **kwargs)

@type_assert(object, str)
def __call__(self, data: Union[str, Dict]) -> Dict[str, Any]:
"""process the raw input data
Fields.nlp,
module_name=Preprocessors.word_segment_text_to_label_preprocessor)
class WordSegmentationBlankSetToLabelPreprocessor(Preprocessor):

Args:
data (str): a sentence
Example:
'you are so handsome.'

Returns:
Dict[str, Any]: the preprocessed data
"""

# preprocess the data for the model input
if isinstance(data, dict):
data = data[self.first_sequence]
text = data.replace(' ', '').strip()
tokens = []
for token in text:
token = self.tokenizer.tokenize(token)
tokens.extend(token)
input_ids = self.tokenizer.convert_tokens_to_ids(tokens)
input_ids = self.tokenizer.build_inputs_with_special_tokens(input_ids)
attention_mask = [1] * len(input_ids)
token_type_ids = [0] * len(input_ids)
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.first_sequence: str = kwargs.pop('first_sequence',
'first_sequence')
self.label = kwargs.pop('label', OutputKeys.LABELS)

def __call__(self, data: str) -> Union[Dict[str, Any], Tuple]:
data = data.split(' ')
data = list(filter(lambda x: len(x) > 0, data))

def produce_train_sample(words):
chars = []
labels = []
for word in words:
chars.extend(list(word))
if len(word) == 1:
labels.append('S-CWS')
else:
labels.extend(['B-CWS'] + ['I-CWS'] * (len(word) - 2)
+ ['E-CWS'])
assert len(chars) == len(labels)
return chars, labels

chars, labels = produce_train_sample(data)
return {
'text': text,
'input_ids': input_ids,
'attention_mask': attention_mask,
'token_type_ids': token_type_ids
self.first_sequence: chars,
self.label: labels,
}


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.zero_shot_cls_tokenizer)
class ZeroShotClassificationPreprocessor(NLPPreprocessorBase):

def __init__(self, model_dir: str, *args, **kwargs):
"""preprocess the data via the vocab.txt from the `model_dir` path
Fields.nlp, module_name=Preprocessors.token_cls_tokenizer)
class TokenClassificationPreprocessor(NLPTokenizerPreprocessorBase):

Args:
model_dir (str): model path
"""
self.sequence_length = kwargs.pop('sequence_length', 512)
super().__init__(model_dir, *args, **kwargs)
def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs):
kwargs['truncation'] = kwargs.get('truncation', True)
kwargs['padding'] = kwargs.get(
'padding', False if mode == ModeKeys.INFERENCE else 'max_length')
kwargs['max_length'] = kwargs.pop('sequence_length', 128)
kwargs['is_split_into_words'] = kwargs.pop(
'is_split_into_words',
False if mode == ModeKeys.INFERENCE else True)
self.label_all_tokens = kwargs.pop('label_all_tokens', False)
super().__init__(model_dir, pair=False, mode=mode, **kwargs)

@type_assert(object, str)
def __call__(self, data, hypothesis_template: str,
candidate_labels: list) -> Dict[str, Any]:
def __call__(self, data: Union[str, Dict]) -> Dict[str, Any]:
"""process the raw input data

Args:
@@ -352,20 +396,74 @@ class ZeroShotClassificationPreprocessor(NLPPreprocessorBase):
Returns:
Dict[str, Any]: the preprocessed data
"""
if isinstance(data, dict):
data = data.get(self.first_sequence)

pairs = [[data, hypothesis_template.format(label)]
for label in candidate_labels]

features = self.tokenizer(
pairs,
padding=True,
truncation=True,
max_length=self.sequence_length,
return_tensors='pt',
truncation_strategy='only_first')
return features
# preprocess the data for the model input
# if isinstance(data, dict):
# data = data[self.first_sequence]
# text = data.replace(' ', '').strip()
# tokens = []
# for token in text:
# token = self.tokenizer.tokenize(token)
# tokens.extend(token)
# input_ids = self.tokenizer.convert_tokens_to_ids(tokens)
# input_ids = self.tokenizer.build_inputs_with_special_tokens(input_ids)
# attention_mask = [1] * len(input_ids)
# token_type_ids = [0] * len(input_ids)

# new code to deal with labels
# tokenized_inputs = self.tokenizer(data, truncation=True, is_split_into_words=True)

text_a = None
labels_list = None
if isinstance(data, str):
text_a = data
elif isinstance(data, dict):
text_a = data.get(self.first_sequence)
labels_list = data.get(self.label)
tokenized_inputs = self.tokenizer(
text_a,
return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None,
**self.tokenize_kwargs)

if labels_list is not None:
assert self.label2id is not None
# Map that sends B-Xxx label to its I-Xxx counterpart
b_to_i_label = []
label_enumerate_values = [
k for k, v in sorted(
self.label2id.items(), key=lambda item: item[1])
]
for idx, label in enumerate(label_enumerate_values):
if label.startswith('B-') and label.replace(
'B-', 'I-') in label_enumerate_values:
b_to_i_label.append(
label_enumerate_values.index(
label.replace('B-', 'I-')))
else:
b_to_i_label.append(idx)

label_row = [self.label2id[lb] for lb in labels_list]
word_ids = tokenized_inputs.word_ids()
previous_word_idx = None
label_ids = []
for word_idx in word_ids:
if word_idx is None:
label_ids.append(-100)
elif word_idx != previous_word_idx:
label_ids.append(label_row[word_idx])
else:
if self.label_all_tokens:
label_ids.append(b_to_i_label[label_row[word_idx]])
else:
label_ids.append(-100)
previous_word_idx = word_idx
labels = label_ids
tokenized_inputs['labels'] = labels
# new code end

if self._mode == ModeKeys.INFERENCE:
tokenized_inputs[OutputKeys.TEXT] = text_a
return tokenized_inputs


@PREPROCESSORS.register_module(


+ 1
- 1
modelscope/preprocessors/space/dialog_state_tracking_preprocessor.py View File

@@ -24,7 +24,7 @@ class DialogStateTrackingPreprocessor(Preprocessor):
"""
super().__init__(*args, **kwargs)

from sofa.models.space import SpaceConfig, SpaceTokenizer
from modelscope.models.nlp.space import SpaceConfig, SpaceTokenizer
self.model_dir: str = model_dir
self.config = SpaceConfig.from_pretrained(self.model_dir)
self.tokenizer = SpaceTokenizer.from_pretrained(self.model_dir)


+ 2
- 0
modelscope/task_datasets/__init__.py View File

@@ -7,12 +7,14 @@ if TYPE_CHECKING:
from .base import TaskDataset
from .builder import TASK_DATASETS, build_task_dataset
from .torch_base_dataset import TorchTaskDataset
from .veco_dataset import VecoDataset

else:
_import_structure = {
'base': ['TaskDataset'],
'builder': ['TASK_DATASETS', 'build_task_dataset'],
'torch_base_dataset': ['TorchTaskDataset'],
'veco_dataset': ['VecoDataset'],
}
import sys



+ 3
- 3
modelscope/task_datasets/base.py View File

@@ -1,6 +1,6 @@
# Copyright (c) Alibaba, Inc. and its affiliates.
from abc import ABC, abstractmethod
from typing import Any, List, Tuple
from typing import Any, List, Tuple, Union


class TaskDataset(ABC):
@@ -8,7 +8,7 @@ class TaskDataset(ABC):
"""

def __init__(self,
datasets: Tuple[Any, List[Any]],
datasets: Union[Any, List[Any]],
mode,
preprocessor=None,
**kwargs):
@@ -18,7 +18,7 @@ class TaskDataset(ABC):
self._inner_dataset = self.prepare_dataset(datasets)

@abstractmethod
def prepare_dataset(self, datasets: Tuple[Any, List[Any]]) -> Any:
def prepare_dataset(self, datasets: Union[Any, List[Any]]) -> Any:
"""Prepare a dataset.

User can process the input datasets in a whole dataset perspective.


+ 3
- 3
modelscope/task_datasets/torch_base_dataset.py View File

@@ -1,5 +1,5 @@
# Copyright (c) Alibaba, Inc. and its affiliates.
from typing import Any, List, Tuple
from typing import Any, List, Tuple, Union

from torch.utils.data import ConcatDataset, Dataset

@@ -14,7 +14,7 @@ class TorchTaskDataset(TaskDataset, Dataset):
"""

def __init__(self,
datasets: Tuple[Any, List[Any]],
datasets: Union[Any, List[Any]],
mode,
preprocessor=None,
**kwargs):
@@ -26,7 +26,7 @@ class TorchTaskDataset(TaskDataset, Dataset):
def __len__(self):
return len(self._inner_dataset)

def prepare_dataset(self, datasets: Tuple[Any, List[Any]]) -> Any:
def prepare_dataset(self, datasets: Union[Any, List[Any]]) -> Any:
"""Prepare a dataset.

User can process the input datasets in a whole dataset perspective.


+ 76
- 0
modelscope/task_datasets/veco_dataset.py View File

@@ -0,0 +1,76 @@
# Copyright (c) Alibaba, Inc. and its affiliates.
from typing import Any, List, Union

import numpy as np
from datasets import Dataset, IterableDataset, concatenate_datasets

from modelscope.metainfo import Models
from modelscope.utils.constant import Tasks
from .builder import TASK_DATASETS
from .torch_base_dataset import TorchTaskDataset


@TASK_DATASETS.register_module(module_name=Models.veco, group_key=Tasks.nli)
class VecoDataset(TorchTaskDataset):

def __init__(self,
datasets: Union[Any, List[Any]],
mode,
preprocessor=None,
**kwargs):
self.seed = kwargs.get('seed', 42)
self.permutation = None
self.datasets = None
super().__init__(datasets, mode, preprocessor, **kwargs)

def switch_dataset(self, idx):
"""Switch dataset in evaluation.

Veco evaluates dataset one by one.

Args:
idx: The index of the dataset
"""
if self.mode == 'train':
raise ValueError(
'Only support switch dataset in the evaluation loop')
if idx >= len(self.datasets):
raise ValueError(
'Index is bigger than the number of the datasets.')
self._inner_dataset = self.datasets[idx]

def __getitem__(self, item):
if self.permutation is not None:
item = self.permutation[item]
return super().__getitem__(item)

def prepare_dataset(self, datasets: Union[Any, List[Any]]) -> Any:
"""Compose all the datasets.

If the mode is 'train', all datasets will be mixed together, if the mode is 'eval',
the datasets will be kept and returns the first one.

Args:
datasets: The datasets to be composed.

Returns: The final dataset.
"""
if not isinstance(datasets, (list, tuple)):
datasets = [datasets]
if self.mode == 'train':
if len(datasets) == 1:
return datasets[0]
elif all([
isinstance(dataset, (Dataset, IterableDataset))
for dataset in datasets
]):
dataset = concatenate_datasets(list(datasets))
return dataset.shuffle(seed=self.seed)
else:
generator = np.random.default_rng(self.seed)
_len = sum([len(dataset) for dataset in datasets])
self.permutation = generator.permutation(_len)
return super().prepare_dataset(datasets)
else:
self.datasets = datasets
return self.datasets[0]

+ 1
- 0
modelscope/trainers/__init__.py View File

@@ -4,4 +4,5 @@ from .cv import (ImageInstanceSegmentationTrainer,
ImagePortraitEnhancementTrainer)
from .multi_modal import CLIPTrainer
from .nlp import SequenceClassificationTrainer
from .nlp_trainer import NlpEpochBasedTrainer, VecoTrainer
from .trainer import EpochBasedTrainer

+ 1
- 0
modelscope/trainers/hooks/evaluation_hook.py View File

@@ -32,6 +32,7 @@ class EvaluationHook(Hook):
def do_evaluate(self, trainer):
"""Evaluate the results."""
eval_res = trainer.evaluate()
trainer.data_loader = trainer.train_dataloader
for name, val in eval_res.items():
trainer.log_buffer.output[name] = val



+ 5
- 3
modelscope/trainers/hooks/lr_scheduler_hook.py View File

@@ -21,9 +21,6 @@ class LrSchedulerHook(Hook):
def __init__(self, by_epoch=True, warmup=None) -> None:
super().__init__()
self.by_epoch = by_epoch
if not self.by_epoch:
raise ValueError('We only support ``by_epoch=True`` now!')

self.warmup = warmup
self.warmup_lr_scheduler = None

@@ -49,6 +46,11 @@ class LrSchedulerHook(Hook):
return lr

def before_train_iter(self, trainer):
if not self.by_epoch:
if self.warmup_lr_scheduler is not None:
self.warmup_lr_scheduler.step()
else:
trainer.lr_scheduler.step()
trainer.log_buffer.output[LogKeys.LR] = self._get_log_lr(trainer)

def before_train_epoch(self, trainer):


+ 192
- 0
modelscope/trainers/nlp_trainer.py View File

@@ -0,0 +1,192 @@
import os
from typing import Callable, Dict, Optional, Tuple, Union

import torch
from torch import nn
from torch.utils.data import Dataset

from modelscope.hub.snapshot_download import snapshot_download
from modelscope.metrics.builder import build_metric
from modelscope.models.base import Model, TorchModel
from modelscope.msdatasets import MsDataset
from modelscope.preprocessors import Preprocessor, build_preprocessor
from modelscope.utils.config import Config, ConfigDict
from modelscope.utils.constant import (DEFAULT_MODEL_REVISION, ModeKeys,
ModelFile, Tasks)
from .base import TRAINERS
from .trainer import EpochBasedTrainer


@TRAINERS.register_module(module_name='NlpEpochBasedTrainer')
class NlpEpochBasedTrainer(EpochBasedTrainer):

def __init__(
self,
model: Optional[Union[TorchModel, nn.Module, str]] = None,
cfg_file: Optional[str] = None,
cfg_modify_fn: Optional[Callable] = None,
arg_parse_fn: Optional[Callable] = None,
data_collator: Optional[Callable] = None,
train_dataset: Optional[Union[MsDataset, Dataset]] = None,
eval_dataset: Optional[Union[MsDataset, Dataset]] = None,
preprocessor: Optional[Preprocessor] = None,
optimizers: Tuple[torch.optim.Optimizer,
torch.optim.lr_scheduler._LRScheduler] = (None,
None),
model_revision: Optional[str] = DEFAULT_MODEL_REVISION,
**kwargs):
"""Add code to adapt with nlp models.

Args:
cfg_modify_fn: An input fn which is used to modify the cfg read out of the file.
"""

if isinstance(model, str):
if os.path.exists(model):
model_dir = model if os.path.isdir(model) else os.path.dirname(
model)
else:
model_dir = snapshot_download(model, revision=model_revision)
cfg_file = os.path.join(model_dir, ModelFile.CONFIGURATION)
else:
assert cfg_file is not None, 'Config file should not be None if model is an nn.Module class'
model_dir = os.path.dirname(cfg_file)

self.cfg_modify_fn = cfg_modify_fn
self.cfg = self.rebuild_config(Config.from_file(cfg_file))
try:
labels = self.cfg.dataset.train.labels
except AttributeError:
labels = None

self.label2id = None
self.num_labels = None
if labels is not None and len(labels) > 0:
self.label2id = {label: idx for idx, label in enumerate(labels)}
self.id2label = {idx: label for idx, label in enumerate(labels)}
self.num_labels = len(labels)

def build_dataset_keys(cfg):
if cfg is not None:
input_keys = {
'first_sequence': getattr(cfg, 'first_sequence', None),
'second_sequence': getattr(cfg, 'second_sequence', None),
'label': getattr(cfg, 'label', None),
}
else:
input_keys = {}

return {k: v for k, v in input_keys.items() if v is not None}

self.train_keys = build_dataset_keys(
self.cfg.dataset.train if hasattr(self.cfg, 'dataset')
and hasattr(self.cfg.dataset, 'train') else None)
# TODO eval may has special keys, which is now not supported.
# because there is only one preprocessor in the trainer, and it only supports one group of keys.
self.eval_keys = self.train_keys

super().__init__(
model=model_dir,
cfg_file=cfg_file,
arg_parse_fn=arg_parse_fn,
data_collator=data_collator,
preprocessor=preprocessor,
optimizers=optimizers,
model_revision=model_revision,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
**kwargs)

def rebuild_config(self, cfg: Config):
if self.cfg_modify_fn is not None:
return self.cfg_modify_fn(cfg)
return cfg

def build_model(self) -> Union[nn.Module, TorchModel]:
""" Instantiate a pytorch model and return.

By default, we will create a model using config from configuration file. You can
override this method in a subclass.

"""
model_args = {} if self.num_labels is None else {
'num_labels': self.num_labels
}
model = Model.from_pretrained(
self.model_dir, cfg_dict=self.cfg, **model_args)
if not isinstance(model, nn.Module) and hasattr(model, 'model'):
return model.model
elif isinstance(model, nn.Module):
return model

def build_preprocessor(self) -> Preprocessor:
"""Build the preprocessor.

User can override this method to implement custom logits.

Returns: The preprocessor instance.

"""
model_args = {} if self.label2id is None else {
'label2id': self.label2id
}
cfg = ConfigDict({
**getattr(self.cfg, 'preprocessor'),
'model_dir':
self.model_dir,
**model_args,
'mode':
ModeKeys.TRAIN,
**self.train_keys,
})
return build_preprocessor(cfg, Tasks.find_field_by_task(self.cfg.task))


@TRAINERS.register_module(module_name='VecoTrainer')
class VecoTrainer(NlpEpochBasedTrainer):

def evaluate(self, checkpoint_path=None):
"""Veco evaluates the datasets one by one.

"""
from modelscope.task_datasets import VecoDataset
self.model.eval()
self._mode = ModeKeys.EVAL
metric_values = {}

if self.eval_dataset is None:
val_data = self.cfg.dataset.val
self.eval_dataset = self.build_dataset(
val_data, mode=ModeKeys.EVAL)

idx = 0
dataset_cnt = 1
if isinstance(self.eval_dataset, VecoDataset):
self.eval_dataset.switch_dataset(idx)
dataset_cnt = len(self.eval_dataset.datasets)

while True:
self.eval_dataloader = self._build_dataloader_with_dataset(
self.eval_dataset, **self.cfg.evaluation.get('dataloader', {}))
self.data_loader = self.eval_dataloader

metric_classes = [
build_metric(metric, default_args={'trainer': self})
for metric in self.metrics
]
self.evaluation_loop(self.eval_dataloader, checkpoint_path,
metric_classes)

for m_idx, metric_cls in enumerate(metric_classes):
if f'eval_dataset[{idx}]' not in metric_values:
metric_values[f'eval_dataset[{idx}]'] = {}
metric_values[f'eval_dataset[{idx}]'][
self.metrics[m_idx]] = metric_cls.evaluate()

idx += 1
if idx < dataset_cnt:
self.eval_dataset.switch_dataset(idx)
else:
break

return metric_values

+ 34
- 24
modelscope/trainers/trainer.py View File

@@ -22,7 +22,8 @@ from modelscope.models.base import Model, TorchModel
from modelscope.msdatasets.ms_dataset import MsDataset
from modelscope.preprocessors import build_preprocessor
from modelscope.preprocessors.base import Preprocessor
from modelscope.task_datasets import TorchTaskDataset, build_task_dataset
from modelscope.task_datasets.builder import build_task_dataset
from modelscope.task_datasets.torch_base_dataset import TorchTaskDataset
from modelscope.trainers.hooks.builder import HOOKS
from modelscope.trainers.hooks.priority import Priority, get_priority
from modelscope.trainers.lrscheduler.builder import build_lr_scheduler
@@ -30,12 +31,12 @@ from modelscope.trainers.optimizer.builder import build_optimizer
from modelscope.utils.config import Config, ConfigDict
from modelscope.utils.constant import (DEFAULT_MODEL_REVISION, Hubs, ModeKeys,
ModelFile, Tasks, TrainerStages)
from modelscope.utils.file_utils import func_receive_dict_inputs
from modelscope.utils.logger import get_logger
from modelscope.utils.registry import build_from_cfg
from modelscope.utils.tensor_utils import torch_default_data_collator
from modelscope.utils.torch_utils import (broadcast, create_device,
get_dist_info, init_dist)
from modelscope.utils.utils import if_func_receive_dict_inputs
from .base import BaseTrainer
from .builder import TRAINERS
from .default_config import DEFAULT_CONFIG
@@ -87,6 +88,7 @@ class EpochBasedTrainer(BaseTrainer):
None),
model_revision: Optional[str] = DEFAULT_MODEL_REVISION,
**kwargs):

if isinstance(model, str):
if os.path.exists(model):
self.model_dir = model if os.path.isdir(
@@ -108,9 +110,9 @@ class EpochBasedTrainer(BaseTrainer):
self.model = model

super().__init__(cfg_file, arg_parse_fn)

# add default config
self.cfg.merge_from_dict(self._get_default_config(), force=False)
self.cfg = self.rebuild_config(self.cfg)

if 'work_dir' in kwargs:
self.work_dir = kwargs['work_dir']
@@ -130,9 +132,9 @@ class EpochBasedTrainer(BaseTrainer):
self.device = create_device(device_name == 'cpu')

self.train_dataset = self.to_task_dataset(
train_dataset, mode='train', preprocessor=self.preprocessor)
train_dataset, mode=ModeKeys.TRAIN, preprocessor=self.preprocessor)
self.eval_dataset = self.to_task_dataset(
eval_dataset, mode='eval', preprocessor=self.preprocessor)
eval_dataset, mode=ModeKeys.EVAL, preprocessor=self.preprocessor)

self.data_collator = data_collator if data_collator is not None else torch_default_data_collator
self.metrics = self.get_metrics()
@@ -168,6 +170,14 @@ class EpochBasedTrainer(BaseTrainer):
if not is_parallel(self.model) and self._dist:
self.model = self.to_parallel(self.model)

def rebuild_config(self, cfg: Config):
"""A method used to rebuild the config, any subclass can override this method.

Returns: The rebuilt config

"""
return cfg

@property
def mode(self):
return self._mode
@@ -203,7 +213,7 @@ class EpochBasedTrainer(BaseTrainer):
return self._max_epochs * len(self.data_loader)

def to_task_dataset(self,
datasets: Tuple[Dataset, List[Dataset]],
datasets: Union[Dataset, List[Dataset]],
mode: str,
preprocessor: Optional[Preprocessor] = None):
"""Build the task specific dataset processor for this trainer.
@@ -229,17 +239,13 @@ class EpochBasedTrainer(BaseTrainer):
cfg = ConfigDict(
type=self.cfg.task, mode=mode, datasets=datasets)
return build_task_dataset(cfg, self.cfg.task)
elif isinstance(datasets,
Dataset) or (isinstance(datasets, List)
and isinstance(datasets[0], Dataset)):
else:
cfg = ConfigDict(
type=self.cfg.model.type, mode=mode, datasets=datasets)
type=self.cfg.model.type,
mode=mode,
datasets=datasets,
preprocessor=preprocessor)
return build_task_dataset(cfg, self.cfg.task)
else:
raise ValueError(
f'invalid datasets type: {type(datasets)}, '
f'expected `MsDataset`, `torch.utils.data.Dataset` or list of them.'
)
except Exception:
if isinstance(datasets, (List, Tuple)) or preprocessor is not None:
return TorchTaskDataset(
@@ -262,8 +268,11 @@ class EpochBasedTrainer(BaseTrainer):
# TODO @wenmeng.zwm @jiangnana.jnn add support for different preprocessor
# when they are different ones in training and evaluation
cfg = ConfigDict({
**getattr(self.cfg, 'preprocessor'), 'model_dir':
self.model_dir
**getattr(self.cfg, 'preprocessor'),
'model_dir':
self.model_dir,
'mode':
ModeKeys.TRAIN,
})
return build_preprocessor(cfg, Tasks.find_field_by_task(self.cfg.task))

@@ -324,6 +333,8 @@ class EpochBasedTrainer(BaseTrainer):
**self.cfg.evaluation.get('dataloader', {}))
self.data_loader = self.eval_dataloader
metric_classes = [build_metric(metric) for metric in self.metrics]
for m in metric_classes:
m.trainer = self
metric_values = self.evaluation_loop(self.eval_dataloader,
checkpoint_path, metric_classes)

@@ -338,10 +349,9 @@ class EpochBasedTrainer(BaseTrainer):
""" Instantiate a pytorch model and return.

By default, we will create a model using config from configuration file. You can
subclass and override this method in a subclass.
override this method in a subclass.

"""
# TODO temp implementation, waiting for @zhangzhicheng
model = Model.from_pretrained(self.model_dir)
if not isinstance(model, nn.Module) and hasattr(model, 'model'):
return model.model
@@ -412,9 +422,8 @@ class EpochBasedTrainer(BaseTrainer):
self._mode = ModeKeys.TRAIN
inputs = self.collate_fn(inputs)
# call model forward but not __call__ to skip postprocess
if isinstance(
inputs,
Mapping) and not if_func_receive_dict_inputs(model.forward):
if isinstance(inputs,
Mapping) and not func_receive_dict_inputs(model.forward):
train_outputs = model.forward(**inputs)
else:
train_outputs = model.forward(inputs)
@@ -495,7 +504,7 @@ class EpochBasedTrainer(BaseTrainer):
if self.eval_dataset is None:
val_data = self.cfg.dataset.val
self.eval_dataset = self.build_dataset(
val_data, mode=ModeKeys.TRAIN)
val_data, mode=ModeKeys.EVAL)

batch_size = self.cfg.evaluation.batch_size
workers = self.cfg.evaluation.workers
@@ -523,7 +532,8 @@ class EpochBasedTrainer(BaseTrainer):
)
torch_dataset = dataset.to_torch_dataset(
preprocessors=self.preprocessor, )
return torch_dataset
dataset = self.to_task_dataset(torch_dataset, mode)
return dataset

def create_optimizer_and_scheduler(self):
""" Create optimizer and lr scheduler


+ 17
- 14
modelscope/trainers/utils/inference.py View File

@@ -10,9 +10,9 @@ import torch
from torch import distributed as dist
from tqdm import tqdm

from modelscope.utils.file_utils import func_receive_dict_inputs
from modelscope.utils.torch_utils import (broadcast, get_dist_info, is_master,
make_tmp_dir)
from modelscope.utils.utils import if_func_receive_dict_inputs


def single_gpu_test(model,
@@ -37,18 +37,19 @@ def single_gpu_test(model,
if data_collate_fn is not None:
data = data_collate_fn(data)
with torch.no_grad():
if isinstance(data,
Mapping) and not if_func_receive_dict_inputs(
model.forward):

result = model(**data)
if isinstance(data, Mapping) and not func_receive_dict_inputs(
model.forward):
result = model.forward(**data)
else:
result = model(data)
result = model.forward(data)
if metric_classes is not None:
for metric_cls in metric_classes:
metric_cls.add(result, data)

batch_size = len(result)
if isinstance(data, dict):
batch_size = len(next(iter(data.values())))
else:
batch_size = len(data)
for _ in range(batch_size):
pbar.update()

@@ -101,16 +102,18 @@ def multi_gpu_test(model,
data = data_collate_fn(data)
data_list.append(data)
with torch.no_grad():
if isinstance(data,
Mapping) and not if_func_receive_dict_inputs(
model.forward):
result = model(**data)
if isinstance(data, Mapping) and not func_receive_dict_inputs(
model.forward):
result = model.forward(**data)
else:
result = model(data)
result = model.forward(data)
results.append(result)

if rank == 0:
batch_size = len(result)
if isinstance(data, dict):
batch_size = len(next(iter(data.values())))
else:
batch_size = len(data)
batch_size_all = batch_size * world_size
count += batch_size_all
if count > len(dataset):


Some files were not shown because too many files changed in this diff

Loading…
Cancel
Save