1. add sbert,veco,palm,space source code
2. support sbert sequence classification, token classification finetune
3. support veco sequence classification finetune
4. support palm nlg finetune
evaluation result: https://sheet.alibaba-inc.com/#/sheet/f7fdcc7f22bd5105 sheet:Maas
5. add ut for finetunes
6. add veco's taskdataset processor
7. add a common trainer for nlp, and a specific trainer for veco
8. merge some duplicate codes of models, preprocessors, pipelines
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9574105
* add basic class of hook&metrics
* pre-commit passed
* change some comments
* pre commit passed
* 1. remove accuracy's groups 2. remove useless hooks 3. simplify priorities
* pre-commit passed
* fix a comment
* Merge branch 'master' into finetune_hooks_metrics
# Conflicts:
# modelscope/metainfo.py
* pre-commit passed
* add basic class of hook&metrics
* pre-commit passed
* change some comments
* pre commit passed
* 1. remove accuracy's groups 2. remove useless hooks 3. simplify priorities
* pre-commit passed
* fix a comment
* Merge branch 'feat/finetune' of gitlab.alibaba-inc.com:Ali-MaaS/MaaS-lib into feat/finetune
* mv hooks related to modelscope/trainers/hooks
* mv priority back
* add torch mdoel base and test
* update hooks, trainer, import_util
* add torch epoch based trainer and dis utils
* add hooks
* fix warmup
* format code stype and fix warmup and add warmup unittest
* fix impls
* pre-commit check passed
* update hook and add EpochBasedTrainer
* add trainer unittest
* Merge branch 'feat/add_hooks' into feat/add_task
# Conflicts:
# modelscope/models/base_torch.py
# modelscope/trainers/hooks/hook.py
# modelscope/trainers/trainer.py
* update unittest name
* rewrite taskdataset to trainer
* fix trainer and add unittest
* add unittest
* code: run to forward
* run through... but ugly code
* arrange some cls
* fix some errs
* revert some mistakes
* init check in
* Merge branch 'feat/add_hooks' into feat/add_task
# Conflicts:
# modelscope/trainers/trainer.py
* test with bigger epoch and size
* add the default metrics class
* move build metrics code to a method
* merge add_task
* merge origin add_task
* add device initialization
* remove preprocessor arg for bool
* add task models
* move metric collect logic to metrics class
* pre-commit passed
* fix cr comments
* precommit passed
* add task models
* Merge remote-tracking branch 'origin/feat/add_task' into feat/backbone_head
* add comment
* change comment formats.
* fix comments
* fix ut bug
* fix comments
* add wrapper check
* fix comments
* pre commit passed
* fix cr comments
* solve a loop import problem
* fix ut bug
* fix ut errors
* change dummydataset to msdataset
* precommit passed
* merge add task
* backbone-head is build, model is not correctly loaded
* model load states matched
* result matched
* lint
* add veco/palm_v2 code
* merge master
* merge master success running
* add repr model name level
* Merge branch 'feat/veco_palm' into feat/finetune_sbert_veco
* model test for training
* add token-classification metric add formal ut
* fix running bug
* finetune and pipeline are working with backbone-head
* add nli
* add missing code
* finetune and pipeline are working with backbone-head
* Merge branch 'feat/backbone_head' of http://gitlab.alibaba-inc.com/Ali-MaaS/MaaS-lib into feat/backbone_head
* add a test repo for pr
* remove merge conflicted file
* remove merge conflicted file 1
* lint check
* import error
* none type bug fix
* forward input unpacking or dict bug
* move head into models, add build_backbone with registry, no base method
* merge master
* feat: 1. add interleave dataset method 2. support multiple dataset in trainer.build_dataset 3. support 3 sub tasks in sequence_classification task
* unfinished
* update the task model structure in NLP field
* merge master
* update by comments
* keep the default model id as current on production
* unfinished
* unfinished
* veco can run
* Merge remote-tracking branch 'origin/master' into feat/backbone_head
* add taskmodel for module management
* remove forward_input_is_dict
* unfinished
* token classification started
* update base model structure
* move space to backbone
* remove 'type' in build_from_cfg method
* test update
* bug fix
* on tesing, mess code
* Merge branch 'feat/backbone_head' into feat/refactor_nlp_730
# Conflicts:
# modelscope/metrics/builder.py
# modelscope/models/__init__.py
# modelscope/models/nlp/__init__.py
# modelscope/preprocessors/nlp.py
# modelscope/trainers/trainer.py
# requirements/multi-modal.txt
* add missing merge
* add sofa source code
* refactor
* add veco task dataset
* add veco task dataset
* pre-commit passed
* fix bug of log
* add some features
* merge master
* bug fix
* refine nlp models
* fix the training error
* unfinished
* refactor pipeline
* Merge branch 'feat/backbone_head' into feat/refactor_nlp_730
# Conflicts:
# modelscope/metrics/builder.py
# modelscope/models/nlp/__init__.py
# modelscope/models/nlp/backbones/structbert/modeling_sbert.py
# modelscope/models/nlp/palm_v2/palm_for_text_generation.py
# modelscope/preprocessors/base.py
# modelscope/preprocessors/nlp.py
# modelscope/trainers/trainer.py
* Merge commit 'ab04ceafc5453ce7daa9aa09e37a55f703072a10' into feat/refactor_nlp_730
# Conflicts:
# modelscope/metainfo.py
# modelscope/metrics/builder.py
# modelscope/models/__init__.py
# modelscope/models/base/base_torch_model.py
# modelscope/models/nlp/__init__.py
# modelscope/models/nlp/backbones/space/model/intent_unified_transformer.py
# modelscope/models/nlp/backbones/space/model/model_base.py
# modelscope/models/nlp/palm_v2/palm_for_text_generation.py
# modelscope/models/nlp/sbert_for_sequence_classification.py
# modelscope/models/nlp/sequence_classification.py
# modelscope/models/nlp/space/__init__.py
# modelscope/models/nlp/space_for_dialog_intent_prediction.py
# modelscope/models/nlp/space_for_dialog_modeling.py
# modelscope/models/nlp/space_for_dialog_state_tracking.py
# modelscope/models/nlp/task_model.py
# modelscope/pipelines/nlp/sentiment_classification_pipeline.py
# modelscope/preprocessors/base.py
# modelscope/preprocessors/nlp.py
# modelscope/trainers/trainer.py
* revert changes
* unify sentnece classification postprocess
* revert some changes, move some model files
* pipeline first case run through
* ws pipeline passed
* Merge branch 'feat/refactor_nlp_730' into feat/finetune_sbert_veco
* finetune
* revert code
* revert some code
* ws finetune started, only the accuracy is weird
* Merge branch 'feat/veco_taskdataset' into feat/finetune_sbert_veco
# Conflicts:
# modelscope/task_datasets/veco_dataset.py
# tests/taskdataset/test_veco_dataset.py
* veco+nli finetune started
* Merge branch 'master' into feat/finetune_sbert_veco
# Conflicts:
# modelscope/models/nlp/sbert_for_sequence_classification.py
# modelscope/models/nlp/sbert_for_token_classification.py
# modelscope/models/nlp/sbert_for_zero_shot_classification.py
# modelscope/models/nlp/space/space_for_dialog_intent_prediction.py
# modelscope/models/nlp/space/space_for_dialog_modeling.py
# modelscope/trainers/trainer.py
* add trainer for nlp
* trainer: dataset params passed into preprocessor
* test passed by nlptrainer
* fix some bugs
* fix some bugs
* add backbone/head subclass
* fix regression bugs
* fix bug in token-cls finetune
* support cfg modification
* fix bug
* fix bug
* update requirements
* add some comments and fix some t
* add some comments and revert a argument
* split to two test files
* revert code
* fixbug in precessor
(cherry picked from commit 7a648d096e
)
* fix ut bug
* support sbert models
* unfinished
* Merge branch 'feat/finetune_sbert_veco' into sly_tmp_veco_finetune
# Conflicts:
# tests/trainers/test_finetune_sequence_classification.py
* fixbug in veco
* fix bug
* fixbug
* correct running params
* remove useless files
* add palm finetuning with cnn_dailymail dataset
* copy space model from sofa
* Merge branch 'feat/finetune_sbert_veco' of gitlab.alibaba-inc.com:Ali-MaaS/MaaS-lib into feat/finetune_sbert_veco
* Merge branch 'master' into feat/finetune_sbert_veco
# Conflicts:
# modelscope/metrics/__init__.py
# modelscope/models/__init__.py
# modelscope/models/nlp/__init__.py
# modelscope/models/nlp/backbones/__init__.py
# modelscope/models/nlp/backbones/structbert/modeling_sbert.py
# modelscope/models/nlp/heads/__init__.py
# modelscope/models/nlp/masked_language.py
# modelscope/models/nlp/palm_v2/palm_for_text_generation.py
# modelscope/models/nlp/sbert_for_nli.py
# modelscope/models/nlp/sbert_for_sentence_similarity.py
# modelscope/models/nlp/sbert_for_sentiment_classification.py
# modelscope/models/nlp/sbert_for_sequence_classification.py
# modelscope/models/nlp/sbert_for_token_classification.py
# modelscope/models/nlp/sbert_for_zero_shot_classification.py
# modelscope/models/nlp/sequence_classification.py
# modelscope/models/nlp/space/space_for_dialog_intent_prediction.py
# modelscope/models/nlp/space/space_for_dialog_modeling.py
# modelscope/models/nlp/space/space_for_dialog_state_tracking.py
# modelscope/models/nlp/structbert/adv_utils.py
# modelscope/models/nlp/structbert/configuration_sbert.py
# modelscope/models/nlp/task_models/task_model.py
# modelscope/pipelines/__init__.py
# modelscope/pipelines/nlp/__init__.py
# modelscope/pipelines/nlp/fill_mask_pipeline.py
# modelscope/pipelines/nlp/named_entity_recognition_pipeline.py
# modelscope/pipelines/nlp/nli_pipeline.py
# modelscope/pipelines/nlp/sentence_similarity_pipeline.py
# modelscope/pipelines/nlp/sentiment_classification_pipeline.py
# modelscope/pipelines/nlp/text_generation_pipeline.py
# modelscope/pipelines/nlp/word_segmentation_pipeline.py
# modelscope/pipelines/nlp/zero_shot_classification_pipeline.py
# modelscope/preprocessors/nlp.py
# modelscope/task_datasets/__init__.py
# modelscope/trainers/trainer.py
# modelscope/trainers/utils/inference.py
# modelscope/utils/file_utils.py
# requirements/nlp.txt
# tests/pipelines/test_nli.py
# tests/pipelines/test_sentence_similarity.py
# tests/pipelines/test_sentiment_classification.py
* fix imports
* mark backbone in their own modeling
* pre-commit check passed
* pre-commit passed, remove roberta model
* fix a bug in ast import
* skip all finetune uts
* fix bugs
* pre-commit passed
* bug fixed
* bug fixed
* bug fixed
* bug fixed
* fix ut bug
* fix bug
* fix ut bug
* fix bug
* fix bug
* fixbugs
* fixbug
* revert veco
* revert veco because of core dump
* fix palm bug
* revert veco
* revert mistaken code
* add a test print
* pre-commit check
* test exception
* add test code
* for test
* fix bug and test
* remove test code
* remove useless file
* 1. fix some bugs 2. add backbone ut
* Merge branch 'master' into feat/finetune_refactor_730
# Conflicts:
# modelscope/metainfo.py
# modelscope/metrics/sequence_classification_metric.py
# modelscope/models/nlp/__init__.py
# modelscope/models/nlp/task_models/task_model.py
# modelscope/preprocessors/__init__.py
# modelscope/preprocessors/nlp.py
# modelscope/trainers/trainer.py
# modelscope/trainers/utils/inference.py
# modelscope/utils/file_utils.py
# tests/trainers/test_trainer_with_nlp.py
* pre-commit passed
* revert files
* increase test level
* unregister models
* fix bugs
* fix cr comments
* fix bug in backbone-head
* add sbert backbone
* fix bug
* add test for token-cls-metric
* pre-commit passed
* fix ut comments
* revert normal tokenizer to fast tokenizer
* Merge branch 'master' into feat/finetune_refactor_730
# Conflicts:
# modelscope/models/nlp/__init__.py
# modelscope/models/nlp/backbones/__init__.py
# modelscope/models/nlp/backbones/structbert/__init__.py
# modelscope/models/nlp/masked_language.py
# modelscope/models/nlp/palm_v2/palm_for_text_generation.py
# modelscope/models/nlp/sbert_for_sequence_classification.py
# modelscope/models/nlp/sbert_for_token_classification.py
# modelscope/models/nlp/sbert_for_zero_shot_classification.py
# modelscope/pipelines/nlp/text_generation_pipeline.py
# modelscope/preprocessors/nlp.py
# modelscope/trainers/trainer.py
# modelscope/trainers/utils/inference.py
* fix merge bugs
* pre commit passed
* fix bug
* fix bug
* fix bug
* fix bug from master
* add print
* fix ut bug
* fix bug
* Merge branch 'master' into feat/finetune_refactor_730
* skip task model test
master
@@ -2,7 +2,7 @@ | |||
"framework": "pytorch", | |||
"task": "sentence-similarity", | |||
"preprocessor": { | |||
"type": "bert-seq-cls-tokenizer-finetune", | |||
"type": "sen-sim-tokenizer", | |||
"first_sequence": "sentence1", | |||
"second_sequence": "sentence2" | |||
}, | |||
@@ -4,7 +4,7 @@ from modelscope.hub.constants import (DEFAULT_MODELSCOPE_DOMAIN, | |||
DEFAULT_MODELSCOPE_GROUP, | |||
MODEL_ID_SEPARATOR, | |||
MODELSCOPE_URL_SCHEME) | |||
from modelscope.utils.utils import get_default_cache_dir | |||
from modelscope.utils.file_utils import get_default_cache_dir | |||
def model_id_to_group_owner_name(model_id): | |||
@@ -53,6 +53,10 @@ class TaskModels(object): | |||
class Heads(object): | |||
# nlp heads | |||
text_classification = 'text-classification' | |||
# mlm | |||
bert_mlm = 'bert-mlm' | |||
# roberta mlm | |||
roberta_mlm = 'roberta-mlm' | |||
class Pipelines(object): | |||
@@ -137,7 +141,7 @@ class Trainers(object): | |||
Holds the standard trainer name to use for identifying different trainer. | |||
This should be used to register trainers. | |||
For a general Trainer, you can use easynlp-trainer/ofa-trainer/sofa-trainer. | |||
For a general Trainer, you can use easynlp-trainer/ofa-trainer. | |||
For a model specific Trainer, you can use ${ModelName}-${Task}-trainer. | |||
""" | |||
@@ -179,6 +183,8 @@ class Preprocessors(object): | |||
sbert_token_cls_tokenizer = 'sbert-token-cls-tokenizer' | |||
zero_shot_cls_tokenizer = 'zero-shot-cls-tokenizer' | |||
text_error_correction = 'text-error-correction' | |||
word_segment_text_to_label_preprocessor = 'word-segment-text-to-label-preprocessor' | |||
fill_mask = 'fill-mask' | |||
# audio preprocessor | |||
linear_aec_fbank = 'linear-aec-fbank' | |||
@@ -204,7 +210,7 @@ class Metrics(object): | |||
# metric for image instance segmentation task | |||
image_ins_seg_coco_metric = 'image-ins-seg-coco-metric' | |||
# metrics for sequence classification task | |||
seq_cls_metric = 'seq_cls_metric' | |||
seq_cls_metric = 'seq-cls-metric' | |||
# metrics for token-classification task | |||
token_cls_metric = 'token-cls-metric' | |||
# metrics for text-generation task | |||
@@ -13,6 +13,7 @@ if TYPE_CHECKING: | |||
from .image_portrait_enhancement_metric import ImagePortraitEnhancementMetric | |||
from .sequence_classification_metric import SequenceClassificationMetric | |||
from .text_generation_metric import TextGenerationMetric | |||
from .token_classification_metric import TokenClassificationMetric | |||
else: | |||
_import_structure = { | |||
@@ -26,6 +27,7 @@ else: | |||
['ImagePortraitEnhancementMetric'], | |||
'sequence_classification_metric': ['SequenceClassificationMetric'], | |||
'text_generation_metric': ['TextGenerationMetric'], | |||
'token_classification_metric': ['TokenClassificationMetric'], | |||
} | |||
import sys | |||
@@ -10,6 +10,9 @@ class Metric(ABC): | |||
complex metrics for a specific task with or without other Metric subclasses. | |||
""" | |||
def __init__(self, trainer=None, *args, **kwargs): | |||
self.trainer = trainer | |||
@abstractmethod | |||
def add(self, outputs: Dict, inputs: Dict): | |||
""" Append logits and labels within an eval loop. | |||
@@ -20,7 +20,9 @@ class MetricKeys(object): | |||
task_default_metrics = { | |||
Tasks.image_segmentation: [Metrics.image_ins_seg_coco_metric], | |||
Tasks.sentence_similarity: [Metrics.seq_cls_metric], | |||
Tasks.nli: [Metrics.seq_cls_metric], | |||
Tasks.sentiment_classification: [Metrics.seq_cls_metric], | |||
Tasks.token_classification: [Metrics.token_cls_metric], | |||
Tasks.text_generation: [Metrics.text_gen_metric], | |||
Tasks.image_denoising: [Metrics.image_denoise_metric], | |||
Tasks.image_color_enhancement: [Metrics.image_color_enhance_metric], | |||
@@ -17,14 +17,14 @@ class SequenceClassificationMetric(Metric): | |||
"""The metric computation class for sequence classification classes. | |||
""" | |||
label_name = 'labels' | |||
def __init__(self): | |||
def __init__(self, *args, **kwargs): | |||
super().__init__(*args, **kwargs) | |||
self.preds = [] | |||
self.labels = [] | |||
def add(self, outputs: Dict, inputs: Dict): | |||
ground_truths = inputs[self.label_name] | |||
label_name = OutputKeys.LABEL if OutputKeys.LABEL in inputs else OutputKeys.LABELS | |||
ground_truths = inputs[label_name] | |||
eval_results = outputs[OutputKeys.LOGITS] | |||
self.preds.append( | |||
torch_nested_numpify(torch_nested_detach(eval_results))) | |||
@@ -0,0 +1,123 @@ | |||
import importlib | |||
from typing import Dict, List, Optional, Union | |||
import numpy as np | |||
from modelscope.outputs import OutputKeys | |||
from ..metainfo import Metrics | |||
from ..utils.registry import default_group | |||
from ..utils.tensor_utils import torch_nested_detach, torch_nested_numpify | |||
from .base import Metric | |||
from .builder import METRICS, MetricKeys | |||
@METRICS.register_module( | |||
group_key=default_group, module_name=Metrics.token_cls_metric) | |||
class TokenClassificationMetric(Metric): | |||
""" | |||
The metric computation class for token-classification task. | |||
Args: | |||
return_entity_level_metrics (bool, *optional*): | |||
Whether to return every label's detail metrics, default False. | |||
""" | |||
def add(self, outputs: Dict, inputs: Dict): | |||
label_name = OutputKeys.LABEL if OutputKeys.LABEL in inputs else OutputKeys.LABELS | |||
ground_truths = inputs[label_name] | |||
eval_results = outputs[OutputKeys.LOGITS] | |||
self.preds.append( | |||
torch_nested_numpify(torch_nested_detach(eval_results))) | |||
self.labels.append( | |||
torch_nested_numpify(torch_nested_detach(ground_truths))) | |||
def __init__(self, return_entity_level_metrics=False, *args, **kwargs): | |||
super().__init__(*args, **kwargs) | |||
self.return_entity_level_metrics = return_entity_level_metrics | |||
self.preds = [] | |||
self.labels = [] | |||
def evaluate(self): | |||
self.id2label = { | |||
id: label | |||
for label, id in self.trainer.label2id.items() | |||
} | |||
self.preds = np.concatenate(self.preds, axis=0) | |||
self.labels = np.concatenate(self.labels, axis=0) | |||
predictions = np.argmax(self.preds, axis=-1) | |||
true_predictions = [[ | |||
self.id2label[p] for (p, lb) in zip(prediction, label) | |||
if lb != -100 | |||
] for prediction, label in zip(predictions, self.labels)] | |||
true_labels = [[ | |||
self.id2label[lb] for (p, lb) in zip(prediction, label) | |||
if lb != -100 | |||
] for prediction, label in zip(predictions, self.labels)] | |||
results = self._compute( | |||
predictions=true_predictions, references=true_labels) | |||
if self.return_entity_level_metrics: | |||
final_results = {} | |||
for key, value in results.items(): | |||
if isinstance(value, dict): | |||
for n, v in value.items(): | |||
final_results[f'{key}_{n}'] = v | |||
else: | |||
final_results[key] = value | |||
return final_results | |||
else: | |||
return { | |||
MetricKeys.PRECISION: results[MetricKeys.PRECISION], | |||
MetricKeys.RECALL: results[MetricKeys.RECALL], | |||
MetricKeys.F1: results[MetricKeys.F1], | |||
MetricKeys.ACCURACY: results[MetricKeys.ACCURACY], | |||
} | |||
@staticmethod | |||
def _compute( | |||
predictions, | |||
references, | |||
suffix: bool = False, | |||
scheme: Optional[str] = None, | |||
mode: Optional[str] = None, | |||
sample_weight: Optional[List[int]] = None, | |||
zero_division: Union[str, int] = 'warn', | |||
): | |||
from seqeval.metrics import accuracy_score, classification_report | |||
if scheme is not None: | |||
try: | |||
scheme_module = importlib.import_module('seqeval.scheme') | |||
scheme = getattr(scheme_module, scheme) | |||
except AttributeError: | |||
raise ValueError( | |||
f'Scheme should be one of [IOB1, IOB2, IOE1, IOE2, IOBES, BILOU], got {scheme}' | |||
) | |||
report = classification_report( | |||
y_true=references, | |||
y_pred=predictions, | |||
suffix=suffix, | |||
output_dict=True, | |||
scheme=scheme, | |||
mode=mode, | |||
sample_weight=sample_weight, | |||
zero_division=zero_division, | |||
) | |||
report.pop('macro avg') | |||
report.pop('weighted avg') | |||
overall_score = report.pop('micro avg') | |||
scores = { | |||
type_name: { | |||
MetricKeys.PRECISION: score['precision'], | |||
MetricKeys.RECALL: score['recall'], | |||
MetricKeys.F1: score['f1-score'], | |||
'number': score['support'], | |||
} | |||
for type_name, score in report.items() | |||
} | |||
scores[MetricKeys.PRECISION] = overall_score['precision'] | |||
scores[MetricKeys.RECALL] = overall_score['recall'] | |||
scores[MetricKeys.F1] = overall_score['f1-score'] | |||
scores[MetricKeys.ACCURACY] = accuracy_score( | |||
y_true=references, y_pred=predictions) | |||
return scores |
@@ -10,6 +10,8 @@ from modelscope.hub.snapshot_download import snapshot_download | |||
from modelscope.models.builder import build_model | |||
from modelscope.utils.config import Config | |||
from modelscope.utils.constant import DEFAULT_MODEL_REVISION, ModelFile | |||
from modelscope.utils.file_utils import func_receive_dict_inputs | |||
from modelscope.utils.hub import parse_label_mapping | |||
from modelscope.utils.logger import get_logger | |||
logger = get_logger() | |||
@@ -69,6 +71,7 @@ class Model(ABC): | |||
def from_pretrained(cls, | |||
model_name_or_path: str, | |||
revision: Optional[str] = DEFAULT_MODEL_REVISION, | |||
cfg_dict: Config = None, | |||
*model_args, | |||
**kwargs): | |||
""" Instantiate a model from local directory or remote model repo. Note | |||
@@ -87,25 +90,25 @@ class Model(ABC): | |||
) | |||
local_model_dir = snapshot_download(model_name_or_path, revision) | |||
logger.info(f'initialize model from {local_model_dir}') | |||
cfg = Config.from_file( | |||
osp.join(local_model_dir, ModelFile.CONFIGURATION)) | |||
if cfg_dict is not None: | |||
cfg = cfg_dict | |||
else: | |||
cfg = Config.from_file( | |||
osp.join(local_model_dir, ModelFile.CONFIGURATION)) | |||
task_name = cfg.task | |||
model_cfg = cfg.model | |||
assert hasattr( | |||
cfg, 'pipeline'), 'pipeline config is missing from config file.' | |||
pipeline_cfg = cfg.pipeline | |||
# TODO @wenmeng.zwm may should manually initialize model after model building | |||
if hasattr(model_cfg, 'model_type') and not hasattr(model_cfg, 'type'): | |||
model_cfg.type = model_cfg.model_type | |||
model_cfg.model_dir = local_model_dir | |||
for k, v in kwargs.items(): | |||
model_cfg[k] = v | |||
model = build_model( | |||
model_cfg, task_name=task_name, default_args=kwargs) | |||
# dynamically add pipeline info to model for pipeline inference | |||
model.pipeline = pipeline_cfg | |||
if hasattr(cfg, 'pipeline'): | |||
model.pipeline = cfg.pipeline | |||
return model |
@@ -5,6 +5,7 @@ from typing import Any, Dict, Optional, Union | |||
import torch | |||
from torch import nn | |||
from modelscope.utils.file_utils import func_receive_dict_inputs | |||
from modelscope.utils.logger import get_logger | |||
from .base_model import Model | |||
@@ -20,6 +21,13 @@ class TorchModel(Model, torch.nn.Module): | |||
super().__init__(model_dir, *args, **kwargs) | |||
torch.nn.Module.__init__(self) | |||
def __call__(self, input: Dict[str, | |||
torch.Tensor]) -> Dict[str, torch.Tensor]: | |||
if func_receive_dict_inputs(self.forward): | |||
return self.postprocess(self.forward(input)) | |||
else: | |||
return self.postprocess(self.forward(**input)) | |||
def forward(self, inputs: Dict[str, | |||
torch.Tensor]) -> Dict[str, torch.Tensor]: | |||
raise NotImplementedError | |||
@@ -50,6 +58,3 @@ class TorchModel(Model, torch.nn.Module): | |||
elif isinstance(module, nn.LayerNorm): | |||
module.bias.data.zero_() | |||
module.weight.data.fill_(1.0) | |||
def compute_loss(self, outputs: Dict[str, Any], labels): | |||
raise NotImplementedError() |
@@ -4,32 +4,26 @@ from typing import TYPE_CHECKING | |||
from modelscope.utils.import_utils import LazyImportModule | |||
if TYPE_CHECKING: | |||
from .backbones import (SbertModel, SpaceGenerator, SpaceModelBase, | |||
GPT3Model) | |||
from .backbones import SbertModel | |||
from .heads import SequenceClassificationHead | |||
from .bert_for_sequence_classification import BertForSequenceClassification | |||
from .csanmt_for_translation import CsanmtForTranslation | |||
from .masked_language import (StructBertForMaskedLM, VecoForMaskedLM, | |||
BertForMaskedLM) | |||
from .nncrf_for_named_entity_recognition import TransformerCRFForNamedEntityRecognition | |||
from .palm_for_text_generation import PalmForTextGeneration | |||
from .sbert_for_nli import SbertForNLI | |||
from .sbert_for_sentence_similarity import SbertForSentenceSimilarity | |||
from .sbert_for_sentiment_classification import SbertForSentimentClassification | |||
from .sbert_for_token_classification import SbertForTokenClassification | |||
from .sbert_for_zero_shot_classification import SbertForZeroShotClassification | |||
from .sequence_classification import SequenceClassificationModel | |||
from .space_for_dialog_intent_prediction import SpaceForDialogIntent | |||
from .space_for_dialog_modeling import SpaceForDialogModeling | |||
from .space_for_dialog_state_tracking import SpaceForDialogStateTracking | |||
from .task_model import SingleBackboneTaskModelBase | |||
from .palm_v2 import PalmForTextGeneration | |||
from .token_classification import SbertForTokenClassification | |||
from .sequence_classification import VecoForSequenceClassification, SbertForSequenceClassification | |||
from .space import SpaceForDialogIntent | |||
from .space import SpaceForDialogModeling | |||
from .space import SpaceForDialogStateTracking | |||
from .task_models.task_model import SingleBackboneTaskModelBase | |||
from .bart_for_text_error_correction import BartForTextErrorCorrection | |||
from .gpt3_for_text_generation import GPT3ForTextGeneration | |||
from .gpt3 import GPT3ForTextGeneration | |||
else: | |||
_import_structure = { | |||
'backbones': | |||
['SbertModel', 'SpaceGenerator', 'SpaceModelBase', 'GPT3Model'], | |||
'backbones': ['SbertModel'], | |||
'heads': ['SequenceClassificationHead'], | |||
'csanmt_for_translation': ['CsanmtForTranslation'], | |||
'bert_for_sequence_classification': ['BertForSequenceClassification'], | |||
@@ -37,21 +31,17 @@ else: | |||
['StructBertForMaskedLM', 'VecoForMaskedLM', 'BertForMaskedLM'], | |||
'nncrf_for_named_entity_recognition': | |||
['TransformerCRFForNamedEntityRecognition'], | |||
'palm_for_text_generation': ['PalmForTextGeneration'], | |||
'sbert_for_nli': ['SbertForNLI'], | |||
'sbert_for_sentence_similarity': ['SbertForSentenceSimilarity'], | |||
'sbert_for_sentiment_classification': | |||
['SbertForSentimentClassification'], | |||
'sbert_for_token_classification': ['SbertForTokenClassification'], | |||
'sbert_for_zero_shot_classification': | |||
['SbertForZeroShotClassification'], | |||
'sequence_classification': ['SequenceClassificationModel'], | |||
'space_for_dialog_intent_prediction': ['SpaceForDialogIntent'], | |||
'space_for_dialog_modeling': ['SpaceForDialogModeling'], | |||
'space_for_dialog_state_tracking': ['SpaceForDialogStateTracking'], | |||
'palm_v2': ['PalmForTextGeneration'], | |||
'token_classification': ['SbertForTokenClassification'], | |||
'sequence_classification': | |||
['VecoForSequenceClassification', 'SbertForSequenceClassification'], | |||
'space': [ | |||
'SpaceForDialogIntent', 'SpaceForDialogModeling', | |||
'SpaceForDialogStateTracking' | |||
], | |||
'task_model': ['SingleBackboneTaskModelBase'], | |||
'bart_for_text_error_correction': ['BartForTextErrorCorrection'], | |||
'gpt3_for_text_generation': ['GPT3ForTextGeneration'], | |||
'gpt3': ['GPT3ForTextGeneration'], | |||
} | |||
import sys | |||
@@ -4,14 +4,10 @@ from typing import TYPE_CHECKING | |||
from modelscope.utils.import_utils import LazyImportModule | |||
if TYPE_CHECKING: | |||
from .space import SpaceGenerator, SpaceModelBase | |||
from .structbert import SbertModel | |||
from .gpt3 import GPT3Model | |||
else: | |||
_import_structure = { | |||
'space': ['SpaceGenerator', 'SpaceModelBase'], | |||
'structbert': ['SbertModel'], | |||
'gpt3': ['GPT3Model'] | |||
} | |||
import sys | |||
@@ -1,2 +0,0 @@ | |||
from .model.generator import Generator as SpaceGenerator | |||
from .model.model_base import SpaceModelBase |
@@ -1,3 +0,0 @@ | |||
from .gen_unified_transformer import GenUnifiedTransformer | |||
from .intent_unified_transformer import IntentUnifiedTransformer | |||
from .unified_transformer import UnifiedTransformer |
@@ -0,0 +1,54 @@ | |||
from transformers import PreTrainedModel | |||
from modelscope.metainfo import Models | |||
from modelscope.models.base import TorchModel | |||
from modelscope.models.builder import BACKBONES | |||
from modelscope.models.nlp.structbert import SbertConfig | |||
from modelscope.models.nlp.structbert import SbertModel as SbertModelTransform | |||
from modelscope.utils.constant import Fields | |||
from modelscope.utils.logger import get_logger | |||
logger = get_logger(__name__) | |||
@BACKBONES.register_module(Fields.nlp, module_name=Models.structbert) | |||
class SbertModel(TorchModel, SbertModelTransform): | |||
def __init__(self, model_dir=None, add_pooling_layer=True, **config): | |||
""" | |||
Args: | |||
model_dir (str, optional): The model checkpoint directory. Defaults to None. | |||
add_pooling_layer (bool, optional): to decide if pool the output from hidden layer. Defaults to True. | |||
""" | |||
config = SbertConfig(**config) | |||
super().__init__(model_dir) | |||
self.config = config | |||
SbertModelTransform.__init__(self, config, add_pooling_layer) | |||
def extract_sequence_outputs(self, outputs): | |||
return outputs['last_hidden_state'] | |||
def extract_pooled_outputs(self, outputs): | |||
return outputs['pooler_output'] | |||
def forward( | |||
self, | |||
input_ids=None, | |||
attention_mask=None, | |||
token_type_ids=None, | |||
position_ids=None, | |||
head_mask=None, | |||
inputs_embeds=None, | |||
encoder_hidden_states=None, | |||
encoder_attention_mask=None, | |||
past_key_values=None, | |||
use_cache=None, | |||
output_attentions=None, | |||
output_hidden_states=None, | |||
return_dict=None, | |||
): | |||
return SbertModelTransform.forward( | |||
self, input_ids, attention_mask, token_type_ids, position_ids, | |||
head_mask, inputs_embeds, encoder_hidden_states, | |||
encoder_attention_mask, past_key_values, use_cache, | |||
output_attentions, output_hidden_states, return_dict) |
@@ -1,19 +0,0 @@ | |||
# Copyright (c) Alibaba, Inc. and its affiliates. | |||
from typing import TYPE_CHECKING | |||
from modelscope.utils.import_utils import LazyImportModule | |||
if TYPE_CHECKING: | |||
from .modeling_sbert import SbertModel | |||
else: | |||
_import_structure = {'modeling_sbert': ['SbertModel']} | |||
import sys | |||
sys.modules[__name__] = LazyImportModule( | |||
__name__, | |||
globals()['__file__'], | |||
_import_structure, | |||
module_spec=__spec__, | |||
extra_objects={}, | |||
) |
@@ -1,815 +0,0 @@ | |||
import math | |||
from dataclasses import dataclass | |||
from typing import Optional, Tuple, Union | |||
import torch | |||
import torch.utils.checkpoint | |||
from packaging import version | |||
from torch import nn | |||
from transformers import PreTrainedModel | |||
from transformers.activations import ACT2FN | |||
from transformers.modeling_outputs import ( | |||
BaseModelOutputWithPastAndCrossAttentions, | |||
BaseModelOutputWithPoolingAndCrossAttentions, ModelOutput) | |||
from transformers.modeling_utils import (apply_chunking_to_forward, | |||
find_pruneable_heads_and_indices, | |||
prune_linear_layer) | |||
from modelscope.metainfo import Models | |||
from modelscope.models.base import TorchModel | |||
from modelscope.models.builder import BACKBONES | |||
from modelscope.utils.constant import Fields | |||
from modelscope.utils.logger import get_logger | |||
from .configuration_sbert import SbertConfig | |||
logger = get_logger(__name__) | |||
@BACKBONES.register_module(Fields.nlp, module_name=Models.structbert) | |||
class SbertModel(TorchModel, PreTrainedModel): | |||
""" | |||
The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of | |||
cross-attention is added between the self-attention layers, following the architecture described in `Attention is | |||
all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, | |||
Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. | |||
To behave as an decoder the model needs to be initialized with the :obj:`is_decoder` argument of the configuration | |||
set to :obj:`True`. To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder` | |||
argument and :obj:`add_cross_attention` set to :obj:`True`; an :obj:`encoder_hidden_states` is then expected as an | |||
input to the forward pass. | |||
""" | |||
def __init__(self, model_dir=None, add_pooling_layer=True, **config): | |||
""" | |||
Args: | |||
model_dir (str, optional): The model checkpoint directory. Defaults to None. | |||
add_pooling_layer (bool, optional): to decide if pool the output from hidden layer. Defaults to True. | |||
""" | |||
config = SbertConfig(**config) | |||
super().__init__(model_dir) | |||
self.config = config | |||
self.embeddings = SbertEmbeddings(config) | |||
self.encoder = SbertEncoder(config) | |||
self.pooler = SbertPooler(config) if add_pooling_layer else None | |||
self.init_weights() | |||
def get_input_embeddings(self): | |||
return self.embeddings.word_embeddings | |||
def set_input_embeddings(self, value): | |||
self.embeddings.word_embeddings = value | |||
def _prune_heads(self, heads_to_prune): | |||
""" | |||
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base | |||
class PreTrainedModel | |||
""" | |||
for layer, heads in heads_to_prune.items(): | |||
self.encoder.layer[layer].attention.prune_heads(heads) | |||
def forward(self, | |||
input_ids=None, | |||
attention_mask=None, | |||
token_type_ids=None, | |||
position_ids=None, | |||
head_mask=None, | |||
inputs_embeds=None, | |||
encoder_hidden_states=None, | |||
encoder_attention_mask=None, | |||
past_key_values=None, | |||
use_cache=None, | |||
output_attentions=None, | |||
output_hidden_states=None, | |||
return_dict=None, | |||
**kwargs): | |||
r""" | |||
encoder_hidden_states (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)` | |||
, `optional`): | |||
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if | |||
the model is configured as a decoder. | |||
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): | |||
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in | |||
the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``: | |||
- 1 for tokens that are **not masked**, | |||
- 0 for tokens that are **masked**. | |||
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` | |||
with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, | |||
sequence_length - 1, embed_size_per_head)`): | |||
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. | |||
If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids` | |||
(those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)` | |||
instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`. | |||
use_cache (:obj:`bool`, `optional`): | |||
If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up | |||
decoding (see :obj:`past_key_values`). | |||
""" | |||
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions | |||
output_hidden_states = ( | |||
output_hidden_states if output_hidden_states is not None else | |||
self.config.output_hidden_states) | |||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict | |||
if self.config.is_decoder: | |||
use_cache = use_cache if use_cache is not None else self.config.use_cache | |||
else: | |||
use_cache = False | |||
if input_ids is not None and inputs_embeds is not None: | |||
raise ValueError( | |||
'You cannot specify both input_ids and inputs_embeds at the same time' | |||
) | |||
elif input_ids is not None: | |||
input_shape = input_ids.size() | |||
elif inputs_embeds is not None: | |||
input_shape = inputs_embeds.size()[:-1] | |||
else: | |||
raise ValueError( | |||
'You have to specify either input_ids or inputs_embeds') | |||
batch_size, seq_length = input_shape | |||
device = input_ids.device if input_ids is not None else inputs_embeds.device | |||
# past_key_values_length | |||
past_key_values_length = past_key_values[0][0].shape[ | |||
2] if past_key_values is not None else 0 | |||
if attention_mask is None: | |||
attention_mask = torch.ones( | |||
((batch_size, seq_length + past_key_values_length)), | |||
device=device) | |||
if token_type_ids is None: | |||
if hasattr(self.embeddings, 'token_type_ids'): | |||
buffered_token_type_ids = self.embeddings.token_type_ids[:, : | |||
seq_length] | |||
buffered_token_type_ids_expanded = buffered_token_type_ids.expand( | |||
batch_size, seq_length) | |||
token_type_ids = buffered_token_type_ids_expanded | |||
else: | |||
token_type_ids = torch.zeros( | |||
input_shape, dtype=torch.long, device=device) | |||
# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length] | |||
# ourselves in which case we just need to make it broadcastable to all heads. | |||
extended_attention_mask: torch.Tensor = self.get_extended_attention_mask( | |||
attention_mask, input_shape, device) | |||
# If a 2D or 3D attention mask is provided for the cross-attention | |||
# we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length] | |||
if self.config.is_decoder and encoder_hidden_states is not None: | |||
encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size( | |||
) | |||
encoder_hidden_shape = (encoder_batch_size, | |||
encoder_sequence_length) | |||
if encoder_attention_mask is None: | |||
encoder_attention_mask = torch.ones( | |||
encoder_hidden_shape, device=device) | |||
encoder_extended_attention_mask = self.invert_attention_mask( | |||
encoder_attention_mask) | |||
else: | |||
encoder_extended_attention_mask = None | |||
# Prepare head mask if needed | |||
# 1.0 in head_mask indicate we keep the head | |||
# attention_probs has shape bsz x n_heads x N x N | |||
# input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] | |||
# and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length] | |||
head_mask = self.get_head_mask(head_mask, | |||
self.config.num_hidden_layers) | |||
embedding_output, orignal_embeds = self.embeddings( | |||
input_ids=input_ids, | |||
position_ids=position_ids, | |||
token_type_ids=token_type_ids, | |||
inputs_embeds=inputs_embeds, | |||
past_key_values_length=past_key_values_length, | |||
return_inputs_embeds=True, | |||
) | |||
encoder_outputs = self.encoder( | |||
embedding_output, | |||
attention_mask=extended_attention_mask, | |||
head_mask=head_mask, | |||
encoder_hidden_states=encoder_hidden_states, | |||
encoder_attention_mask=encoder_extended_attention_mask, | |||
past_key_values=past_key_values, | |||
use_cache=use_cache, | |||
output_attentions=output_attentions, | |||
output_hidden_states=output_hidden_states, | |||
return_dict=return_dict, | |||
) | |||
sequence_output = encoder_outputs[0] | |||
pooled_output = self.pooler( | |||
sequence_output) if self.pooler is not None else None | |||
if not return_dict: | |||
return (sequence_output, | |||
pooled_output) + encoder_outputs[1:] + (orignal_embeds, ) | |||
return BaseModelOutputWithPoolingAndCrossAttentionsWithEmbedding( | |||
last_hidden_state=sequence_output, | |||
pooler_output=pooled_output, | |||
past_key_values=encoder_outputs.past_key_values, | |||
hidden_states=encoder_outputs.hidden_states, | |||
attentions=encoder_outputs.attentions, | |||
cross_attentions=encoder_outputs.cross_attentions, | |||
embedding_output=orignal_embeds) | |||
def extract_sequence_outputs(self, outputs): | |||
return outputs['last_hidden_state'] | |||
def extract_pooled_outputs(self, outputs): | |||
return outputs['pooler_output'] | |||
class SbertEmbeddings(nn.Module): | |||
"""Construct the embeddings from word, position and token_type embeddings.""" | |||
def __init__(self, config): | |||
super().__init__() | |||
self.word_embeddings = nn.Embedding( | |||
config.vocab_size, | |||
config.hidden_size, | |||
padding_idx=config.pad_token_id) | |||
self.position_embeddings = nn.Embedding(config.max_position_embeddings, | |||
config.hidden_size) | |||
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, | |||
config.hidden_size) | |||
# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load | |||
# any TensorFlow checkpoint file | |||
self.LayerNorm = nn.LayerNorm( | |||
config.hidden_size, eps=config.layer_norm_eps) | |||
self.dropout = nn.Dropout(config.hidden_dropout_prob) | |||
# position_ids (1, len position emb) is contiguous in memory and exported when serialized | |||
self.position_embedding_type = getattr(config, | |||
'position_embedding_type', | |||
'absolute') | |||
self.register_buffer( | |||
'position_ids', | |||
torch.arange(config.max_position_embeddings).expand((1, -1))) | |||
if version.parse(torch.__version__) > version.parse('1.6.0'): | |||
self.register_buffer( | |||
'token_type_ids', | |||
torch.zeros( | |||
self.position_ids.size(), | |||
dtype=torch.long, | |||
device=self.position_ids.device), | |||
persistent=False, | |||
) | |||
def forward(self, | |||
input_ids=None, | |||
token_type_ids=None, | |||
position_ids=None, | |||
inputs_embeds=None, | |||
past_key_values_length=0, | |||
return_inputs_embeds=False): | |||
if input_ids is not None: | |||
input_shape = input_ids.size() | |||
else: | |||
input_shape = inputs_embeds.size()[:-1] | |||
seq_length = input_shape[1] | |||
if position_ids is None: | |||
position_ids = self.position_ids[:, | |||
past_key_values_length:seq_length | |||
+ past_key_values_length] | |||
# Setting the token_type_ids to the registered buffer in constructor where it is all zeros, which usually occurs | |||
# when its auto-generated, registered buffer helps users when tracing the model without passing token_type_ids | |||
# issue #5664 | |||
if token_type_ids is None: | |||
if hasattr(self, 'token_type_ids'): | |||
buffered_token_type_ids = self.token_type_ids[:, :seq_length] | |||
buffered_token_type_ids_expanded = buffered_token_type_ids.expand( | |||
input_shape[0], seq_length) | |||
token_type_ids = buffered_token_type_ids_expanded | |||
else: | |||
token_type_ids = torch.zeros( | |||
input_shape, | |||
dtype=torch.long, | |||
device=self.position_ids.device) | |||
if inputs_embeds is None: | |||
inputs_embeds = self.word_embeddings(input_ids) | |||
token_type_embeddings = self.token_type_embeddings(token_type_ids) | |||
embeddings = inputs_embeds + token_type_embeddings | |||
if self.position_embedding_type == 'absolute': | |||
position_embeddings = self.position_embeddings(position_ids) | |||
embeddings += position_embeddings | |||
embeddings = self.LayerNorm(embeddings) | |||
embeddings = self.dropout(embeddings) | |||
if not return_inputs_embeds: | |||
return embeddings | |||
else: | |||
return embeddings, inputs_embeds | |||
class SbertSelfAttention(nn.Module): | |||
def __init__(self, config): | |||
super().__init__() | |||
if config.hidden_size % config.num_attention_heads != 0 and not hasattr( | |||
config, 'embedding_size'): | |||
raise ValueError( | |||
f'The hidden size ({config.hidden_size}) is not a multiple of the number of attention ' | |||
f'heads ({config.num_attention_heads})') | |||
self.num_attention_heads = config.num_attention_heads | |||
self.attention_head_size = int(config.hidden_size | |||
/ config.num_attention_heads) | |||
self.all_head_size = self.num_attention_heads * self.attention_head_size | |||
self.query = nn.Linear(config.hidden_size, self.all_head_size) | |||
self.key = nn.Linear(config.hidden_size, self.all_head_size) | |||
self.value = nn.Linear(config.hidden_size, self.all_head_size) | |||
self.dropout = nn.Dropout(config.attention_probs_dropout_prob) | |||
self.position_embedding_type = getattr(config, | |||
'position_embedding_type', | |||
'absolute') | |||
if self.position_embedding_type == 'relative_key' or self.position_embedding_type == 'relative_key_query': | |||
self.max_position_embeddings = config.max_position_embeddings | |||
self.distance_embedding = nn.Embedding( | |||
2 * config.max_position_embeddings - 1, | |||
self.attention_head_size) | |||
self.is_decoder = config.is_decoder | |||
def transpose_for_scores(self, x): | |||
new_x_shape = x.size()[:-1] + (self.num_attention_heads, | |||
self.attention_head_size) | |||
x = x.view(*new_x_shape) | |||
return x.permute(0, 2, 1, 3) | |||
def forward( | |||
self, | |||
hidden_states, | |||
attention_mask=None, | |||
head_mask=None, | |||
encoder_hidden_states=None, | |||
encoder_attention_mask=None, | |||
past_key_value=None, | |||
output_attentions=False, | |||
): | |||
mixed_query_layer = self.query(hidden_states) | |||
# If this is instantiated as a cross-attention module, the keys | |||
# and values come from an encoder; the attention mask needs to be | |||
# such that the encoder's padding tokens are not attended to. | |||
is_cross_attention = encoder_hidden_states is not None | |||
if is_cross_attention and past_key_value is not None: | |||
# reuse k,v, cross_attentions | |||
key_layer = past_key_value[0] | |||
value_layer = past_key_value[1] | |||
attention_mask = encoder_attention_mask | |||
elif is_cross_attention: | |||
key_layer = self.transpose_for_scores( | |||
self.key(encoder_hidden_states)) | |||
value_layer = self.transpose_for_scores( | |||
self.value(encoder_hidden_states)) | |||
attention_mask = encoder_attention_mask | |||
elif past_key_value is not None: | |||
key_layer = self.transpose_for_scores(self.key(hidden_states)) | |||
value_layer = self.transpose_for_scores(self.value(hidden_states)) | |||
key_layer = torch.cat([past_key_value[0], key_layer], dim=2) | |||
value_layer = torch.cat([past_key_value[1], value_layer], dim=2) | |||
else: | |||
key_layer = self.transpose_for_scores(self.key(hidden_states)) | |||
value_layer = self.transpose_for_scores(self.value(hidden_states)) | |||
query_layer = self.transpose_for_scores(mixed_query_layer) | |||
if self.is_decoder: | |||
# if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states. | |||
# Further calls to cross_attention layer can then reuse all cross-attention | |||
# key/value_states (first "if" case) | |||
# if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of | |||
# all previous decoder key/value_states. Further calls to uni-directional self-attention | |||
# can concat previous decoder key/value_states to current projected key/value_states (third "elif" case) | |||
# if encoder bi-directional self-attention `past_key_value` is always `None` | |||
past_key_value = (key_layer, value_layer) | |||
# Take the dot product between "query" and "key" to get the raw attention scores. | |||
attention_scores = torch.matmul(query_layer, | |||
key_layer.transpose(-1, -2)) | |||
if self.position_embedding_type == 'relative_key' or self.position_embedding_type == 'relative_key_query': | |||
seq_length = hidden_states.size()[1] | |||
position_ids_l = torch.arange( | |||
seq_length, dtype=torch.long, | |||
device=hidden_states.device).view(-1, 1) | |||
position_ids_r = torch.arange( | |||
seq_length, dtype=torch.long, | |||
device=hidden_states.device).view(1, -1) | |||
distance = position_ids_l - position_ids_r | |||
positional_embedding = self.distance_embedding( | |||
distance + self.max_position_embeddings - 1) | |||
positional_embedding = positional_embedding.to( | |||
dtype=query_layer.dtype) # fp16 compatibility | |||
if self.position_embedding_type == 'relative_key': | |||
relative_position_scores = torch.einsum( | |||
'bhld,lrd->bhlr', query_layer, positional_embedding) | |||
attention_scores = attention_scores + relative_position_scores | |||
elif self.position_embedding_type == 'relative_key_query': | |||
relative_position_scores_query = torch.einsum( | |||
'bhld,lrd->bhlr', query_layer, positional_embedding) | |||
relative_position_scores_key = torch.einsum( | |||
'bhrd,lrd->bhlr', key_layer, positional_embedding) | |||
attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key | |||
attention_scores = attention_scores / math.sqrt( | |||
self.attention_head_size) | |||
if attention_mask is not None: | |||
# Apply the attention mask is (precomputed for all layers in SbertModel forward() function) | |||
attention_scores = attention_scores + attention_mask | |||
# Normalize the attention scores to probabilities. | |||
attention_probs = nn.Softmax(dim=-1)(attention_scores) | |||
# This is actually dropping out entire tokens to attend to, which might | |||
# seem a bit unusual, but is taken from the original Transformer paper. | |||
attention_probs = self.dropout(attention_probs) | |||
# Mask heads if we want to | |||
if head_mask is not None: | |||
attention_probs = attention_probs * head_mask | |||
context_layer = torch.matmul(attention_probs, value_layer) | |||
context_layer = context_layer.permute(0, 2, 1, 3).contiguous() | |||
new_context_layer_shape = context_layer.size()[:-2] + ( | |||
self.all_head_size, ) | |||
context_layer = context_layer.view(*new_context_layer_shape) | |||
outputs = (context_layer, | |||
attention_probs) if output_attentions else (context_layer, ) | |||
if self.is_decoder: | |||
outputs = outputs + (past_key_value, ) | |||
return outputs | |||
class SbertSelfOutput(nn.Module): | |||
def __init__(self, config): | |||
super().__init__() | |||
self.dense = nn.Linear(config.hidden_size, config.hidden_size) | |||
self.LayerNorm = nn.LayerNorm( | |||
config.hidden_size, eps=config.layer_norm_eps) | |||
self.dropout = nn.Dropout(config.hidden_dropout_prob) | |||
def forward(self, hidden_states, input_tensor): | |||
hidden_states = self.dense(hidden_states) | |||
hidden_states = self.dropout(hidden_states) | |||
hidden_states = self.LayerNorm(hidden_states + input_tensor) | |||
return hidden_states | |||
class SbertAttention(nn.Module): | |||
def __init__(self, config): | |||
super().__init__() | |||
self.self = SbertSelfAttention(config) | |||
self.output = SbertSelfOutput(config) | |||
self.pruned_heads = set() | |||
def prune_heads(self, heads): | |||
if len(heads) == 0: | |||
return | |||
heads, index = find_pruneable_heads_and_indices( | |||
heads, self.self.num_attention_heads, | |||
self.self.attention_head_size, self.pruned_heads) | |||
# Prune linear layers | |||
self.self.query = prune_linear_layer(self.self.query, index) | |||
self.self.key = prune_linear_layer(self.self.key, index) | |||
self.self.value = prune_linear_layer(self.self.value, index) | |||
self.output.dense = prune_linear_layer(self.output.dense, index, dim=1) | |||
# Update hyper params and store pruned heads | |||
self.self.num_attention_heads = self.self.num_attention_heads - len( | |||
heads) | |||
self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads | |||
self.pruned_heads = self.pruned_heads.union(heads) | |||
def forward( | |||
self, | |||
hidden_states, | |||
attention_mask=None, | |||
head_mask=None, | |||
encoder_hidden_states=None, | |||
encoder_attention_mask=None, | |||
past_key_value=None, | |||
output_attentions=False, | |||
): | |||
self_outputs = self.self( | |||
hidden_states, | |||
attention_mask, | |||
head_mask, | |||
encoder_hidden_states, | |||
encoder_attention_mask, | |||
past_key_value, | |||
output_attentions, | |||
) | |||
attention_output = self.output(self_outputs[0], hidden_states) | |||
outputs = (attention_output, | |||
) + self_outputs[1:] # add attentions if we output them | |||
return outputs | |||
class SbertIntermediate(nn.Module): | |||
def __init__(self, config): | |||
super().__init__() | |||
self.dense = nn.Linear(config.hidden_size, config.intermediate_size) | |||
if isinstance(config.hidden_act, str): | |||
self.intermediate_act_fn = ACT2FN[config.hidden_act] | |||
else: | |||
self.intermediate_act_fn = config.hidden_act | |||
def forward(self, hidden_states): | |||
hidden_states = self.dense(hidden_states) | |||
hidden_states = self.intermediate_act_fn(hidden_states) | |||
return hidden_states | |||
class SbertOutput(nn.Module): | |||
def __init__(self, config): | |||
super().__init__() | |||
self.dense = nn.Linear(config.intermediate_size, config.hidden_size) | |||
self.LayerNorm = nn.LayerNorm( | |||
config.hidden_size, eps=config.layer_norm_eps) | |||
self.dropout = nn.Dropout(config.hidden_dropout_prob) | |||
def forward(self, hidden_states, input_tensor): | |||
hidden_states = self.dense(hidden_states) | |||
hidden_states = self.dropout(hidden_states) | |||
hidden_states = self.LayerNorm(hidden_states + input_tensor) | |||
return hidden_states | |||
class SbertLayer(nn.Module): | |||
def __init__(self, config): | |||
super().__init__() | |||
self.chunk_size_feed_forward = config.chunk_size_feed_forward | |||
self.seq_len_dim = 1 | |||
self.attention = SbertAttention(config) | |||
self.is_decoder = config.is_decoder | |||
self.add_cross_attention = config.add_cross_attention | |||
if self.add_cross_attention: | |||
if not self.is_decoder: | |||
raise ValueError( | |||
f'{self} should be used as a decoder model if cross attention is added' | |||
) | |||
self.crossattention = SbertAttention(config) | |||
self.intermediate = SbertIntermediate(config) | |||
self.output = SbertOutput(config) | |||
def forward( | |||
self, | |||
hidden_states, | |||
attention_mask=None, | |||
head_mask=None, | |||
encoder_hidden_states=None, | |||
encoder_attention_mask=None, | |||
past_key_value=None, | |||
output_attentions=False, | |||
): | |||
# decoder uni-directional self-attention cached key/values tuple is at positions 1,2 | |||
self_attn_past_key_value = past_key_value[: | |||
2] if past_key_value is not None else None | |||
self_attention_outputs = self.attention( | |||
hidden_states, | |||
attention_mask, | |||
head_mask, | |||
output_attentions=output_attentions, | |||
past_key_value=self_attn_past_key_value, | |||
) | |||
attention_output = self_attention_outputs[0] | |||
# if decoder, the last output is tuple of self-attn cache | |||
if self.is_decoder: | |||
outputs = self_attention_outputs[1:-1] | |||
present_key_value = self_attention_outputs[-1] | |||
else: | |||
outputs = self_attention_outputs[ | |||
1:] # add self attentions if we output attention weights | |||
cross_attn_present_key_value = None | |||
if self.is_decoder and encoder_hidden_states is not None: | |||
if not hasattr(self, 'crossattention'): | |||
raise ValueError( | |||
f'If `encoder_hidden_states` are passed, {self} has to be instantiated' | |||
f'with cross-attention layers by setting `config.add_cross_attention=True`' | |||
) | |||
# cross_attn cached key/values tuple is at positions 3,4 of past_key_value tuple | |||
cross_attn_past_key_value = past_key_value[ | |||
-2:] if past_key_value is not None else None | |||
cross_attention_outputs = self.crossattention( | |||
attention_output, | |||
attention_mask, | |||
head_mask, | |||
encoder_hidden_states, | |||
encoder_attention_mask, | |||
cross_attn_past_key_value, | |||
output_attentions, | |||
) | |||
attention_output = cross_attention_outputs[0] | |||
outputs = outputs + cross_attention_outputs[ | |||
1:-1] # add cross attentions if we output attention weights | |||
# add cross-attn cache to positions 3,4 of present_key_value tuple | |||
cross_attn_present_key_value = cross_attention_outputs[-1] | |||
present_key_value = present_key_value + cross_attn_present_key_value | |||
layer_output = apply_chunking_to_forward(self.feed_forward_chunk, | |||
self.chunk_size_feed_forward, | |||
self.seq_len_dim, | |||
attention_output) | |||
outputs = (layer_output, ) + outputs | |||
# if decoder, return the attn key/values as the last output | |||
if self.is_decoder: | |||
outputs = outputs + (present_key_value, ) | |||
return outputs | |||
def feed_forward_chunk(self, attention_output): | |||
intermediate_output = self.intermediate(attention_output) | |||
layer_output = self.output(intermediate_output, attention_output) | |||
return layer_output | |||
class SbertEncoder(nn.Module): | |||
def __init__(self, config): | |||
super().__init__() | |||
self.config = config | |||
self.layer = nn.ModuleList( | |||
[SbertLayer(config) for _ in range(config.num_hidden_layers)]) | |||
self.gradient_checkpointing = False | |||
def forward( | |||
self, | |||
hidden_states, | |||
attention_mask=None, | |||
head_mask=None, | |||
encoder_hidden_states=None, | |||
encoder_attention_mask=None, | |||
past_key_values=None, | |||
use_cache=None, | |||
output_attentions=False, | |||
output_hidden_states=False, | |||
return_dict=True, | |||
): | |||
all_hidden_states = () if output_hidden_states else None | |||
all_self_attentions = () if output_attentions else None | |||
all_cross_attentions = ( | |||
) if output_attentions and self.config.add_cross_attention else None | |||
next_decoder_cache = () if use_cache else None | |||
for i, layer_module in enumerate(self.layer): | |||
if output_hidden_states: | |||
all_hidden_states = all_hidden_states + (hidden_states, ) | |||
layer_head_mask = head_mask[i] if head_mask is not None else None | |||
past_key_value = past_key_values[ | |||
i] if past_key_values is not None else None | |||
if self.gradient_checkpointing and self.training: | |||
if use_cache: | |||
logger.warning( | |||
'`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...' | |||
) | |||
use_cache = False | |||
def create_custom_forward(module): | |||
def custom_forward(*inputs): | |||
return module(*inputs, past_key_value, | |||
output_attentions) | |||
return custom_forward | |||
layer_outputs = torch.utils.checkpoint.checkpoint( | |||
create_custom_forward(layer_module), | |||
hidden_states, | |||
attention_mask, | |||
layer_head_mask, | |||
encoder_hidden_states, | |||
encoder_attention_mask, | |||
) | |||
else: | |||
layer_outputs = layer_module( | |||
hidden_states, | |||
attention_mask, | |||
layer_head_mask, | |||
encoder_hidden_states, | |||
encoder_attention_mask, | |||
past_key_value, | |||
output_attentions, | |||
) | |||
hidden_states = layer_outputs[0] | |||
if use_cache: | |||
next_decoder_cache += (layer_outputs[-1], ) | |||
if output_attentions: | |||
all_self_attentions = all_self_attentions + ( | |||
layer_outputs[1], ) | |||
if self.config.add_cross_attention: | |||
all_cross_attentions = all_cross_attentions + ( | |||
layer_outputs[2], ) | |||
if output_hidden_states: | |||
all_hidden_states = all_hidden_states + (hidden_states, ) | |||
if not return_dict: | |||
return tuple(v for v in [ | |||
hidden_states, | |||
next_decoder_cache, | |||
all_hidden_states, | |||
all_self_attentions, | |||
all_cross_attentions, | |||
] if v is not None) | |||
return BaseModelOutputWithPastAndCrossAttentions( | |||
last_hidden_state=hidden_states, | |||
past_key_values=next_decoder_cache, | |||
hidden_states=all_hidden_states, | |||
attentions=all_self_attentions, | |||
cross_attentions=all_cross_attentions, | |||
) | |||
class SbertPooler(nn.Module): | |||
def __init__(self, config): | |||
super().__init__() | |||
self.dense = nn.Linear(config.hidden_size, config.hidden_size) | |||
self.activation = nn.Tanh() | |||
def forward(self, hidden_states): | |||
# We "pool" the model by simply taking the hidden state corresponding | |||
# to the first token. | |||
first_token_tensor = hidden_states[:, 0] | |||
pooled_output = self.dense(first_token_tensor) | |||
pooled_output = self.activation(pooled_output) | |||
return pooled_output | |||
@dataclass | |||
class SbertForPreTrainingOutput(ModelOutput): | |||
""" | |||
Output type of :class:`~structbert.utils.BertForPreTraining`. | |||
Args: | |||
loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`): | |||
Total loss as the sum of the masked language modeling loss and the next sequence prediction | |||
(classification) loss. | |||
prediction_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`): | |||
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). | |||
seq_relationship_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`): | |||
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation | |||
before SoftMax). | |||
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when | |||
``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``): | |||
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) | |||
of shape :obj:`(batch_size, sequence_length, hidden_size)`. | |||
Hidden-states of the model at the output of each layer plus the initial embedding outputs. | |||
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when | |||
``output_attentions=True`` is passed or when ``config.output_attentions=True``): | |||
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, | |||
sequence_length, sequence_length)`. | |||
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention | |||
heads. | |||
""" | |||
loss: Optional[torch.FloatTensor] = None | |||
prediction_logits: torch.FloatTensor = None | |||
seq_relationship_logits: torch.FloatTensor = None | |||
hidden_states: Optional[Tuple[torch.FloatTensor]] = None | |||
attentions: Optional[Tuple[torch.FloatTensor]] = None | |||
@dataclass | |||
class BaseModelOutputWithPoolingAndCrossAttentionsWithEmbedding( | |||
BaseModelOutputWithPoolingAndCrossAttentions): | |||
embedding_output: torch.FloatTensor = None | |||
logits: Optional[Union[tuple, torch.FloatTensor]] = None | |||
kwargs: dict = None |
@@ -6,10 +6,12 @@ from modelscope.utils.import_utils import LazyImportModule | |||
if TYPE_CHECKING: | |||
from .configuration_gpt3 import GPT3Config | |||
from .modeling_gpt3 import GPT3Model | |||
from .gpt3_for_text_generation import GPT3ForTextGeneration | |||
else: | |||
_import_structure = { | |||
'configuration_gpt3': ['GPT3Config'], | |||
'modeling_gpt3': ['GPT3Model'] | |||
'modeling_gpt3': ['GPT3Model'], | |||
'gpt3_for_text_generation': ['GPT3ForTextGeneration'], | |||
} | |||
import sys |
@@ -20,7 +20,7 @@ class GPT3ForTextGeneration(TorchModel): | |||
""" | |||
super().__init__(model_dir, *args, **kwargs) | |||
from modelscope.models.nlp import GPT3Model | |||
from modelscope.models.nlp.gpt3 import GPT3Model | |||
from transformers import BertTokenizer | |||
self.model = GPT3Model.from_pretrained(model_dir) |
@@ -5,9 +5,11 @@ from modelscope.utils.import_utils import LazyImportModule | |||
if TYPE_CHECKING: | |||
from .sequence_classification_head import SequenceClassificationHead | |||
from .torch_pretrain_head import BertMLMHead, RobertaMLMHead | |||
else: | |||
_import_structure = { | |||
'sequence_classification_head': ['SequenceClassificationHead'] | |||
'sequence_classification_head': ['SequenceClassificationHead'], | |||
'torch_pretrain_head': ['BertMLMHead', 'RobertaMLMHead'], | |||
} | |||
import sys | |||
@@ -1,5 +1,4 @@ | |||
import importlib | |||
from typing import Dict, List, Optional, Union | |||
from typing import Dict | |||
import torch | |||
import torch.nn.functional as F | |||
@@ -0,0 +1,26 @@ | |||
from typing import Dict | |||
import torch | |||
from transformers.models.bert.modeling_bert import BertOnlyMLMHead | |||
from transformers.models.roberta.modeling_roberta import RobertaLMHead | |||
from modelscope.metainfo import Heads | |||
from modelscope.models.base import TorchHead | |||
from modelscope.models.builder import HEADS | |||
from modelscope.utils.constant import Tasks | |||
@HEADS.register_module(Tasks.fill_mask, module_name=Heads.bert_mlm) | |||
class BertMLMHead(BertOnlyMLMHead, TorchHead): | |||
def compute_loss(self, outputs: Dict[str, torch.Tensor], | |||
labels) -> Dict[str, torch.Tensor]: | |||
raise NotImplementedError() | |||
@HEADS.register_module(Tasks.fill_mask, module_name=Heads.roberta_mlm) | |||
class RobertaMLMHead(RobertaLMHead, TorchHead): | |||
def compute_loss(self, outputs: Dict[str, torch.Tensor], | |||
labels) -> Dict[str, torch.Tensor]: | |||
raise NotImplementedError() |
@@ -1,72 +1,115 @@ | |||
from typing import Dict | |||
from typing import Any, Dict, Optional, Union | |||
import numpy as np | |||
from transformers import BertForMaskedLM as BertForMaskedLMTransformer | |||
from modelscope.metainfo import Models | |||
from modelscope.models import TorchModel | |||
from modelscope.models.base import Tensor | |||
from modelscope.models.base import TorchModel | |||
from modelscope.models.builder import MODELS | |||
from modelscope.models.nlp.structbert import SbertForMaskedLM | |||
from modelscope.models.nlp.veco import \ | |||
VecoForMaskedLM as VecoForMaskedLMTransformer | |||
from modelscope.outputs import OutputKeys | |||
from modelscope.utils.constant import Tasks | |||
__all__ = ['BertForMaskedLM', 'StructBertForMaskedLM', 'VecoForMaskedLM'] | |||
class MaskedLanguageModelBase(TorchModel): | |||
def __init__(self, model_dir: str, *args, **kwargs): | |||
super().__init__(model_dir, *args, **kwargs) | |||
self.model = self.build_model() | |||
def build_model(self): | |||
raise NotImplementedError() | |||
def train(self): | |||
return self.model.train() | |||
def eval(self): | |||
return self.model.eval() | |||
@property | |||
def config(self): | |||
if hasattr(self.model, 'config'): | |||
return self.model.config | |||
return None | |||
def forward(self, input: Dict[str, Tensor]) -> Dict[str, np.ndarray]: | |||
"""return the result by the model | |||
Args: | |||
input (Dict[str, Any]): the preprocessed data | |||
Returns: | |||
Dict[str, np.ndarray]: results | |||
""" | |||
rst = self.model( | |||
input_ids=input['input_ids'], | |||
attention_mask=input['attention_mask'], | |||
token_type_ids=input['token_type_ids']) | |||
return {'logits': rst['logits'], 'input_ids': input['input_ids']} | |||
@MODELS.register_module(Tasks.fill_mask, module_name=Models.structbert) | |||
class StructBertForMaskedLM(MaskedLanguageModelBase): | |||
def build_model(self): | |||
from sofa import SbertForMaskedLM | |||
return SbertForMaskedLM.from_pretrained(self.model_dir) | |||
@MODELS.register_module(Tasks.fill_mask, module_name=Models.veco) | |||
class VecoForMaskedLM(MaskedLanguageModelBase): | |||
def build_model(self): | |||
from sofa import VecoForMaskedLM | |||
return VecoForMaskedLM.from_pretrained(self.model_dir) | |||
class StructBertForMaskedLM(TorchModel, SbertForMaskedLM): | |||
def __init__(self, config, model_dir): | |||
super(TorchModel, self).__init__(model_dir) | |||
SbertForMaskedLM.__init__(self, config) | |||
def forward(self, | |||
input_ids=None, | |||
attention_mask=None, | |||
token_type_ids=None, | |||
position_ids=None, | |||
head_mask=None, | |||
labels=None): | |||
output = SbertForMaskedLM.forward( | |||
self, | |||
input_ids=input_ids, | |||
attention_mask=attention_mask, | |||
token_type_ids=token_type_ids, | |||
position_ids=position_ids, | |||
head_mask=head_mask, | |||
labels=labels) | |||
output[OutputKeys.INPUT_IDS] = input_ids | |||
return output | |||
@classmethod | |||
def _instantiate(cls, **kwargs): | |||
model_dir = kwargs.get('model_dir') | |||
return super(SbertForMaskedLM, StructBertForMaskedLM).from_pretrained( | |||
pretrained_model_name_or_path=model_dir, model_dir=model_dir) | |||
@MODELS.register_module(Tasks.fill_mask, module_name=Models.bert) | |||
class BertForMaskedLM(MaskedLanguageModelBase): | |||
class BertForMaskedLM(TorchModel, BertForMaskedLMTransformer): | |||
def __init__(self, config, model_dir): | |||
super(TorchModel, self).__init__(model_dir) | |||
BertForMaskedLMTransformer.__init__(self, config) | |||
def forward(self, | |||
input_ids=None, | |||
attention_mask=None, | |||
token_type_ids=None, | |||
position_ids=None, | |||
head_mask=None, | |||
labels=None): | |||
output = BertForMaskedLMTransformer.forward( | |||
self, | |||
input_ids=input_ids, | |||
attention_mask=attention_mask, | |||
token_type_ids=token_type_ids, | |||
position_ids=position_ids, | |||
head_mask=head_mask, | |||
labels=labels) | |||
output[OutputKeys.INPUT_IDS] = input_ids | |||
return output | |||
@classmethod | |||
def _instantiate(cls, **kwargs): | |||
model_dir = kwargs.get('model_dir') | |||
return super(BertForMaskedLMTransformer, | |||
BertForMaskedLM).from_pretrained( | |||
pretrained_model_name_or_path=model_dir, | |||
model_dir=model_dir) | |||
def build_model(self): | |||
from transformers import BertForMaskedLM | |||
return BertForMaskedLM.from_pretrained(self.model_dir) | |||
@MODELS.register_module(Tasks.fill_mask, module_name=Models.veco) | |||
class VecoForMaskedLM(TorchModel, VecoForMaskedLMTransformer): | |||
def __init__(self, config, model_dir): | |||
super(TorchModel, self).__init__(model_dir) | |||
VecoForMaskedLMTransformer.__init__(self, config) | |||
def forward(self, | |||
input_ids=None, | |||
attention_mask=None, | |||
token_type_ids=None, | |||
position_ids=None, | |||
head_mask=None, | |||
labels=None): | |||
output = VecoForMaskedLMTransformer.forward( | |||
self, | |||
input_ids=input_ids, | |||
attention_mask=attention_mask, | |||
token_type_ids=token_type_ids, | |||
position_ids=position_ids, | |||
head_mask=head_mask, | |||
labels=labels) | |||
output[OutputKeys.INPUT_IDS] = input_ids | |||
return output | |||
@classmethod | |||
def _instantiate(cls, **kwargs): | |||
model_dir = kwargs.get('model_dir') | |||
return super(VecoForMaskedLMTransformer, | |||
VecoForMaskedLM).from_pretrained( | |||
pretrained_model_name_or_path=model_dir, | |||
model_dir=model_dir) |
@@ -0,0 +1,43 @@ | |||
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors. | |||
# All rights reserved. | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
from typing import TYPE_CHECKING | |||
from modelscope.utils.import_utils import LazyImportModule | |||
if TYPE_CHECKING: | |||
from .configuration_palm import PalmConfig | |||
from .modeling_palm import ( | |||
AbsSummarizer, | |||
PalmForConditionalGeneration, | |||
Translator, | |||
) | |||
from .palm_for_text_generation import PalmForTextGeneration | |||
else: | |||
_import_structure = { | |||
'configuration_palm': ['PalmConfig'], | |||
'modeling_palm': | |||
['AbsSummarizer', 'PalmForConditionalGeneration', 'Translator'], | |||
'palm_for_text_generation': ['PalmForTextGeneration'], | |||
} | |||
import sys | |||
sys.modules[__name__] = LazyImportModule( | |||
__name__, | |||
globals()['__file__'], | |||
_import_structure, | |||
module_spec=__spec__, | |||
extra_objects={}, | |||
) |
@@ -0,0 +1,116 @@ | |||
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors. | |||
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. | |||
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
""" PALM model configuration """ | |||
from transformers.configuration_utils import PretrainedConfig | |||
from modelscope.utils import logger as logging | |||
logger = logging.get_logger(__name__) | |||
class PalmConfig(PretrainedConfig): | |||
r""" | |||
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model | |||
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information. | |||
Args: | |||
vocab_size (:obj:`int`, `optional`, defaults to 30522): | |||
Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the | |||
:obj:`inputs_ids` passed when calling :class:`~transformers.BertModel` or | |||
:class:`~transformers.TFBertModel`. | |||
hidden_size (:obj:`int`, `optional`, defaults to 768): | |||
Dimensionality of the encoder layers and the pooler layer. | |||
num_hidden_layers (:obj:`int`, `optional`, defaults to 12): | |||
Number of hidden layers in the Transformer encoder. | |||
num_attention_heads (:obj:`int`, `optional`, defaults to 12): | |||
Number of attention heads for each attention layer in the Transformer encoder. | |||
intermediate_size (:obj:`int`, `optional`, defaults to 3072): | |||
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder. | |||
hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`): | |||
The non-linear activation function (function or string) in the encoder and pooler. If string, | |||
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported. | |||
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1): | |||
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. | |||
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1): | |||
The dropout ratio for the attention probabilities. | |||
max_position_embeddings (:obj:`int`, `optional`, defaults to 512): | |||
The maximum sequence length that this model might ever be used with. Typically set this to something large | |||
just in case (e.g., 512 or 1024 or 2048). | |||
type_vocab_size (:obj:`int`, `optional`, defaults to 2): | |||
The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.BertModel` or | |||
:class:`~transformers.TFBertModel`. | |||
initializer_range (:obj:`float`, `optional`, defaults to 0.02): | |||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. | |||
layernorm_epsilon (:obj:`float`, `optional`, defaults to 1e-12): | |||
The epsilon used by the layer normalization layers. | |||
dec_hidden_layers (:obj:`int`, `optional`, defaults to 12): | |||
Number of hidden layers in the Transformer decoder. | |||
attn_separate (:obj:`bool`, `optional`, defaults to false): | |||
Whether or not to separate the q, k, v of attention. | |||
Examples:: | |||
>>> from modelscope.models.nlp.palm_v2 import PalmForConditionalGeneration, PalmConfig | |||
>>> configuration = PalmConfig() | |||
>>> # Initializing a model from the configuration | |||
>>> model = PalmForConditionalGeneration(configuration) | |||
>>> # Accessing the model configuration | |||
>>> configuration = model.config | |||
""" | |||
model_type = 'palm' | |||
def __init__(self, | |||
encoder='roberta', | |||
encoder_pth='roberta-base', | |||
max_pos=512, | |||
share_emb=False, | |||
dec_layers=12, | |||
dec_hidden_size=768, | |||
dec_heads=8, | |||
dec_ff_size=3072, | |||
dec_dropout=0.2, | |||
use_bert_emb=True, | |||
label_smoothing=0.1, | |||
alpha=0.95, | |||
beam_size=5, | |||
min_length=40, | |||
max_length=130, | |||
sample_topk=False, | |||
block_trigram=False, | |||
**kwargs): | |||
super().__init__(**kwargs) | |||
self.encoder = encoder | |||
self.encoder_pth = encoder_pth | |||
self.max_pos = max_pos | |||
self.share_emb = share_emb | |||
self.dec_layers = dec_layers | |||
self.dec_hidden_size = dec_hidden_size | |||
self.dec_heads = dec_heads | |||
self.dec_ff_size = dec_ff_size | |||
self.dec_dropout = dec_dropout | |||
self.use_bert_emb = use_bert_emb | |||
self.label_smoothing = label_smoothing | |||
# Translator | |||
self.alpha = alpha | |||
self.beam_size = beam_size | |||
self.min_length = min_length | |||
self.max_length = max_length | |||
self.sample_topk = sample_topk | |||
self.block_trigram = block_trigram |
@@ -0,0 +1,872 @@ | |||
# ============================================================================== | |||
# Copyright 2017 Baidu.com, Inc. All Rights Reserved | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
# ============================================================================== | |||
""" | |||
This module computes evaluation metrics for DuReader dataset. | |||
""" | |||
import argparse | |||
import copy | |||
import math | |||
import re | |||
import sys | |||
import zipfile | |||
from collections import Counter, defaultdict | |||
import json | |||
import numpy as np | |||
from rouge import Rouge | |||
EMPTY = '' | |||
YESNO_LABELS = set(['Yes', 'No', 'Depends']) | |||
def my_lcs(string, sub): | |||
""" | |||
Calculates longest common subsequence for a pair of tokenized strings | |||
:param string : list of str : tokens from a string split using whitespace | |||
:param sub : list of str : shorter string, also split using whitespace | |||
:returns: length (list of int): length of the longest common subsequence between the two strings | |||
Note: my_lcs only gives length of the longest common subsequence, not the actual LCS | |||
""" | |||
if (len(string) < len(sub)): | |||
sub, string = string, sub | |||
lengths = [[0 for i in range(0, | |||
len(sub) + 1)] | |||
for j in range(0, | |||
len(string) + 1)] | |||
for j in range(1, len(sub) + 1): | |||
for i in range(1, len(string) + 1): | |||
if (string[i - 1] == sub[j - 1]): | |||
lengths[i][j] = lengths[i - 1][j - 1] + 1 | |||
else: | |||
lengths[i][j] = max(lengths[i - 1][j], lengths[i][j - 1]) | |||
return lengths[len(string)][len(sub)] | |||
class Bleu: | |||
def __init__(self, n=4): | |||
# default compute Blue score up to 4 | |||
self._n = n | |||
self._hypo_for_image = {} | |||
self.ref_for_image = {} | |||
def compute_score(self, gts, res): | |||
assert (list(gts.keys()) == list(res.keys())) | |||
imgIds = list(gts.keys()) | |||
bleu_scorer = BleuScorer(n=self._n) | |||
for id in imgIds: | |||
hypo = res[id] | |||
ref = gts[id] | |||
# Sanity check. | |||
assert (type(hypo) is list) | |||
assert (len(hypo) == 1) | |||
assert (type(ref) is list) | |||
assert (len(ref) >= 1) | |||
bleu_scorer += (hypo[0], ref) | |||
score, scores = bleu_scorer.compute_score(option='closest', verbose=1) | |||
return score, scores | |||
def method(self): | |||
return 'Bleu' | |||
def precook(s, n=4, out=False): | |||
"""Takes a string as input and returns an object that can be given to | |||
either cook_refs or cook_test. This is optional: cook_refs and cook_test | |||
can take string arguments as well.""" | |||
words = s.split() | |||
counts = defaultdict(int) | |||
for k in range(1, n + 1): | |||
for i in range(len(words) - k + 1): | |||
ngram = tuple(words[i:i + k]) | |||
counts[ngram] += 1 | |||
return (len(words), counts) | |||
def cook_refs(refs, eff=None, n=4): # lhuang: oracle will call with "average" | |||
'''Takes a list of reference sentences for a single segment | |||
and returns an object that encapsulates everything that BLEU | |||
needs to know about them.''' | |||
reflen = [] | |||
maxcounts = {} | |||
for ref in refs: | |||
rl, counts = precook(ref, n) | |||
reflen.append(rl) | |||
for (ngram, count) in counts.items(): | |||
maxcounts[ngram] = max(maxcounts.get(ngram, 0), count) | |||
# Calculate effective reference sentence length. | |||
if eff == 'shortest': | |||
reflen = min(reflen) | |||
elif eff == 'average': | |||
reflen = float(sum(reflen)) / len(reflen) | |||
# lhuang: N.B.: leave reflen computaiton to the very end!! | |||
# lhuang: N.B.: in case of "closest", keep a list of reflens!! (bad design) | |||
return reflen, maxcounts | |||
def cook_test(test, xxx_todo_changeme, eff=None, n=4): | |||
'''Takes a test sentence and returns an object that | |||
encapsulates everything that BLEU needs to know about it.''' | |||
(reflen, refmaxcounts) = xxx_todo_changeme | |||
testlen, counts = precook(test, n, True) | |||
result = {} | |||
# Calculate effective reference sentence length. | |||
if eff == 'closest': | |||
result['reflen'] = min((abs(ref - testlen), ref) for ref in reflen)[1] | |||
else: # i.e., "average" or "shortest" or None | |||
result['reflen'] = reflen | |||
result['testlen'] = testlen | |||
result['guess'] = [max(0, testlen - k + 1) for k in range(1, n + 1)] | |||
result['correct'] = [0] * n | |||
for (ngram, count) in counts.items(): | |||
result['correct'][len(ngram) - 1] += min( | |||
refmaxcounts.get(ngram, 0), count) | |||
return result | |||
class BleuScorer(object): | |||
"""Bleu scorer. | |||
""" | |||
__slots__ = 'n', 'crefs', 'ctest', '_score', '_ratio', '_testlen', '_reflen', 'special_reflen' | |||
# special_reflen is used in oracle (proportional effective ref len for a node). | |||
def copy(self): | |||
''' copy the refs.''' | |||
new = BleuScorer(n=self.n) | |||
new.ctest = copy.copy(self.ctest) | |||
new.crefs = copy.copy(self.crefs) | |||
new._score = None | |||
return new | |||
def __init__(self, test=None, refs=None, n=4, special_reflen=None): | |||
''' singular instance ''' | |||
self.n = n | |||
self.crefs = [] | |||
self.ctest = [] | |||
self.cook_append(test, refs) | |||
self.special_reflen = special_reflen | |||
def cook_append(self, test, refs): | |||
'''called by constructor and __iadd__ to avoid creating new instances.''' | |||
if refs is not None: | |||
self.crefs.append(cook_refs(refs)) | |||
if test is not None: | |||
cooked_test = cook_test(test, self.crefs[-1]) | |||
self.ctest.append(cooked_test) # N.B.: -1 | |||
else: | |||
self.ctest.append( | |||
None) # lens of crefs and ctest have to match | |||
self._score = None # need to recompute | |||
def ratio(self, option=None): | |||
self.compute_score(option=option) | |||
return self._ratio | |||
def score_ratio(self, option=None): | |||
'''return (bleu, len_ratio) pair''' | |||
return (self.fscore(option=option), self.ratio(option=option)) | |||
def score_ratio_str(self, option=None): | |||
return '%.4f (%.2f)' % self.score_ratio(option) | |||
def reflen(self, option=None): | |||
self.compute_score(option=option) | |||
return self._reflen | |||
def testlen(self, option=None): | |||
self.compute_score(option=option) | |||
return self._testlen | |||
def retest(self, new_test): | |||
if type(new_test) is str: | |||
new_test = [new_test] | |||
assert len(new_test) == len(self.crefs), new_test | |||
self.ctest = [] | |||
for t, rs in zip(new_test, self.crefs): | |||
self.ctest.append(cook_test(t, rs)) | |||
self._score = None | |||
return self | |||
def rescore(self, new_test): | |||
''' replace test(s) with new test(s), and returns the new score.''' | |||
return self.retest(new_test).compute_score() | |||
def size(self): | |||
assert len(self.crefs) == len( | |||
self.ctest), 'refs/test mismatch! %d<>%d' % (len( | |||
self.crefs), len(self.ctest)) | |||
return len(self.crefs) | |||
def __iadd__(self, other): | |||
'''add an instance (e.g., from another sentence).''' | |||
if type(other) is tuple: | |||
# avoid creating new BleuScorer instances | |||
self.cook_append(other[0], other[1]) | |||
else: | |||
assert self.compatible(other), 'incompatible BLEUs.' | |||
self.ctest.extend(other.ctest) | |||
self.crefs.extend(other.crefs) | |||
self._score = None # need to recompute | |||
return self | |||
def compatible(self, other): | |||
return isinstance(other, BleuScorer) and self.n == other.n | |||
def single_reflen(self, option='average'): | |||
return self._single_reflen(self.crefs[0][0], option) | |||
def _single_reflen(self, reflens, option=None, testlen=None): | |||
if option == 'shortest': | |||
reflen = min(reflens) | |||
elif option == 'average': | |||
reflen = float(sum(reflens)) / len(reflens) | |||
elif option == 'closest': | |||
reflen = min((abs(ref - testlen), ref) for ref in reflens)[1] | |||
else: | |||
assert False, 'unsupported reflen option %s' % option | |||
return reflen | |||
def recompute_score(self, option=None, verbose=0): | |||
self._score = None | |||
return self.compute_score(option, verbose) | |||
def compute_score(self, option=None, verbose=0): | |||
n = self.n | |||
small = 1e-9 | |||
tiny = 1e-15 # so that if guess is 0 still return 0 | |||
bleu_list = [[] for _ in range(n)] | |||
if self._score is not None: | |||
return self._score | |||
if option is None: | |||
option = 'average' if len(self.crefs) == 1 else 'closest' | |||
self._testlen = 0 | |||
self._reflen = 0 | |||
totalcomps = { | |||
'testlen': 0, | |||
'reflen': 0, | |||
'guess': [0] * n, | |||
'correct': [0] * n | |||
} | |||
# for each sentence | |||
for comps in self.ctest: | |||
testlen = comps['testlen'] | |||
self._testlen += testlen | |||
if self.special_reflen is None: # need computation | |||
reflen = self._single_reflen(comps['reflen'], option, testlen) | |||
else: | |||
reflen = self.special_reflen | |||
self._reflen += reflen | |||
for key in ['guess', 'correct']: | |||
for k in range(n): | |||
totalcomps[key][k] += comps[key][k] | |||
# append per image bleu score | |||
bleu = 1. | |||
for k in range(n): | |||
bleu *= (float(comps['correct'][k]) + tiny) / ( | |||
float(comps['guess'][k]) + small) | |||
bleu_list[k].append(bleu**(1. / (k + 1))) | |||
ratio = (testlen + tiny) / (reflen + small | |||
) # N.B.: avoid zero division | |||
if ratio < 1: | |||
for k in range(n): | |||
bleu_list[k][-1] *= math.exp(1 - 1 / ratio) | |||
if verbose > 1: | |||
print(comps, reflen) | |||
totalcomps['reflen'] = self._reflen | |||
totalcomps['testlen'] = self._testlen | |||
bleus = [] | |||
bleu = 1. | |||
for k in range(n): | |||
bleu *= float(totalcomps['correct'][k] + tiny) / ( | |||
totalcomps['guess'][k] + small) | |||
bleus.append(bleu**(1. / (k + 1))) | |||
ratio = (self._testlen + tiny) / (self._reflen + small | |||
) # N.B.: avoid zero division | |||
if ratio < 1: | |||
for k in range(n): | |||
bleus[k] *= math.exp(1 - 1 / ratio) | |||
if verbose > 0: | |||
print(totalcomps) | |||
print('ratio:', ratio) | |||
self._score = bleus | |||
return self._score, bleu_list | |||
def normalize(s): | |||
""" | |||
Normalize strings to space joined chars. | |||
Args: | |||
s: a list of strings. | |||
Returns: | |||
A list of normalized strings. | |||
""" | |||
if not s: | |||
return s | |||
normalized = [] | |||
for ss in s: | |||
tokens = [c for c in list(ss) if len(c.strip()) != 0] | |||
normalized.append(' '.join(tokens)) | |||
return normalized | |||
def data_check(obj, task): | |||
""" | |||
Check data. | |||
Raises: | |||
Raises AssertionError when data is not legal. | |||
""" | |||
assert 'question_id' in obj, "Missing 'question_id' field." | |||
assert 'question_type' in obj, \ | |||
"Missing 'question_type' field. question_id: {}".format(obj['question_type']) | |||
assert 'yesno_answers' in obj, \ | |||
"Missing 'yesno_answers' field. question_id: {}".format(obj['question_id']) | |||
assert isinstance(obj['yesno_answers'], list), \ | |||
r"""'yesno_answers' field must be a list, if the 'question_type' is not | |||
'YES_NO', then this field should be an empty list. | |||
question_id: {}""".format(obj['question_id']) | |||
assert 'entity_answers' in obj, \ | |||
"Missing 'entity_answers' field. question_id: {}".format(obj['question_id']) | |||
assert isinstance( | |||
obj['entity_answers'], | |||
list) and len(obj['entity_answers']) > 0, r"""'entity_answers' field | |||
must be a list, and has at least one element, which can be a empty list. | |||
question_id: {}""".format(obj['question_id']) | |||
def read_file(file_name, task, is_ref=False): | |||
""" | |||
Read predict answers or reference answers from file. | |||
Args: | |||
file_name: the name of the file containing predict result or reference | |||
result. | |||
Returns: | |||
A dictionary mapping question_id to the result information. The result | |||
information itself is also a dictionary with has four keys: | |||
- question_type: type of the query. | |||
- yesno_answers: A list of yesno answers corresponding to 'answers'. | |||
- answers: A list of predicted answers. | |||
- entity_answers: A list, each element is also a list containing the entities | |||
tagged out from the corresponding answer string. | |||
""" | |||
def _open(file_name, mode, zip_obj=None): | |||
if zip_obj is not None: | |||
return zip_obj.open(file_name, mode) | |||
return open(file_name, mode) | |||
results = {} | |||
keys = ['answers', 'yesno_answers', 'entity_answers', 'question_type'] | |||
if is_ref: | |||
keys += ['source'] | |||
zf = zipfile.ZipFile(file_name, | |||
'r') if file_name.endswith('.zip') else None | |||
file_list = [file_name] if zf is None else zf.namelist() | |||
for fn in file_list: | |||
for line in _open(fn, 'r', zip_obj=zf): | |||
try: | |||
obj = json.loads(line.strip()) | |||
except ValueError: | |||
raise ValueError('Every line of data should be legal json') | |||
data_check(obj, task) | |||
qid = obj['question_id'] | |||
assert qid not in results, 'Duplicate question_id: {}'.format(qid) | |||
results[qid] = {} | |||
for k in keys: | |||
results[qid][k] = obj[k] | |||
return results | |||
def compute_bleu_rouge(pred_dict, ref_dict, bleu_order=4): | |||
""" | |||
Compute bleu and rouge scores. | |||
""" | |||
assert set(pred_dict.keys()) == set(ref_dict.keys()), \ | |||
'missing keys: {}'.format(set(ref_dict.keys()) - set(pred_dict.keys())) | |||
scores = {} | |||
bleu_scores, _ = Bleu(bleu_order).compute_score(ref_dict, pred_dict) | |||
for i, bleu_score in enumerate(bleu_scores): | |||
scores['Bleu-%d' % (i + 1)] = bleu_score | |||
# rouge_score, _ = Rouge().compute_score(ref_dict, pred_dict) | |||
rouge_score = Rouge().get_scores( | |||
list(map(lambda x: x[0], pred_dict.values())), | |||
list(map(lambda x: x[0], ref_dict.values()))) | |||
rouge_score = sum([d['rouge-l']['f'] | |||
for d in rouge_score]) / len(rouge_score) | |||
scores['Rouge-L'] = rouge_score | |||
return scores | |||
def local_prf(pred_list, ref_list): | |||
""" | |||
Compute local precision recall and f1-score, | |||
given only one prediction list and one reference list | |||
""" | |||
common = Counter(pred_list) & Counter(ref_list) | |||
num_same = sum(common.values()) | |||
if num_same == 0: | |||
return 0, 0, 0 | |||
p = 1.0 * num_same / len(pred_list) | |||
r = 1.0 * num_same / len(ref_list) | |||
f1 = (2 * p * r) / (p + r) | |||
return p, r, f1 | |||
def compute_prf(pred_dict, ref_dict): | |||
""" | |||
Compute precision recall and f1-score. | |||
""" | |||
# pred_question_ids = set(pred_dict.keys()) | |||
ref_question_ids = set(ref_dict.keys()) | |||
correct_preds, total_correct, total_preds = 0, 0, 0 | |||
for question_id in ref_question_ids: | |||
pred_entity_list = pred_dict.get(question_id, [[]]) | |||
assert len(pred_entity_list) == 1, \ | |||
'the number of entity list for question_id {} is not 1.'.format(question_id) | |||
pred_entity_list = pred_entity_list[0] | |||
all_ref_entity_lists = ref_dict[question_id] | |||
best_local_f1 = 0 | |||
best_ref_entity_list = None | |||
for ref_entity_list in all_ref_entity_lists: | |||
local_f1 = local_prf(pred_entity_list, ref_entity_list)[2] | |||
if local_f1 > best_local_f1: | |||
best_ref_entity_list = ref_entity_list | |||
best_local_f1 = local_f1 | |||
if best_ref_entity_list is None: | |||
if len(all_ref_entity_lists) > 0: | |||
best_ref_entity_list = sorted( | |||
all_ref_entity_lists, key=lambda x: len(x))[0] | |||
else: | |||
best_ref_entity_list = [] | |||
gold_entities = set(best_ref_entity_list) | |||
pred_entities = set(pred_entity_list) | |||
correct_preds += len(gold_entities & pred_entities) | |||
total_preds += len(pred_entities) | |||
total_correct += len(gold_entities) | |||
p = float(correct_preds) / total_preds if correct_preds > 0 else 0 | |||
r = float(correct_preds) / total_correct if correct_preds > 0 else 0 | |||
f1 = 2 * p * r / (p + r) if correct_preds > 0 else 0 | |||
return {'Precision': p, 'Recall': r, 'F1': f1} | |||
def prepare_prf(pred_dict, ref_dict): | |||
""" | |||
Prepares data for calculation of prf scores. | |||
""" | |||
preds = {k: v['entity_answers'] for k, v in pred_dict.items()} | |||
refs = {k: v['entity_answers'] for k, v in ref_dict.items()} | |||
return preds, refs | |||
def filter_dict(result_dict, key_tag): | |||
""" | |||
Filter a subset of the result_dict, where keys ends with 'key_tag'. | |||
""" | |||
filtered = {} | |||
for k, v in result_dict.items(): | |||
if k.endswith(key_tag): | |||
filtered[k] = v | |||
return filtered | |||
def get_metrics(pred_result, ref_result, task, source): | |||
""" | |||
Computes metrics. | |||
""" | |||
metrics = {} | |||
ref_result_filtered = {} | |||
pred_result_filtered = {} | |||
if source == 'both': | |||
ref_result_filtered = ref_result | |||
pred_result_filtered = pred_result | |||
else: | |||
for question_id, info in ref_result.items(): | |||
if info['source'] == source: | |||
ref_result_filtered[question_id] = info | |||
if question_id in pred_result: | |||
pred_result_filtered[question_id] = pred_result[ | |||
question_id] | |||
if task == 'main' or task == 'all' \ | |||
or task == 'description': | |||
pred_dict, ref_dict = prepare_bleu(pred_result_filtered, | |||
ref_result_filtered, task) | |||
metrics = compute_bleu_rouge(pred_dict, ref_dict) | |||
elif task == 'yesno': | |||
pred_dict, ref_dict = prepare_bleu(pred_result_filtered, | |||
ref_result_filtered, task) | |||
keys = ['Yes', 'No', 'Depends'] | |||
preds = [filter_dict(pred_dict, k) for k in keys] | |||
refs = [filter_dict(ref_dict, k) for k in keys] | |||
metrics = compute_bleu_rouge(pred_dict, ref_dict) | |||
for k, pred, ref in zip(keys, preds, refs): | |||
m = compute_bleu_rouge(pred, ref) | |||
k_metric = [(k + '|' + key, v) for key, v in m.items()] | |||
metrics.update(k_metric) | |||
elif task == 'entity': | |||
pred_dict, ref_dict = prepare_prf(pred_result_filtered, | |||
ref_result_filtered) | |||
pred_dict_bleu, ref_dict_bleu = prepare_bleu(pred_result_filtered, | |||
ref_result_filtered, task) | |||
metrics = compute_prf(pred_dict, ref_dict) | |||
metrics.update(compute_bleu_rouge(pred_dict_bleu, ref_dict_bleu)) | |||
else: | |||
raise ValueError('Illegal task name: {}'.format(task)) | |||
return metrics | |||
def prepare_bleu(pred_result, ref_result, task): | |||
""" | |||
Prepares data for calculation of bleu and rouge scores. | |||
""" | |||
pred_list, ref_list = [], [] | |||
qids = ref_result.keys() | |||
for qid in qids: | |||
if task == 'main': | |||
pred, ref = get_main_result(qid, pred_result, ref_result) | |||
elif task == 'yesno': | |||
pred, ref = get_yesno_result(qid, pred_result, ref_result) | |||
elif task == 'all': | |||
pred, ref = get_all_result(qid, pred_result, ref_result) | |||
elif task == 'entity': | |||
pred, ref = get_entity_result(qid, pred_result, ref_result) | |||
elif task == 'description': | |||
pred, ref = get_desc_result(qid, pred_result, ref_result) | |||
else: | |||
raise ValueError('Illegal task name: {}'.format(task)) | |||
if pred and ref: | |||
pred_list += pred | |||
ref_list += ref | |||
pred_dict = dict(pred_list) | |||
ref_dict = dict(ref_list) | |||
for qid, ans in ref_dict.items(): | |||
ref_dict[qid] = normalize(ref_dict[qid]) | |||
pred_dict[qid] = normalize(pred_dict.get(qid, [EMPTY])) | |||
if not ans or ans == [EMPTY]: | |||
del ref_dict[qid] | |||
del pred_dict[qid] | |||
for k, v in pred_dict.items(): | |||
assert len(v) == 1, \ | |||
'There should be only one predict answer. question_id: {}'.format(k) | |||
return pred_dict, ref_dict | |||
def get_main_result(qid, pred_result, ref_result): | |||
""" | |||
Prepare answers for task 'main'. | |||
Args: | |||
qid: question_id. | |||
pred_result: A dict include all question_id's result information read | |||
from args.pred_file. | |||
ref_result: A dict incluce all question_id's result information read | |||
from args.ref_file. | |||
Returns: | |||
Two lists, the first one contains predict result, the second | |||
one contains reference result of the same question_id. Each list has | |||
elements of tuple (question_id, answers), 'answers' is a list of strings. | |||
""" | |||
ref_ans = ref_result[qid]['answers'] | |||
if not ref_ans: | |||
ref_ans = [EMPTY] | |||
pred_ans = pred_result.get(qid, {}).get('answers', [])[:1] | |||
if not pred_ans: | |||
pred_ans = [EMPTY] | |||
return [(qid, pred_ans)], [(qid, ref_ans)] | |||
def get_entity_result(qid, pred_result, ref_result): | |||
""" | |||
Prepare answers for task 'entity'. | |||
Args: | |||
qid: question_id. | |||
pred_result: A dict include all question_id's result information read | |||
from args.pred_file. | |||
ref_result: A dict incluce all question_id's result information read | |||
from args.ref_file. | |||
Returns: | |||
Two lists, the first one contains predict result, the second | |||
one contains reference result of the same question_id. Each list has | |||
elements of tuple (question_id, answers), 'answers' is a list of strings. | |||
""" | |||
if ref_result[qid]['question_type'] != 'ENTITY': | |||
return None, None | |||
return get_main_result(qid, pred_result, ref_result) | |||
def get_desc_result(qid, pred_result, ref_result): | |||
""" | |||
Prepare answers for task 'description'. | |||
Args: | |||
qid: question_id. | |||
pred_result: A dict include all question_id's result information read | |||
from args.pred_file. | |||
ref_result: A dict incluce all question_id's result information read | |||
from args.ref_file. | |||
Returns: | |||
Two lists, the first one contains predict result, the second | |||
one contains reference result of the same question_id. Each list has | |||
elements of tuple (question_id, answers), 'answers' is a list of strings. | |||
""" | |||
if ref_result[qid]['question_type'] != 'DESCRIPTION': | |||
return None, None | |||
return get_main_result(qid, pred_result, ref_result) | |||
def get_yesno_result(qid, pred_result, ref_result): | |||
""" | |||
Prepare answers for task 'yesno'. | |||
Args: | |||
qid: question_id. | |||
pred_result: A dict include all question_id's result information read | |||
from args.pred_file. | |||
ref_result: A dict incluce all question_id's result information read | |||
from args.ref_file. | |||
Returns: | |||
Two lists, the first one contains predict result, the second | |||
one contains reference result of the same question_id. Each list has | |||
elements of tuple (question_id, answers), 'answers' is a list of strings. | |||
""" | |||
def _uniq(li, is_ref): | |||
uniq_li = [] | |||
left = [] | |||
keys = set() | |||
for k, v in li: | |||
if k not in keys: | |||
uniq_li.append((k, v)) | |||
keys.add(k) | |||
else: | |||
left.append((k, v)) | |||
if is_ref: | |||
dict_li = dict(uniq_li) | |||
for k, v in left: | |||
dict_li[k] += v | |||
uniq_li = [(k, v) for k, v in dict_li.items()] | |||
return uniq_li | |||
def _expand_result(uniq_li): | |||
expanded = uniq_li[:] | |||
keys = set([x[0] for x in uniq_li]) | |||
for k in YESNO_LABELS - keys: | |||
expanded.append((k, [EMPTY])) | |||
return expanded | |||
def _get_yesno_ans(qid, result_dict, is_ref=False): | |||
if qid not in result_dict: | |||
return [(str(qid) + '_' + k, v) for k, v in _expand_result([])] | |||
yesno_answers = result_dict[qid]['yesno_answers'] | |||
answers = result_dict[qid]['answers'] | |||
lbl_ans = _uniq([(k, [v]) for k, v in zip(yesno_answers, answers)], | |||
is_ref) | |||
ret = [(str(qid) + '_' + k, v) for k, v in _expand_result(lbl_ans)] | |||
return ret | |||
if ref_result[qid]['question_type'] != 'YES_NO': | |||
return None, None | |||
ref_ans = _get_yesno_ans(qid, ref_result, is_ref=True) | |||
pred_ans = _get_yesno_ans(qid, pred_result) | |||
return pred_ans, ref_ans | |||
def get_all_result(qid, pred_result, ref_result): | |||
""" | |||
Prepare answers for task 'all'. | |||
Args: | |||
qid: question_id. | |||
pred_result: A dict include all question_id's result information read | |||
from args.pred_file. | |||
ref_result: A dict incluce all question_id's result information read | |||
from args.ref_file. | |||
Returns: | |||
Two lists, the first one contains predict result, the second | |||
one contains reference result of the same question_id. Each list has | |||
elements of tuple (question_id, answers), 'answers' is a list of strings. | |||
""" | |||
if ref_result[qid]['question_type'] == 'YES_NO': | |||
return get_yesno_result(qid, pred_result, ref_result) | |||
return get_main_result(qid, pred_result, ref_result) | |||
def format_metrics(metrics, task, err_msg): | |||
""" | |||
Format metrics. 'err' field returns any error occured during evaluation. | |||
Args: | |||
metrics: A dict object contains metrics for different tasks. | |||
task: Task name. | |||
err_msg: Exception raised during evaluation. | |||
Returns: | |||
Formatted result. | |||
""" | |||
result = {} | |||
sources = ['both', 'search', 'zhidao'] | |||
if err_msg is not None: | |||
return {'errorMsg': str(err_msg), 'errorCode': 1, 'data': []} | |||
data = [] | |||
if task != 'all' and task != 'main': | |||
sources = ['both'] | |||
if task == 'entity': | |||
metric_names = ['Bleu-4', 'Rouge-L'] | |||
metric_names_prf = ['F1', 'Precision', 'Recall'] | |||
for name in metric_names + metric_names_prf: | |||
for src in sources: | |||
obj = { | |||
'name': name, | |||
'value': round(metrics[src].get(name, 0) * 100, 2), | |||
'type': src, | |||
} | |||
data.append(obj) | |||
elif task == 'yesno': | |||
metric_names = ['Bleu-4', 'Rouge-L'] | |||
details = ['Yes', 'No', 'Depends'] | |||
src = sources[0] | |||
for name in metric_names: | |||
obj = { | |||
'name': name, | |||
'value': round(metrics[src].get(name, 0) * 100, 2), | |||
'type': 'All', | |||
} | |||
data.append(obj) | |||
for d in details: | |||
obj = { | |||
'name': name, | |||
'value': round(metrics[src].get(d + '|' + name, 0) * 100, | |||
2), | |||
'type': d | |||
} | |||
data.append(obj) | |||
else: | |||
metric_names = ['Bleu-4', 'Rouge-L'] | |||
for name in metric_names: | |||
for src in sources: | |||
obj = { | |||
'name': name, | |||
'value': round(metrics[src].get(name, 0) * 100, 2), | |||
'type': src | |||
} | |||
data.append(obj) | |||
result['data'] = data | |||
result['errorCode'] = 0 | |||
result['errorMsg'] = 'success' | |||
return result | |||
def main(args): | |||
""" | |||
Do evaluation. | |||
""" | |||
err = None | |||
metrics = {} | |||
try: | |||
pred_result = read_file(args.pred_file, args.task) | |||
ref_result = read_file(args.ref_file, args.task, is_ref=True) | |||
sources = ['both', 'search', 'zhidao'] | |||
if args.task not in set(['main', 'all']): | |||
sources = sources[:1] | |||
for source in sources: | |||
metrics[source] = get_metrics(pred_result, ref_result, args.task, | |||
source) | |||
except ValueError as ve: | |||
err = ve | |||
except AssertionError as ae: | |||
err = ae | |||
print( | |||
json.dumps( | |||
format_metrics(metrics, args.task, err), | |||
ensure_ascii=False).encode('utf8')) | |||
if __name__ == '__main__': | |||
parser = argparse.ArgumentParser() | |||
parser.add_argument('pred_file', help='predict file') | |||
parser.add_argument('ref_file', help='reference file') | |||
parser.add_argument( | |||
'task', help='task name: Main|Yes_No|All|Entity|Description') | |||
args = parser.parse_args() | |||
args.task = args.task.lower().replace('_', '') | |||
main(args) |
@@ -22,8 +22,8 @@ class PalmForTextGeneration(TorchModel): | |||
""" | |||
super().__init__(model_dir, *args, **kwargs) | |||
from sofa.models.palm_v2 import (PalmForConditionalGeneration, | |||
Translator) | |||
from modelscope.models.nlp.palm_v2 import ( | |||
PalmForConditionalGeneration, Translator) | |||
self.model = PalmForConditionalGeneration.from_pretrained(model_dir) | |||
self.tokenizer = self.model.tokenizer | |||
self.generator = Translator(self.model) |
@@ -1,23 +0,0 @@ | |||
from modelscope.metainfo import Models | |||
from modelscope.models.builder import MODELS | |||
from modelscope.utils.constant import Tasks | |||
from .sbert_for_sequence_classification import \ | |||
SbertForSequenceClassificationBase | |||
__all__ = ['SbertForNLI'] | |||
@MODELS.register_module(Tasks.nli, module_name=Models.structbert) | |||
class SbertForNLI(SbertForSequenceClassificationBase): | |||
def __init__(self, model_dir: str, *args, **kwargs): | |||
"""initialize the text generation model from the `model_dir` path. | |||
Args: | |||
model_dir (str): the model path. | |||
model_cls (Optional[Any], optional): model loader, if None, use the | |||
default loader to load model weights, by default None. | |||
""" | |||
super().__init__( | |||
model_dir, *args, model_args={'num_labels': 3}, **kwargs) | |||
assert self.model.config.num_labels == 3 |
@@ -1,25 +0,0 @@ | |||
from modelscope.metainfo import Models | |||
from modelscope.models.builder import MODELS | |||
from modelscope.utils.constant import Tasks | |||
from .sbert_for_sequence_classification import \ | |||
SbertForSequenceClassificationBase | |||
__all__ = ['SbertForSentenceSimilarity'] | |||
@MODELS.register_module( | |||
Tasks.sentence_similarity, module_name=Models.structbert) | |||
class SbertForSentenceSimilarity(SbertForSequenceClassificationBase): | |||
def __init__(self, model_dir: str, *args, **kwargs): | |||
"""initialize the sentence similarity model from the `model_dir` path. | |||
Args: | |||
model_dir (str): the model path. | |||
model_cls (Optional[Any], optional): model loader, if None, use the | |||
default loader to load model weights, by default None. | |||
""" | |||
super().__init__( | |||
model_dir, *args, model_args={'num_labels': 2}, **kwargs) | |||
self.model_dir = model_dir | |||
assert self.model.config.num_labels == 2 |
@@ -1,22 +0,0 @@ | |||
from modelscope.metainfo import Models | |||
from modelscope.models.builder import MODELS | |||
from modelscope.utils.constant import Tasks | |||
from .sbert_for_sequence_classification import \ | |||
SbertForSequenceClassificationBase | |||
__all__ = ['SbertForSentimentClassification'] | |||
@MODELS.register_module( | |||
Tasks.sentiment_classification, module_name=Models.structbert) | |||
class SbertForSentimentClassification(SbertForSequenceClassificationBase): | |||
def __init__(self, model_dir: str, *args, **kwargs): | |||
"""initialize the text generation model from the `model_dir` path. | |||
Args: | |||
model_dir (str): the model path. | |||
""" | |||
super().__init__( | |||
model_dir, *args, model_args={'num_labels': 2}, **kwargs) | |||
assert self.model.config.num_labels == 2 |
@@ -1,82 +0,0 @@ | |||
import os | |||
from typing import Any, Dict | |||
import json | |||
import numpy as np | |||
import torch | |||
from sofa.models.sbert.modeling_sbert import SbertModel, SbertPreTrainedModel | |||
from torch import nn | |||
from modelscope.models import TorchModel | |||
class SbertTextClassfier(SbertPreTrainedModel): | |||
def __init__(self, config): | |||
super().__init__(config) | |||
self.num_labels = config.num_labels | |||
self.config = config | |||
self.encoder = SbertModel(config, add_pooling_layer=True) | |||
self.dropout = nn.Dropout(config.hidden_dropout_prob) | |||
self.classifier = nn.Linear(config.hidden_size, config.num_labels) | |||
def forward(self, | |||
input_ids=None, | |||
token_type_ids=None, | |||
labels=None, | |||
**kwargs): | |||
outputs = self.encoder( | |||
input_ids, | |||
token_type_ids=token_type_ids, | |||
return_dict=None, | |||
) | |||
pooled_output = outputs[1] | |||
pooled_output = self.dropout(pooled_output) | |||
logits = self.classifier(pooled_output) | |||
if labels is not None: | |||
loss_fct = nn.CrossEntropyLoss() | |||
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) | |||
return {'logits': logits, 'loss': loss} | |||
return {'logits': logits} | |||
def build(**kwags): | |||
return SbertTextClassfier.from_pretrained(model_dir, **model_args) | |||
class SbertForSequenceClassificationBase(TorchModel): | |||
def __init__(self, model_dir: str, model_args=None, *args, **kwargs): | |||
super().__init__(model_dir, *args, **kwargs) | |||
if model_args is None: | |||
model_args = {} | |||
self.model = SbertTextClassfier.from_pretrained( | |||
model_dir, **model_args) | |||
self.id2label = {} | |||
self.label_path = os.path.join(self.model_dir, 'label_mapping.json') | |||
if os.path.exists(self.label_path): | |||
with open(self.label_path) as f: | |||
self.label_mapping = json.load(f) | |||
self.id2label = { | |||
idx: name | |||
for name, idx in self.label_mapping.items() | |||
} | |||
def train(self): | |||
return self.model.train() | |||
def eval(self): | |||
return self.model.eval() | |||
def forward(self, input: Dict[str, Any]) -> Dict[str, np.ndarray]: | |||
input_ids = torch.tensor(input['input_ids'], dtype=torch.long) | |||
token_type_ids = torch.tensor( | |||
input['token_type_ids'], dtype=torch.long) | |||
return self.model.forward(input_ids, token_type_ids) | |||
def postprocess(self, input, **kwargs): | |||
logits = input['logits'] | |||
probs = logits.softmax(-1).cpu().numpy() | |||
pred = logits.argmax(-1).cpu().numpy() | |||
logits = logits.cpu().numpy() | |||
res = {'predictions': pred, 'probabilities': probs, 'logits': logits} | |||
return res |
@@ -1,64 +0,0 @@ | |||
from typing import Any, Dict, Union | |||
import numpy as np | |||
import torch | |||
from modelscope.metainfo import Models | |||
from modelscope.models import TorchModel | |||
from modelscope.models.base import Tensor | |||
from modelscope.models.builder import MODELS | |||
from modelscope.utils.constant import Tasks | |||
__all__ = ['SbertForTokenClassification'] | |||
@MODELS.register_module(Tasks.word_segmentation, module_name=Models.structbert) | |||
class SbertForTokenClassification(TorchModel): | |||
def __init__(self, model_dir: str, *args, **kwargs): | |||
"""initialize the word segmentation model from the `model_dir` path. | |||
Args: | |||
model_dir (str): the model path. | |||
model_cls (Optional[Any], optional): model loader, if None, use the | |||
default loader to load model weights, by default None. | |||
""" | |||
super().__init__(model_dir, *args, **kwargs) | |||
self.model_dir = model_dir | |||
import sofa | |||
self.model = sofa.SbertForTokenClassification.from_pretrained( | |||
self.model_dir) | |||
self.config = sofa.SbertConfig.from_pretrained(self.model_dir) | |||
def train(self): | |||
return self.model.train() | |||
def eval(self): | |||
return self.model.eval() | |||
def forward(self, input: Dict[str, | |||
Any]) -> Dict[str, Union[str, np.ndarray]]: | |||
"""return the result by the model | |||
Args: | |||
input (Dict[str, Any]): the preprocessed data | |||
Returns: | |||
Dict[str, Union[str,np.ndarray]]: results | |||
Example: | |||
{ | |||
'predictions': array([1,4]), # lable 0-negative 1-positive | |||
'logits': array([[-0.53860897, 1.5029076 ]], dtype=float32) # true value | |||
'text': str(今天), | |||
} | |||
""" | |||
input_ids = torch.tensor(input['input_ids']).unsqueeze(0) | |||
return {**self.model(input_ids), 'text': input['text']} | |||
def postprocess(self, input: Dict[str, Tensor], | |||
**kwargs) -> Dict[str, Tensor]: | |||
logits = input['logits'] | |||
pred = torch.argmax(logits[0], dim=-1) | |||
pred = pred.cpu().numpy() | |||
rst = {'predictions': pred, 'logits': logits, 'text': input['text']} | |||
return rst |
@@ -1,50 +0,0 @@ | |||
from typing import Any, Dict | |||
import numpy as np | |||
from modelscope.metainfo import Models | |||
from modelscope.models import TorchModel | |||
from modelscope.models.builder import MODELS | |||
from modelscope.utils.constant import Tasks | |||
__all__ = ['SbertForZeroShotClassification'] | |||
@MODELS.register_module( | |||
Tasks.zero_shot_classification, module_name=Models.structbert) | |||
class SbertForZeroShotClassification(TorchModel): | |||
def __init__(self, model_dir: str, *args, **kwargs): | |||
"""initialize the zero shot classification model from the `model_dir` path. | |||
Args: | |||
model_dir (str): the model path. | |||
""" | |||
super().__init__(model_dir, *args, **kwargs) | |||
from sofa import SbertForSequenceClassification | |||
self.model = SbertForSequenceClassification.from_pretrained(model_dir) | |||
def train(self): | |||
return self.model.train() | |||
def eval(self): | |||
return self.model.eval() | |||
def forward(self, input: Dict[str, Any]) -> Dict[str, np.ndarray]: | |||
"""return the result by the model | |||
Args: | |||
input (Dict[str, Any]): the preprocessed data | |||
Returns: | |||
Dict[str, np.ndarray]: results | |||
Example: | |||
{ | |||
'logits': array([[-0.53860897, 1.5029076 ]], dtype=float32) # true value | |||
} | |||
""" | |||
outputs = self.model(**input) | |||
logits = outputs['logits'].cpu().numpy() | |||
res = {'logits': logits} | |||
return res |
@@ -1,85 +1,174 @@ | |||
import os | |||
from typing import Any, Dict | |||
from abc import abstractmethod | |||
import json | |||
import numpy as np | |||
from torch import nn | |||
from modelscope.metainfo import TaskModels | |||
from modelscope.metainfo import Models | |||
from modelscope.models.base import TorchModel | |||
from modelscope.models.builder import MODELS | |||
from modelscope.models.nlp.structbert import SbertPreTrainedModel | |||
from modelscope.models.nlp.veco import \ | |||
VecoForSequenceClassification as VecoForSequenceClassificationTransform | |||
from modelscope.outputs import OutputKeys | |||
from modelscope.utils.constant import Tasks | |||
from .task_model import SingleBackboneTaskModelBase | |||
from modelscope.utils.hub import parse_label_mapping | |||
from modelscope.utils.tensor_utils import (torch_nested_detach, | |||
torch_nested_numpify) | |||
__all__ = ['SequenceClassificationModel'] | |||
__all__ = ['SbertForSequenceClassification', 'VecoForSequenceClassification'] | |||
@MODELS.register_module( | |||
Tasks.sentiment_classification, module_name=TaskModels.text_classification) | |||
@MODELS.register_module( | |||
Tasks.text_classification, module_name=TaskModels.text_classification) | |||
class SequenceClassificationModel(SingleBackboneTaskModelBase): | |||
class SequenceClassificationBase(TorchModel): | |||
base_model_prefix: str = 'bert' | |||
def __init__(self, config, model_dir): | |||
super().__init__(model_dir) | |||
self.num_labels = config.num_labels | |||
self.config = config | |||
setattr(self, self.base_model_prefix, self.build_base_model()) | |||
self.dropout = nn.Dropout(config.hidden_dropout_prob) | |||
self.classifier = nn.Linear(config.hidden_size, config.num_labels) | |||
def __init__(self, model_dir: str, *args, **kwargs): | |||
"""initialize the sequence classification model from the `model_dir` path. | |||
@abstractmethod | |||
def build_base_model(self): | |||
"""Build the backbone model. | |||
Args: | |||
model_dir (str): the model path. | |||
Returns: the backbone instance. | |||
""" | |||
super().__init__(model_dir, *args, **kwargs) | |||
if 'base_model_prefix' in kwargs: | |||
self._base_model_prefix = kwargs['base_model_prefix'] | |||
backbone_cfg = self.cfg.backbone | |||
head_cfg = self.cfg.head | |||
# get the num_labels from label_mapping.json | |||
self.id2label = {} | |||
self.label_path = os.path.join(model_dir, 'label_mapping.json') | |||
if os.path.exists(self.label_path): | |||
with open(self.label_path) as f: | |||
self.label_mapping = json.load(f) | |||
self.id2label = { | |||
idx: name | |||
for name, idx in self.label_mapping.items() | |||
} | |||
head_cfg['num_labels'] = len(self.label_mapping) | |||
self.build_backbone(backbone_cfg) | |||
self.build_head(head_cfg) | |||
def forward(self, input: Dict[str, Any]) -> Dict[str, np.ndarray]: | |||
outputs = super().forward(input) | |||
sequence_output, pooled_output = self.extract_backbone_outputs(outputs) | |||
outputs = self.head.forward(pooled_output) | |||
if 'labels' in input: | |||
loss = self.compute_loss(outputs, input['labels']) | |||
outputs.update(loss) | |||
return outputs | |||
def extract_logits(self, outputs): | |||
return outputs[OutputKeys.LOGITS].cpu().detach() | |||
def extract_backbone_outputs(self, outputs): | |||
sequence_output = None | |||
pooled_output = None | |||
if hasattr(self.backbone, 'extract_sequence_outputs'): | |||
sequence_output = self.backbone.extract_sequence_outputs(outputs) | |||
if hasattr(self.backbone, 'extract_pooled_outputs'): | |||
pooled_output = self.backbone.extract_pooled_outputs(outputs) | |||
return sequence_output, pooled_output | |||
def compute_loss(self, outputs, labels): | |||
loss = self.head.compute_loss(outputs, labels) | |||
return loss | |||
pass | |||
@property | |||
def base_model(self): | |||
return getattr(self, self.base_model_prefix) | |||
def forward(self, **kwargs): | |||
labels = None | |||
if OutputKeys.LABEL in kwargs: | |||
labels = kwargs.pop(OutputKeys.LABEL) | |||
elif OutputKeys.LABELS in kwargs: | |||
labels = kwargs.pop(OutputKeys.LABELS) | |||
outputs = self.base_model.forward(**kwargs) | |||
# backbone model should return pooled_output as its second output | |||
pooled_output = outputs[1] | |||
pooled_output = self.dropout(pooled_output) | |||
logits = self.classifier(pooled_output) | |||
if labels is not None: | |||
loss_fct = nn.CrossEntropyLoss() | |||
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) | |||
return {OutputKeys.LOGITS: logits, OutputKeys.LOSS: loss} | |||
return {OutputKeys.LOGITS: logits} | |||
def postprocess(self, input, **kwargs): | |||
logits = self.extract_logits(input) | |||
probs = logits.softmax(-1).numpy() | |||
pred = logits.argmax(-1).numpy() | |||
logits = logits.numpy() | |||
logits = input[OutputKeys.LOGITS] | |||
probs = torch_nested_numpify(torch_nested_detach(logits.softmax(-1))) | |||
pred = torch_nested_numpify(torch_nested_detach(logits.argmax(-1))) | |||
logits = torch_nested_numpify(torch_nested_detach(logits)) | |||
res = { | |||
OutputKeys.PREDICTIONS: pred, | |||
OutputKeys.PROBABILITIES: probs, | |||
OutputKeys.LOGITS: logits | |||
} | |||
return res | |||
@MODELS.register_module( | |||
Tasks.sentence_similarity, module_name=Models.structbert) | |||
@MODELS.register_module( | |||
Tasks.sentiment_classification, module_name=Models.structbert) | |||
@MODELS.register_module(Tasks.nli, module_name=Models.structbert) | |||
@MODELS.register_module( | |||
Tasks.zero_shot_classification, module_name=Models.structbert) | |||
class SbertForSequenceClassification(SequenceClassificationBase, | |||
SbertPreTrainedModel): | |||
base_model_prefix: str = 'bert' | |||
supports_gradient_checkpointing = True | |||
_keys_to_ignore_on_load_missing = [r'position_ids'] | |||
def __init__(self, config, model_dir): | |||
if hasattr(config, 'base_model_prefix'): | |||
SbertForSequenceClassification.base_model_prefix = config.base_model_prefix | |||
super().__init__(config, model_dir) | |||
def build_base_model(self): | |||
from .structbert import SbertModel | |||
return SbertModel(self.config, add_pooling_layer=True) | |||
def forward(self, | |||
input_ids=None, | |||
attention_mask=None, | |||
token_type_ids=None, | |||
labels=None, | |||
**kwargs): | |||
return super().forward( | |||
input_ids=input_ids, | |||
attention_mask=attention_mask, | |||
token_type_ids=token_type_ids, | |||
labels=labels) | |||
@classmethod | |||
def _instantiate(cls, **kwargs): | |||
model_dir = kwargs.get('model_dir') | |||
num_labels = kwargs.get('num_labels') | |||
if num_labels is None: | |||
label2id = parse_label_mapping(model_dir) | |||
if label2id is not None and len(label2id) > 0: | |||
num_labels = len(label2id) | |||
model_args = {} if num_labels is None else {'num_labels': num_labels} | |||
return super(SbertPreTrainedModel, | |||
SbertForSequenceClassification).from_pretrained( | |||
pretrained_model_name_or_path=kwargs.get('model_dir'), | |||
model_dir=kwargs.get('model_dir'), | |||
**model_args) | |||
@MODELS.register_module(Tasks.sentence_similarity, module_name=Models.veco) | |||
@MODELS.register_module( | |||
Tasks.sentiment_classification, module_name=Models.veco) | |||
@MODELS.register_module(Tasks.nli, module_name=Models.veco) | |||
class VecoForSequenceClassification(TorchModel, | |||
VecoForSequenceClassificationTransform): | |||
def __init__(self, config, model_dir): | |||
super().__init__(model_dir) | |||
VecoForSequenceClassificationTransform.__init__(self, config) | |||
def forward(self, | |||
input_ids=None, | |||
attention_mask=None, | |||
token_type_ids=None, | |||
position_ids=None, | |||
head_mask=None, | |||
inputs_embeds=None, | |||
labels=None, | |||
output_attentions=None, | |||
output_hidden_states=None, | |||
**kwargs): | |||
return VecoForSequenceClassificationTransform.forward( | |||
self, | |||
input_ids=input_ids, | |||
attention_mask=attention_mask, | |||
token_type_ids=token_type_ids, | |||
position_ids=position_ids, | |||
head_mask=head_mask, | |||
inputs_embeds=inputs_embeds, | |||
output_attentions=output_attentions, | |||
output_hidden_states=output_hidden_states, | |||
labels=labels) | |||
@classmethod | |||
def _instantiate(cls, **kwargs): | |||
model_dir = kwargs.get('model_dir') | |||
num_labels = kwargs.get('num_labels') | |||
if num_labels is None: | |||
label2id = parse_label_mapping(model_dir) | |||
if label2id is not None and len(label2id) > 0: | |||
num_labels = len(label2id) | |||
model_args = {} if num_labels is None else {'num_labels': num_labels} | |||
return super(VecoForSequenceClassificationTransform, | |||
VecoForSequenceClassification).from_pretrained( | |||
pretrained_model_name_or_path=kwargs.get('model_dir'), | |||
model_dir=kwargs.get('model_dir'), | |||
**model_args) |
@@ -0,0 +1,28 @@ | |||
from typing import TYPE_CHECKING | |||
from modelscope.utils.import_utils import LazyImportModule | |||
if TYPE_CHECKING: | |||
from .model import SpaceGenerator | |||
from .model import SpaceModelBase, SpaceTokenizer, SpaceConfig | |||
from .space_for_dialog_intent_prediction import SpaceForDialogIntent | |||
from .space_for_dialog_modeling import SpaceForDialogModeling | |||
from .space_for_dialog_state_tracking import SpaceForDialogStateTracking | |||
else: | |||
_import_structure = { | |||
'model': | |||
['SpaceGenerator', 'SpaceModelBase', 'SpaceTokenizer', 'SpaceConfig'], | |||
'space_for_dialog_intent_prediction': ['SpaceForDialogIntent'], | |||
'space_for_dialog_modeling': ['SpaceForDialogModeling'], | |||
'space_for_dialog_state_tracking': ['SpaceForDialogStateTracking'], | |||
} | |||
import sys | |||
sys.modules[__name__] = LazyImportModule( | |||
__name__, | |||
globals()['__file__'], | |||
_import_structure, | |||
module_spec=__spec__, | |||
extra_objects={}, | |||
) |
@@ -0,0 +1,10 @@ | |||
from .configuration_space import SpaceConfig | |||
from .gen_unified_transformer import GenUnifiedTransformer | |||
from .generator import Generator as SpaceGenerator | |||
from .intent_unified_transformer import IntentUnifiedTransformer | |||
from .model_base import SpaceModelBase | |||
from .modeling_space import (SpaceForDST, SpaceForMaskedLM, | |||
SpaceForPreTraining, SpaceModel) | |||
from .tokenization_space import (BasicTokenizer, SpaceTokenizer, | |||
WordpieceTokenizer) | |||
from .unified_transformer import UnifiedTransformer |
@@ -0,0 +1,32 @@ | |||
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors. | |||
# Copyright 2018 The Google AI Language Team Authors. | |||
# Copyright 2020 The HuggingFace Inc. team. | |||
# Copyright (c) 2018, NVIDIA CORPORATION. | |||
# All rights reserved. | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
"""Space configuration, mainly copied from :class:`~transformers.configuration_xlm_roberta` """ | |||
from modelscope.models.nlp.structbert import SbertConfig | |||
from modelscope.utils import logger as logging | |||
logger = logging.get_logger(__name__) | |||
class SpaceConfig(SbertConfig): | |||
""" | |||
This class overrides [`SbertConfig`]. Please check the superclass for the appropriate | |||
documentation alongside usage examples. | |||
""" | |||
model_type = 'space' |
@@ -0,0 +1,268 @@ | |||
# Copyright 2019 Facebook AI Research and the HuggingFace Inc. team. | |||
# Copyright (c) 2018, NVIDIA CORPORATION. | |||
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors. | |||
# All rights reserved. | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
"""PyTorch Space model. mainly copied from :module:`~transformers.modeling_xlm_roberta`""" | |||
import torch | |||
from torch import nn | |||
from torch.nn import CrossEntropyLoss | |||
from transformers.file_utils import add_start_docstrings | |||
from modelscope.models.nlp.structbert.modeling_sbert import ( | |||
SbertForMaskedLM, SbertModel, SbertPreTrainedModel) | |||
from .configuration_space import SpaceConfig | |||
SPACE_START_DOCSTRING = r""" | |||
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic | |||
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, | |||
pruning heads etc.) | |||
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) | |||
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to | |||
general usage and behavior. | |||
Parameters: | |||
config ([`SpaceConfig`]): Model configuration class with all the parameters of the | |||
model. Initializing with a config file does not load the weights associated with the model, only the | |||
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model | |||
weights. | |||
""" | |||
@add_start_docstrings( | |||
'The bare Space Model transformer outputting raw hidden-states without any specific head on top. ' | |||
'It is identical with the Bert Model from Transformers', | |||
SPACE_START_DOCSTRING, | |||
) | |||
class SpaceModel(SbertModel): | |||
""" | |||
This class overrides [`SbertModel`]. Please check the superclass for the appropriate | |||
documentation alongside usage examples. | |||
""" | |||
config_class = SpaceConfig | |||
@add_start_docstrings( | |||
""" | |||
Space Model transformer with Dialog state tracking heads on top (a inform projection | |||
layer with a dialog state layer and a set of slots including history infromation from | |||
previous dialog) e.g. for multiwoz2.2 tasks. | |||
""", | |||
SPACE_START_DOCSTRING, | |||
) | |||
class SpaceForDST(SbertPreTrainedModel): | |||
def __init__(self, config): | |||
super(SpaceForDST, self).__init__(config) | |||
self.slot_list = config.dst_slot_list | |||
self.class_types = config.dst_class_types | |||
self.class_labels = config.dst_class_labels | |||
self.token_loss_for_nonpointable = config.dst_token_loss_for_nonpointable | |||
self.refer_loss_for_nonpointable = config.dst_refer_loss_for_nonpointable | |||
self.class_aux_feats_inform = config.dst_class_aux_feats_inform | |||
self.class_aux_feats_ds = config.dst_class_aux_feats_ds | |||
self.class_loss_ratio = config.dst_class_loss_ratio | |||
# Only use refer loss if refer class is present in dataset. | |||
if 'refer' in self.class_types: | |||
self.refer_index = self.class_types.index('refer') | |||
else: | |||
self.refer_index = -1 | |||
self.bert = SpaceModel(config) | |||
self.dropout = nn.Dropout(config.dst_dropout_rate) | |||
self.dropout_heads = nn.Dropout(config.dst_heads_dropout_rate) | |||
if self.class_aux_feats_inform: | |||
self.add_module( | |||
'inform_projection', | |||
nn.Linear(len(self.slot_list), len(self.slot_list))) | |||
if self.class_aux_feats_ds: | |||
self.add_module( | |||
'ds_projection', | |||
nn.Linear(len(self.slot_list), len(self.slot_list))) | |||
aux_dims = len(self.slot_list) * ( | |||
self.class_aux_feats_inform + self.class_aux_feats_ds | |||
) # second term is 0, 1 or 2 | |||
for slot in self.slot_list: | |||
self.add_module( | |||
'class_' + slot, | |||
nn.Linear(config.hidden_size + aux_dims, self.class_labels)) | |||
self.add_module('token_' + slot, nn.Linear(config.hidden_size, 2)) | |||
self.add_module( | |||
'refer_' + slot, | |||
nn.Linear(config.hidden_size + aux_dims, | |||
len(self.slot_list) + 1)) | |||
self.init_weights() | |||
def forward(self, | |||
input_ids, | |||
input_mask=None, | |||
segment_ids=None, | |||
position_ids=None, | |||
head_mask=None, | |||
start_pos=None, | |||
end_pos=None, | |||
inform_slot_id=None, | |||
refer_id=None, | |||
class_label_id=None, | |||
diag_state=None): | |||
outputs = self.bert( | |||
input_ids, | |||
attention_mask=input_mask, | |||
token_type_ids=segment_ids, | |||
position_ids=position_ids, | |||
head_mask=head_mask) | |||
sequence_output = outputs[0] | |||
pooled_output = outputs[1] | |||
sequence_output = self.dropout(sequence_output) | |||
pooled_output = self.dropout(pooled_output) | |||
# TODO: establish proper format in labels already? | |||
if inform_slot_id is not None: | |||
inform_labels = torch.stack(list(inform_slot_id.values()), | |||
1).float() | |||
if diag_state is not None: | |||
diag_state_labels = torch.clamp( | |||
torch.stack(list(diag_state.values()), 1).float(), 0.0, 1.0) | |||
total_loss = 0 | |||
per_slot_per_example_loss = {} | |||
per_slot_class_logits = {} | |||
per_slot_start_logits = {} | |||
per_slot_end_logits = {} | |||
per_slot_refer_logits = {} | |||
for slot in self.slot_list: | |||
if self.class_aux_feats_inform and self.class_aux_feats_ds: | |||
pooled_output_aux = torch.cat( | |||
(pooled_output, self.inform_projection(inform_labels), | |||
self.ds_projection(diag_state_labels)), 1) | |||
elif self.class_aux_feats_inform: | |||
pooled_output_aux = torch.cat( | |||
(pooled_output, self.inform_projection(inform_labels)), 1) | |||
elif self.class_aux_feats_ds: | |||
pooled_output_aux = torch.cat( | |||
(pooled_output, self.ds_projection(diag_state_labels)), 1) | |||
else: | |||
pooled_output_aux = pooled_output | |||
class_logits = self.dropout_heads( | |||
getattr(self, 'class_' + slot)(pooled_output_aux)) | |||
token_logits = self.dropout_heads( | |||
getattr(self, 'token_' + slot)(sequence_output)) | |||
start_logits, end_logits = token_logits.split(1, dim=-1) | |||
start_logits = start_logits.squeeze(-1) | |||
end_logits = end_logits.squeeze(-1) | |||
refer_logits = self.dropout_heads( | |||
getattr(self, 'refer_' + slot)(pooled_output_aux)) | |||
per_slot_class_logits[slot] = class_logits | |||
per_slot_start_logits[slot] = start_logits | |||
per_slot_end_logits[slot] = end_logits | |||
per_slot_refer_logits[slot] = refer_logits | |||
# If there are no labels, don't compute loss | |||
if class_label_id is not None and start_pos is not None and end_pos is not None and refer_id is not None: | |||
# If we are on multi-GPU, split add a dimension | |||
if len(start_pos[slot].size()) > 1: | |||
start_pos[slot] = start_pos[slot].squeeze(-1) | |||
if len(end_pos[slot].size()) > 1: | |||
end_pos[slot] = end_pos[slot].squeeze(-1) | |||
# sometimes the start/end positions are outside our model inputs, we ignore these terms | |||
ignored_index = start_logits.size(1) # This is a single index | |||
start_pos[slot].clamp_(0, ignored_index) | |||
end_pos[slot].clamp_(0, ignored_index) | |||
class_loss_fct = CrossEntropyLoss(reduction='none') | |||
token_loss_fct = CrossEntropyLoss( | |||
reduction='none', ignore_index=ignored_index) | |||
refer_loss_fct = CrossEntropyLoss(reduction='none') | |||
start_loss = token_loss_fct(start_logits, start_pos[slot]) | |||
end_loss = token_loss_fct(end_logits, end_pos[slot]) | |||
token_loss = (start_loss + end_loss) / 2.0 | |||
token_is_pointable = (start_pos[slot] > 0).float() | |||
if not self.token_loss_for_nonpointable: | |||
token_loss *= token_is_pointable | |||
refer_loss = refer_loss_fct(refer_logits, refer_id[slot]) | |||
token_is_referrable = torch.eq(class_label_id[slot], | |||
self.refer_index).float() | |||
if not self.refer_loss_for_nonpointable: | |||
refer_loss *= token_is_referrable | |||
class_loss = class_loss_fct(class_logits, class_label_id[slot]) | |||
if self.refer_index > -1: | |||
per_example_loss = (self.class_loss_ratio) * class_loss + ( | |||
(1 - self.class_loss_ratio) / 2) * token_loss + ( | |||
(1 - self.class_loss_ratio) / 2) * refer_loss | |||
else: | |||
per_example_loss = self.class_loss_ratio * class_loss + ( | |||
1 - self.class_loss_ratio) * token_loss | |||
total_loss += per_example_loss.sum() | |||
per_slot_per_example_loss[slot] = per_example_loss | |||
# add hidden states and attention if they are here | |||
outputs = (total_loss, ) + ( | |||
per_slot_per_example_loss, | |||
per_slot_class_logits, | |||
per_slot_start_logits, | |||
per_slot_end_logits, | |||
per_slot_refer_logits, | |||
) + outputs[2:] | |||
return outputs | |||
@add_start_docstrings( | |||
'The Space Model Model with a `language modeling` head on tops', | |||
SPACE_START_DOCSTRING, | |||
) | |||
class SpaceForMaskedLM(SbertForMaskedLM): | |||
""" | |||
This class overrides [`SbertForMaskedLM`]. Please check the superclass for the | |||
appropriate documentation alongside usage examples. | |||
""" | |||
config_class = SpaceConfig | |||
@add_start_docstrings( | |||
""" | |||
Space Model with only one head on top as done during the pretraining: a `masked language modeling` head. | |||
""", | |||
SPACE_START_DOCSTRING, | |||
) | |||
class SpaceForPreTraining(SbertPreTrainedModel): | |||
def __init__(self, model_name_or_path: str): | |||
super(SpaceForPreTraining, self).__init__() | |||
self.bert_model = SpaceForMaskedLM.from_pretrained(model_name_or_path) | |||
def forward(self, input_ids: torch.tensor, mlm_labels: torch.tensor): | |||
outputs = self.bert_model(input_ids, masked_lm_labels=mlm_labels) | |||
return outputs[0] |
@@ -0,0 +1,29 @@ | |||
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team. | |||
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors. | |||
# All rights reserved. | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License | |||
"""Tokenization classes for Space. mainly copied from :module:`~transformers.tokenization_xlm_roberta`""" | |||
from modelscope.models.nlp.structbert import (BasicTokenizer, SbertTokenizer, | |||
WordpieceTokenizer) | |||
from modelscope.utils import logger as logging | |||
logger = logging.get_logger(__name__) | |||
class SpaceTokenizer(SbertTokenizer): | |||
""" | |||
This class overrides [`SpaceTokenizer`]. Please check the superclass for the appropriate | |||
documentation alongside usage examples. | |||
""" |
@@ -5,10 +5,9 @@ import torch | |||
import torch.nn as nn | |||
import torch.nn.functional as F | |||
from modelscope.models.nlp.backbones.space.model.model_base import \ | |||
SpaceModelBase | |||
from modelscope.models.nlp.backbones.space.modules.embedder import Embedder | |||
from modelscope.models.nlp.backbones.space.modules.transformer_block import \ | |||
from modelscope.models.nlp.space.model.model_base import SpaceModelBase | |||
from modelscope.models.nlp.space.modules.embedder import Embedder | |||
from modelscope.models.nlp.space.modules.transformer_block import \ | |||
TransformerBlock | |||
@@ -7,7 +7,7 @@ from modelscope.metainfo import Models | |||
from modelscope.models import TorchModel | |||
from modelscope.models.base import Tensor | |||
from modelscope.models.builder import MODELS | |||
from modelscope.models.nlp.backbones import SpaceGenerator, SpaceModelBase | |||
from modelscope.models.nlp.space import SpaceGenerator, SpaceModelBase | |||
from modelscope.preprocessors.space import IntentBPETextField | |||
from modelscope.utils.config import Config | |||
from modelscope.utils.constant import ModelFile, Tasks |
@@ -7,7 +7,7 @@ from modelscope.metainfo import Models | |||
from modelscope.models import TorchModel | |||
from modelscope.models.base import Tensor | |||
from modelscope.models.builder import MODELS | |||
from modelscope.models.nlp.backbones import SpaceGenerator, SpaceModelBase | |||
from modelscope.models.nlp.space import SpaceGenerator, SpaceModelBase | |||
from modelscope.preprocessors.space import MultiWOZBPETextField | |||
from modelscope.utils.config import Config | |||
from modelscope.utils.constant import ModelFile, Tasks |
@@ -21,7 +21,7 @@ class SpaceForDialogStateTracking(TorchModel): | |||
super().__init__(model_dir, *args, **kwargs) | |||
from sofa.models.space import SpaceConfig, SpaceForDST | |||
from modelscope.models.nlp.space.model import SpaceForDST, SpaceConfig | |||
self.model_dir = model_dir | |||
self.config = SpaceConfig.from_pretrained(self.model_dir) |
@@ -0,0 +1,45 @@ | |||
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors. | |||
# All rights reserved. | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
from typing import TYPE_CHECKING | |||
from modelscope.utils.import_utils import LazyImportModule | |||
if TYPE_CHECKING: | |||
from .configuration_sbert import SbertConfig | |||
from .modeling_sbert import (SbertForMaskedLM, SbertModel, | |||
SbertPreTrainedModel) | |||
from .tokenization_sbert import (BasicTokenizer, SbertTokenizer, | |||
WordpieceTokenizer) | |||
from .tokenization_sbert_fast import SbertTokenizerFast | |||
else: | |||
_import_structure = { | |||
'configuration_sbert': ['SbertConfig'], | |||
'modeling_sbert': | |||
['SbertForMaskedLM', 'SbertModel', 'SbertPreTrainedModel'], | |||
'tokenization_sbert': | |||
['BasicTokenizer', 'SbertTokenizer', 'WordpieceTokenizer'], | |||
'tokenization_sbert_fast': ['SbertTokenizerFast'], | |||
} | |||
import sys | |||
sys.modules[__name__] = LazyImportModule( | |||
__name__, | |||
globals()['__file__'], | |||
_import_structure, | |||
module_spec=__spec__, | |||
extra_objects={}, | |||
) |
@@ -59,7 +59,8 @@ def compute_adv_loss(embedding, | |||
""" | |||
Calculate the adv loss of the model. | |||
:param embedding: Original sentense embedding | |||
:param model: The model or the forward function(including decoder/classifier), accept kwargs as input, output logits | |||
:param model: The model, or the forward function(including decoder/classifier), | |||
accept kwargs as input, output logits | |||
:param ori_logits: The original logits outputed from the model function | |||
:param ori_loss: The original loss | |||
:param adv_grad_factor: This factor will be multipled by the KL loss grad and then the result will be added to | |||
@@ -119,7 +120,8 @@ def compute_adv_loss_pair(embedding, | |||
""" | |||
Calculate the adv loss of the model. This function is used in the pair logits scenerio. | |||
:param embedding: Original sentense embedding | |||
:param model: The model or the forward function(including decoder/classifier), accept kwargs as input, output logits | |||
:param model: The model, or the forward function(including decoder/classifier), | |||
accept kwargs as input, output logits | |||
:param start_logits: The original start logits outputed from the model function | |||
:param end_logits: The original end logits outputed from the model function | |||
:param ori_loss: The original loss |
@@ -24,11 +24,12 @@ logger = logging.get_logger(__name__) | |||
class SbertConfig(PretrainedConfig): | |||
r""" | |||
This is the configuration class to store the configuration of a :class:`~sofa.models.SbertModel`. | |||
This is the configuration class to store the configuration | |||
of a :class:`~modelscope.models.nlp.structbert.SbertModel`. | |||
It is used to instantiate a SBERT model according to the specified arguments. | |||
Configuration objects inherit from :class:`~sofa.utils.PretrainedConfig` and can be used to control the model | |||
outputs. Read the documentation from :class:`~sofa.utils.PretrainedConfig` for more information. | |||
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model | |||
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information. | |||
Args: | |||
@@ -99,11 +100,13 @@ class SbertConfig(PretrainedConfig): | |||
type_vocab_size=2, | |||
initializer_range=0.02, | |||
layer_norm_eps=1e-12, | |||
pad_token_id=0, | |||
position_embedding_type='absolute', | |||
use_cache=True, | |||
classifier_dropout=None, | |||
**kwargs): | |||
super().__init__(**kwargs) | |||
super().__init__(pad_token_id=pad_token_id, **kwargs) | |||
self.vocab_size = vocab_size | |||
self.hidden_size = hidden_size | |||
self.num_hidden_layers = num_hidden_layers |
@@ -0,0 +1,516 @@ | |||
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors. | |||
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. | |||
# All rights reserved. | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
"""Tokenization classes for Sbert. mainly copied from :module:`~transformers.tokenization_bert`""" | |||
import collections | |||
import os | |||
import unicodedata | |||
from typing import List, Optional, Tuple | |||
from transformers.tokenization_utils import (PreTrainedTokenizer, _is_control, | |||
_is_punctuation, _is_whitespace) | |||
from modelscope.utils.logger import get_logger | |||
logger = get_logger(__name__) | |||
VOCAB_FILES_NAMES = {'vocab_file': 'vocab.txt'} | |||
PRETRAINED_VOCAB_FILES_MAP = {'vocab_file': {}} | |||
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { | |||
'chinese_sbert-large-std-512': 512, | |||
'english_sbert-large-std-512': 512, | |||
} | |||
PRETRAINED_INIT_CONFIGURATION = { | |||
'english_sbert-large-std-512': { | |||
'do_lower_case': True | |||
}, | |||
} | |||
def load_vocab(vocab_file): | |||
"""Loads a vocabulary file into a dictionary.""" | |||
vocab = collections.OrderedDict() | |||
with open(vocab_file, 'r', encoding='utf-8') as reader: | |||
tokens = reader.readlines() | |||
for index, token in enumerate(tokens): | |||
token = token.rstrip('\n') | |||
vocab[token] = index | |||
return vocab | |||
def whitespace_tokenize(text): | |||
"""Runs basic whitespace cleaning and splitting on a piece of text.""" | |||
text = text.strip() | |||
if not text: | |||
return [] | |||
tokens = text.split() | |||
return tokens | |||
class SbertTokenizer(PreTrainedTokenizer): | |||
r""" | |||
Construct a SBERT tokenizer. Based on WordPiece. | |||
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods. | |||
Users should refer to this superclass for more information regarding those methods. | |||
Args: | |||
vocab_file (:obj:`str`): | |||
File containing the vocabulary. | |||
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`): | |||
Whether or not to lowercase the input when tokenizing. | |||
do_basic_tokenize (:obj:`bool`, `optional`, defaults to :obj:`True`): | |||
Whether or not to do basic tokenization before WordPiece. | |||
never_split (:obj:`Iterable`, `optional`): | |||
Collection of tokens which will never be split during tokenization. Only has an effect when | |||
:obj:`do_basic_tokenize=True` | |||
unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`): | |||
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this | |||
token instead. | |||
sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`): | |||
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for | |||
sequence classification or for a text and a question for question answering. It is also used as the last | |||
token of a sequence built with special tokens. | |||
pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`): | |||
The token used for padding, for example when batching sequences of different lengths. | |||
cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`): | |||
The classifier token which is used when doing sequence classification (classification of the whole sequence | |||
instead of per-token classification). It is the first token of the sequence when built with special tokens. | |||
mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`): | |||
The token used for masking values. This is the token used when training this model with masked language | |||
modeling. This is the token which the model will try to predict. | |||
tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`): | |||
Whether or not to tokenize Chinese characters. | |||
This should likely be deactivated for Japanese (see this `issue | |||
<https://github.com/huggingface/transformers/issues/328>`__). | |||
strip_accents: (:obj:`bool`, `optional`): | |||
Whether or not to strip all accents. If this option is not specified, then it will be determined by the | |||
value for :obj:`lowercase` (as in the original BERT). | |||
""" | |||
vocab_files_names = VOCAB_FILES_NAMES | |||
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP | |||
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION | |||
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES | |||
def __init__(self, | |||
vocab_file, | |||
do_lower_case=True, | |||
do_basic_tokenize=True, | |||
never_split=None, | |||
unk_token='[UNK]', | |||
sep_token='[SEP]', | |||
pad_token='[PAD]', | |||
cls_token='[CLS]', | |||
mask_token='[MASK]', | |||
tokenize_chinese_chars=True, | |||
strip_accents=None, | |||
**kwargs): | |||
super().__init__( | |||
do_lower_case=do_lower_case, | |||
do_basic_tokenize=do_basic_tokenize, | |||
never_split=never_split, | |||
unk_token=unk_token, | |||
sep_token=sep_token, | |||
pad_token=pad_token, | |||
cls_token=cls_token, | |||
mask_token=mask_token, | |||
tokenize_chinese_chars=tokenize_chinese_chars, | |||
strip_accents=strip_accents, | |||
**kwargs, | |||
) | |||
if not os.path.isfile(vocab_file): | |||
raise ValueError( | |||
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained " | |||
'model use `tokenizer = SbertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`' | |||
) | |||
self.vocab = load_vocab(vocab_file) | |||
self.ids_to_tokens = collections.OrderedDict([ | |||
(ids, tok) for tok, ids in self.vocab.items() | |||
]) | |||
self.do_basic_tokenize = do_basic_tokenize | |||
if do_basic_tokenize: | |||
self.basic_tokenizer = BasicTokenizer( | |||
do_lower_case=do_lower_case, | |||
never_split=never_split, | |||
tokenize_chinese_chars=tokenize_chinese_chars, | |||
strip_accents=strip_accents, | |||
) | |||
self.wordpiece_tokenizer = WordpieceTokenizer( | |||
vocab=self.vocab, unk_token=self.unk_token) | |||
@property | |||
def do_lower_case(self): | |||
return self.basic_tokenizer.do_lower_case | |||
@property | |||
def vocab_size(self): | |||
return len(self.vocab) | |||
def get_vocab(self): | |||
return dict(self.vocab, **self.added_tokens_encoder) | |||
def _tokenize(self, text): | |||
split_tokens = [] | |||
if self.do_basic_tokenize: | |||
for token in self.basic_tokenizer.tokenize( | |||
text, never_split=self.all_special_tokens): | |||
# If the token is part of the never_split set | |||
if token in self.basic_tokenizer.never_split: | |||
split_tokens.append(token) | |||
else: | |||
split_tokens += self.wordpiece_tokenizer.tokenize(token) | |||
else: | |||
split_tokens = self.wordpiece_tokenizer.tokenize(text) | |||
return split_tokens | |||
def _convert_token_to_id(self, token): | |||
"""Converts a token (str) in an id using the vocab.""" | |||
return self.vocab.get(token, self.vocab.get(self.unk_token)) | |||
def _convert_id_to_token(self, index): | |||
"""Converts an index (integer) in a token (str) using the vocab.""" | |||
return self.ids_to_tokens.get(index, self.unk_token) | |||
def convert_tokens_to_string(self, tokens): | |||
"""Converts a sequence of tokens (string) in a single string.""" | |||
out_string = ' '.join(tokens).replace(' ##', '').strip() | |||
return out_string | |||
def build_inputs_with_special_tokens( | |||
self, | |||
token_ids_0: List[int], | |||
token_ids_1: Optional[List[int]] = None) -> List[int]: | |||
""" | |||
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and | |||
adding special tokens. A SBERT sequence has the following format: | |||
- single sequence: ``[CLS] X [SEP]`` | |||
- pair of sequences: ``[CLS] A [SEP] B [SEP]`` | |||
Args: | |||
token_ids_0 (:obj:`List[int]`): | |||
List of IDs to which the special tokens will be added. | |||
token_ids_1 (:obj:`List[int]`, `optional`): | |||
Optional second list of IDs for sequence pairs. | |||
Returns: | |||
:obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens. | |||
""" | |||
if token_ids_1 is None: | |||
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id] | |||
cls = [self.cls_token_id] | |||
sep = [self.sep_token_id] | |||
return cls + token_ids_0 + sep + token_ids_1 + sep | |||
def get_special_tokens_mask( | |||
self, | |||
token_ids_0: List[int], | |||
token_ids_1: Optional[List[int]] = None, | |||
already_has_special_tokens: bool = False) -> List[int]: | |||
""" | |||
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding | |||
special tokens using the tokenizer ``prepare_for_model`` method. | |||
Args: | |||
token_ids_0 (:obj:`List[int]`): | |||
List of IDs. | |||
token_ids_1 (:obj:`List[int]`, `optional`): | |||
Optional second list of IDs for sequence pairs. | |||
already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`): | |||
Whether or not the token list is already formatted with special tokens for the model. | |||
Returns: | |||
:obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. | |||
""" | |||
if already_has_special_tokens: | |||
return super().get_special_tokens_mask( | |||
token_ids_0=token_ids_0, | |||
token_ids_1=token_ids_1, | |||
already_has_special_tokens=True) | |||
if token_ids_1 is not None: | |||
return [1] + ([0] * len(token_ids_0)) + [1] + ( | |||
[0] * len(token_ids_1)) + [1] | |||
return [1] + ([0] * len(token_ids_0)) + [1] | |||
def create_token_type_ids_from_sequences( | |||
self, | |||
token_ids_0: List[int], | |||
token_ids_1: Optional[List[int]] = None) -> List[int]: | |||
""" | |||
Create a mask from the two sequences passed to be used in a sequence-pair classification task. A SBERT sequence | |||
pair mask has the following format: | |||
:: | |||
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | |||
| first sequence | second sequence | | |||
If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s). | |||
Args: | |||
token_ids_0 (:obj:`List[int]`): | |||
List of IDs. | |||
token_ids_1 (:obj:`List[int]`, `optional`): | |||
Optional second list of IDs for sequence pairs. | |||
Returns: | |||
:obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given | |||
sequence(s). | |||
""" | |||
sep = [self.sep_token_id] | |||
cls = [self.cls_token_id] | |||
if token_ids_1 is None: | |||
return len(cls + token_ids_0 + sep) * [0] | |||
return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 | |||
+ sep) * [1] | |||
def save_vocabulary(self, | |||
save_directory: str, | |||
filename_prefix: Optional[str] = None) -> Tuple[str]: | |||
index = 0 | |||
if os.path.isdir(save_directory): | |||
vocab_file = os.path.join( | |||
save_directory, | |||
(filename_prefix + '-' if filename_prefix else '') | |||
+ VOCAB_FILES_NAMES['vocab_file']) | |||
else: | |||
vocab_file = (filename_prefix | |||
+ '-' if filename_prefix else '') + save_directory | |||
with open(vocab_file, 'w', encoding='utf-8') as writer: | |||
for token, token_index in sorted( | |||
self.vocab.items(), key=lambda kv: kv[1]): | |||
if index != token_index: | |||
logger.warning( | |||
f'Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive.' | |||
' Please check that the vocabulary is not corrupted!') | |||
index = token_index | |||
writer.write(token + '\n') | |||
index += 1 | |||
return (vocab_file, ) | |||
class BasicTokenizer(object): | |||
""" | |||
Constructs a BasicTokenizer that will run basic tokenization (punctuation splitting, lower casing, etc.). | |||
Args: | |||
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`): | |||
Whether or not to lowercase the input when tokenizing. | |||
never_split (:obj:`Iterable`, `optional`): | |||
Collection of tokens which will never be split during tokenization. Only has an effect when | |||
:obj:`do_basic_tokenize=True` | |||
tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`): | |||
Whether or not to tokenize Chinese characters. | |||
This should likely be deactivated for Japanese (see this `issue | |||
<https://github.com/huggingface/transformers/issues/328>`__). | |||
strip_accents: (:obj:`bool`, `optional`): | |||
Whether or not to strip all accents. If this option is not specified, then it will be determined by the | |||
value for :obj:`lowercase` (as in the original BERT). | |||
""" | |||
def __init__(self, | |||
do_lower_case=True, | |||
never_split=None, | |||
tokenize_chinese_chars=True, | |||
strip_accents=None): | |||
if never_split is None: | |||
never_split = [] | |||
self.do_lower_case = do_lower_case | |||
self.never_split = set(never_split) | |||
self.tokenize_chinese_chars = tokenize_chinese_chars | |||
self.strip_accents = strip_accents | |||
def tokenize(self, text, never_split=None): | |||
""" | |||
Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see | |||
WordPieceTokenizer. | |||
Args: | |||
**never_split**: (`optional`) list of str | |||
Kept for backward compatibility purposes. Now implemented directly at the base class level (see | |||
:func:`PreTrainedTokenizer.tokenize`) List of token not to split. | |||
""" | |||
# union() returns a new set by concatenating the two sets. | |||
never_split = self.never_split.union( | |||
set(never_split)) if never_split else self.never_split | |||
text = self._clean_text(text) | |||
# This was added on November 1st, 2018 for the multilingual and Chinese | |||
# models. This is also applied to the English models now, but it doesn't | |||
# matter since the English models were not trained on any Chinese data | |||
# and generally don't have any Chinese data in them (there are Chinese | |||
# characters in the vocabulary because Wikipedia does have some Chinese | |||
# words in the English Wikipedia.). | |||
if self.tokenize_chinese_chars: | |||
text = self._tokenize_chinese_chars(text) | |||
orig_tokens = whitespace_tokenize(text) | |||
split_tokens = [] | |||
for token in orig_tokens: | |||
if token not in never_split: | |||
if self.do_lower_case: | |||
token = token.lower() | |||
if self.strip_accents is not False: | |||
token = self._run_strip_accents(token) | |||
elif self.strip_accents: | |||
token = self._run_strip_accents(token) | |||
split_tokens.extend(self._run_split_on_punc(token, never_split)) | |||
output_tokens = whitespace_tokenize(' '.join(split_tokens)) | |||
return output_tokens | |||
def _run_strip_accents(self, text): | |||
"""Strips accents from a piece of text.""" | |||
text = unicodedata.normalize('NFD', text) | |||
output = [] | |||
for char in text: | |||
cat = unicodedata.category(char) | |||
if cat == 'Mn': | |||
continue | |||
output.append(char) | |||
return ''.join(output) | |||
def _run_split_on_punc(self, text, never_split=None): | |||
"""Splits punctuation on a piece of text.""" | |||
if never_split is not None and text in never_split: | |||
return [text] | |||
chars = list(text) | |||
i = 0 | |||
start_new_word = True | |||
output = [] | |||
while i < len(chars): | |||
char = chars[i] | |||
if _is_punctuation(char): | |||
output.append([char]) | |||
start_new_word = True | |||
else: | |||
if start_new_word: | |||
output.append([]) | |||
start_new_word = False | |||
output[-1].append(char) | |||
i += 1 | |||
return [''.join(x) for x in output] | |||
def _tokenize_chinese_chars(self, text): | |||
"""Adds whitespace around any CJK character.""" | |||
output = [] | |||
for char in text: | |||
cp = ord(char) | |||
if self._is_chinese_char(cp): | |||
output.append(' ') | |||
output.append(char) | |||
output.append(' ') | |||
else: | |||
output.append(char) | |||
return ''.join(output) | |||
def _is_chinese_char(self, cp): | |||
"""Checks whether CP is the codepoint of a CJK character.""" | |||
# This defines a "chinese character" as anything in the CJK Unicode block: | |||
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) | |||
# | |||
# Note that the CJK Unicode block is NOT all Japanese and Korean characters, | |||
# despite its name. The modern Korean Hangul alphabet is a different block, | |||
# as is Japanese Hiragana and Katakana. Those alphabets are used to write | |||
# space-separated words, so they are not treated specially and handled | |||
# like the all of the other languages. | |||
if ((0x4E00 <= cp <= 0x9FFF) or (0x3400 <= cp <= 0x4DBF) | |||
or (0x20000 <= cp <= 0x2A6DF) or (0x2A700 <= cp <= 0x2B73F) | |||
or (0x2B740 <= cp <= 0x2B81F) or (0x2B820 <= cp <= 0x2CEAF) | |||
or (0xF900 <= cp <= 0xFAFF) or (0x2F800 <= cp <= 0x2FA1F)): | |||
return True | |||
return False | |||
def _clean_text(self, text): | |||
"""Performs invalid character removal and whitespace cleanup on text.""" | |||
output = [] | |||
for char in text: | |||
cp = ord(char) | |||
if cp == 0 or cp == 0xFFFD or _is_control(char): | |||
continue | |||
if _is_whitespace(char): | |||
output.append(' ') | |||
else: | |||
output.append(char) | |||
return ''.join(output) | |||
class WordpieceTokenizer(object): | |||
"""Runs WordPiece tokenization.""" | |||
def __init__(self, vocab, unk_token, max_input_chars_per_word=100): | |||
self.vocab = vocab | |||
self.unk_token = unk_token | |||
self.max_input_chars_per_word = max_input_chars_per_word | |||
def tokenize(self, text): | |||
""" | |||
Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform | |||
tokenization using the given vocabulary. | |||
For example, :obj:`input = "unaffable"` wil return as output :obj:`["un", "##aff", "##able"]`. | |||
Args: | |||
text: A single token or whitespace separated tokens. This should have | |||
already been passed through `BasicTokenizer`. | |||
Returns: | |||
A list of wordpiece tokens. | |||
""" | |||
output_tokens = [] | |||
for token in whitespace_tokenize(text): | |||
chars = list(token) | |||
if len(chars) > self.max_input_chars_per_word: | |||
output_tokens.append(self.unk_token) | |||
continue | |||
is_bad = False | |||
start = 0 | |||
sub_tokens = [] | |||
while start < len(chars): | |||
end = len(chars) | |||
cur_substr = None | |||
while start < end: | |||
substr = ''.join(chars[start:end]) | |||
if start > 0: | |||
substr = '##' + substr | |||
if substr in self.vocab: | |||
cur_substr = substr | |||
break | |||
end -= 1 | |||
if cur_substr is None: | |||
is_bad = True | |||
break | |||
sub_tokens.append(cur_substr) | |||
start = end | |||
if is_bad: | |||
output_tokens.append(self.unk_token) | |||
else: | |||
output_tokens.extend(sub_tokens) | |||
return output_tokens |
@@ -0,0 +1,200 @@ | |||
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors. | |||
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. | |||
# All rights reserved. | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
"""Fast Tokenization classes for Sbert. mainly copied from :module:`~transformers.tokenization_bert_fast`""" | |||
from typing import List, Optional, Tuple | |||
import json | |||
import transformers | |||
from tokenizers import normalizers | |||
from transformers.tokenization_utils_fast import PreTrainedTokenizerFast | |||
from modelscope.utils.logger import get_logger | |||
from .tokenization_sbert import SbertTokenizer | |||
logger = get_logger(__name__) | |||
VOCAB_FILES_NAMES = { | |||
'vocab_file': 'vocab.txt', | |||
'tokenizer_file': 'tokenizer.json' | |||
} | |||
PRETRAINED_VOCAB_FILES_MAP = { | |||
'vocab_file': {}, | |||
'tokenizer_file': {}, | |||
} | |||
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { | |||
'chinese_sbert-large-std-512': 512, | |||
'english_sbert-large-std-512': 512, | |||
} | |||
PRETRAINED_INIT_CONFIGURATION = { | |||
'english_sbert-large-std-512': { | |||
'do_lower_case': True | |||
}, | |||
} | |||
transformers.SLOW_TO_FAST_CONVERTERS[ | |||
'SbertTokenizer'] = transformers.SLOW_TO_FAST_CONVERTERS['BertTokenizer'] | |||
class SbertTokenizerFast(PreTrainedTokenizerFast): | |||
r""" | |||
Construct a "fast" SBERT tokenizer (backed by HuggingFace's `tokenizers` library). Based on WordPiece. | |||
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the main | |||
methods. Users should refer to this superclass for more information regarding those methods. | |||
Args: | |||
vocab_file (:obj:`str`): | |||
File containing the vocabulary. | |||
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`): | |||
Whether or not to lowercase the input when tokenizing. | |||
unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`): | |||
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this | |||
token instead. | |||
sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`): | |||
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for | |||
sequence classification or for a text and a question for question answering. It is also used as the last | |||
token of a sequence built with special tokens. | |||
pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`): | |||
The token used for padding, for example when batching sequences of different lengths. | |||
cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`): | |||
The classifier token which is used when doing sequence classification (classification of the whole sequence | |||
instead of per-token classification). It is the first token of the sequence when built with special tokens. | |||
mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`): | |||
The token used for masking values. This is the token used when training this model with masked language | |||
modeling. This is the token which the model will try to predict. | |||
clean_text (:obj:`bool`, `optional`, defaults to :obj:`True`): | |||
Whether or not to clean the text before tokenization by removing any control characters and replacing all | |||
whitespaces by the classic one. | |||
tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`): | |||
Whether or not to tokenize Chinese characters. This should likely be deactivated for Japanese (see `this | |||
issue <https://github.com/huggingface/transformers/issues/328>`__). | |||
strip_accents: (:obj:`bool`, `optional`): | |||
Whether or not to strip all accents. If this option is not specified, then it will be determined by the | |||
value for :obj:`lowercase` (as in the original BERT). | |||
wordpieces_prefix: (:obj:`str`, `optional`, defaults to :obj:`"##"`): | |||
The prefix for subwords. | |||
""" | |||
vocab_files_names = VOCAB_FILES_NAMES | |||
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP | |||
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION | |||
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES | |||
slow_tokenizer_class = SbertTokenizer | |||
def __init__(self, | |||
vocab_file=None, | |||
tokenizer_file=None, | |||
do_lower_case=True, | |||
unk_token='[UNK]', | |||
sep_token='[SEP]', | |||
pad_token='[PAD]', | |||
cls_token='[CLS]', | |||
mask_token='[MASK]', | |||
tokenize_chinese_chars=True, | |||
strip_accents=None, | |||
**kwargs): | |||
super().__init__( | |||
vocab_file, | |||
tokenizer_file=tokenizer_file, | |||
do_lower_case=do_lower_case, | |||
unk_token=unk_token, | |||
sep_token=sep_token, | |||
pad_token=pad_token, | |||
cls_token=cls_token, | |||
mask_token=mask_token, | |||
tokenize_chinese_chars=tokenize_chinese_chars, | |||
strip_accents=strip_accents, | |||
**kwargs, | |||
) | |||
pre_tok_state = json.loads( | |||
self.backend_tokenizer.normalizer.__getstate__()) | |||
if (pre_tok_state.get('lowercase', do_lower_case) != do_lower_case | |||
or pre_tok_state.get('strip_accents', | |||
strip_accents) != strip_accents): | |||
pre_tok_class = getattr(normalizers, pre_tok_state.pop('type')) | |||
pre_tok_state['lowercase'] = do_lower_case | |||
pre_tok_state['strip_accents'] = strip_accents | |||
self.backend_tokenizer.normalizer = pre_tok_class(**pre_tok_state) | |||
self.do_lower_case = do_lower_case | |||
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None): | |||
""" | |||
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and | |||
adding special tokens. A SBERT sequence has the following format: | |||
- single sequence: ``[CLS] X [SEP]`` | |||
- pair of sequences: ``[CLS] A [SEP] B [SEP]`` | |||
Args: | |||
token_ids_0 (:obj:`List[int]`): | |||
List of IDs to which the special tokens will be added. | |||
token_ids_1 (:obj:`List[int]`, `optional`): | |||
Optional second list of IDs for sequence pairs. | |||
Returns: | |||
:obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens. | |||
""" | |||
output = [self.cls_token_id] + token_ids_0 + [self.sep_token_id] | |||
if token_ids_1: | |||
output += token_ids_1 + [self.sep_token_id] | |||
return output | |||
def create_token_type_ids_from_sequences( | |||
self, | |||
token_ids_0: List[int], | |||
token_ids_1: Optional[List[int]] = None) -> List[int]: | |||
""" | |||
Create a mask from the two sequences passed to be used in a sequence-pair classification task. A SBERT sequence | |||
pair mask has the following format: | |||
:: | |||
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | |||
| first sequence | second sequence | | |||
If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s). | |||
Args: | |||
token_ids_0 (:obj:`List[int]`): | |||
List of IDs. | |||
token_ids_1 (:obj:`List[int]`, `optional`): | |||
Optional second list of IDs for sequence pairs. | |||
Returns: | |||
:obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given | |||
sequence(s). | |||
""" | |||
sep = [self.sep_token_id] | |||
cls = [self.cls_token_id] | |||
if token_ids_1 is None: | |||
return len(cls + token_ids_0 + sep) * [0] | |||
return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 | |||
+ sep) * [1] | |||
def save_vocabulary(self, | |||
save_directory: str, | |||
filename_prefix: Optional[str] = None) -> Tuple[str]: | |||
files = self._tokenizer.model.save( | |||
save_directory, name=filename_prefix) | |||
return tuple(files) |
@@ -0,0 +1,86 @@ | |||
import os | |||
from typing import Any, Dict | |||
import json | |||
import numpy as np | |||
from modelscope.metainfo import TaskModels | |||
from modelscope.models.builder import MODELS | |||
from modelscope.models.nlp.task_models.task_model import \ | |||
SingleBackboneTaskModelBase | |||
from modelscope.outputs import OutputKeys | |||
from modelscope.utils.constant import Tasks | |||
__all__ = ['SequenceClassificationModel'] | |||
@MODELS.register_module( | |||
Tasks.sentiment_classification, module_name=TaskModels.text_classification) | |||
@MODELS.register_module( | |||
Tasks.text_classification, module_name=TaskModels.text_classification) | |||
class SequenceClassificationModel(SingleBackboneTaskModelBase): | |||
def __init__(self, model_dir: str, *args, **kwargs): | |||
"""initialize the sequence classification model from the `model_dir` path. | |||
Args: | |||
model_dir (str): the model path. | |||
""" | |||
super().__init__(model_dir, *args, **kwargs) | |||
if 'base_model_prefix' in kwargs: | |||
self._base_model_prefix = kwargs['base_model_prefix'] | |||
backbone_cfg = self.cfg.backbone | |||
head_cfg = self.cfg.head | |||
# get the num_labels from label_mapping.json | |||
self.id2label = {} | |||
self.label_path = os.path.join(model_dir, 'label_mapping.json') | |||
if os.path.exists(self.label_path): | |||
with open(self.label_path) as f: | |||
self.label_mapping = json.load(f) | |||
self.id2label = { | |||
idx: name | |||
for name, idx in self.label_mapping.items() | |||
} | |||
head_cfg['num_labels'] = len(self.label_mapping) | |||
self.build_backbone(backbone_cfg) | |||
self.build_head(head_cfg) | |||
def forward(self, input: Dict[str, Any]) -> Dict[str, np.ndarray]: | |||
outputs = super().forward(input) | |||
sequence_output, pooled_output = self.extract_backbone_outputs(outputs) | |||
outputs = self.head.forward(pooled_output) | |||
if 'labels' in input: | |||
loss = self.compute_loss(outputs, input['labels']) | |||
outputs.update(loss) | |||
return outputs | |||
def extract_logits(self, outputs): | |||
return outputs[OutputKeys.LOGITS].cpu().detach() | |||
def extract_backbone_outputs(self, outputs): | |||
sequence_output = None | |||
pooled_output = None | |||
if hasattr(self.backbone, 'extract_sequence_outputs'): | |||
sequence_output = self.backbone.extract_sequence_outputs(outputs) | |||
if hasattr(self.backbone, 'extract_pooled_outputs'): | |||
pooled_output = self.backbone.extract_pooled_outputs(outputs) | |||
return sequence_output, pooled_output | |||
def compute_loss(self, outputs, labels): | |||
loss = self.head.compute_loss(outputs, labels) | |||
return loss | |||
def postprocess(self, input, **kwargs): | |||
logits = self.extract_logits(input) | |||
probs = logits.softmax(-1).numpy() | |||
pred = logits.argmax(-1).numpy() | |||
logits = logits.numpy() | |||
res = { | |||
OutputKeys.PREDICTIONS: pred, | |||
OutputKeys.PROBABILITIES: probs, | |||
OutputKeys.LOGITS: logits | |||
} | |||
return res |
@@ -11,8 +11,8 @@ from modelscope.models.base import TorchModel | |||
from modelscope.models.builder import build_backbone, build_head | |||
from modelscope.utils.config import ConfigDict | |||
from modelscope.utils.constant import Fields, Tasks | |||
from modelscope.utils.file_utils import func_receive_dict_inputs | |||
from modelscope.utils.logger import get_logger | |||
from modelscope.utils.utils import if_func_receive_dict_inputs | |||
logger = get_logger(__name__) | |||
@@ -424,12 +424,15 @@ class SingleBackboneTaskModelBase(BaseTaskModel): | |||
def forward(self, input: Dict[str, Any]) -> Dict[str, Any]: | |||
"""default forward method is the backbone-only forward""" | |||
if if_func_receive_dict_inputs(self.backbone.forward): | |||
if func_receive_dict_inputs(self.backbone.forward): | |||
outputs = self.backbone.forward(input) | |||
else: | |||
outputs = self.backbone.forward(**input) | |||
return outputs | |||
def compute_loss(self, outputs: Dict[str, Any], labels): | |||
raise NotImplementedError() | |||
class EncoderDecoderTaskModelBase(BaseTaskModel): | |||
""" | |||
@@ -472,13 +475,13 @@ class EncoderDecoderTaskModelBase(BaseTaskModel): | |||
return getattr(self, self._decoder_prefix) | |||
def forward(self, input: Dict[str, Any]) -> Dict[str, Any]: | |||
if if_func_receive_dict_inputs(self.encoder_.forward): | |||
if func_receive_dict_inputs(self.encoder_.forward): | |||
encoder_outputs = self.encoder_.forward(input) | |||
else: | |||
encoder_outputs = self.encoder_.forward(**input) | |||
decoder_inputs = self.project_decoder_inputs_and_mediate( | |||
input, encoder_outputs) | |||
if if_func_receive_dict_inputs(self.decoder_.forward): | |||
if func_receive_dict_inputs(self.decoder_.forward): | |||
outputs = self.decoder_.forward(decoder_inputs) | |||
else: | |||
outputs = self.decoder_.forward(**decoder_inputs) |
@@ -0,0 +1,147 @@ | |||
from abc import abstractmethod | |||
from typing import Dict | |||
import numpy as np | |||
import torch | |||
from torch import nn | |||
from modelscope.metainfo import Models | |||
from modelscope.models.base import TorchModel | |||
from modelscope.models.builder import MODELS | |||
from modelscope.outputs import OutputKeys | |||
from modelscope.utils.constant import Tasks | |||
from modelscope.utils.hub import parse_label_mapping | |||
from modelscope.utils.tensor_utils import (torch_nested_detach, | |||
torch_nested_numpify) | |||
from .structbert import SbertPreTrainedModel | |||
__all__ = ['SbertForTokenClassification'] | |||
class TokenClassification(TorchModel): | |||
base_model_prefix: str = 'bert' | |||
def __init__(self, config, model_dir): | |||
super().__init__(model_dir) | |||
self.num_labels = config.num_labels | |||
self.config = config | |||
setattr(self, self.base_model_prefix, self.build_base_model()) | |||
classifier_dropout = ( | |||
config.classifier_dropout if config.classifier_dropout is not None | |||
else config.hidden_dropout_prob) | |||
self.dropout = nn.Dropout(classifier_dropout) | |||
self.classifier = nn.Linear(config.hidden_size, config.num_labels) | |||
@abstractmethod | |||
def build_base_model(self): | |||
"""Build the backbone model. | |||
Returns: the backbone instance. | |||
""" | |||
pass | |||
@property | |||
def base_model(self): | |||
return getattr(self, self.base_model_prefix) | |||
def compute_loss(self, logits, labels, **kwargs): | |||
"""Compute loss. | |||
For example, if backbone is pretrained model, there will be a 'attention_mask' parameter to skip | |||
useless tokens. | |||
Args: | |||
logits: The logits from the classifier | |||
labels: The labels | |||
**kwargs: Other input params. | |||
Returns: Loss. | |||
""" | |||
pass | |||
def forward(self, **kwargs): | |||
labels = None | |||
if OutputKeys.LABEL in kwargs: | |||
labels = kwargs.pop(OutputKeys.LABEL) | |||
elif OutputKeys.LABELS in kwargs: | |||
labels = kwargs.pop(OutputKeys.LABELS) | |||
outputs = self.base_model(**kwargs) | |||
# base model should return the sequence_output as its first output | |||
sequence_output = outputs[0] | |||
sequence_output = self.dropout(sequence_output) | |||
logits = self.classifier(sequence_output) | |||
if labels is not None: | |||
loss = self.compute_loss(logits, labels, **kwargs) | |||
return {OutputKeys.LOGITS: logits, OutputKeys.LOSS: loss} | |||
return {OutputKeys.LOGITS: logits} | |||
def postprocess(self, input: Dict[str, np.ndarray], | |||
**kwargs) -> Dict[str, np.ndarray]: | |||
logits = input[OutputKeys.LOGITS] | |||
pred = torch.argmax(logits[0], dim=-1) | |||
pred = torch_nested_numpify(torch_nested_detach(pred)) | |||
logits = torch_nested_numpify(torch_nested_detach(logits)) | |||
rst = {OutputKeys.PREDICTIONS: pred, OutputKeys.LOGITS: logits} | |||
return rst | |||
@MODELS.register_module(Tasks.word_segmentation, module_name=Models.structbert) | |||
@MODELS.register_module( | |||
Tasks.token_classification, module_name=Models.structbert) | |||
class SbertForTokenClassification(TokenClassification, SbertPreTrainedModel): | |||
supports_gradient_checkpointing = True | |||
_keys_to_ignore_on_load_unexpected = [r'pooler'] | |||
def __init__(self, config, model_dir): | |||
if hasattr(config, 'base_model_prefix'): | |||
SbertForTokenClassification.base_model_prefix = config.base_model_prefix | |||
super().__init__(config, model_dir) | |||
def build_base_model(self): | |||
from .structbert import SbertModel | |||
return SbertModel(self.config, add_pooling_layer=False) | |||
def forward(self, | |||
input_ids=None, | |||
attention_mask=None, | |||
token_type_ids=None, | |||
labels=None, | |||
**kwargs): | |||
return super().forward( | |||
input_ids=input_ids, | |||
attention_mask=attention_mask, | |||
token_type_ids=token_type_ids, | |||
labels=labels) | |||
def compute_loss(self, logits, labels, attention_mask=None, **kwargs): | |||
loss_fct = nn.CrossEntropyLoss() | |||
# Only keep active parts of the loss | |||
if attention_mask is not None: | |||
active_loss = attention_mask.view(-1) == 1 | |||
active_logits = logits.view(-1, self.num_labels) | |||
active_labels = torch.where( | |||
active_loss, labels.view(-1), | |||
torch.tensor(loss_fct.ignore_index).type_as(labels)) | |||
return loss_fct(active_logits, active_labels) | |||
else: | |||
return loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) | |||
@classmethod | |||
def _instantiate(cls, **kwargs): | |||
model_dir = kwargs.get('model_dir') | |||
num_labels = kwargs.get('num_labels') | |||
if num_labels is None: | |||
label2id = parse_label_mapping(model_dir) | |||
if label2id is not None and len(label2id) > 0: | |||
num_labels = len(label2id) | |||
model_args = {} if num_labels is None else {'num_labels': num_labels} | |||
return super(SbertPreTrainedModel, | |||
SbertForTokenClassification).from_pretrained( | |||
pretrained_model_name_or_path=kwargs.get('model_dir'), | |||
model_dir=kwargs.get('model_dir'), | |||
**model_args) |
@@ -0,0 +1,43 @@ | |||
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors. | |||
# All rights reserved. | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
from typing import TYPE_CHECKING | |||
from modelscope.utils.import_utils import LazyImportModule | |||
if TYPE_CHECKING: | |||
from .configuration_veco import VecoConfig | |||
from .modeling_veco import (VecoForMaskedLM, VecoForSequenceClassification, | |||
VecoModel) | |||
from .tokenization_veco import VecoTokenizer | |||
from .tokenization_veco_fast import VecoTokenizerFast | |||
else: | |||
_import_structure = { | |||
'configuration_veco': ['VecoConfig'], | |||
'modeling_veco': | |||
['VecoForMaskedLM', 'VecoForSequenceClassification', 'VecoModel'], | |||
'tokenization_veco': ['VecoTokenizer'], | |||
'tokenization_veco_fast': ['VecoTokenizerFast'], | |||
} | |||
import sys | |||
sys.modules[__name__] = LazyImportModule( | |||
__name__, | |||
globals()['__file__'], | |||
_import_structure, | |||
module_spec=__spec__, | |||
extra_objects={}, | |||
) |
@@ -0,0 +1,33 @@ | |||
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors. | |||
# Copyright 2018 The Google AI Language Team Authors. | |||
# Copyright 2020 The HuggingFace Inc. team. | |||
# Copyright (c) 2018, NVIDIA CORPORATION. | |||
# All rights reserved. | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
"""Veco configuration, mainly copied from :class:`~transformers.configuration_xlm_roberta` """ | |||
from transformers import RobertaConfig | |||
from modelscope.utils import logger as logging | |||
logger = logging.get_logger(__name__) | |||
class VecoConfig(RobertaConfig): | |||
""" | |||
This class overrides [`RobertaConfig`]. Please check the superclass for the appropriate | |||
documentation alongside usage examples. | |||
""" | |||
model_type = 'veco' |
@@ -0,0 +1,143 @@ | |||
# Copyright 2019 Facebook AI Research and the HuggingFace Inc. team. | |||
# Copyright (c) 2018, NVIDIA CORPORATION. | |||
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors. | |||
# All rights reserved. | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
"""PyTorch Veco model. mainly copied from :module:`~transformers.modeling_xlm_roberta`""" | |||
from transformers import (RobertaForMaskedLM, RobertaForMultipleChoice, | |||
RobertaForQuestionAnswering, | |||
RobertaForSequenceClassification, | |||
RobertaForTokenClassification, RobertaModel) | |||
from transformers.file_utils import add_start_docstrings | |||
from modelscope.metainfo import Models | |||
from modelscope.models.builder import BACKBONES | |||
from modelscope.utils import logger as logging | |||
from modelscope.utils.constant import Fields | |||
from .configuration_veco import VecoConfig | |||
logger = logging.get_logger(__name__) | |||
VECO_PRETRAINED_MODEL_ARCHIVE_LIST = [] | |||
VECO_START_DOCSTRING = r""" | |||
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic | |||
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, | |||
pruning heads etc.) | |||
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) | |||
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to | |||
general usage and behavior. | |||
Parameters: | |||
config ([`VecoConfig`]): Model configuration class with all the parameters of the | |||
model. Initializing with a config file does not load the weights associated with the model, only the | |||
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model | |||
weights. | |||
""" | |||
@add_start_docstrings( | |||
'The bare Veco Model transformer outputting raw hidden-states without any specific head on top.', | |||
VECO_START_DOCSTRING, | |||
) | |||
class VecoModel(RobertaModel): | |||
""" | |||
This class overrides [`RobertaModel`]. Please check the superclass for the appropriate | |||
documentation alongside usage examples. | |||
""" | |||
config_class = VecoConfig | |||
@add_start_docstrings( | |||
""" | |||
Veco Model transformer with a sequence classification/regression head on top (a linear layer on top of the | |||
pooled output) e.g. for GLUE tasks. | |||
""", | |||
VECO_START_DOCSTRING, | |||
) | |||
class VecoForSequenceClassification(RobertaForSequenceClassification): | |||
""" | |||
This class overrides [`RobertaForSequenceClassification`]. Please check the superclass for the | |||
appropriate documentation alongside usage examples. | |||
""" | |||
config_class = VecoConfig | |||
@add_start_docstrings( | |||
""" | |||
Veco Model transformer with a masked language model head on top (a linear layer on top of the | |||
pooled output). | |||
""", | |||
VECO_START_DOCSTRING, | |||
) | |||
class VecoForMaskedLM(RobertaForMaskedLM): | |||
""" | |||
This class overrides [`RobertaForMaskedLM`]. Please check the superclass for the | |||
appropriate documentation alongside usage examples. | |||
""" | |||
config_class = VecoConfig | |||
@add_start_docstrings( | |||
""" | |||
Veco Model with a multiple choice classification head on top (a linear layer on top of the pooled output and | |||
a softmax) e.g. for RocStories/SWAG tasks. | |||
""", | |||
VECO_START_DOCSTRING, | |||
) | |||
class VecoForMultipleChoice(RobertaForMultipleChoice): | |||
""" | |||
This class overrides [`RobertaForMultipleChoice`]. Please check the superclass for the | |||
appropriate documentation alongside usage examples. | |||
""" | |||
config_class = VecoConfig | |||
@add_start_docstrings( | |||
""" | |||
Veco Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. | |||
for Named-Entity-Recognition (NER) tasks. | |||
""", | |||
VECO_START_DOCSTRING, | |||
) | |||
class VecoForTokenClassification(RobertaForTokenClassification): | |||
""" | |||
This class overrides [`RobertaForTokenClassification`]. Please check the superclass for the | |||
appropriate documentation alongside usage examples. | |||
""" | |||
config_class = VecoConfig | |||
@add_start_docstrings( | |||
""" | |||
Veco Model with a span classification head on top for extractive question-answering tasks like SQuAD (a | |||
linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`). | |||
""", | |||
VECO_START_DOCSTRING, | |||
) | |||
class VecoForQuestionAnswering(RobertaForQuestionAnswering): | |||
""" | |||
This class overrides [`RobertaForQuestionAnswering`]. Please check the superclass for the | |||
appropriate documentation alongside usage examples. | |||
""" | |||
config_class = VecoConfig |
@@ -0,0 +1,321 @@ | |||
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team. | |||
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors. | |||
# All rights reserved. | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License | |||
"""Tokenization classes for Veco. mainly copied from :module:`~transformers.tokenization_xlm_roberta`""" | |||
import os | |||
from shutil import copyfile | |||
from typing import Any, Dict, List, Optional, Tuple | |||
import sentencepiece as spm | |||
from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer | |||
from modelscope.utils import logger as logging | |||
logger = logging.get_logger(__name__) | |||
SPIECE_UNDERLINE = '▁' | |||
VOCAB_FILES_NAMES = {'vocab_file': 'sentencepiece.bpe.model'} | |||
PRETRAINED_VOCAB_FILES_MAP = {'vocab_file': {}} | |||
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {} | |||
class VecoTokenizer(PreTrainedTokenizer): | |||
""" | |||
Adapted from [`RobertaTokenizer`] and [`XLNetTokenizer`]. Based on | |||
[SentencePiece](https://github.com/google/sentencepiece). | |||
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. | |||
Users should refer to this superclass for more information regarding those methods. | |||
Args: | |||
vocab_file (`str`): | |||
Path to the vocabulary file. | |||
bos_token (`str`, *optional*, defaults to `"<s>"`): | |||
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. | |||
<Tip> | |||
When building a sequence using special tokens, this is not the token that is used for the beginning of | |||
sequence. The token used is the `cls_token`. | |||
</Tip> | |||
eos_token (`str`, *optional*, defaults to `"</s>"`): | |||
The end of sequence token. | |||
<Tip> | |||
When building a sequence using special tokens, this is not the token that is used for the end of | |||
sequence. The token used is the `sep_token`. | |||
</Tip> | |||
sep_token (`str`, *optional*, defaults to `"</s>"`): | |||
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for | |||
sequence classification or for a text and a question for question answering. It is also used as the last | |||
token of a sequence built with special tokens. | |||
cls_token (`str`, *optional*, defaults to `"<s>"`): | |||
The classifier token which is used when doing sequence classification (classification of the whole sequence | |||
instead of per-token classification). It is the first token of the sequence when built with special tokens. | |||
unk_token (`str`, *optional*, defaults to `"<unk>"`): | |||
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this | |||
token instead. | |||
pad_token (`str`, *optional*, defaults to `"<pad>"`): | |||
The token used for padding, for example when batching sequences of different lengths. | |||
mask_token (`str`, *optional*, defaults to `"<mask>"`): | |||
The token used for masking values. This is the token used when training this model with masked language | |||
modeling. This is the token which the model will try to predict. | |||
additional_special_tokens (`List[str]`, *optional*, defaults to `["<s>NOTUSED", "</s>NOTUSED"]`): | |||
Additional special tokens used by the tokenizer. | |||
sp_model_kwargs (`dict`, *optional*): | |||
Will be passed to the `SentencePieceProcessor.__init__()` method. | |||
The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) | |||
can be used, among other things, to set: | |||
- `enable_sampling`: Enable subword regularization. | |||
- `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout. | |||
- `nbest_size = {0,1}`: No sampling is performed. | |||
- `nbest_size > 1`: samples from the nbest_size results. | |||
- `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) | |||
using forward-filtering-and-backward-sampling algorithm. | |||
- `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for | |||
BPE-dropout. | |||
Attributes: | |||
sp_model (`SentencePieceProcessor`): | |||
The *SentencePiece* processor that is used for every conversion (string, tokens and IDs). | |||
""" | |||
vocab_files_names = VOCAB_FILES_NAMES | |||
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP | |||
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES | |||
model_input_names = ['input_ids', 'attention_mask'] | |||
def __init__(self, | |||
vocab_file, | |||
bos_token='<s>', | |||
eos_token='</s>', | |||
sep_token='</s>', | |||
cls_token='<s>', | |||
unk_token='<unk>', | |||
pad_token='<pad>', | |||
mask_token='<mask>', | |||
sp_model_kwargs: Optional[Dict[str, Any]] = None, | |||
**kwargs) -> None: | |||
# Mask token behave like a normal word, i.e. include the space before it | |||
mask_token = AddedToken( | |||
mask_token, lstrip=True, rstrip=False) if isinstance( | |||
mask_token, str) else mask_token | |||
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs | |||
super().__init__( | |||
bos_token=bos_token, | |||
eos_token=eos_token, | |||
unk_token=unk_token, | |||
sep_token=sep_token, | |||
cls_token=cls_token, | |||
pad_token=pad_token, | |||
mask_token=mask_token, | |||
sp_model_kwargs=self.sp_model_kwargs, | |||
**kwargs, | |||
) | |||
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) | |||
self.sp_model.Load(str(vocab_file)) | |||
self.vocab_file = vocab_file | |||
# Original fairseq vocab and spm vocab must be "aligned": | |||
# Vocab | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |||
# -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ---- | |||
# fairseq | '<s>' | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's' | '▁de' | '-' | |||
# spm | '<unk>' | '<s>' | '</s>' | ',' | '.' | '▁' | 's' | '▁de' | '-' | '▁a' | |||
# Mimic fairseq token-to-id alignment for the first 4 token | |||
self.fairseq_tokens_to_ids = { | |||
'<s>': 0, | |||
'<pad>': 1, | |||
'</s>': 2, | |||
'<unk>': 3 | |||
} | |||
# The first "real" token "," has position 4 in the original fairseq vocab and position 3 in the spm vocab | |||
self.fairseq_offset = 1 | |||
self.fairseq_tokens_to_ids['<mask>'] = len( | |||
self.sp_model) + self.fairseq_offset | |||
self.fairseq_ids_to_tokens = { | |||
v: k | |||
for k, v in self.fairseq_tokens_to_ids.items() | |||
} | |||
def __getstate__(self): | |||
state = self.__dict__.copy() | |||
state['sp_model'] = None | |||
state['sp_model_proto'] = self.sp_model.serialized_model_proto() | |||
return state | |||
def __setstate__(self, d): | |||
self.__dict__ = d | |||
# for backward compatibility | |||
if not hasattr(self, 'sp_model_kwargs'): | |||
self.sp_model_kwargs = {} | |||
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) | |||
self.sp_model.LoadFromSerializedProto(self.sp_model_proto) | |||
def build_inputs_with_special_tokens( | |||
self, | |||
token_ids_0: List[int], | |||
token_ids_1: Optional[List[int]] = None) -> List[int]: | |||
""" | |||
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and | |||
adding special tokens. An Veco sequence has the following format: | |||
- single sequence: `<s> X </s>` | |||
- pair of sequences: `<s> A </s></s> B </s>` | |||
Args: | |||
token_ids_0 (`List[int]`): | |||
List of IDs to which the special tokens will be added. | |||
token_ids_1 (`List[int]`, *optional*): | |||
Optional second list of IDs for sequence pairs. | |||
Returns: | |||
`List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens. | |||
""" | |||
if token_ids_1 is None: | |||
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id] | |||
cls = [self.cls_token_id] | |||
sep = [self.sep_token_id] | |||
return cls + token_ids_0 + sep + sep + token_ids_1 + sep | |||
def get_special_tokens_mask( | |||
self, | |||
token_ids_0: List[int], | |||
token_ids_1: Optional[List[int]] = None, | |||
already_has_special_tokens: bool = False) -> List[int]: | |||
""" | |||
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding | |||
special tokens using the tokenizer `prepare_for_model` method. | |||
Args: | |||
token_ids_0 (`List[int]`): | |||
List of IDs. | |||
token_ids_1 (`List[int]`, *optional*): | |||
Optional second list of IDs for sequence pairs. | |||
already_has_special_tokens (`bool`, *optional*, defaults to `False`): | |||
Whether or not the token list is already formatted with special tokens for the model. | |||
Returns: | |||
`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. | |||
""" | |||
if already_has_special_tokens: | |||
return super().get_special_tokens_mask( | |||
token_ids_0=token_ids_0, | |||
token_ids_1=token_ids_1, | |||
already_has_special_tokens=True) | |||
if token_ids_1 is None: | |||
return [1] + ([0] * len(token_ids_0)) + [1] | |||
return [1] + ([0] * len(token_ids_0)) + [1, 1] + ( | |||
[0] * len(token_ids_1)) + [1] | |||
def create_token_type_ids_from_sequences( | |||
self, | |||
token_ids_0: List[int], | |||
token_ids_1: Optional[List[int]] = None) -> List[int]: | |||
""" | |||
Create a mask from the two sequences passed to be used in a sequence-pair classification task. Veco does | |||
not make use of token type ids, therefore a list of zeros is returned. | |||
Args: | |||
token_ids_0 (`List[int]`): | |||
List of IDs. | |||
token_ids_1 (`List[int]`, *optional*): | |||
Optional second list of IDs for sequence pairs. | |||
Returns: | |||
`List[int]`: List of zeros. | |||
""" | |||
sep = [self.sep_token_id] | |||
cls = [self.cls_token_id] | |||
if token_ids_1 is None: | |||
return len(cls + token_ids_0 + sep) * [0] | |||
return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0] | |||
@property | |||
def vocab_size(self): | |||
return len( | |||
self.sp_model) + self.fairseq_offset + 1 # Add the <mask> token | |||
def get_vocab(self): | |||
vocab = { | |||
self.convert_ids_to_tokens(i): i | |||
for i in range(self.vocab_size) | |||
} | |||
vocab.update(self.added_tokens_encoder) | |||
return vocab | |||
def _tokenize(self, text: str) -> List[str]: | |||
return self.sp_model.encode(text, out_type=str) | |||
def _convert_token_to_id(self, token): | |||
"""Converts a token (str) in an id using the vocab.""" | |||
if token in self.fairseq_tokens_to_ids: | |||
return self.fairseq_tokens_to_ids[token] | |||
spm_id = self.sp_model.PieceToId(token) | |||
# Need to return unknown token if the SP model returned 0 | |||
return spm_id + self.fairseq_offset if spm_id else self.unk_token_id | |||
def _convert_id_to_token(self, index): | |||
"""Converts an index (integer) in a token (str) using the vocab.""" | |||
if index in self.fairseq_ids_to_tokens: | |||
return self.fairseq_ids_to_tokens[index] | |||
return self.sp_model.IdToPiece(index - self.fairseq_offset) | |||
def convert_tokens_to_string(self, tokens): | |||
"""Converts a sequence of tokens (strings for sub-words) in a single string.""" | |||
out_string = ''.join(tokens).replace(SPIECE_UNDERLINE, ' ').strip() | |||
return out_string | |||
def save_vocabulary(self, | |||
save_directory: str, | |||
filename_prefix: Optional[str] = None) -> Tuple[str]: | |||
if not os.path.isdir(save_directory): | |||
logger.error( | |||
f'Vocabulary path ({save_directory}) should be a directory') | |||
return | |||
out_vocab_file = os.path.join( | |||
save_directory, (filename_prefix + '-' if filename_prefix else '') | |||
+ VOCAB_FILES_NAMES['vocab_file']) | |||
if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file): | |||
copyfile(self.vocab_file, out_vocab_file) | |||
return (out_vocab_file, ) |
@@ -0,0 +1,213 @@ | |||
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team. | |||
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors. | |||
# All rights reserved. | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License | |||
"""Fast Tokenization classes for Veco. mainly copied from :module:`~transformers.tokenization_xlm_roberta_fast`""" | |||
import os | |||
from shutil import copyfile | |||
from typing import List, Optional, Tuple | |||
import transformers | |||
from transformers.file_utils import is_sentencepiece_available | |||
from transformers.tokenization_utils import AddedToken | |||
from transformers.tokenization_utils_fast import PreTrainedTokenizerFast | |||
from modelscope.utils import logger as logging | |||
if is_sentencepiece_available(): | |||
from .tokenization_veco import VecoTokenizer | |||
else: | |||
VecoTokenizer = None | |||
logger = logging.get_logger(__name__) | |||
VOCAB_FILES_NAMES = { | |||
'vocab_file': 'sentencepiece.bpe.model', | |||
'tokenizer_file': 'tokenizer.json' | |||
} | |||
PRETRAINED_VOCAB_FILES_MAP = { | |||
'vocab_file': {}, | |||
'tokenizer_file': {}, | |||
} | |||
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {} | |||
transformers.SLOW_TO_FAST_CONVERTERS[ | |||
'VecoTokenizer'] = transformers.SLOW_TO_FAST_CONVERTERS[ | |||
'XLMRobertaTokenizer'] | |||
class VecoTokenizerFast(PreTrainedTokenizerFast): | |||
""" | |||
Adapted from [`RobertaTokenizer`] and [`XLNetTokenizer`]. | |||
Based on [BPE](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=BPE#models). | |||
This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main | |||
methods. Users should refer to this superclass for more information regarding those methods. | |||
Args: | |||
vocab_file (`str`): | |||
Path to the vocabulary file. | |||
bos_token (`str`, *optional*, defaults to `"<s>"`): | |||
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. | |||
<Tip> | |||
When building a sequence using special tokens, this is not the token that is used for the beginning of | |||
sequence. The token used is the `cls_token`. | |||
</Tip> | |||
eos_token (`str`, *optional*, defaults to `"</s>"`): | |||
The end of sequence token. | |||
<Tip> | |||
When building a sequence using special tokens, this is not the token that is used for the end of | |||
sequence. The token used is the `sep_token`. | |||
</Tip> | |||
sep_token (`str`, *optional*, defaults to `"</s>"`): | |||
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for | |||
sequence classification or for a text and a question for question answering. It is also used as the last | |||
token of a sequence built with special tokens. | |||
cls_token (`str`, *optional*, defaults to `"<s>"`): | |||
The classifier token which is used when doing sequence classification (classification of the whole sequence | |||
instead of per-token classification). It is the first token of the sequence when built with special tokens. | |||
unk_token (`str`, *optional*, defaults to `"<unk>"`): | |||
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this | |||
token instead. | |||
pad_token (`str`, *optional*, defaults to `"<pad>"`): | |||
The token used for padding, for example when batching sequences of different lengths. | |||
mask_token (`str`, *optional*, defaults to `"<mask>"`): | |||
The token used for masking values. This is the token used when training this model with masked language | |||
modeling. This is the token which the model will try to predict. | |||
additional_special_tokens (`List[str]`, *optional*, defaults to `["<s>NOTUSED", "</s>NOTUSED"]`): | |||
Additional special tokens used by the tokenizer. | |||
""" | |||
vocab_files_names = VOCAB_FILES_NAMES | |||
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP | |||
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES | |||
model_input_names = ['input_ids', 'attention_mask'] | |||
slow_tokenizer_class = VecoTokenizer | |||
def __init__(self, | |||
vocab_file=None, | |||
tokenizer_file=None, | |||
bos_token='<s>', | |||
eos_token='</s>', | |||
sep_token='</s>', | |||
cls_token='<s>', | |||
unk_token='<unk>', | |||
pad_token='<pad>', | |||
mask_token='<mask>', | |||
**kwargs): | |||
# Mask token behave like a normal word, i.e. include the space before it | |||
mask_token = AddedToken( | |||
mask_token, lstrip=True, rstrip=False) if isinstance( | |||
mask_token, str) else mask_token | |||
super().__init__( | |||
vocab_file, | |||
tokenizer_file=tokenizer_file, | |||
bos_token=bos_token, | |||
eos_token=eos_token, | |||
sep_token=sep_token, | |||
cls_token=cls_token, | |||
unk_token=unk_token, | |||
pad_token=pad_token, | |||
mask_token=mask_token, | |||
**kwargs, | |||
) | |||
self.vocab_file = vocab_file | |||
self.can_save_slow_tokenizer = False if not self.vocab_file else True | |||
def build_inputs_with_special_tokens( | |||
self, | |||
token_ids_0: List[int], | |||
token_ids_1: Optional[List[int]] = None) -> List[int]: | |||
""" | |||
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and | |||
adding special tokens. An Veco sequence has the following format: | |||
- single sequence: `<s> X </s>` | |||
- pair of sequences: `<s> A </s></s> B </s>` | |||
Args: | |||
token_ids_0 (`List[int]`): | |||
List of IDs to which the special tokens will be added. | |||
token_ids_1 (`List[int]`, *optional*): | |||
Optional second list of IDs for sequence pairs. | |||
Returns: | |||
`List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens. | |||
""" | |||
if token_ids_1 is None: | |||
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id] | |||
cls = [self.cls_token_id] | |||
sep = [self.sep_token_id] | |||
return cls + token_ids_0 + sep + sep + token_ids_1 + sep | |||
def create_token_type_ids_from_sequences( | |||
self, | |||
token_ids_0: List[int], | |||
token_ids_1: Optional[List[int]] = None) -> List[int]: | |||
""" | |||
Create a mask from the two sequences passed to be used in a sequence-pair classification task. Veco does | |||
not make use of token type ids, therefore a list of zeros is returned. | |||
Args: | |||
token_ids_0 (`List[int]`): | |||
List of IDs. | |||
token_ids_1 (`List[int]`, *optional*): | |||
Optional second list of IDs for sequence pairs. | |||
Returns: | |||
`List[int]`: List of zeros. | |||
""" | |||
sep = [self.sep_token_id] | |||
cls = [self.cls_token_id] | |||
if token_ids_1 is None: | |||
return len(cls + token_ids_0 + sep) * [0] | |||
return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0] | |||
def save_vocabulary(self, | |||
save_directory: str, | |||
filename_prefix: Optional[str] = None) -> Tuple[str]: | |||
if not self.can_save_slow_tokenizer: | |||
raise ValueError( | |||
'Your fast tokenizer does not have the necessary information to save the vocabulary for a slow ' | |||
'tokenizer.') | |||
if not os.path.isdir(save_directory): | |||
logger.error( | |||
f'Vocabulary path ({save_directory}) should be a directory.') | |||
return | |||
out_vocab_file = os.path.join( | |||
save_directory, (filename_prefix + '-' if filename_prefix else '') | |||
+ VOCAB_FILES_NAMES['vocab_file']) | |||
if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file): | |||
copyfile(self.vocab_file, out_vocab_file) | |||
return (out_vocab_file, ) |
@@ -517,3 +517,10 @@ class MsDataset: | |||
def to_hf_dataset(self) -> Dataset: | |||
self._hf_ds.reset_format() | |||
return self._hf_ds | |||
@staticmethod | |||
def interleave_datasets(datasets: List[Any], | |||
probabilities: Optional[List[float]] = None, | |||
seed: Optional[int] = None): | |||
from datasets import interleave_datasets | |||
return interleave_datasets(datasets, probabilities, seed) |
@@ -9,6 +9,7 @@ class OutputKeys(object): | |||
SCORES = 'scores' | |||
LABEL = 'label' | |||
LABELS = 'labels' | |||
INPUT_IDS = 'input_ids' | |||
LABEL_POS = 'label_pos' | |||
POSES = 'poses' | |||
CAPTION = 'caption' | |||
@@ -9,9 +9,8 @@ if TYPE_CHECKING: | |||
from .dialog_state_tracking_pipeline import DialogStateTrackingPipeline | |||
from .fill_mask_pipeline import FillMaskPipeline | |||
from .named_entity_recognition_pipeline import NamedEntityRecognitionPipeline | |||
from .nli_pipeline import NLIPipeline | |||
from .sentence_similarity_pipeline import SentenceSimilarityPipeline | |||
from .sentiment_classification_pipeline import SentimentClassificationPipeline | |||
from .pair_sentence_classification_pipeline import PairSentenceClassificationPipeline | |||
from .single_sentence_classification_pipeline import SingleSentenceClassificationPipeline | |||
from .sequence_classification_pipeline import SequenceClassificationPipeline | |||
from .text_generation_pipeline import TextGenerationPipeline | |||
from .translation_pipeline import TranslationPipeline | |||
@@ -28,10 +27,10 @@ else: | |||
'dialog_modeling_pipeline': ['DialogModelingPipeline'], | |||
'dialog_state_tracking_pipeline': ['DialogStateTrackingPipeline'], | |||
'fill_mask_pipeline': ['FillMaskPipeline'], | |||
'nli_pipeline': ['NLIPipeline'], | |||
'sentence_similarity_pipeline': ['SentenceSimilarityPipeline'], | |||
'sentiment_classification_pipeline': | |||
['SentimentClassificationPipeline'], | |||
'single_sentence_classification_pipeline': | |||
['SingleSentenceClassificationPipeline'], | |||
'pair_sentence_classification_pipeline': | |||
['PairSentenceClassificationPipeline'], | |||
'sequence_classification_pipeline': ['SequenceClassificationPipeline'], | |||
'text_generation_pipeline': ['TextGenerationPipeline'], | |||
'word_segmentation_pipeline': ['WordSegmentationPipeline'], | |||
@@ -5,11 +5,10 @@ import torch | |||
from modelscope.metainfo import Pipelines | |||
from modelscope.models import Model | |||
from modelscope.models.nlp.masked_language import MaskedLanguageModelBase | |||
from modelscope.outputs import OutputKeys | |||
from modelscope.pipelines.base import Pipeline, Tensor | |||
from modelscope.pipelines.builder import PIPELINES | |||
from modelscope.preprocessors import FillMaskPreprocessor | |||
from modelscope.preprocessors import FillMaskPreprocessor, Preprocessor | |||
from modelscope.utils.config import Config | |||
from modelscope.utils.constant import ModelFile, Tasks | |||
@@ -21,18 +20,18 @@ _type_map = {'veco': 'roberta', 'sbert': 'bert'} | |||
class FillMaskPipeline(Pipeline): | |||
def __init__(self, | |||
model: Union[MaskedLanguageModelBase, str], | |||
preprocessor: Optional[FillMaskPreprocessor] = None, | |||
first_sequence='sentense', | |||
model: Union[Model, str], | |||
preprocessor: Optional[Preprocessor] = None, | |||
first_sequence='sentence', | |||
**kwargs): | |||
"""use `model` and `preprocessor` to create a nlp fill mask pipeline for prediction | |||
Args: | |||
model (MaskedLanguageModelBase): a model instance | |||
preprocessor (FillMaskPreprocessor): a preprocessor instance | |||
model (Model): a model instance | |||
preprocessor (Preprocessor): a preprocessor instance | |||
""" | |||
fill_mask_model = model if isinstance( | |||
model, MaskedLanguageModelBase) else Model.from_pretrained(model) | |||
model, Model) else Model.from_pretrained(model) | |||
if preprocessor is None: | |||
preprocessor = FillMaskPreprocessor( | |||
@@ -73,7 +72,7 @@ class FillMaskPipeline(Pipeline): | |||
def forward(self, inputs: Dict[str, Any], | |||
**forward_params) -> Dict[str, Any]: | |||
with torch.no_grad(): | |||
return super().forward(inputs, **forward_params) | |||
return self.model(inputs, **forward_params) | |||
def postprocess(self, inputs: Dict[str, Tensor]) -> Dict[str, Tensor]: | |||
"""process the prediction results | |||
@@ -85,8 +84,8 @@ class FillMaskPipeline(Pipeline): | |||
Dict[str, str]: the prediction results | |||
""" | |||
import numpy as np | |||
logits = inputs['logits'].detach().cpu().numpy() | |||
input_ids = inputs['input_ids'].detach().cpu().numpy() | |||
logits = inputs[OutputKeys.LOGITS].detach().cpu().numpy() | |||
input_ids = inputs[OutputKeys.INPUT_IDS].detach().cpu().numpy() | |||
pred_ids = np.argmax(logits, axis=-1) | |||
model_type = self.model.config.model_type | |||
process_type = model_type if model_type in self.mask_id else _type_map[ | |||
@@ -4,11 +4,10 @@ import torch | |||
from modelscope.metainfo import Pipelines | |||
from modelscope.models import Model | |||
from modelscope.models.nlp import TransformerCRFForNamedEntityRecognition | |||
from modelscope.outputs import OutputKeys | |||
from modelscope.pipelines.base import Pipeline, Tensor | |||
from modelscope.pipelines.base import Pipeline | |||
from modelscope.pipelines.builder import PIPELINES | |||
from modelscope.preprocessors import NERPreprocessor | |||
from modelscope.preprocessors import NERPreprocessor, Preprocessor | |||
from modelscope.utils.constant import Tasks | |||
__all__ = ['NamedEntityRecognitionPipeline'] | |||
@@ -20,13 +19,12 @@ __all__ = ['NamedEntityRecognitionPipeline'] | |||
class NamedEntityRecognitionPipeline(Pipeline): | |||
def __init__(self, | |||
model: Union[TransformerCRFForNamedEntityRecognition, str], | |||
preprocessor: Optional[NERPreprocessor] = None, | |||
model: Union[Model, str], | |||
preprocessor: Optional[Preprocessor] = None, | |||
**kwargs): | |||
model = model if isinstance(model, | |||
TransformerCRFForNamedEntityRecognition | |||
) else Model.from_pretrained(model) | |||
Model) else Model.from_pretrained(model) | |||
if preprocessor is None: | |||
preprocessor = NERPreprocessor(model.model_dir) | |||
model.eval() | |||
@@ -1,73 +0,0 @@ | |||
import uuid | |||
from typing import Any, Dict, Union | |||
import numpy as np | |||
import torch | |||
from modelscope.metainfo import Pipelines | |||
from modelscope.models import Model | |||
from modelscope.models.nlp import SbertForNLI | |||
from modelscope.outputs import OutputKeys | |||
from modelscope.pipelines.base import Pipeline | |||
from modelscope.pipelines.builder import PIPELINES | |||
from modelscope.preprocessors import NLIPreprocessor | |||
from modelscope.utils.constant import Tasks | |||
__all__ = ['NLIPipeline'] | |||
@PIPELINES.register_module(Tasks.nli, module_name=Pipelines.nli) | |||
class NLIPipeline(Pipeline): | |||
def __init__(self, | |||
model: Union[SbertForNLI, str], | |||
preprocessor: NLIPreprocessor = None, | |||
first_sequence='first_sequence', | |||
second_sequence='second_sequence', | |||
**kwargs): | |||
"""use `model` and `preprocessor` to create a nlp text classification pipeline for prediction | |||
Args: | |||
model (SbertForNLI): a model instance | |||
preprocessor (NLIPreprocessor): a preprocessor instance | |||
""" | |||
assert isinstance(model, str) or isinstance(model, SbertForNLI), \ | |||
'model must be a single str or SbertForNLI' | |||
model = model if isinstance( | |||
model, SbertForNLI) else Model.from_pretrained(model) | |||
if preprocessor is None: | |||
preprocessor = NLIPreprocessor( | |||
model.model_dir, | |||
first_sequence=first_sequence, | |||
second_sequence=second_sequence) | |||
model.eval() | |||
super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||
assert len(model.id2label) > 0 | |||
def forward(self, inputs: Dict[str, Any], | |||
**forward_params) -> Dict[str, Any]: | |||
with torch.no_grad(): | |||
return super().forward(inputs, **forward_params) | |||
def postprocess(self, | |||
inputs: Dict[str, Any], | |||
topk: int = 5) -> Dict[str, str]: | |||
"""process the prediction results | |||
Args: | |||
inputs (Dict[str, Any]): _description_ | |||
Returns: | |||
Dict[str, str]: the prediction results | |||
""" | |||
probs = inputs['probabilities'][0] | |||
num_classes = probs.shape[0] | |||
topk = min(topk, num_classes) | |||
top_indices = np.argpartition(probs, -topk)[-topk:] | |||
cls_ids = top_indices[np.argsort(probs[top_indices])] | |||
probs = probs[cls_ids].tolist() | |||
cls_names = [self.model.id2label[cid] for cid in cls_ids] | |||
return {OutputKeys.SCORES: probs, OutputKeys.LABELS: cls_names} |
@@ -0,0 +1,37 @@ | |||
from typing import Union | |||
from modelscope.models.base import Model | |||
from ...metainfo import Pipelines | |||
from ...preprocessors import (PairSentenceClassificationPreprocessor, | |||
Preprocessor) | |||
from ...utils.constant import Tasks | |||
from ..builder import PIPELINES | |||
from .sequence_classification_pipeline_base import \ | |||
SequenceClassificationPipelineBase | |||
__all__ = ['PairSentenceClassificationPipeline'] | |||
@PIPELINES.register_module(Tasks.nli, module_name=Pipelines.nli) | |||
@PIPELINES.register_module( | |||
Tasks.sentence_similarity, module_name=Pipelines.sentence_similarity) | |||
class PairSentenceClassificationPipeline(SequenceClassificationPipelineBase): | |||
def __init__(self, | |||
model: Union[Model, str], | |||
preprocessor: Preprocessor = None, | |||
first_sequence='first_sequence', | |||
second_sequence='second_sequence', | |||
**kwargs): | |||
"""use `model` and `preprocessor` to create a nlp pair sentence classification pipeline for prediction | |||
Args: | |||
model (Model): a model instance | |||
preprocessor (Preprocessor): a preprocessor instance | |||
""" | |||
if preprocessor is None: | |||
preprocessor = PairSentenceClassificationPreprocessor( | |||
model.model_dir if isinstance(model, Model) else model, | |||
first_sequence=first_sequence, | |||
second_sequence=second_sequence) | |||
super().__init__(model=model, preprocessor=preprocessor, **kwargs) |
@@ -1,73 +0,0 @@ | |||
from typing import Any, Dict, Union | |||
import numpy as np | |||
import torch | |||
from modelscope.metainfo import Pipelines | |||
from modelscope.models import Model | |||
from modelscope.models.nlp import SbertForSentenceSimilarity | |||
from modelscope.outputs import OutputKeys | |||
from modelscope.pipelines.base import Input, Pipeline | |||
from modelscope.pipelines.builder import PIPELINES | |||
from modelscope.preprocessors import SentenceSimilarityPreprocessor | |||
from modelscope.utils.constant import Tasks | |||
__all__ = ['SentenceSimilarityPipeline'] | |||
@PIPELINES.register_module( | |||
Tasks.sentence_similarity, module_name=Pipelines.sentence_similarity) | |||
class SentenceSimilarityPipeline(Pipeline): | |||
def __init__(self, | |||
model: Union[Model, str], | |||
preprocessor: SentenceSimilarityPreprocessor = None, | |||
first_sequence='first_sequence', | |||
second_sequence='second_sequence', | |||
**kwargs): | |||
"""use `model` and `preprocessor` to create a nlp sentence similarity pipeline for prediction | |||
Args: | |||
model (SbertForSentenceSimilarity): a model instance | |||
preprocessor (SentenceSimilarityPreprocessor): a preprocessor instance | |||
""" | |||
assert isinstance(model, str) or isinstance(model, SbertForSentenceSimilarity), \ | |||
'model must be a single str or SbertForSentenceSimilarity' | |||
sc_model = model if isinstance( | |||
model, | |||
SbertForSentenceSimilarity) else Model.from_pretrained(model) | |||
if preprocessor is None: | |||
preprocessor = SentenceSimilarityPreprocessor( | |||
sc_model.model_dir, | |||
first_sequence=first_sequence, | |||
second_sequence=second_sequence) | |||
sc_model.eval() | |||
super().__init__(model=sc_model, preprocessor=preprocessor, **kwargs) | |||
assert hasattr(self.model, 'id2label'), \ | |||
'id2label map should be initalizaed in init function.' | |||
def forward(self, inputs: Dict[str, Any], | |||
**forward_params) -> Dict[str, Any]: | |||
with torch.no_grad(): | |||
return super().forward(inputs, **forward_params) | |||
def postprocess(self, inputs: Dict[str, Any], | |||
**postprocess_params) -> Dict[str, str]: | |||
"""process the prediction results | |||
Args: | |||
inputs (Dict[str, Any]): _description_ | |||
Returns: | |||
Dict[str, str]: the prediction results | |||
""" | |||
probs = inputs['probabilities'][0] | |||
num_classes = probs.shape[0] | |||
top_indices = np.argpartition(probs, -num_classes)[-num_classes:] | |||
cls_ids = top_indices[np.argsort(-probs[top_indices], axis=-1)] | |||
probs = probs[cls_ids].tolist() | |||
cls_names = [self.model.id2label[cid] for cid in cls_ids] | |||
b = 0 | |||
return {OutputKeys.SCORES: probs[b], OutputKeys.LABELS: cls_names[b]} |
@@ -1,74 +0,0 @@ | |||
from typing import Any, Dict, Union | |||
import numpy as np | |||
import torch | |||
from modelscope.metainfo import Pipelines | |||
from modelscope.models import Model | |||
from modelscope.models.nlp import SequenceClassificationModel | |||
from modelscope.outputs import OutputKeys | |||
from modelscope.pipelines.base import Pipeline | |||
from modelscope.pipelines.builder import PIPELINES | |||
from modelscope.preprocessors import SentimentClassificationPreprocessor | |||
from modelscope.utils.constant import Tasks | |||
__all__ = ['SentimentClassificationPipeline'] | |||
@PIPELINES.register_module( | |||
Tasks.sentiment_classification, | |||
module_name=Pipelines.sentiment_classification) | |||
class SentimentClassificationPipeline(Pipeline): | |||
def __init__(self, | |||
model: Union[SequenceClassificationModel, str], | |||
preprocessor: SentimentClassificationPreprocessor = None, | |||
first_sequence='first_sequence', | |||
second_sequence='second_sequence', | |||
**kwargs): | |||
"""use `model` and `preprocessor` to create a nlp text classification pipeline for prediction | |||
Args: | |||
model (SequenceClassificationModel): a model instance | |||
preprocessor (SentimentClassificationPreprocessor): a preprocessor instance | |||
""" | |||
assert isinstance(model, str) or isinstance(model, SequenceClassificationModel), \ | |||
'model must be a single str or SentimentClassification' | |||
model = model if isinstance( | |||
model, | |||
SequenceClassificationModel) else Model.from_pretrained(model) | |||
if preprocessor is None: | |||
preprocessor = SentimentClassificationPreprocessor( | |||
model.model_dir, | |||
first_sequence=first_sequence, | |||
second_sequence=second_sequence) | |||
model.eval() | |||
super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||
assert len(model.id2label) > 0 | |||
def forward(self, inputs: Dict[str, Any], | |||
**forward_params) -> Dict[str, Any]: | |||
with torch.no_grad(): | |||
return super().forward(inputs, **forward_params) | |||
def postprocess(self, | |||
inputs: Dict[str, Any], | |||
topk: int = 5) -> Dict[str, str]: | |||
"""process the prediction results | |||
Args: | |||
inputs (Dict[str, Any]): _description_ | |||
Returns: | |||
Dict[str, str]: the prediction results | |||
""" | |||
probs = inputs['probabilities'][0] | |||
num_classes = probs.shape[0] | |||
topk = min(topk, num_classes) | |||
top_indices = np.argpartition(probs, -topk)[-topk:] | |||
cls_ids = top_indices[np.argsort(probs[top_indices])] | |||
probs = probs[cls_ids].tolist() | |||
cls_names = [self.model.id2label[cid] for cid in cls_ids] | |||
return {OutputKeys.SCORES: probs, OutputKeys.LABELS: cls_names} |
@@ -0,0 +1,60 @@ | |||
from typing import Any, Dict, Union | |||
import numpy as np | |||
import torch | |||
from modelscope.models.base import Model | |||
from modelscope.outputs import OutputKeys | |||
from ...preprocessors import Preprocessor | |||
from ..base import Pipeline | |||
class SequenceClassificationPipelineBase(Pipeline): | |||
def __init__(self, model: Union[Model, str], preprocessor: Preprocessor, | |||
**kwargs): | |||
"""use `model` and `preprocessor` to create a nlp text classification pipeline for prediction | |||
Args: | |||
model (str or Model): a model instance | |||
preprocessor (Preprocessor): a preprocessor instance | |||
""" | |||
assert isinstance(model, str) or isinstance(model, Model), \ | |||
'model must be a single str or Model' | |||
model = model if isinstance(model, | |||
Model) else Model.from_pretrained(model) | |||
assert preprocessor is not None | |||
model.eval() | |||
super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||
self.id2label = kwargs.get('id2label') | |||
if self.id2label is None and hasattr(self.preprocessor, 'id2label'): | |||
self.id2label = self.preprocessor.id2label | |||
assert self.id2label is not None, 'Cannot convert id to the original label, please pass in the mapping ' \ | |||
'as a parameter or make sure the preprocessor has the attribute.' | |||
def forward(self, inputs: Dict[str, Any], | |||
**forward_params) -> Dict[str, Any]: | |||
with torch.no_grad(): | |||
return self.model(inputs, **forward_params) | |||
def postprocess(self, | |||
inputs: Dict[str, Any], | |||
topk: int = 5) -> Dict[str, str]: | |||
"""process the prediction results | |||
Args: | |||
inputs (Dict[str, Any]): _description_ | |||
topk (int): The topk probs to take | |||
Returns: | |||
Dict[str, str]: the prediction results | |||
""" | |||
probs = inputs[OutputKeys.PROBABILITIES][0] | |||
num_classes = probs.shape[0] | |||
topk = min(topk, num_classes) | |||
top_indices = np.argpartition(probs, -topk)[-topk:] | |||
cls_ids = top_indices[np.argsort(probs[top_indices])] | |||
probs = probs[cls_ids].tolist() | |||
cls_names = [self.id2label[cid] for cid in cls_ids] | |||
return {OutputKeys.SCORES: probs, OutputKeys.LABELS: cls_names} |
@@ -0,0 +1,35 @@ | |||
from typing import Union | |||
from ...metainfo import Pipelines | |||
from ...models import Model | |||
from ...preprocessors import (Preprocessor, | |||
SingleSentenceClassificationPreprocessor) | |||
from ...utils.constant import Tasks | |||
from ..builder import PIPELINES | |||
from .sequence_classification_pipeline_base import \ | |||
SequenceClassificationPipelineBase | |||
__all__ = ['SingleSentenceClassificationPipeline'] | |||
@PIPELINES.register_module( | |||
Tasks.sentiment_classification, | |||
module_name=Pipelines.sentiment_classification) | |||
class SingleSentenceClassificationPipeline(SequenceClassificationPipelineBase): | |||
def __init__(self, | |||
model: Union[Model, str], | |||
preprocessor: Preprocessor = None, | |||
first_sequence='first_sequence', | |||
**kwargs): | |||
"""use `model` and `preprocessor` to create a nlp single sentence classification pipeline for prediction | |||
Args: | |||
model (Model): a model instance | |||
preprocessor (Preprocessor): a preprocessor instance | |||
""" | |||
if preprocessor is None: | |||
preprocessor = SingleSentenceClassificationPreprocessor( | |||
model.model_dir if isinstance(model, Model) else model, | |||
first_sequence=first_sequence) | |||
super().__init__(model=model, preprocessor=preprocessor, **kwargs) |
@@ -3,7 +3,7 @@ from typing import Any, Dict, Optional, Union | |||
import torch | |||
from modelscope.metainfo import Pipelines | |||
from modelscope.models.base import TorchModel | |||
from modelscope.models.base import Model | |||
from modelscope.pipelines.base import Pipeline, Tensor | |||
from modelscope.pipelines.builder import PIPELINES | |||
from modelscope.preprocessors import TextGenerationPreprocessor | |||
@@ -17,7 +17,7 @@ __all__ = ['TextGenerationPipeline'] | |||
class TextGenerationPipeline(Pipeline): | |||
def __init__(self, | |||
model: Union[TorchModel, str], | |||
model: Union[Model, str], | |||
preprocessor: Optional[TextGenerationPreprocessor] = None, | |||
**kwargs): | |||
"""use `model` and `preprocessor` to create a nlp text generation pipeline for prediction | |||
@@ -26,8 +26,8 @@ class TextGenerationPipeline(Pipeline): | |||
model (PalmForTextGeneration): a model instance | |||
preprocessor (TextGenerationPreprocessor): a preprocessor instance | |||
""" | |||
model = model if isinstance( | |||
model, TorchModel) else TorchModel.from_pretrained(model) | |||
model = model if isinstance(model, | |||
Model) else Model.from_pretrained(model) | |||
if preprocessor is None: | |||
preprocessor = TextGenerationPreprocessor( | |||
model.model_dir, | |||
@@ -4,11 +4,9 @@ from typing import Any, Dict | |||
import numpy as np | |||
import tensorflow as tf | |||
from modelscope.hub.snapshot_download import snapshot_download | |||
from modelscope.metainfo import Pipelines | |||
from modelscope.models.nlp import CsanmtForTranslation | |||
from modelscope.outputs import OutputKeys | |||
from modelscope.pipelines.base import Pipeline, Tensor | |||
from modelscope.pipelines.base import Pipeline | |||
from modelscope.pipelines.builder import PIPELINES | |||
from modelscope.utils.constant import ModelFile, Tasks | |||
from modelscope.utils.logger import get_logger | |||
@@ -4,11 +4,11 @@ import torch | |||
from modelscope.metainfo import Pipelines | |||
from modelscope.models import Model | |||
from modelscope.models.nlp import SbertForTokenClassification | |||
from modelscope.outputs import OutputKeys | |||
from modelscope.pipelines.base import Pipeline, Tensor | |||
from modelscope.pipelines.builder import PIPELINES | |||
from modelscope.preprocessors import TokenClassificationPreprocessor | |||
from modelscope.preprocessors import (Preprocessor, | |||
TokenClassificationPreprocessor) | |||
from modelscope.utils.constant import Tasks | |||
__all__ = ['WordSegmentationPipeline'] | |||
@@ -18,33 +18,35 @@ __all__ = ['WordSegmentationPipeline'] | |||
Tasks.word_segmentation, module_name=Pipelines.word_segmentation) | |||
class WordSegmentationPipeline(Pipeline): | |||
def __init__( | |||
self, | |||
model: Union[SbertForTokenClassification, str], | |||
preprocessor: Optional[TokenClassificationPreprocessor] = None, | |||
**kwargs): | |||
def __init__(self, | |||
model: Union[Model, str], | |||
preprocessor: Optional[Preprocessor] = None, | |||
**kwargs): | |||
"""use `model` and `preprocessor` to create a nlp word segmentation pipeline for prediction | |||
Args: | |||
model (StructBertForTokenClassification): a model instance | |||
preprocessor (TokenClassificationPreprocessor): a preprocessor instance | |||
model (Model): a model instance | |||
preprocessor (Preprocessor): a preprocessor instance | |||
""" | |||
model = model if isinstance( | |||
model, | |||
SbertForTokenClassification) else Model.from_pretrained(model) | |||
model = model if isinstance(model, | |||
Model) else Model.from_pretrained(model) | |||
if preprocessor is None: | |||
preprocessor = TokenClassificationPreprocessor(model.model_dir) | |||
model.eval() | |||
super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||
self.tokenizer = preprocessor.tokenizer | |||
self.config = model.config | |||
assert len(self.config.id2label) > 0 | |||
self.id2label = self.config.id2label | |||
self.id2label = kwargs.get('id2label') | |||
if self.id2label is None and hasattr(self.preprocessor, 'id2label'): | |||
self.id2label = self.preprocessor.id2label | |||
assert self.id2label is not None, 'Cannot convert id to the original label, please pass in the mapping ' \ | |||
'as a parameter or make sure the preprocessor has the attribute.' | |||
def forward(self, inputs: Dict[str, Any], | |||
**forward_params) -> Dict[str, Any]: | |||
text = inputs.pop(OutputKeys.TEXT) | |||
with torch.no_grad(): | |||
return super().forward(inputs, **forward_params) | |||
return { | |||
**self.model(inputs, **forward_params), OutputKeys.TEXT: text | |||
} | |||
def postprocess(self, inputs: Dict[str, Any], | |||
**postprocess_params) -> Dict[str, str]: | |||
@@ -5,11 +5,11 @@ from scipy.special import softmax | |||
from modelscope.metainfo import Pipelines | |||
from modelscope.models import Model | |||
from modelscope.models.nlp import SbertForZeroShotClassification | |||
from modelscope.outputs import OutputKeys | |||
from modelscope.pipelines.base import Pipeline | |||
from modelscope.pipelines.builder import PIPELINES | |||
from modelscope.preprocessors import ZeroShotClassificationPreprocessor | |||
from modelscope.preprocessors import (Preprocessor, | |||
ZeroShotClassificationPreprocessor) | |||
from modelscope.utils.constant import Tasks | |||
__all__ = ['ZeroShotClassificationPipeline'] | |||
@@ -21,19 +21,18 @@ __all__ = ['ZeroShotClassificationPipeline'] | |||
class ZeroShotClassificationPipeline(Pipeline): | |||
def __init__(self, | |||
model: Union[SbertForZeroShotClassification, str], | |||
preprocessor: ZeroShotClassificationPreprocessor = None, | |||
model: Union[Model, str], | |||
preprocessor: Preprocessor = None, | |||
**kwargs): | |||
"""use `model` and `preprocessor` to create a nlp text classification pipeline for prediction | |||
"""use `model` and `preprocessor` to create a nlp zero-shot text classification pipeline for prediction | |||
Args: | |||
model (SbertForZeroShotClassification): a model instance | |||
preprocessor (SentimentClassificationPreprocessor): a preprocessor instance | |||
model (Model): a model instance | |||
preprocessor (Preprocessor): a preprocessor instance | |||
""" | |||
assert isinstance(model, str) or isinstance(model, SbertForZeroShotClassification), \ | |||
'model must be a single str or SbertForZeroShotClassification' | |||
model = model if isinstance( | |||
model, | |||
SbertForZeroShotClassification) else Model.from_pretrained(model) | |||
assert isinstance(model, str) or isinstance(model, Model), \ | |||
'model must be a single str or Model' | |||
model = model if isinstance(model, | |||
Model) else Model.from_pretrained(model) | |||
self.entailment_id = 0 | |||
self.contradiction_id = 2 | |||
if preprocessor is None: | |||
@@ -58,7 +57,7 @@ class ZeroShotClassificationPipeline(Pipeline): | |||
def forward(self, inputs: Dict[str, Any], | |||
**forward_params) -> Dict[str, Any]: | |||
with torch.no_grad(): | |||
return super().forward(inputs, **forward_params) | |||
return self.model(inputs, **forward_params) | |||
def postprocess(self, | |||
inputs: Dict[str, Any], | |||
@@ -70,7 +69,7 @@ class ZeroShotClassificationPipeline(Pipeline): | |||
Returns: | |||
Dict[str, Any]: the prediction results | |||
""" | |||
logits = inputs['logits'] | |||
logits = inputs[OutputKeys.LOGITS] | |||
if multi_label or len(candidate_labels) == 1: | |||
logits = logits[..., [self.contradiction_id, self.entailment_id]] | |||
scores = softmax(logits, axis=-1)[..., 1] | |||
@@ -18,11 +18,11 @@ if TYPE_CHECKING: | |||
MPlugVisualQuestionAnsweringPreprocessor) | |||
from .nlp import (Tokenize, SequenceClassificationPreprocessor, | |||
TextGenerationPreprocessor, | |||
TokenClassificationPreprocessor, NLIPreprocessor, | |||
SentimentClassificationPreprocessor, | |||
SentenceSimilarityPreprocessor, FillMaskPreprocessor, | |||
ZeroShotClassificationPreprocessor, NERPreprocessor, | |||
TextErrorCorrectionPreprocessor) | |||
TokenClassificationPreprocessor, | |||
SingleSentenceClassificationPreprocessor, | |||
PairSentenceClassificationPreprocessor, | |||
FillMaskPreprocessor, ZeroShotClassificationPreprocessor, | |||
NERPreprocessor, TextErrorCorrectionPreprocessor) | |||
from .space import (DialogIntentPredictionPreprocessor, | |||
DialogModelingPreprocessor, | |||
DialogStateTrackingPreprocessor) | |||
@@ -46,8 +46,8 @@ else: | |||
'nlp': [ | |||
'Tokenize', 'SequenceClassificationPreprocessor', | |||
'TextGenerationPreprocessor', 'TokenClassificationPreprocessor', | |||
'NLIPreprocessor', 'SentimentClassificationPreprocessor', | |||
'SentenceSimilarityPreprocessor', 'FillMaskPreprocessor', | |||
'SingleSentenceClassificationPreprocessor', | |||
'PairSentenceClassificationPreprocessor', 'FillMaskPreprocessor', | |||
'ZeroShotClassificationPreprocessor', 'NERPreprocessor', | |||
'TextErrorCorrectionPreprocessor' | |||
], | |||
@@ -1,5 +1,5 @@ | |||
# Copyright (c) Alibaba, Inc. and its affiliates. | |||
import os | |||
from abc import ABC, abstractmethod | |||
from typing import Any, Dict | |||
@@ -10,6 +10,8 @@ class Preprocessor(ABC): | |||
def __init__(self, *args, **kwargs): | |||
self._mode = ModeKeys.INFERENCE | |||
self.device = int( | |||
os.environ['LOCAL_RANK']) if 'LOCAL_RANK' in os.environ else None | |||
pass | |||
@abstractmethod | |||
@@ -2,14 +2,14 @@ | |||
import os.path as osp | |||
import uuid | |||
from typing import Any, Dict, Optional, Union | |||
from typing import Any, Dict, Iterable, Optional, Tuple, Union | |||
from transformers import AutoTokenizer | |||
from modelscope.metainfo import Preprocessors | |||
from modelscope.models import Model | |||
from modelscope.metainfo import Models, Preprocessors | |||
from modelscope.outputs import OutputKeys | |||
from modelscope.utils.constant import Fields, InputFields, ModeKeys | |||
from modelscope.utils.hub import parse_label_mapping | |||
from modelscope.utils.hub import get_model_type, parse_label_mapping | |||
from modelscope.utils.type_assert import type_assert | |||
from .base import Preprocessor | |||
from .builder import PREPROCESSORS | |||
@@ -17,8 +17,8 @@ from .builder import PREPROCESSORS | |||
__all__ = [ | |||
'Tokenize', 'SequenceClassificationPreprocessor', | |||
'TextGenerationPreprocessor', 'TokenClassificationPreprocessor', | |||
'NLIPreprocessor', 'SentimentClassificationPreprocessor', | |||
'FillMaskPreprocessor', 'SentenceSimilarityPreprocessor', | |||
'PairSentenceClassificationPreprocessor', | |||
'SingleSentenceClassificationPreprocessor', 'FillMaskPreprocessor', | |||
'ZeroShotClassificationPreprocessor', 'NERPreprocessor', | |||
'TextErrorCorrectionPreprocessor' | |||
] | |||
@@ -38,99 +38,6 @@ class Tokenize(Preprocessor): | |||
return data | |||
class NLPPreprocessorBase(Preprocessor): | |||
def __init__(self, model_dir: str, *args, **kwargs): | |||
"""preprocess the data via the vocab.txt from the `model_dir` path | |||
Args: | |||
model_dir (str): model path | |||
""" | |||
super().__init__(*args, **kwargs) | |||
self.model_dir: str = model_dir | |||
self.first_sequence: str = kwargs.pop('first_sequence', | |||
'first_sequence') | |||
self.second_sequence = kwargs.pop('second_sequence', 'second_sequence') | |||
self.tokenize_kwargs = kwargs | |||
self.tokenizer = self.build_tokenizer(model_dir) | |||
self.label2id = parse_label_mapping(self.model_dir) | |||
def build_tokenizer(self, model_dir): | |||
from sofa import SbertTokenizer | |||
return SbertTokenizer.from_pretrained(model_dir) | |||
@type_assert(object, object) | |||
def __call__(self, data: Union[str, tuple, Dict]) -> Dict[str, Any]: | |||
"""process the raw input data | |||
Args: | |||
data (tuple): [sentence1, sentence2] | |||
sentence1 (str): a sentence | |||
Example: | |||
'you are so handsome.' | |||
sentence2 (str): a sentence | |||
Example: | |||
'you are so beautiful.' | |||
Returns: | |||
Dict[str, Any]: the preprocessed data | |||
""" | |||
text_a, text_b = None, None | |||
if isinstance(data, str): | |||
text_a = data | |||
elif isinstance(data, tuple): | |||
assert len(data) == 2 | |||
text_a, text_b = data | |||
elif isinstance(data, dict): | |||
text_a = data.get(self.first_sequence) | |||
text_b = data.get(self.second_sequence, None) | |||
rst = self.tokenizer(text_a, text_b, **self.tokenize_kwargs) | |||
if self._mode == ModeKeys.TRAIN: | |||
rst = {k: v.squeeze() for k, v in rst.items()} | |||
if self.label2id is not None and 'label' in data: | |||
rst['label'] = self.label2id[str(data['label'])] | |||
return rst | |||
@PREPROCESSORS.register_module( | |||
Fields.nlp, module_name=Preprocessors.nli_tokenizer) | |||
class NLIPreprocessor(NLPPreprocessorBase): | |||
def __init__(self, model_dir: str, *args, **kwargs): | |||
kwargs['truncation'] = True | |||
kwargs['padding'] = False | |||
kwargs['return_tensors'] = 'pt' | |||
kwargs['max_length'] = kwargs.pop('sequence_length', 128) | |||
super().__init__(model_dir, *args, **kwargs) | |||
@PREPROCESSORS.register_module( | |||
Fields.nlp, module_name=Preprocessors.sen_cls_tokenizer) | |||
class SentimentClassificationPreprocessor(NLPPreprocessorBase): | |||
def __init__(self, model_dir: str, *args, **kwargs): | |||
kwargs['truncation'] = True | |||
kwargs['padding'] = 'max_length' | |||
kwargs['return_tensors'] = 'pt' | |||
kwargs['max_length'] = kwargs.pop('sequence_length', 128) | |||
super().__init__(model_dir, *args, **kwargs) | |||
@PREPROCESSORS.register_module( | |||
Fields.nlp, module_name=Preprocessors.sen_sim_tokenizer) | |||
class SentenceSimilarityPreprocessor(NLPPreprocessorBase): | |||
def __init__(self, model_dir: str, *args, **kwargs): | |||
kwargs['truncation'] = True | |||
kwargs['padding'] = False if 'padding' not in kwargs else kwargs[ | |||
'padding'] | |||
kwargs['return_tensors'] = 'pt' | |||
kwargs['max_length'] = kwargs.pop('sequence_length', 128) | |||
super().__init__(model_dir, *args, **kwargs) | |||
@PREPROCESSORS.register_module( | |||
Fields.nlp, module_name=Preprocessors.bert_seq_cls_tokenizer) | |||
class SequenceClassificationPreprocessor(Preprocessor): | |||
@@ -197,32 +104,193 @@ class SequenceClassificationPreprocessor(Preprocessor): | |||
return rst | |||
class NLPTokenizerPreprocessorBase(Preprocessor): | |||
def __init__(self, model_dir: str, pair: bool, mode: str, **kwargs): | |||
"""preprocess the data via the vocab.txt from the `model_dir` path | |||
Args: | |||
model_dir (str): model path | |||
""" | |||
super().__init__(**kwargs) | |||
self.model_dir: str = model_dir | |||
self.first_sequence: str = kwargs.pop('first_sequence', | |||
'first_sequence') | |||
self.second_sequence = kwargs.pop('second_sequence', 'second_sequence') | |||
self.pair = pair | |||
self._mode = mode | |||
self.label = kwargs.pop('label', OutputKeys.LABEL) | |||
self.label2id = None | |||
if 'label2id' in kwargs: | |||
self.label2id = kwargs.pop('label2id') | |||
if self.label2id is None: | |||
self.label2id = parse_label_mapping(self.model_dir) | |||
self.tokenize_kwargs = kwargs | |||
self.tokenizer = self.build_tokenizer(model_dir) | |||
@property | |||
def id2label(self): | |||
if self.label2id is not None: | |||
return {id: label for label, id in self.label2id.items()} | |||
return None | |||
def build_tokenizer(self, model_dir): | |||
model_type = get_model_type(model_dir) | |||
if model_type in (Models.structbert, Models.gpt3, Models.palm): | |||
from modelscope.models.nlp.structbert import SbertTokenizerFast | |||
return SbertTokenizerFast.from_pretrained(model_dir) | |||
elif model_type == Models.veco: | |||
from modelscope.models.nlp.veco import VecoTokenizerFast | |||
return VecoTokenizerFast.from_pretrained(model_dir) | |||
else: | |||
return AutoTokenizer.from_pretrained(model_dir) | |||
def __call__(self, data: Union[str, Tuple, Dict]) -> Dict[str, Any]: | |||
"""process the raw input data | |||
Args: | |||
data (tuple): [sentence1, sentence2] | |||
sentence1 (str): a sentence | |||
Example: | |||
'you are so handsome.' | |||
sentence2 (str): a sentence | |||
Example: | |||
'you are so beautiful.' | |||
Returns: | |||
Dict[str, Any]: the preprocessed data | |||
""" | |||
text_a, text_b, labels = self.parse_text_and_label(data) | |||
output = self.tokenizer( | |||
text_a, | |||
text_b, | |||
return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None, | |||
**self.tokenize_kwargs) | |||
self.labels_to_id(labels, output) | |||
return output | |||
def parse_text_and_label(self, data): | |||
text_a, text_b, labels = None, None, None | |||
if isinstance(data, str): | |||
text_a = data | |||
elif isinstance(data, tuple) or isinstance(data, list): | |||
if len(data) == 3: | |||
text_a, text_b, labels = data | |||
elif len(data) == 2: | |||
if self.pair: | |||
text_a, text_b = data | |||
else: | |||
text_a, labels = data | |||
elif isinstance(data, dict): | |||
text_a = data.get(self.first_sequence) | |||
text_b = data.get(self.second_sequence) | |||
labels = data.get(self.label) | |||
return text_a, text_b, labels | |||
def labels_to_id(self, labels, output): | |||
def label_can_be_mapped(label): | |||
return isinstance(label, str) or isinstance(label, int) | |||
if labels is not None: | |||
if isinstance(labels, Iterable) and all([label_can_be_mapped(label) for label in labels]) \ | |||
and self.label2id is not None: | |||
output[OutputKeys.LABEL] = [ | |||
self.label2id[str(label)] for label in labels | |||
] | |||
elif label_can_be_mapped(labels) and self.label2id is not None: | |||
output[OutputKeys.LABEL] = self.label2id[str(labels)] | |||
else: | |||
output[OutputKeys.LABEL] = labels | |||
@PREPROCESSORS.register_module( | |||
Fields.nlp, module_name='bert-seq-cls-tokenizer-finetune') | |||
class SentenceSimilarityFinetunePreprocessor(SentenceSimilarityPreprocessor): | |||
"""Sentence similarity preprocessor in the finetune scenario | |||
Fields.nlp, module_name=Preprocessors.nli_tokenizer) | |||
@PREPROCESSORS.register_module( | |||
Fields.nlp, module_name=Preprocessors.sen_sim_tokenizer) | |||
class PairSentenceClassificationPreprocessor(NLPTokenizerPreprocessorBase): | |||
def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs): | |||
kwargs['truncation'] = kwargs.get('truncation', True) | |||
kwargs['padding'] = kwargs.get( | |||
'padding', False if mode == 'inference' else 'max_length') | |||
kwargs['max_length'] = kwargs.pop('sequence_length', 128) | |||
super().__init__(model_dir, pair=True, mode=mode, **kwargs) | |||
Mainly added the label mapping procedure. | |||
""" | |||
def __init__(self, model_dir: str, *args, **kwargs): | |||
kwargs['padding'] = 'max_length' | |||
super().__init__(model_dir, *args, **kwargs) | |||
@PREPROCESSORS.register_module( | |||
Fields.nlp, module_name=Preprocessors.sen_cls_tokenizer) | |||
class SingleSentenceClassificationPreprocessor(NLPTokenizerPreprocessorBase): | |||
def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs): | |||
kwargs['truncation'] = kwargs.get('truncation', True) | |||
kwargs['padding'] = kwargs.get( | |||
'padding', False if mode == 'inference' else 'max_length') | |||
kwargs['max_length'] = kwargs.pop('sequence_length', 128) | |||
super().__init__(model_dir, pair=False, mode=mode, **kwargs) | |||
@PREPROCESSORS.register_module( | |||
Fields.nlp, module_name=Preprocessors.zero_shot_cls_tokenizer) | |||
class ZeroShotClassificationPreprocessor(NLPTokenizerPreprocessorBase): | |||
def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs): | |||
"""preprocess the data via the vocab.txt from the `model_dir` path | |||
Args: | |||
model_dir (str): model path | |||
""" | |||
self.sequence_length = kwargs.pop('sequence_length', 512) | |||
super().__init__(model_dir, pair=False, mode=mode, **kwargs) | |||
def __call__(self, data: Union[str, Dict], hypothesis_template: str, | |||
candidate_labels: list) -> Dict[str, Any]: | |||
"""process the raw input data | |||
Args: | |||
data (str or dict): a sentence | |||
Example: | |||
'you are so handsome.' | |||
Returns: | |||
Dict[str, Any]: the preprocessed data | |||
""" | |||
if isinstance(data, dict): | |||
data = data.get(self.first_sequence) | |||
pairs = [[data, hypothesis_template.format(label)] | |||
for label in candidate_labels] | |||
features = self.tokenizer( | |||
pairs, | |||
padding=True, | |||
truncation=True, | |||
max_length=self.sequence_length, | |||
truncation_strategy='only_first', | |||
return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None) | |||
return features | |||
@PREPROCESSORS.register_module( | |||
Fields.nlp, module_name=Preprocessors.text_gen_tokenizer) | |||
class TextGenerationPreprocessor(NLPPreprocessorBase): | |||
class TextGenerationPreprocessor(NLPTokenizerPreprocessorBase): | |||
def __init__(self, model_dir: str, tokenizer=None, *args, **kwargs): | |||
def __init__(self, | |||
model_dir: str, | |||
tokenizer=None, | |||
mode=ModeKeys.INFERENCE, | |||
**kwargs): | |||
self.tokenizer = self.build_tokenizer( | |||
model_dir) if tokenizer is None else tokenizer | |||
kwargs['truncation'] = True | |||
kwargs['padding'] = True | |||
kwargs['return_tensors'] = 'pt' | |||
kwargs['return_token_type_ids'] = False | |||
kwargs['truncation'] = kwargs.get('truncation', True) | |||
kwargs['padding'] = kwargs.get('padding', True) | |||
kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids', | |||
False) | |||
kwargs['max_length'] = kwargs.pop('sequence_length', 128) | |||
super().__init__(model_dir, *args, **kwargs) | |||
super().__init__(model_dir, pair=False, mode=mode, **kwargs) | |||
@staticmethod | |||
def get_roberta_tokenizer_dir(model_dir: str) -> Optional[str]: | |||
@@ -240,19 +308,13 @@ class TextGenerationPreprocessor(NLPPreprocessorBase): | |||
roberta_tokenizer_dir, do_lower_case=False) | |||
return super().build_tokenizer(model_dir) | |||
@PREPROCESSORS.register_module( | |||
Fields.nlp, module_name='palm-text-gen-tokenizer-finetune') | |||
class TextGenerationFinetunePreprocessor(TextGenerationPreprocessor): | |||
@type_assert(object, dict) | |||
def __call__(self, data: dict) -> Dict[str, Any]: | |||
def __call__(self, data: Union[Dict, str]) -> Dict[str, Any]: | |||
if self._mode == 'inference': | |||
return super().__call__(data) | |||
src_txt = data['src_txt'] | |||
tgt_txt = data['tgt_txt'] | |||
src_rst = super().__call__(src_txt) | |||
tgt_rst = super().__call__(tgt_txt) | |||
src_rst = {k: v.squeeze() for k, v in src_rst.items()} | |||
tgt_rst = {k: v.squeeze() for k, v in tgt_rst.items()} | |||
return { | |||
'src': src_rst['input_ids'], | |||
@@ -261,87 +323,69 @@ class TextGenerationFinetunePreprocessor(TextGenerationPreprocessor): | |||
} | |||
@PREPROCESSORS.register_module(Fields.nlp) | |||
class FillMaskPreprocessor(NLPPreprocessorBase): | |||
@PREPROCESSORS.register_module(Fields.nlp, module_name=Preprocessors.fill_mask) | |||
class FillMaskPreprocessor(NLPTokenizerPreprocessorBase): | |||
def __init__(self, model_dir: str, *args, **kwargs): | |||
kwargs['truncation'] = True | |||
kwargs['padding'] = 'max_length' | |||
kwargs['return_tensors'] = 'pt' | |||
def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs): | |||
kwargs['truncation'] = kwargs.get('truncation', True) | |||
kwargs['padding'] = kwargs.get('padding', 'max_length') | |||
kwargs['max_length'] = kwargs.pop('sequence_length', 128) | |||
kwargs['return_token_type_ids'] = True | |||
super().__init__(model_dir, *args, **kwargs) | |||
def build_tokenizer(self, model_dir): | |||
from modelscope.utils.hub import get_model_type | |||
model_type = get_model_type(model_dir) | |||
if model_type in ['sbert', 'structbert', 'bert']: | |||
from sofa import SbertTokenizer | |||
return SbertTokenizer.from_pretrained(model_dir, use_fast=False) | |||
elif model_type == 'veco': | |||
from sofa import VecoTokenizer | |||
return VecoTokenizer.from_pretrained(model_dir, use_fast=False) | |||
else: | |||
# TODO Only support veco & sbert | |||
raise RuntimeError(f'Unsupported model type: {model_type}') | |||
kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids', | |||
True) | |||
super().__init__(model_dir, pair=False, mode=mode, **kwargs) | |||
@PREPROCESSORS.register_module( | |||
Fields.nlp, module_name=Preprocessors.token_cls_tokenizer) | |||
class TokenClassificationPreprocessor(NLPPreprocessorBase): | |||
def __init__(self, model_dir: str, *args, **kwargs): | |||
super().__init__(model_dir, *args, **kwargs) | |||
@type_assert(object, str) | |||
def __call__(self, data: Union[str, Dict]) -> Dict[str, Any]: | |||
"""process the raw input data | |||
Fields.nlp, | |||
module_name=Preprocessors.word_segment_text_to_label_preprocessor) | |||
class WordSegmentationBlankSetToLabelPreprocessor(Preprocessor): | |||
Args: | |||
data (str): a sentence | |||
Example: | |||
'you are so handsome.' | |||
Returns: | |||
Dict[str, Any]: the preprocessed data | |||
""" | |||
# preprocess the data for the model input | |||
if isinstance(data, dict): | |||
data = data[self.first_sequence] | |||
text = data.replace(' ', '').strip() | |||
tokens = [] | |||
for token in text: | |||
token = self.tokenizer.tokenize(token) | |||
tokens.extend(token) | |||
input_ids = self.tokenizer.convert_tokens_to_ids(tokens) | |||
input_ids = self.tokenizer.build_inputs_with_special_tokens(input_ids) | |||
attention_mask = [1] * len(input_ids) | |||
token_type_ids = [0] * len(input_ids) | |||
def __init__(self, **kwargs): | |||
super().__init__(**kwargs) | |||
self.first_sequence: str = kwargs.pop('first_sequence', | |||
'first_sequence') | |||
self.label = kwargs.pop('label', OutputKeys.LABELS) | |||
def __call__(self, data: str) -> Union[Dict[str, Any], Tuple]: | |||
data = data.split(' ') | |||
data = list(filter(lambda x: len(x) > 0, data)) | |||
def produce_train_sample(words): | |||
chars = [] | |||
labels = [] | |||
for word in words: | |||
chars.extend(list(word)) | |||
if len(word) == 1: | |||
labels.append('S-CWS') | |||
else: | |||
labels.extend(['B-CWS'] + ['I-CWS'] * (len(word) - 2) | |||
+ ['E-CWS']) | |||
assert len(chars) == len(labels) | |||
return chars, labels | |||
chars, labels = produce_train_sample(data) | |||
return { | |||
'text': text, | |||
'input_ids': input_ids, | |||
'attention_mask': attention_mask, | |||
'token_type_ids': token_type_ids | |||
self.first_sequence: chars, | |||
self.label: labels, | |||
} | |||
@PREPROCESSORS.register_module( | |||
Fields.nlp, module_name=Preprocessors.zero_shot_cls_tokenizer) | |||
class ZeroShotClassificationPreprocessor(NLPPreprocessorBase): | |||
def __init__(self, model_dir: str, *args, **kwargs): | |||
"""preprocess the data via the vocab.txt from the `model_dir` path | |||
Fields.nlp, module_name=Preprocessors.token_cls_tokenizer) | |||
class TokenClassificationPreprocessor(NLPTokenizerPreprocessorBase): | |||
Args: | |||
model_dir (str): model path | |||
""" | |||
self.sequence_length = kwargs.pop('sequence_length', 512) | |||
super().__init__(model_dir, *args, **kwargs) | |||
def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs): | |||
kwargs['truncation'] = kwargs.get('truncation', True) | |||
kwargs['padding'] = kwargs.get( | |||
'padding', False if mode == ModeKeys.INFERENCE else 'max_length') | |||
kwargs['max_length'] = kwargs.pop('sequence_length', 128) | |||
kwargs['is_split_into_words'] = kwargs.pop( | |||
'is_split_into_words', | |||
False if mode == ModeKeys.INFERENCE else True) | |||
self.label_all_tokens = kwargs.pop('label_all_tokens', False) | |||
super().__init__(model_dir, pair=False, mode=mode, **kwargs) | |||
@type_assert(object, str) | |||
def __call__(self, data, hypothesis_template: str, | |||
candidate_labels: list) -> Dict[str, Any]: | |||
def __call__(self, data: Union[str, Dict]) -> Dict[str, Any]: | |||
"""process the raw input data | |||
Args: | |||
@@ -352,20 +396,74 @@ class ZeroShotClassificationPreprocessor(NLPPreprocessorBase): | |||
Returns: | |||
Dict[str, Any]: the preprocessed data | |||
""" | |||
if isinstance(data, dict): | |||
data = data.get(self.first_sequence) | |||
pairs = [[data, hypothesis_template.format(label)] | |||
for label in candidate_labels] | |||
features = self.tokenizer( | |||
pairs, | |||
padding=True, | |||
truncation=True, | |||
max_length=self.sequence_length, | |||
return_tensors='pt', | |||
truncation_strategy='only_first') | |||
return features | |||
# preprocess the data for the model input | |||
# if isinstance(data, dict): | |||
# data = data[self.first_sequence] | |||
# text = data.replace(' ', '').strip() | |||
# tokens = [] | |||
# for token in text: | |||
# token = self.tokenizer.tokenize(token) | |||
# tokens.extend(token) | |||
# input_ids = self.tokenizer.convert_tokens_to_ids(tokens) | |||
# input_ids = self.tokenizer.build_inputs_with_special_tokens(input_ids) | |||
# attention_mask = [1] * len(input_ids) | |||
# token_type_ids = [0] * len(input_ids) | |||
# new code to deal with labels | |||
# tokenized_inputs = self.tokenizer(data, truncation=True, is_split_into_words=True) | |||
text_a = None | |||
labels_list = None | |||
if isinstance(data, str): | |||
text_a = data | |||
elif isinstance(data, dict): | |||
text_a = data.get(self.first_sequence) | |||
labels_list = data.get(self.label) | |||
tokenized_inputs = self.tokenizer( | |||
text_a, | |||
return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None, | |||
**self.tokenize_kwargs) | |||
if labels_list is not None: | |||
assert self.label2id is not None | |||
# Map that sends B-Xxx label to its I-Xxx counterpart | |||
b_to_i_label = [] | |||
label_enumerate_values = [ | |||
k for k, v in sorted( | |||
self.label2id.items(), key=lambda item: item[1]) | |||
] | |||
for idx, label in enumerate(label_enumerate_values): | |||
if label.startswith('B-') and label.replace( | |||
'B-', 'I-') in label_enumerate_values: | |||
b_to_i_label.append( | |||
label_enumerate_values.index( | |||
label.replace('B-', 'I-'))) | |||
else: | |||
b_to_i_label.append(idx) | |||
label_row = [self.label2id[lb] for lb in labels_list] | |||
word_ids = tokenized_inputs.word_ids() | |||
previous_word_idx = None | |||
label_ids = [] | |||
for word_idx in word_ids: | |||
if word_idx is None: | |||
label_ids.append(-100) | |||
elif word_idx != previous_word_idx: | |||
label_ids.append(label_row[word_idx]) | |||
else: | |||
if self.label_all_tokens: | |||
label_ids.append(b_to_i_label[label_row[word_idx]]) | |||
else: | |||
label_ids.append(-100) | |||
previous_word_idx = word_idx | |||
labels = label_ids | |||
tokenized_inputs['labels'] = labels | |||
# new code end | |||
if self._mode == ModeKeys.INFERENCE: | |||
tokenized_inputs[OutputKeys.TEXT] = text_a | |||
return tokenized_inputs | |||
@PREPROCESSORS.register_module( | |||
@@ -24,7 +24,7 @@ class DialogStateTrackingPreprocessor(Preprocessor): | |||
""" | |||
super().__init__(*args, **kwargs) | |||
from sofa.models.space import SpaceConfig, SpaceTokenizer | |||
from modelscope.models.nlp.space import SpaceConfig, SpaceTokenizer | |||
self.model_dir: str = model_dir | |||
self.config = SpaceConfig.from_pretrained(self.model_dir) | |||
self.tokenizer = SpaceTokenizer.from_pretrained(self.model_dir) | |||
@@ -7,12 +7,14 @@ if TYPE_CHECKING: | |||
from .base import TaskDataset | |||
from .builder import TASK_DATASETS, build_task_dataset | |||
from .torch_base_dataset import TorchTaskDataset | |||
from .veco_dataset import VecoDataset | |||
else: | |||
_import_structure = { | |||
'base': ['TaskDataset'], | |||
'builder': ['TASK_DATASETS', 'build_task_dataset'], | |||
'torch_base_dataset': ['TorchTaskDataset'], | |||
'veco_dataset': ['VecoDataset'], | |||
} | |||
import sys | |||
@@ -1,6 +1,6 @@ | |||
# Copyright (c) Alibaba, Inc. and its affiliates. | |||
from abc import ABC, abstractmethod | |||
from typing import Any, List, Tuple | |||
from typing import Any, List, Tuple, Union | |||
class TaskDataset(ABC): | |||
@@ -8,7 +8,7 @@ class TaskDataset(ABC): | |||
""" | |||
def __init__(self, | |||
datasets: Tuple[Any, List[Any]], | |||
datasets: Union[Any, List[Any]], | |||
mode, | |||
preprocessor=None, | |||
**kwargs): | |||
@@ -18,7 +18,7 @@ class TaskDataset(ABC): | |||
self._inner_dataset = self.prepare_dataset(datasets) | |||
@abstractmethod | |||
def prepare_dataset(self, datasets: Tuple[Any, List[Any]]) -> Any: | |||
def prepare_dataset(self, datasets: Union[Any, List[Any]]) -> Any: | |||
"""Prepare a dataset. | |||
User can process the input datasets in a whole dataset perspective. | |||
@@ -1,5 +1,5 @@ | |||
# Copyright (c) Alibaba, Inc. and its affiliates. | |||
from typing import Any, List, Tuple | |||
from typing import Any, List, Tuple, Union | |||
from torch.utils.data import ConcatDataset, Dataset | |||
@@ -14,7 +14,7 @@ class TorchTaskDataset(TaskDataset, Dataset): | |||
""" | |||
def __init__(self, | |||
datasets: Tuple[Any, List[Any]], | |||
datasets: Union[Any, List[Any]], | |||
mode, | |||
preprocessor=None, | |||
**kwargs): | |||
@@ -26,7 +26,7 @@ class TorchTaskDataset(TaskDataset, Dataset): | |||
def __len__(self): | |||
return len(self._inner_dataset) | |||
def prepare_dataset(self, datasets: Tuple[Any, List[Any]]) -> Any: | |||
def prepare_dataset(self, datasets: Union[Any, List[Any]]) -> Any: | |||
"""Prepare a dataset. | |||
User can process the input datasets in a whole dataset perspective. | |||
@@ -0,0 +1,76 @@ | |||
# Copyright (c) Alibaba, Inc. and its affiliates. | |||
from typing import Any, List, Union | |||
import numpy as np | |||
from datasets import Dataset, IterableDataset, concatenate_datasets | |||
from modelscope.metainfo import Models | |||
from modelscope.utils.constant import Tasks | |||
from .builder import TASK_DATASETS | |||
from .torch_base_dataset import TorchTaskDataset | |||
@TASK_DATASETS.register_module(module_name=Models.veco, group_key=Tasks.nli) | |||
class VecoDataset(TorchTaskDataset): | |||
def __init__(self, | |||
datasets: Union[Any, List[Any]], | |||
mode, | |||
preprocessor=None, | |||
**kwargs): | |||
self.seed = kwargs.get('seed', 42) | |||
self.permutation = None | |||
self.datasets = None | |||
super().__init__(datasets, mode, preprocessor, **kwargs) | |||
def switch_dataset(self, idx): | |||
"""Switch dataset in evaluation. | |||
Veco evaluates dataset one by one. | |||
Args: | |||
idx: The index of the dataset | |||
""" | |||
if self.mode == 'train': | |||
raise ValueError( | |||
'Only support switch dataset in the evaluation loop') | |||
if idx >= len(self.datasets): | |||
raise ValueError( | |||
'Index is bigger than the number of the datasets.') | |||
self._inner_dataset = self.datasets[idx] | |||
def __getitem__(self, item): | |||
if self.permutation is not None: | |||
item = self.permutation[item] | |||
return super().__getitem__(item) | |||
def prepare_dataset(self, datasets: Union[Any, List[Any]]) -> Any: | |||
"""Compose all the datasets. | |||
If the mode is 'train', all datasets will be mixed together, if the mode is 'eval', | |||
the datasets will be kept and returns the first one. | |||
Args: | |||
datasets: The datasets to be composed. | |||
Returns: The final dataset. | |||
""" | |||
if not isinstance(datasets, (list, tuple)): | |||
datasets = [datasets] | |||
if self.mode == 'train': | |||
if len(datasets) == 1: | |||
return datasets[0] | |||
elif all([ | |||
isinstance(dataset, (Dataset, IterableDataset)) | |||
for dataset in datasets | |||
]): | |||
dataset = concatenate_datasets(list(datasets)) | |||
return dataset.shuffle(seed=self.seed) | |||
else: | |||
generator = np.random.default_rng(self.seed) | |||
_len = sum([len(dataset) for dataset in datasets]) | |||
self.permutation = generator.permutation(_len) | |||
return super().prepare_dataset(datasets) | |||
else: | |||
self.datasets = datasets | |||
return self.datasets[0] |
@@ -4,4 +4,5 @@ from .cv import (ImageInstanceSegmentationTrainer, | |||
ImagePortraitEnhancementTrainer) | |||
from .multi_modal import CLIPTrainer | |||
from .nlp import SequenceClassificationTrainer | |||
from .nlp_trainer import NlpEpochBasedTrainer, VecoTrainer | |||
from .trainer import EpochBasedTrainer |
@@ -32,6 +32,7 @@ class EvaluationHook(Hook): | |||
def do_evaluate(self, trainer): | |||
"""Evaluate the results.""" | |||
eval_res = trainer.evaluate() | |||
trainer.data_loader = trainer.train_dataloader | |||
for name, val in eval_res.items(): | |||
trainer.log_buffer.output[name] = val | |||
@@ -21,9 +21,6 @@ class LrSchedulerHook(Hook): | |||
def __init__(self, by_epoch=True, warmup=None) -> None: | |||
super().__init__() | |||
self.by_epoch = by_epoch | |||
if not self.by_epoch: | |||
raise ValueError('We only support ``by_epoch=True`` now!') | |||
self.warmup = warmup | |||
self.warmup_lr_scheduler = None | |||
@@ -49,6 +46,11 @@ class LrSchedulerHook(Hook): | |||
return lr | |||
def before_train_iter(self, trainer): | |||
if not self.by_epoch: | |||
if self.warmup_lr_scheduler is not None: | |||
self.warmup_lr_scheduler.step() | |||
else: | |||
trainer.lr_scheduler.step() | |||
trainer.log_buffer.output[LogKeys.LR] = self._get_log_lr(trainer) | |||
def before_train_epoch(self, trainer): | |||
@@ -0,0 +1,192 @@ | |||
import os | |||
from typing import Callable, Dict, Optional, Tuple, Union | |||
import torch | |||
from torch import nn | |||
from torch.utils.data import Dataset | |||
from modelscope.hub.snapshot_download import snapshot_download | |||
from modelscope.metrics.builder import build_metric | |||
from modelscope.models.base import Model, TorchModel | |||
from modelscope.msdatasets import MsDataset | |||
from modelscope.preprocessors import Preprocessor, build_preprocessor | |||
from modelscope.utils.config import Config, ConfigDict | |||
from modelscope.utils.constant import (DEFAULT_MODEL_REVISION, ModeKeys, | |||
ModelFile, Tasks) | |||
from .base import TRAINERS | |||
from .trainer import EpochBasedTrainer | |||
@TRAINERS.register_module(module_name='NlpEpochBasedTrainer') | |||
class NlpEpochBasedTrainer(EpochBasedTrainer): | |||
def __init__( | |||
self, | |||
model: Optional[Union[TorchModel, nn.Module, str]] = None, | |||
cfg_file: Optional[str] = None, | |||
cfg_modify_fn: Optional[Callable] = None, | |||
arg_parse_fn: Optional[Callable] = None, | |||
data_collator: Optional[Callable] = None, | |||
train_dataset: Optional[Union[MsDataset, Dataset]] = None, | |||
eval_dataset: Optional[Union[MsDataset, Dataset]] = None, | |||
preprocessor: Optional[Preprocessor] = None, | |||
optimizers: Tuple[torch.optim.Optimizer, | |||
torch.optim.lr_scheduler._LRScheduler] = (None, | |||
None), | |||
model_revision: Optional[str] = DEFAULT_MODEL_REVISION, | |||
**kwargs): | |||
"""Add code to adapt with nlp models. | |||
Args: | |||
cfg_modify_fn: An input fn which is used to modify the cfg read out of the file. | |||
""" | |||
if isinstance(model, str): | |||
if os.path.exists(model): | |||
model_dir = model if os.path.isdir(model) else os.path.dirname( | |||
model) | |||
else: | |||
model_dir = snapshot_download(model, revision=model_revision) | |||
cfg_file = os.path.join(model_dir, ModelFile.CONFIGURATION) | |||
else: | |||
assert cfg_file is not None, 'Config file should not be None if model is an nn.Module class' | |||
model_dir = os.path.dirname(cfg_file) | |||
self.cfg_modify_fn = cfg_modify_fn | |||
self.cfg = self.rebuild_config(Config.from_file(cfg_file)) | |||
try: | |||
labels = self.cfg.dataset.train.labels | |||
except AttributeError: | |||
labels = None | |||
self.label2id = None | |||
self.num_labels = None | |||
if labels is not None and len(labels) > 0: | |||
self.label2id = {label: idx for idx, label in enumerate(labels)} | |||
self.id2label = {idx: label for idx, label in enumerate(labels)} | |||
self.num_labels = len(labels) | |||
def build_dataset_keys(cfg): | |||
if cfg is not None: | |||
input_keys = { | |||
'first_sequence': getattr(cfg, 'first_sequence', None), | |||
'second_sequence': getattr(cfg, 'second_sequence', None), | |||
'label': getattr(cfg, 'label', None), | |||
} | |||
else: | |||
input_keys = {} | |||
return {k: v for k, v in input_keys.items() if v is not None} | |||
self.train_keys = build_dataset_keys( | |||
self.cfg.dataset.train if hasattr(self.cfg, 'dataset') | |||
and hasattr(self.cfg.dataset, 'train') else None) | |||
# TODO eval may has special keys, which is now not supported. | |||
# because there is only one preprocessor in the trainer, and it only supports one group of keys. | |||
self.eval_keys = self.train_keys | |||
super().__init__( | |||
model=model_dir, | |||
cfg_file=cfg_file, | |||
arg_parse_fn=arg_parse_fn, | |||
data_collator=data_collator, | |||
preprocessor=preprocessor, | |||
optimizers=optimizers, | |||
model_revision=model_revision, | |||
train_dataset=train_dataset, | |||
eval_dataset=eval_dataset, | |||
**kwargs) | |||
def rebuild_config(self, cfg: Config): | |||
if self.cfg_modify_fn is not None: | |||
return self.cfg_modify_fn(cfg) | |||
return cfg | |||
def build_model(self) -> Union[nn.Module, TorchModel]: | |||
""" Instantiate a pytorch model and return. | |||
By default, we will create a model using config from configuration file. You can | |||
override this method in a subclass. | |||
""" | |||
model_args = {} if self.num_labels is None else { | |||
'num_labels': self.num_labels | |||
} | |||
model = Model.from_pretrained( | |||
self.model_dir, cfg_dict=self.cfg, **model_args) | |||
if not isinstance(model, nn.Module) and hasattr(model, 'model'): | |||
return model.model | |||
elif isinstance(model, nn.Module): | |||
return model | |||
def build_preprocessor(self) -> Preprocessor: | |||
"""Build the preprocessor. | |||
User can override this method to implement custom logits. | |||
Returns: The preprocessor instance. | |||
""" | |||
model_args = {} if self.label2id is None else { | |||
'label2id': self.label2id | |||
} | |||
cfg = ConfigDict({ | |||
**getattr(self.cfg, 'preprocessor'), | |||
'model_dir': | |||
self.model_dir, | |||
**model_args, | |||
'mode': | |||
ModeKeys.TRAIN, | |||
**self.train_keys, | |||
}) | |||
return build_preprocessor(cfg, Tasks.find_field_by_task(self.cfg.task)) | |||
@TRAINERS.register_module(module_name='VecoTrainer') | |||
class VecoTrainer(NlpEpochBasedTrainer): | |||
def evaluate(self, checkpoint_path=None): | |||
"""Veco evaluates the datasets one by one. | |||
""" | |||
from modelscope.task_datasets import VecoDataset | |||
self.model.eval() | |||
self._mode = ModeKeys.EVAL | |||
metric_values = {} | |||
if self.eval_dataset is None: | |||
val_data = self.cfg.dataset.val | |||
self.eval_dataset = self.build_dataset( | |||
val_data, mode=ModeKeys.EVAL) | |||
idx = 0 | |||
dataset_cnt = 1 | |||
if isinstance(self.eval_dataset, VecoDataset): | |||
self.eval_dataset.switch_dataset(idx) | |||
dataset_cnt = len(self.eval_dataset.datasets) | |||
while True: | |||
self.eval_dataloader = self._build_dataloader_with_dataset( | |||
self.eval_dataset, **self.cfg.evaluation.get('dataloader', {})) | |||
self.data_loader = self.eval_dataloader | |||
metric_classes = [ | |||
build_metric(metric, default_args={'trainer': self}) | |||
for metric in self.metrics | |||
] | |||
self.evaluation_loop(self.eval_dataloader, checkpoint_path, | |||
metric_classes) | |||
for m_idx, metric_cls in enumerate(metric_classes): | |||
if f'eval_dataset[{idx}]' not in metric_values: | |||
metric_values[f'eval_dataset[{idx}]'] = {} | |||
metric_values[f'eval_dataset[{idx}]'][ | |||
self.metrics[m_idx]] = metric_cls.evaluate() | |||
idx += 1 | |||
if idx < dataset_cnt: | |||
self.eval_dataset.switch_dataset(idx) | |||
else: | |||
break | |||
return metric_values |
@@ -22,7 +22,8 @@ from modelscope.models.base import Model, TorchModel | |||
from modelscope.msdatasets.ms_dataset import MsDataset | |||
from modelscope.preprocessors import build_preprocessor | |||
from modelscope.preprocessors.base import Preprocessor | |||
from modelscope.task_datasets import TorchTaskDataset, build_task_dataset | |||
from modelscope.task_datasets.builder import build_task_dataset | |||
from modelscope.task_datasets.torch_base_dataset import TorchTaskDataset | |||
from modelscope.trainers.hooks.builder import HOOKS | |||
from modelscope.trainers.hooks.priority import Priority, get_priority | |||
from modelscope.trainers.lrscheduler.builder import build_lr_scheduler | |||
@@ -30,12 +31,12 @@ from modelscope.trainers.optimizer.builder import build_optimizer | |||
from modelscope.utils.config import Config, ConfigDict | |||
from modelscope.utils.constant import (DEFAULT_MODEL_REVISION, Hubs, ModeKeys, | |||
ModelFile, Tasks, TrainerStages) | |||
from modelscope.utils.file_utils import func_receive_dict_inputs | |||
from modelscope.utils.logger import get_logger | |||
from modelscope.utils.registry import build_from_cfg | |||
from modelscope.utils.tensor_utils import torch_default_data_collator | |||
from modelscope.utils.torch_utils import (broadcast, create_device, | |||
get_dist_info, init_dist) | |||
from modelscope.utils.utils import if_func_receive_dict_inputs | |||
from .base import BaseTrainer | |||
from .builder import TRAINERS | |||
from .default_config import DEFAULT_CONFIG | |||
@@ -87,6 +88,7 @@ class EpochBasedTrainer(BaseTrainer): | |||
None), | |||
model_revision: Optional[str] = DEFAULT_MODEL_REVISION, | |||
**kwargs): | |||
if isinstance(model, str): | |||
if os.path.exists(model): | |||
self.model_dir = model if os.path.isdir( | |||
@@ -108,9 +110,9 @@ class EpochBasedTrainer(BaseTrainer): | |||
self.model = model | |||
super().__init__(cfg_file, arg_parse_fn) | |||
# add default config | |||
self.cfg.merge_from_dict(self._get_default_config(), force=False) | |||
self.cfg = self.rebuild_config(self.cfg) | |||
if 'work_dir' in kwargs: | |||
self.work_dir = kwargs['work_dir'] | |||
@@ -130,9 +132,9 @@ class EpochBasedTrainer(BaseTrainer): | |||
self.device = create_device(device_name == 'cpu') | |||
self.train_dataset = self.to_task_dataset( | |||
train_dataset, mode='train', preprocessor=self.preprocessor) | |||
train_dataset, mode=ModeKeys.TRAIN, preprocessor=self.preprocessor) | |||
self.eval_dataset = self.to_task_dataset( | |||
eval_dataset, mode='eval', preprocessor=self.preprocessor) | |||
eval_dataset, mode=ModeKeys.EVAL, preprocessor=self.preprocessor) | |||
self.data_collator = data_collator if data_collator is not None else torch_default_data_collator | |||
self.metrics = self.get_metrics() | |||
@@ -168,6 +170,14 @@ class EpochBasedTrainer(BaseTrainer): | |||
if not is_parallel(self.model) and self._dist: | |||
self.model = self.to_parallel(self.model) | |||
def rebuild_config(self, cfg: Config): | |||
"""A method used to rebuild the config, any subclass can override this method. | |||
Returns: The rebuilt config | |||
""" | |||
return cfg | |||
@property | |||
def mode(self): | |||
return self._mode | |||
@@ -203,7 +213,7 @@ class EpochBasedTrainer(BaseTrainer): | |||
return self._max_epochs * len(self.data_loader) | |||
def to_task_dataset(self, | |||
datasets: Tuple[Dataset, List[Dataset]], | |||
datasets: Union[Dataset, List[Dataset]], | |||
mode: str, | |||
preprocessor: Optional[Preprocessor] = None): | |||
"""Build the task specific dataset processor for this trainer. | |||
@@ -229,17 +239,13 @@ class EpochBasedTrainer(BaseTrainer): | |||
cfg = ConfigDict( | |||
type=self.cfg.task, mode=mode, datasets=datasets) | |||
return build_task_dataset(cfg, self.cfg.task) | |||
elif isinstance(datasets, | |||
Dataset) or (isinstance(datasets, List) | |||
and isinstance(datasets[0], Dataset)): | |||
else: | |||
cfg = ConfigDict( | |||
type=self.cfg.model.type, mode=mode, datasets=datasets) | |||
type=self.cfg.model.type, | |||
mode=mode, | |||
datasets=datasets, | |||
preprocessor=preprocessor) | |||
return build_task_dataset(cfg, self.cfg.task) | |||
else: | |||
raise ValueError( | |||
f'invalid datasets type: {type(datasets)}, ' | |||
f'expected `MsDataset`, `torch.utils.data.Dataset` or list of them.' | |||
) | |||
except Exception: | |||
if isinstance(datasets, (List, Tuple)) or preprocessor is not None: | |||
return TorchTaskDataset( | |||
@@ -262,8 +268,11 @@ class EpochBasedTrainer(BaseTrainer): | |||
# TODO @wenmeng.zwm @jiangnana.jnn add support for different preprocessor | |||
# when they are different ones in training and evaluation | |||
cfg = ConfigDict({ | |||
**getattr(self.cfg, 'preprocessor'), 'model_dir': | |||
self.model_dir | |||
**getattr(self.cfg, 'preprocessor'), | |||
'model_dir': | |||
self.model_dir, | |||
'mode': | |||
ModeKeys.TRAIN, | |||
}) | |||
return build_preprocessor(cfg, Tasks.find_field_by_task(self.cfg.task)) | |||
@@ -324,6 +333,8 @@ class EpochBasedTrainer(BaseTrainer): | |||
**self.cfg.evaluation.get('dataloader', {})) | |||
self.data_loader = self.eval_dataloader | |||
metric_classes = [build_metric(metric) for metric in self.metrics] | |||
for m in metric_classes: | |||
m.trainer = self | |||
metric_values = self.evaluation_loop(self.eval_dataloader, | |||
checkpoint_path, metric_classes) | |||
@@ -338,10 +349,9 @@ class EpochBasedTrainer(BaseTrainer): | |||
""" Instantiate a pytorch model and return. | |||
By default, we will create a model using config from configuration file. You can | |||
subclass and override this method in a subclass. | |||
override this method in a subclass. | |||
""" | |||
# TODO temp implementation, waiting for @zhangzhicheng | |||
model = Model.from_pretrained(self.model_dir) | |||
if not isinstance(model, nn.Module) and hasattr(model, 'model'): | |||
return model.model | |||
@@ -412,9 +422,8 @@ class EpochBasedTrainer(BaseTrainer): | |||
self._mode = ModeKeys.TRAIN | |||
inputs = self.collate_fn(inputs) | |||
# call model forward but not __call__ to skip postprocess | |||
if isinstance( | |||
inputs, | |||
Mapping) and not if_func_receive_dict_inputs(model.forward): | |||
if isinstance(inputs, | |||
Mapping) and not func_receive_dict_inputs(model.forward): | |||
train_outputs = model.forward(**inputs) | |||
else: | |||
train_outputs = model.forward(inputs) | |||
@@ -495,7 +504,7 @@ class EpochBasedTrainer(BaseTrainer): | |||
if self.eval_dataset is None: | |||
val_data = self.cfg.dataset.val | |||
self.eval_dataset = self.build_dataset( | |||
val_data, mode=ModeKeys.TRAIN) | |||
val_data, mode=ModeKeys.EVAL) | |||
batch_size = self.cfg.evaluation.batch_size | |||
workers = self.cfg.evaluation.workers | |||
@@ -523,7 +532,8 @@ class EpochBasedTrainer(BaseTrainer): | |||
) | |||
torch_dataset = dataset.to_torch_dataset( | |||
preprocessors=self.preprocessor, ) | |||
return torch_dataset | |||
dataset = self.to_task_dataset(torch_dataset, mode) | |||
return dataset | |||
def create_optimizer_and_scheduler(self): | |||
""" Create optimizer and lr scheduler | |||
@@ -10,9 +10,9 @@ import torch | |||
from torch import distributed as dist | |||
from tqdm import tqdm | |||
from modelscope.utils.file_utils import func_receive_dict_inputs | |||
from modelscope.utils.torch_utils import (broadcast, get_dist_info, is_master, | |||
make_tmp_dir) | |||
from modelscope.utils.utils import if_func_receive_dict_inputs | |||
def single_gpu_test(model, | |||
@@ -37,18 +37,19 @@ def single_gpu_test(model, | |||
if data_collate_fn is not None: | |||
data = data_collate_fn(data) | |||
with torch.no_grad(): | |||
if isinstance(data, | |||
Mapping) and not if_func_receive_dict_inputs( | |||
model.forward): | |||
result = model(**data) | |||
if isinstance(data, Mapping) and not func_receive_dict_inputs( | |||
model.forward): | |||
result = model.forward(**data) | |||
else: | |||
result = model(data) | |||
result = model.forward(data) | |||
if metric_classes is not None: | |||
for metric_cls in metric_classes: | |||
metric_cls.add(result, data) | |||
batch_size = len(result) | |||
if isinstance(data, dict): | |||
batch_size = len(next(iter(data.values()))) | |||
else: | |||
batch_size = len(data) | |||
for _ in range(batch_size): | |||
pbar.update() | |||
@@ -101,16 +102,18 @@ def multi_gpu_test(model, | |||
data = data_collate_fn(data) | |||
data_list.append(data) | |||
with torch.no_grad(): | |||
if isinstance(data, | |||
Mapping) and not if_func_receive_dict_inputs( | |||
model.forward): | |||
result = model(**data) | |||
if isinstance(data, Mapping) and not func_receive_dict_inputs( | |||
model.forward): | |||
result = model.forward(**data) | |||
else: | |||
result = model(data) | |||
result = model.forward(data) | |||
results.append(result) | |||
if rank == 0: | |||
batch_size = len(result) | |||
if isinstance(data, dict): | |||
batch_size = len(next(iter(data.values()))) | |||
else: | |||
batch_size = len(data) | |||
batch_size_all = batch_size * world_size | |||
count += batch_size_all | |||
if count > len(dataset): | |||