You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

test_multilingual_word_segmentation.py 3.3 kB

[to #42322933] Refactor NLP and fix some user feedbacks 1. Abstract keys of dicts needed by nlp metric classes into the init method 2. Add Preprocessor.save_pretrained to save preprocessor information 3. Abstract the config saving function, which can lead to normally saving in the direct call of from_pretrained, and the modification of cfg one by one when training. 4. Remove SbertTokenizer and VecoTokenizer, use transformers' tokenizers instead 5. Use model/preprocessor's from_pretrained in all nlp pipeline classes. 6. Add model_kwargs and preprocessor_kwargs in all nlp pipeline classes 7. Add base classes for fill-mask and text-classification preprocessor, as a demo for later changes 8. Fix user feedback: Re-train the model in continue training scenario 9. Fix user feedback: Too many checkpoint saved 10. Simplify the nlp-trainer 11. Fix user feedback: Split the default trainer's __init__ method, which makes user easier to override 12. Add safe_get to Config class ---------------------------- Another refactor from version 36 ------------------------- 13. Name all nlp transformers' preprocessors from TaskNamePreprocessor to TaskNameTransformersPreprocessor, for example: TextClassificationPreprocessor -> TextClassificationTransformersPreprocessor 14. Add a base class per task for all nlp tasks' preprocessors which has at least two sub-preprocessors 15. Add output classes of nlp models 16. Refactor the logic for token-classification 17. Fix bug: checkpoint_hook does not support pytorch_model.pt 18. Fix bug: Pipeline name does not match with task name, so inference will not succeed after training NOTE: This is just a stop bleeding solution, the root cause is the uncertainty of the relationship between models and pipelines Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10723513 * add save_pretrained to preprocessor * save preprocessor config in hook * refactor label-id mapping fetching logic * test ok on sentence-similarity * run on finetuning * fix bug * pre-commit passed * fix bug * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/preprocessors/nlp/nlp_base.py * add params to init * 1. support max ckpt num 2. support ignoring others but bin file in continue training 3. add arguments to some nlp metrics * Split trainer init impls to overridable methods * remove some obsolete tokenizers * unfinished * support input params in pipeline * fix bugs * fix ut bug * fix bug * fix ut bug * fix ut bug * fix ut bug * add base class for some preprocessors * Merge commit '379867739548f394d0fa349ba07afe04adf4c8b6' into feat/refactor_config * compatible with old code * fix ut bug * fix ut bugs * fix bug * add some comments * fix ut bug * add a requirement * fix pre-commit * Merge commit '0451b3d3cb2bebfef92ec2c227b2a3dd8d01dc6a' into feat/refactor_config * fixbug * Support function type in registry * fix ut bug * fix bug * Merge commit '5f719e542b963f0d35457e5359df879a5eb80b82' into feat/refactor_config # Conflicts: # modelscope/pipelines/nlp/multilingual_word_segmentation_pipeline.py # modelscope/pipelines/nlp/named_entity_recognition_pipeline.py # modelscope/pipelines/nlp/word_segmentation_pipeline.py # modelscope/utils/hub.py * remove obsolete file * rename init args * rename params * fix merge bug * add default preprocessor config for ner-model * move a method a util file * remove unused config * Fix a bug in pbar * bestckptsaver:change default ckpt numbers to 1 * 1. Add assert to max_epoch 2. split init_dist and get_device 3. change cmp func name * Fix bug * fix bug * fix bug * unfinished refactoring * unfinished * uw * uw * uw * uw * Merge branch 'feat/refactor_config' into feat/refactor_trainer # Conflicts: # modelscope/preprocessors/nlp/document_segmentation_preprocessor.py # modelscope/preprocessors/nlp/faq_question_answering_preprocessor.py # modelscope/preprocessors/nlp/relation_extraction_preprocessor.py # modelscope/preprocessors/nlp/text_generation_preprocessor.py * uw * uw * unify nlp task outputs * uw * uw * uw * uw * change the order of text cls pipeline * refactor t5 * refactor tg task preprocessor * fix * unfinished * temp * refactor code * unfinished * unfinished * unfinished * unfinished * uw * Merge branch 'feat/refactor_config' into feat/refactor_trainer * smoke test pass * ut testing * pre-commit passed * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/models/nlp/bert/document_segmentation.py # modelscope/pipelines/nlp/__init__.py # modelscope/pipelines/nlp/document_segmentation_pipeline.py * merge master * unifnished * Merge branch 'feat/fix_bug_pipeline_name' into feat/refactor_config * fix bug * fix ut bug * support ner batch inference * fix ut bug * fix bug * support batch inference on three nlp tasks * unfinished * fix bug * fix bug * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/models/base/base_model.py # modelscope/pipelines/nlp/conversational_text_to_sql_pipeline.py # modelscope/pipelines/nlp/dialog_intent_prediction_pipeline.py # modelscope/pipelines/nlp/dialog_modeling_pipeline.py # modelscope/pipelines/nlp/dialog_state_tracking_pipeline.py # modelscope/pipelines/nlp/document_segmentation_pipeline.py # modelscope/pipelines/nlp/faq_question_answering_pipeline.py # modelscope/pipelines/nlp/feature_extraction_pipeline.py # modelscope/pipelines/nlp/fill_mask_pipeline.py # modelscope/pipelines/nlp/information_extraction_pipeline.py # modelscope/pipelines/nlp/named_entity_recognition_pipeline.py # modelscope/pipelines/nlp/sentence_embedding_pipeline.py # modelscope/pipelines/nlp/summarization_pipeline.py # modelscope/pipelines/nlp/table_question_answering_pipeline.py # modelscope/pipelines/nlp/text2text_generation_pipeline.py # modelscope/pipelines/nlp/text_classification_pipeline.py # modelscope/pipelines/nlp/text_error_correction_pipeline.py # modelscope/pipelines/nlp/text_generation_pipeline.py # modelscope/pipelines/nlp/text_ranking_pipeline.py # modelscope/pipelines/nlp/token_classification_pipeline.py # modelscope/pipelines/nlp/word_segmentation_pipeline.py # modelscope/pipelines/nlp/zero_shot_classification_pipeline.py # modelscope/trainers/nlp_trainer.py * pre-commit passed * fix bug * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/preprocessors/__init__.py * fix bug * fix bug * fix bug * fix bug * fix bug * fixbug * pre-commit passed * fix bug * fixbug * fix bug * fix bug * fix bug * fix bug * self review done * fixbug * fix bug * fix bug * fix bugs * remove sub-token offset mapping * fix name bug * add some tests * 1. support batch inference of text-generation,text2text-generation,token-classification,text-classification 2. add corresponding UTs * add old logic back * tmp save * add tokenize by words logic back * move outputs file back * revert veco token-classification back * fix typo * Fix description * Merge commit '4dd99b8f6e4e7aefe047c68a1bedd95d3ec596d6' into feat/refactor_config * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/pipelines/builder.py
2 years ago
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
  1. # Copyright (c) Alibaba, Inc. and its affiliates.
  2. import unittest
  3. from modelscope.hub.snapshot_download import snapshot_download
  4. from modelscope.models import Model
  5. from modelscope.models.nlp import TransformerCRFForWordSegmentation
  6. from modelscope.pipelines import pipeline
  7. from modelscope.pipelines.nlp import WordSegmentationThaiPipeline
  8. from modelscope.preprocessors import WordSegmentationPreprocessorThai
  9. from modelscope.utils.constant import Tasks
  10. from modelscope.utils.demo_utils import DemoCompatibilityCheck
  11. from modelscope.utils.regress_test_utils import MsRegressTool
  12. from modelscope.utils.test_utils import test_level
  13. class WordSegmentationTest(unittest.TestCase, DemoCompatibilityCheck):
  14. def setUp(self) -> None:
  15. self.task = Tasks.word_segmentation
  16. self.model_id = 'damo/nlp_xlmr_word-segmentation_thai'
  17. sentence = 'รถคันเก่าก็ยังเก็บเอาไว้ยังไม่ได้ขาย'
  18. regress_tool = MsRegressTool(baseline=False)
  19. @unittest.skipUnless(test_level() >= 0, 'skip test in current test level')
  20. def test_run_by_direct_model_download(self):
  21. cache_path = snapshot_download(self.model_id)
  22. tokenizer = WordSegmentationPreprocessorThai(cache_path)
  23. model = TransformerCRFForWordSegmentation.from_pretrained(cache_path)
  24. pipeline1 = WordSegmentationThaiPipeline(model, preprocessor=tokenizer)
  25. pipeline2 = pipeline(
  26. Tasks.word_segmentation, model=model, preprocessor=tokenizer)
  27. print(f'sentence: {self.sentence}\n'
  28. f'pipeline1:{pipeline1(input=self.sentence)}')
  29. print(f'pipeline2: {pipeline2(input=self.sentence)}')
  30. @unittest.skipUnless(test_level() >= 0, 'skip test in current test level')
  31. def test_run_with_model_from_modelhub(self):
  32. model = Model.from_pretrained(self.model_id)
  33. tokenizer = WordSegmentationPreprocessorThai(model.model_dir)
  34. pipeline_ins = pipeline(
  35. task=Tasks.word_segmentation, model=model, preprocessor=tokenizer)
  36. print(pipeline_ins(input=self.sentence))
  37. @unittest.skipUnless(test_level() >= 0, 'skip test in current test level')
  38. def test_run_with_model_name(self):
  39. pipeline_ins = pipeline(
  40. task=Tasks.word_segmentation, model=self.model_id)
  41. print(pipeline_ins(input=self.sentence))
  42. @unittest.skipUnless(test_level() >= 0, 'skip test in current test level')
  43. def test_run_with_model_name_batch(self):
  44. pipeline_ins = pipeline(
  45. task=Tasks.word_segmentation, model=self.model_id)
  46. print(
  47. pipeline_ins(
  48. input=[self.sentence, self.sentence[:10], self.sentence[6:]],
  49. batch_size=2))
  50. @unittest.skipUnless(test_level() >= 0, 'skip test in current test level')
  51. def test_run_with_model_name_batch_iter(self):
  52. pipeline_ins = pipeline(
  53. task=Tasks.word_segmentation, model=self.model_id, padding=False)
  54. print(
  55. pipeline_ins(
  56. input=[self.sentence, self.sentence[:10], self.sentence[6:]]))
  57. @unittest.skip('demo compatibility test is only enabled on a needed-basis')
  58. def test_demo_compatibility(self):
  59. self.compatibility_check()
  60. if __name__ == '__main__':
  61. unittest.main()