History

oacjiewen eaa7ec5cd2 1. fixed for markdownlint errors. 2. fixed spell errors.		4 years ago
..
README.md	1. fixed for markdownlint errors.	4 years ago

__init__.py	add_python_distribute_pretrain_script	5 years ago

get_distribute_pretrain_cmd.py	edit readme	5 years ago

hyper_parameter_config.ini	fix some doc error	5 years ago

README.md

Run distribute pretrain

Run distribute pretrain

description

The number of Ascend accelerators can be automatically allocated based on the device_num set in hccl config file, You don not need to specify that.

how to use

For example, if we want to generate the launch command of the distributed training of Bert model on Ascend accelerators, we can run the following command in /bert/ dir:

python ./scripts/ascend_distributed_launcher/get_distribute_pretrain_cmd.py --run_script_dir ./run_pretrain.py --hyper_parameter_config_dir ./scripts/ascend_distributed_launcher/hyper_parameter_config.ini --data_dir /path/dataset/ --hccl_config_dir model_zoo/utils/hccl_tools/hccl_2p_56_x.x.x.x.json

output:

hccl_config_dir: model_zoo/utils/hccl_tools/hccl_2p_56_x.x.x.x.json
the number of logical core: 192
avg_core_per_rank: 96
rank_size: 2

start training for rank 0, device 5:
rank_id: 0
device_id: 5
core nums: 0-95
epoch_size: 8
data_dir: /data/small_512/
schema_dir:
log file dir: ./LOG5/log.txt

start training for rank 1, device 6:
rank_id: 1
device_id: 6
core nums: 96-191
epoch_size: 8
data_dir: /data/small_512/
schema_dir:
log file dir: ./LOG6/log.txt

Note

Note that hccl_2p_56_x.x.x.x.json can use hccl_tools.py to generate.
For hyper parameter, please note that you should customize the scripts hyper_parameter_config.ini. Please note that these two hyper parameters are not allowed to be configured here:
- device_id
- device_num
- data_dir
For Other Model, please note that you should customize the option run_script and Corresponding hyper_parameter_config.ini.

MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios.

C++ Python Text Unity3D Asset C other

314202276@qq.com 5518576+mindspore_ci@user.noreply.gitee.com tommylike@qq.com zhaozhenlong1@huawei.com zhoufeng54@huawei.com sunsuodong@huawei.com wangkaisheng2@huawei.com yangruoqi@huawei.com shiliang10@huawei.com xiefangqi2@huawei.com caifubi1@huawei.com lingqiaomin.huawei.com chenweifeng720@huawei.com fuzhiye@huawei.com liubuyu1@huawei.com changzherui1@huawei.com huanghui44@huawei.com guozhijian@huawei.com yaoyifan1@huawei.com zhaoting23@huawei.com liuxiao93@huawei.com peixu.ren1@huawei.com xuanyue@huawei.com lizhenyu13@huawei.com yuchaojie1@huawei.com

README.md

Run distribute pretrain

description

how to use

Note

Contributors (25+) All

Contributors (25+)
All