@@ -0,0 +1,92 @@ | |||
# 如何在启智平台上进行模型训练 - GPU版本 | |||
- 启智集群单数据集的训练,启智集群多数据集的训练,智算集群的单数据集训练,这3个的训练使用方式不同,请注意区分: | |||
- 启智集群单数据集的训练示例请参考示例中[train.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train.py)的代码注释 | |||
- 启智集群多数据集的训练示例请参考示例中[train_for_multidataset.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_multidataset.py)的代码注释 | |||
- 智算集群单数据集的训练示例请参考示例中[train_for_c2net.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_c2net.py)的代码注释 | |||
- 启智集群中单数据集和多数据集的区别在于使用方式不同: | |||
如本示例中单数据集MNISTDataset_torch.zip的使用方式是:数据集位于/dataset/下 | |||
多数据集时MNISTDataset_torch.zip的使用方式是:数据集位于/dataset/MNISTDataset_torch/下 | |||
## 1 概述 | |||
- 本项目以#LeNet5-MNIST-PyTorch为例,简要介绍如何在启智AI协同平台上使用Pytorch完成训练任务,包括单数据集的训练,多数据集的训练,智算网络的训练,旨在为AI开发者提供启智训练示例。 | |||
- 用户可以直接使用本项目提供的数据集和代码文件创建自己的训练任务。 | |||
## 2 准备工作 | |||
- 启智平台使用准备,本项目需要用户创建启智平台账户,克隆代码到自己的账户,上传数据集,具体操作方法可以通过访问[OpenI_Learning](https://git.openi.org.cn/zeizei/OpenI_Learning)项目学习小白训练营系列课程进行学习。 | |||
### 2.1 数据准备 | |||
#### 数据集获取 | |||
- 如果你需要试运行本示例,则无需再次上传数据集,因为本示例中的数据集MnistDataset_torch.zip已经设置为公开数据集,可以直接引用,数据集也可从本项目的数据集目录中下载并查看数据结构,[MNISTDataset_torch.zip数据集下载](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/datasets?type=0),[mnist_epoch1_0.73.pkl.zip数据集下载](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/datasets?type=0)。 | |||
- 数据文件说明 | |||
- MNISTData数据集是由10类28∗28的灰度图片组成,训练数据集包含60000张图片,测试数据集包含10000张图片。 | |||
- 数据集压缩包的目录结构如下: | |||
> MNISTDataset_torch.zip | |||
> ├── test | |||
> │ └── MNIST | |||
> │ │── raw | |||
> │ │ ├── t10k-images-idx3-ubyte | |||
> │ │ └── t10k-labels-idx1-ubyte | |||
> │ │ ├── train-images-idx3-ubyte | |||
> │ │ └── train-labels-idx1-ubyte | |||
> │ └── processed | |||
> │ ├── test.pt | |||
> │ └── training.pt | |||
> └── train | |||
> └── MNIST | |||
> │── raw | |||
> │ ├── t10k-images-idx3-ubyte | |||
> │ └── t10k-labels-idx1-ubyte | |||
> │ ├── train-images-idx3-ubyte | |||
> │ └── train-labels-idx1-ubyte | |||
> └── processed | |||
> ├── test.pt | |||
> └── training.pt | |||
> mnist_epoch1_0.73.pkl.zip | |||
> ├── mnist_epoch1_0.73.pkl | |||
#### 数据集上传 | |||
使用GPU进行训练,需要在GPU芯片上运行,所以上传的数据集需要传到GPU界面。(此步骤在本示例中不需要,可直接选择公开数据集MNISTDataset_torch.zip) | |||
### 2.2 执行脚本准备 | |||
#### 示例代码 | |||
- 示例代码可从本仓库中下载,[代码下载](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU) | |||
- 代码文件说明 | |||
- [train.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train.py),用于单数据集训练的脚本文件。具体说明请参考[train.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train.py) | |||
- [train_for_multidataset.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_multidataset.py),用于多数据集训练的脚本文件。具体说明请参考[train_for_multidataset.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_multidataset.py) | |||
- [train_for_c2net.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_c2net.py),用于智算网络训练的脚本文件。具体说明请参考[train_for_c2net.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_c2net.py) | |||
- [model.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/model.py),使用的训练网络,在单数据集训练,多数据集训练,智算网络训练中使用到。 | |||
## 3 创建训练任务 | |||
准备好数据和执行脚本以后,需要创建训练任务将Pytorch脚本运行。首次使用的用户可参考本示例代码。 | |||
### 训练界面示例 | |||
由于A100的适配性问题,A100需要使用cuda11以上的cuda版本,目前平台已提供基于A100的cuda基础镜像,只需要选择对应的公共镜像: | |||
![avatar](Example_picture/适用A100的基础镜像.png) | |||
训练界面参数参考如下: | |||
![avatar](Example_picture/基础镜像.png) | |||
表1 创建训练作业界面参数说明 | |||
| 参数名称 | 说明 | | |||
| ----------------- | ----------- | | |||
| 计算资源 | 选择CPU/GPU | | |||
| 代码分支 | 选择仓库代码中要使用的代码分支,默认可选择master分支 | | |||
| 镜像 | 镜像选择已在调试环境中调试好的镜像,目前版本请选择基础镜像:平台提供基于A100的cuda基础镜像,如dockerhub.pcl.ac.cn:5000/user-images/openi:cuda111_python37_pytorch191| | |||
| 启动文件 | 启动文件选择代码目录下的启动脚本train.py | | |||
| 数据集 | 数据集选择已上传到启智平台的公共数据集MnistDataset_torch.zip | | |||
| 运行参数 | 增加运行参数可以向脚本中其他参数传值,如epoch_size | | |||
| 资源规格 | 规格选择含有GPU个数的规格| | |||
## 4 查看运行结果 | |||
### 4.1 在训练作业界面可以查看运行日志 | |||
目前训练任务的日志只能在代码中print输出,参考示例train.py代码相关print | |||
### 4.2 训练结束后可以下载模型文件 | |||
![avatar](Example_picture/结果下载.png) | |||
## 对于示例代码有任何问题,欢迎在本项目中提issue。 |
@@ -0,0 +1,76 @@ | |||
#!/usr/bin/python | |||
#coding=utf-8 | |||
''' | |||
GPU INFERENCE INSTANCE | |||
If there are Chinese comments in the code,please add at the beginning: | |||
#!/usr/bin/python | |||
#coding=utf-8 | |||
Due to the adaptability of a100, please use the recommended image of the | |||
platform with cuda 11.Then adjust the code and submit the image. | |||
The image of this example is: dockerhub.pcl.ac.cn:5000/user-images/openi:cuda111_python37_pytorch191 | |||
In the environment, the uploaded dataset will be automatically placed in the /dataset directory. | |||
if MnistDataset_torch.zip is selected,Then the dataset directory is /dataset/test; | |||
The model file selected is in /model directory. | |||
The result download path is under /result . and the Qizhi platform will provide file downloads under the /result directory. | |||
本例中的镜像是dockerhub.pcl.ac.cn:5000/user-images/openi:cuda111_python37_pytorch191 | |||
选择的数据集被放置在/dataset目录 | |||
选择的模型文件放置在/model目录 | |||
输出结果路径是/result目录 | |||
!!!注意:目前推理的资源环境不支持联网,所以镜像无法使用公网镜像,镜像必须先提交到启智平台;推理的数据集也需要先上传到启智平台 | |||
''' | |||
import numpy as np | |||
import torch | |||
from torchvision.datasets import mnist | |||
from torch.utils.data import DataLoader | |||
from torchvision.transforms import ToTensor | |||
import os | |||
import argparse | |||
# Training settings | |||
parser = argparse.ArgumentParser(description='PyTorch MNIST Example') | |||
#获取模型文件名称 | |||
parser.add_argument('--modelname', help='model name') | |||
if __name__ == '__main__': | |||
args, unknown = parser.parse_known_args() | |||
print('cuda is available:{}'.format(torch.cuda.is_available())) | |||
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") | |||
test_dataset = mnist.MNIST(root='/dataset/test', train=False, transform=ToTensor(), | |||
download=False) | |||
test_loader = DataLoader(test_dataset, batch_size=256) | |||
#如果文件名确定,model_path可以直接写死 | |||
model_path = '/model/'+args.modelname | |||
model = torch.load(model_path).to(device) | |||
model.eval() | |||
correct = 0 | |||
_sum = 0 | |||
for idx, (test_x, test_label) in enumerate(test_loader): | |||
test_x = test_x | |||
test_label = test_label | |||
predict_y = model(test_x.to(device).float()).detach() | |||
predict_ys = np.argmax(predict_y.cpu(), axis=-1) | |||
label_np = test_label.numpy() | |||
_ = predict_ys == test_label | |||
correct += np.sum(_.numpy(), axis=-1) | |||
_sum += _.shape[0] | |||
print('accuracy: {:.2f}'.format(correct / _sum)) | |||
#结果写入/result | |||
filename = 'result.txt' | |||
file_path = os.path.join('/result', filename) | |||
with open(file_path, 'w') as file: | |||
file.write('accuracy: {:.2f}'.format(correct / _sum)) |
@@ -0,0 +1,35 @@ | |||
from torch.nn import Module | |||
from torch import nn | |||
class Model(Module): | |||
def __init__(self): | |||
super(Model, self).__init__() | |||
self.conv1 = nn.Conv2d(1, 6, 5) | |||
self.relu1 = nn.ReLU() | |||
self.pool1 = nn.MaxPool2d(2) | |||
self.conv2 = nn.Conv2d(6, 16, 5) | |||
self.relu2 = nn.ReLU() | |||
self.pool2 = nn.MaxPool2d(2) | |||
self.fc1 = nn.Linear(256, 120) | |||
self.relu3 = nn.ReLU() | |||
self.fc2 = nn.Linear(120, 84) | |||
self.relu4 = nn.ReLU() | |||
self.fc3 = nn.Linear(84, 10) | |||
self.relu5 = nn.ReLU() | |||
def forward(self, x): | |||
y = self.conv1(x) | |||
y = self.relu1(y) | |||
y = self.pool1(y) | |||
y = self.conv2(y) | |||
y = self.relu2(y) | |||
y = self.pool2(y) | |||
y = y.view(y.shape[0], -1) | |||
y = self.fc1(y) | |||
y = self.relu3(y) | |||
y = self.fc2(y) | |||
y = self.relu4(y) | |||
y = self.fc3(y) | |||
y = self.relu5(y) | |||
return y |
@@ -0,0 +1 @@ | |||
from einops import rearrange |
@@ -0,0 +1,86 @@ | |||
#!/usr/bin/python | |||
#coding=utf-8 | |||
''' | |||
If there are Chinese comments in the code,please add at the beginning: | |||
#!/usr/bin/python | |||
#coding=utf-8 | |||
Due to the adaptability of a100, before using the training environment, please use the recommended image of the | |||
platform with cuda 11.Then adjust the code and submit the image. | |||
The image of this example is: dockerhub.pcl.ac.cn:5000/user-images/openi:cuda111_python37_pytorch191 | |||
In the training environment, the uploaded dataset will be automatically placed in the /dataset directory. | |||
If it is a single dataset: | |||
if MnistDataset_torch.zip is selected,Then the dataset directory is /dataset/train, /dataset/test; | |||
If it is a multiple dataset: | |||
If MnistDataset_torch.zip and checkpoint_epoch1_0.73.zip are selected, | |||
the dataset directory is /dataset/MnistDataset_torch/train, /dataset/MnistDataset_torch/test | |||
and /dataset/checkpoint_epoch1_0.73/mnist_epoch1_0.73.pkl | |||
The model download path is under /model by default. Please specify the model output location to /model, | |||
and the Qizhi platform will provide file downloads under the /model directory. | |||
''' | |||
from model import Model | |||
import numpy as np | |||
import torch | |||
from torchvision.datasets import mnist | |||
from torch.nn import CrossEntropyLoss | |||
from torch.optim import SGD | |||
from torch.utils.data import DataLoader | |||
from torchvision.transforms import ToTensor | |||
import argparse | |||
# Training settings | |||
parser = argparse.ArgumentParser(description='PyTorch MNIST Example') | |||
#The dataset location is placed under /dataset | |||
parser.add_argument('--traindata', default="/dataset/train" ,help='path to train dataset') | |||
parser.add_argument('--testdata', default="/dataset/test" ,help='path to test dataset') | |||
parser.add_argument('--epoch_size', type=int, default=1, help='how much epoch to train') | |||
parser.add_argument('--batch_size', type=int, default=256, help='how much batch_size in epoch') | |||
if __name__ == '__main__': | |||
args, unknown = parser.parse_known_args() | |||
#log output | |||
print('cuda is available:{}'.format(torch.cuda.is_available())) | |||
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") | |||
batch_size = args.batch_size | |||
train_dataset = mnist.MNIST(root=args.traindata, train=True, transform=ToTensor(),download=False) | |||
test_dataset = mnist.MNIST(root=args.testdata, train=False, transform=ToTensor(),download=False) | |||
train_loader = DataLoader(train_dataset, batch_size=batch_size) | |||
test_loader = DataLoader(test_dataset, batch_size=batch_size) | |||
model = Model().to(device) | |||
sgd = SGD(model.parameters(), lr=1e-1) | |||
cost = CrossEntropyLoss() | |||
epoch = args.epoch_size | |||
print('epoch_size is:{}'.format(epoch)) | |||
for _epoch in range(epoch): | |||
print('the {} epoch_size begin'.format(_epoch + 1)) | |||
model.train() | |||
for idx, (train_x, train_label) in enumerate(train_loader): | |||
train_x = train_x.to(device) | |||
train_label = train_label.to(device) | |||
label_np = np.zeros((train_label.shape[0], 10)) | |||
sgd.zero_grad() | |||
predict_y = model(train_x.float()) | |||
loss = cost(predict_y, train_label.long()) | |||
if idx % 10 == 0: | |||
print('idx: {}, loss: {}'.format(idx, loss.sum().item())) | |||
loss.backward() | |||
sgd.step() | |||
correct = 0 | |||
_sum = 0 | |||
model.eval() | |||
for idx, (test_x, test_label) in enumerate(test_loader): | |||
test_x = test_x | |||
test_label = test_label | |||
predict_y = model(test_x.to(device).float()).detach() | |||
predict_ys = np.argmax(predict_y.cpu(), axis=-1) | |||
label_np = test_label.numpy() | |||
_ = predict_ys == test_label | |||
correct += np.sum(_.numpy(), axis=-1) | |||
_sum += _.shape[0] | |||
print('accuracy: {:.2f}'.format(correct / _sum)) | |||
#The model output location is placed under /model | |||
torch.save(model, '/model/mnist_epoch{}_{:.2f}.pkl'.format(_epoch+1, correct / _sum)) |
@@ -0,0 +1,78 @@ | |||
#!/usr/bin/python | |||
#coding=utf-8 | |||
''' | |||
If there are Chinese comments in the code,please add at the beginning: | |||
#!/usr/bin/python | |||
#coding=utf-8 | |||
In the training environment, | |||
the code will be automatically placed in the /tmp/code directory, | |||
the uploaded dataset will be automatically placed in the /tmp/dataset directory, and | |||
the model download path is under /tmp/output by default, please specify the model output location to /tmp/model, | |||
qizhi platform will provide file downloads under the /tmp/output directory. | |||
''' | |||
from model import Model | |||
import numpy as np | |||
import torch | |||
from torchvision.datasets import mnist | |||
from torch.nn import CrossEntropyLoss | |||
from torch.optim import SGD | |||
from torch.utils.data import DataLoader | |||
from torchvision.transforms import ToTensor | |||
import argparse | |||
# Training settings | |||
parser = argparse.ArgumentParser(description='PyTorch MNIST Example') | |||
#The dataset location is placed under /dataset | |||
parser.add_argument('--traindata', default="/tmp/dataset/train" ,help='path to train dataset') | |||
parser.add_argument('--testdata', default="/tmp/dataset/test" ,help='path to test dataset') | |||
parser.add_argument('--epoch_size', type=int, default=1, help='how much epoch to train') | |||
parser.add_argument('--batch_size', type=int, default=256, help='how much batch_size in epoch') | |||
if __name__ == '__main__': | |||
args, unknown = parser.parse_known_args() | |||
#log output | |||
print('cuda is available:{}'.format(torch.cuda.is_available())) | |||
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") | |||
batch_size = args.batch_size | |||
train_dataset = mnist.MNIST(root=args.traindata, train=True, transform=ToTensor(),download=False) | |||
test_dataset = mnist.MNIST(root=args.testdata, train=False, transform=ToTensor(),download=False) | |||
train_loader = DataLoader(train_dataset, batch_size=batch_size) | |||
test_loader = DataLoader(test_dataset, batch_size=batch_size) | |||
model = Model().to(device) | |||
sgd = SGD(model.parameters(), lr=1e-1) | |||
cost = CrossEntropyLoss() | |||
epoch = args.epoch_size | |||
print('epoch_size is:{}'.format(epoch)) | |||
for _epoch in range(epoch): | |||
print('the {} epoch_size begin'.format(_epoch + 1)) | |||
model.train() | |||
for idx, (train_x, train_label) in enumerate(train_loader): | |||
train_x = train_x.to(device) | |||
train_label = train_label.to(device) | |||
label_np = np.zeros((train_label.shape[0], 10)) | |||
sgd.zero_grad() | |||
predict_y = model(train_x.float()) | |||
loss = cost(predict_y, train_label.long()) | |||
if idx % 10 == 0: | |||
print('idx: {}, loss: {}'.format(idx, loss.sum().item())) | |||
loss.backward() | |||
sgd.step() | |||
correct = 0 | |||
_sum = 0 | |||
model.eval() | |||
for idx, (test_x, test_label) in enumerate(test_loader): | |||
test_x = test_x | |||
test_label = test_label | |||
predict_y = model(test_x.to(device).float()).detach() | |||
predict_ys = np.argmax(predict_y.cpu(), axis=-1) | |||
label_np = test_label.numpy() | |||
_ = predict_ys == test_label | |||
correct += np.sum(_.numpy(), axis=-1) | |||
_sum += _.shape[0] | |||
print('accuracy: {:.2f}'.format(correct / _sum)) | |||
#The model output location is placed under /model | |||
torch.save(model, '/tmp/output/mnist_epoch{}_{:.2f}.pkl'.format(_epoch+1, correct / _sum)) |
@@ -0,0 +1,116 @@ | |||
#!/usr/bin/python | |||
#coding=utf-8 | |||
''' | |||
If there are Chinese comments in the code,please add at the beginning: | |||
#!/usr/bin/python | |||
#coding=utf-8 | |||
1,The dataset structure of the multi-dataset in this example | |||
MnistDataset_torch.zip | |||
├── test | |||
└── train | |||
checkpoint_epoch1_0.73.zip | |||
├── mnist_epoch1_0.73.pkl | |||
2,Due to the adaptability of a100, before using the training environment, please use the recommended image of the | |||
platform with cuda 11.Then adjust the code and submit the image. | |||
The image of this example is: dockerhub.pcl.ac.cn:5000/user-images/openi:cuda111_python37_pytorch191 | |||
In the training environment, the uploaded dataset will be automatically placed in the /dataset directory. | |||
Note: the paths are different when selecting a single dataset and multiple datasets. | |||
(1)If it is a single dataset: if MnistDataset_torch.zip is selected, | |||
the dataset directory is /dataset/train, /dataset/test; | |||
The dataset structure of the single dataset in the training image in this example: | |||
dataset | |||
├── test | |||
└── train | |||
(2)If multiple datasets are selected, such as MnistDataset_torch.zip and checkpoint_epoch1_0.73.zip, | |||
the dataset directory is /dataset/MnistDataset_torch/train, /dataset/MnistDataset_torch/test | |||
and /dataset/checkpoint_epoch1_0.73/mnist_epoch1_0.73.pkl | |||
The dataset structure in the training image for multiple datasets in this example: | |||
dataset | |||
├── MnistDataset_torch | |||
| ├── test | |||
| └── train | |||
└── checkpoint_epoch1_0.73 | |||
├── mnist_epoch1_0.73.pkl | |||
The model download path is under /model by default. Please specify the model output location to /model, | |||
and the Qizhi platform will provide file downloads under the /model directory. | |||
''' | |||
from model import Model | |||
import numpy as np | |||
import torch | |||
from torchvision.datasets import mnist | |||
from torch.nn import CrossEntropyLoss | |||
from torch.optim import SGD | |||
from torch.utils.data import DataLoader | |||
from torchvision.transforms import ToTensor | |||
import argparse | |||
# Training settings | |||
parser = argparse.ArgumentParser(description='PyTorch MNIST Example') | |||
#The dataset location is placed under /dataset | |||
parser.add_argument('--traindata', default="/dataset/MnistDataset_torch/train" ,help='path to train dataset') | |||
parser.add_argument('--testdata', default="/dataset/MnistDataset_torch/test" ,help='path to test dataset') | |||
parser.add_argument('--checkpoint', default="/dataset/checkpoint_epoch1_0.73/mnist_epoch1_0.73.pkl" ,help='checkpoint file') | |||
parser.add_argument('--epoch_size', type=int, default=1, help='how much epoch to train') | |||
parser.add_argument('--batch_size', type=int, default=256, help='how much batch_size in epoch') | |||
#获取模型文件名称 | |||
parser.add_argument('--modelname', help='model name') | |||
if __name__ == '__main__': | |||
args, unknown = parser.parse_known_args() | |||
#log output | |||
print('cuda is available:{}'.format(torch.cuda.is_available())) | |||
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") | |||
batch_size = args.batch_size | |||
train_dataset = mnist.MNIST(root=args.traindata, train=True, transform=ToTensor(),download=False) | |||
test_dataset = mnist.MNIST(root=args.testdata, train=False, transform=ToTensor(),download=False) | |||
train_loader = DataLoader(train_dataset, batch_size=batch_size) | |||
test_loader = DataLoader(test_dataset, batch_size=batch_size) | |||
model = Model().to(device) | |||
sgd = SGD(model.parameters(), lr=1e-1) | |||
cost = CrossEntropyLoss() | |||
epoch = args.epoch_size | |||
print('epoch_size is:{}'.format(epoch)) | |||
# Load the trained model | |||
# path = args.checkpoint | |||
# checkpoint = torch.load(path, map_location=device) | |||
# model.load_state_dict(checkpoint) | |||
for _epoch in range(epoch): | |||
print('the {} epoch_size begin'.format(_epoch + 1)) | |||
model.train() | |||
for idx, (train_x, train_label) in enumerate(train_loader): | |||
train_x = train_x.to(device) | |||
train_label = train_label.to(device) | |||
label_np = np.zeros((train_label.shape[0], 10)) | |||
sgd.zero_grad() | |||
predict_y = model(train_x.float()) | |||
loss = cost(predict_y, train_label.long()) | |||
if idx % 10 == 0: | |||
print('idx: {}, loss: {}'.format(idx, loss.sum().item())) | |||
loss.backward() | |||
sgd.step() | |||
correct = 0 | |||
_sum = 0 | |||
model.eval() | |||
for idx, (test_x, test_label) in enumerate(test_loader): | |||
test_x = test_x | |||
test_label = test_label | |||
predict_y = model(test_x.to(device).float()).detach() | |||
predict_ys = np.argmax(predict_y.cpu(), axis=-1) | |||
label_np = test_label.numpy() | |||
_ = predict_ys == test_label | |||
correct += np.sum(_.numpy(), axis=-1) | |||
_sum += _.shape[0] | |||
print('accuracy: {:.2f}'.format(correct / _sum)) | |||
#The model output location is placed under /model | |||
torch.save(model, '/model/mnist_epoch{}_{:.2f}.pkl'.format(_epoch+1, correct / _sum)) |
@@ -0,0 +1,99 @@ | |||
# 如何在启智平台上进行模型训练 - NPU版本 | |||
- **启智集群和智算网络集群的单数据集训练,多数据集训练,训练使用方式不同,请按需求选择一种训练方式即可,注意区别(以下环境默认是训练环境)**: | |||
- 启智集群单数据集单卡或多卡的训练示例请参考示例中[train.py](https://git.openi.org.cn/OpenIOSSG/MNIST_Example/src/branch/master/train.py)的代码注释 | |||
- 启智集群单数据集单卡的推理示例请参考示例中[inference.py](https://git.openi.org.cn/OpenIOSSG/MNIST_Example/src/branch/master/inference.py)的代码注释 | |||
- 启智集群多数据集单卡或多卡的训练示例请参考示例中[train_for_multidataset.py](https://git.openi.org.cn/OpenIOSSG/MNIST_Example/src/branch/master/train_for_multidataset.py)的代码注释 | |||
- 智算网络集群单数据集单卡或多卡训练示例请参考示例中[train_for_c2net.py](https://git.openi.org.cn/OpenIOSSG/MNIST_Example/src/branch/master/train_for_c2net.py)的代码注释 | |||
- 更多关于分布式训练的教程可参考mindspore官网教程[mindspore分布式训练教程](https://www.mindspore.cn/tutorial/training/zh-CN/r1.2/advanced_use/distributed_training_ascend.html) | |||
- **NPU启智集群中单数据集和多数据集的区别**: | |||
- 超参数不同: | |||
单数据集的超参数通过--data_url传递 | |||
多数据集的超参数通过--multi_data_url传递,并且需要保留--data_url | |||
- 数据集使用方式不同: | |||
如本示例中单数据集MNISTData.zip的使用方式是:数据集位于/cache/data下 | |||
多数据集时MNISTData.zip的使用方式是:数据集位于/cache/data/MNISTData/下 | |||
- **NPU启智集群和智算网络集群的区别**: | |||
- 启智集群需要使用moxing拷贝数据到obs | |||
- 智算网络集群不需要moxing拷贝数据到obs | |||
- **NPU启智集群调试镜像和训练镜像的环境的区别**: | |||
- 若想要使用调试环境的多卡并行训练,可参考示例[调试环境多卡并行示例](https://git.openi.org.cn/OpenIOSSG/MNIST_Example_NPU_Debug) | |||
## 1 概述 | |||
- 本项目以LeNet-MNIST为例,简要介绍如何在启智AI协同平台上使用MindSpore完成训练任务,并提供单数据集的训练,多数据集的训练,智算网络的训练,单数据集推理等训练代码示例,旨在为AI开发者提供启智npu训练示例。对于示例代码有任何问题,欢迎在本项目中提issue。 | |||
- 用户可以直接使用本项目提供的数据集和代码文件创建自己的训练任务。 | |||
- 启智平台对接ModelArts和OBS,将数据集,代码,训练资源池等整合在启智AI协同平台上供开发者使用。 | |||
- ModelArts是华为云提供的面向开发者的一站式AI开发平台,集成了昇腾AI处理器资源池,用户可以在ModelArts下体验MindSpore。 | |||
- OBS是华为云提供的存储方式。 | |||
## 2 准备工作 | |||
- 启智平台使用准备,本项目需要用户创建启智平台账户,克隆代码到自己的账户,上传数据集,具体操作方法可以通过访问[OpenI_Learning](https://git.openi.org.cn/zeizei/OpenI_Learning)项目学习小白训练营系列课程进行学习。 | |||
### 2.1 数据准备 | |||
#### 数据集下载 | |||
- 数据集可从本项目的数据集目录中下载,[数据集下载](https://git.openi.org.cn/OpenIOSSG/MNIST_Example/datasets?type=1) | |||
- 数据文件说明 | |||
- MNISTData数据集是由10类28∗28的灰度图片组成,训练数据集包含60000张图片,测试数据集包含10000张图片。 | |||
- 数据集压缩包的目录结构如下: | |||
> MNIST_Data.zip | |||
> ├── test | |||
> │ ├── t10k-images-idx3-ubyte | |||
> │ └── t10k-labels-idx1-ubyte | |||
> └── train | |||
> ├── train-images-idx3-ubyte | |||
> └── train-labels-idx1-ubyte | |||
> checkpoint_lenet-1_1875.zip | |||
> ├── checkpoint_lenet-1_1875.ckpt | |||
#### 数据集上传 | |||
- 由于本示例使用的是Mindspore开发,需要在NPU芯片运行,所以上传的数据集需要传到NPU界面。\ | |||
【注意:如果你需要试运行本示例,则无需再次上传数据集,因为本示例中的数据集MNIST_Example已经设置为公开数据集,可以直接引用或点赞收藏后使用】 | |||
- 如下所示: | |||
- ![avatar](Example_Picture/数据集上传位置.png) | |||
### 2.2 执行脚本准备 | |||
#### 示例代码 | |||
- 示例代码可从本仓库中下载,[代码下载](https://git.openi.org.cn/OpenIOSSG/MNIST_Example) | |||
- 代码文件说明 | |||
- [train.py](https://git.openi.org.cn/OpenIOSSG/MNIST_Example/src/branch/master/train.py),启智集群单数据集训练的脚本文件,包括将数据集从obs拷贝到训练镜像中、指定迭代次数、把训练后的模型数据拷贝回obs等。具体说明请参考[train.py](https://git.openi.org.cn/OpenIOSSG/MNIST_Example/src/branch/master/train.py)的代码注释 | |||
- [train_for_c2net.py](https://git.openi.org.cn/OpenIOSSG/MNIST_Example/src/branch/master/train_for_c2net.py),智算网络训练的脚本文件,包括指定迭代次数等。具体说明请参考[train_for_c2net.py](https://git.openi.org.cn/OpenIOSSG/MNIST_Example/src/branch/master/train_for_c2net.py)的代码注释 | |||
- [train_for_multidataset.py](https://git.openi.org.cn/OpenIOSSG/MNIST_Example/src/branch/master/train_for_multidataset.py),启智集群包括多数据集训练的脚本文件,将多数据集从obs拷贝到训练镜像中、指定迭代次数、把训练后的模型数据拷贝回obs等。具体说明请参考[train_for_multidataset.py](https://git.openi.org.cn/OpenIOSSG/MNIST_Example/src/branch/master/train_for_multidataset.py)的代码注释 | |||
- [inference.py](https://git.openi.org.cn/OpenIOSSG/MNIST_Example/src/branch/master/inference.py),启智集群用于推理的脚本文件,包括将数据集从obs拷贝到训练镜像中、指定迭代次数、把训练后的模型数据拷贝回obs等。具体说明请参考[inference.py](https://git.openi.org.cn/OpenIOSSG/MNIST_Example/src/branch/master/inference.py)的代码注释 | |||
- [config.py](https://git.openi.org.cn/OpenIOSSG/MNIST_Example/src/branch/master/config.py),网络配置信息,在单数据集训练,多数据集训练,智算网络训练等训练脚本中会使用到。 | |||
- [dataset.py](https://git.openi.org.cn/OpenIOSSG/MNIST_Example/src/branch/master/dataset.py),对原始数据集进行预处理,产生可用于网络训练的数据集,在单数据集的训练,多数据集的训练,智算网络的训练等训练脚本中会使用到。 | |||
- [lenet.py](https://git.openi.org.cn/OpenIOSSG/MNIST_Example/src/branch/master/lenet.py),使用的训练网络,在单数据集训练,多数据集训练,智算网络训练等训练脚本中会使用到。 | |||
- [dataset_distributes.py](https://git.openi.org.cn/OpenIOSSG/MNIST_Example/src/branch/master/dataset_distributes.py),对原始数据集进行预处理,产生可用于单机多卡训练的数据集。 | |||
## 3 创建训练任务 | |||
- 准备好数据和执行脚本以后,需要创建训练任务将MindSpore脚本真正运行起来。首次使用的用户可参考本示例代码。 | |||
### 使用MindSpore作为训练框架创建训练作业,界面截图如下图所示。 | |||
![avatar](Example_Picture/新建训练任务页面.png) | |||
表1 创建训练作业界面参数说明 | |||
| 参数名称 | 说明 | | |||
| ----------------- | ----------- | | |||
| 代码分支 | 选择仓库代码中要使用的代码分支,默认可选择master分支。 | | |||
| AI引擎 | AI引擎选择[Ascend-Powered-Engine]和所需的MindSpore版本(本示例图片为 [Mindspore-1.3.0-python3.7-aarch64],请注意使用与所选版本对应的脚本)。 | | |||
| 启动文件 | 启动文件选择代码目录下的启动脚本。 | | |||
| 数据集 | 数据集选择已上传到启智平台的数据集。 | | |||
| 运行参数 | 单数据集数据存储位置和训练输出位置分别对应运行参数data_url和train_url,注意多数据集需要增加参数multi_data_url并在代码中声明,选择增加运行参数可以向脚本中其他参数传值,如epoch_size。在这里只需填入其他参数传值,data_url和train_url已默认加入运行参数,用户无需重复指定,只需在代码中指定。 | | |||
| 资源池 | 规格选择[Ascend: 1 * Ascend 910 CPU:24 核 256GiB],表示单机单卡 | | |||
<!-- 注:若要在启智平台上使用CPU,需要在启智平台训练界面上加上运行参数device_target=CPU,否则默认是Ascend,如下图所示 | |||
![avatar](Example_Picture/运行参数界面.png) --> | |||
## 4 查看运行结果 | |||
### 4.1 在训练作业界面可以查看运行日志 | |||
![avatar](Example_Picture/查看日志页面.png) | |||
### 4.2 训练结束后可以下载模型文件 | |||
![avatar](Example_Picture/模型下载页面.png) | |||
## 对于示例代码有任何问题,欢迎在本项目中提issue。 |
@@ -0,0 +1,33 @@ | |||
# Copyright 2020 Huawei Technologies Co., Ltd | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
# ============================================================================ | |||
""" | |||
network config setting, will be used in train.py | |||
""" | |||
from easydict import EasyDict as edict | |||
mnist_cfg = edict({ | |||
'num_classes': 10, | |||
'lr': 0.01, | |||
'momentum': 0.9, | |||
'epoch_size': 10, | |||
'batch_size': 32, | |||
'buffer_size': 1000, | |||
'image_height': 32, | |||
'image_width': 32, | |||
'save_checkpoint_steps': 1875, | |||
'keep_checkpoint_max': 150, | |||
'air_name': "lenet", | |||
}) |
@@ -0,0 +1,60 @@ | |||
# Copyright 2020 Huawei Technologies Co., Ltd | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
# ============================================================================ | |||
""" | |||
Produce the dataset | |||
""" | |||
import mindspore.dataset as ds | |||
import mindspore.dataset.vision.c_transforms as CV | |||
import mindspore.dataset.transforms.c_transforms as C | |||
from mindspore.dataset.vision import Inter | |||
from mindspore.common import dtype as mstype | |||
def create_dataset(data_path, batch_size=32, repeat_size=1, | |||
num_parallel_workers=1): | |||
""" | |||
create dataset for train or test | |||
""" | |||
# define dataset | |||
mnist_ds = ds.MnistDataset(data_path) | |||
resize_height, resize_width = 32, 32 | |||
rescale = 1.0 / 255.0 | |||
shift = 0.0 | |||
rescale_nml = 1 / 0.3081 | |||
shift_nml = -1 * 0.1307 / 0.3081 | |||
# define map operations | |||
resize_op = CV.Resize((resize_height, resize_width), interpolation=Inter.LINEAR) # Bilinear mode | |||
rescale_nml_op = CV.Rescale(rescale_nml, shift_nml) | |||
rescale_op = CV.Rescale(rescale, shift) | |||
hwc2chw_op = CV.HWC2CHW() | |||
type_cast_op = C.TypeCast(mstype.int32) | |||
# apply map operations on images | |||
mnist_ds = mnist_ds.map(operations=type_cast_op, input_columns="label", num_parallel_workers=num_parallel_workers) | |||
mnist_ds = mnist_ds.map(operations=resize_op, input_columns="image", num_parallel_workers=num_parallel_workers) | |||
mnist_ds = mnist_ds.map(operations=rescale_op, input_columns="image", num_parallel_workers=num_parallel_workers) | |||
mnist_ds = mnist_ds.map(operations=rescale_nml_op, input_columns="image", num_parallel_workers=num_parallel_workers) | |||
mnist_ds = mnist_ds.map(operations=hwc2chw_op, input_columns="image", num_parallel_workers=num_parallel_workers) | |||
# apply DatasetOps | |||
buffer_size = 10000 | |||
mnist_ds = mnist_ds.shuffle(buffer_size=buffer_size) # 10000 as in LeNet train script | |||
mnist_ds = mnist_ds.batch(batch_size, drop_remainder=True) | |||
mnist_ds = mnist_ds.repeat(repeat_size) | |||
return mnist_ds |
@@ -0,0 +1,54 @@ | |||
""" | |||
Produce the dataset: | |||
与单机不同的是,在数据集接口需要传入num_shards和shard_id参数,分别对应卡的数量和逻辑序号,建议通过HCCL接口获取: | |||
get_rank:获取当前设备在集群中的ID。 | |||
get_group_size:获取集群数量。 | |||
""" | |||
import mindspore.dataset as ds | |||
import mindspore.dataset.vision.c_transforms as CV | |||
import mindspore.dataset.transforms.c_transforms as C | |||
from mindspore.dataset.vision import Inter | |||
from mindspore.common import dtype as mstype | |||
from mindspore.communication.management import get_rank, get_group_size | |||
def create_dataset_parallel(data_path, batch_size=32, repeat_size=1, | |||
num_parallel_workers=1, shard_id=0, num_shards=8): | |||
""" | |||
create dataset for train or test | |||
""" | |||
resize_height, resize_width = 32, 32 | |||
rescale = 1.0 / 255.0 | |||
shift = 0.0 | |||
rescale_nml = 1 / 0.3081 | |||
shift_nml = -1 * 0.1307 / 0.3081 | |||
# get shard_id and num_shards.Get the ID of the current device in the cluster And Get the number of clusters. | |||
shard_id = get_rank() | |||
num_shards = get_group_size() | |||
# define dataset | |||
mnist_ds = ds.MnistDataset(data_path, num_shards=num_shards, shard_id=shard_id) | |||
# define map operations | |||
resize_op = CV.Resize((resize_height, resize_width), interpolation=Inter.LINEAR) # Bilinear mode | |||
rescale_nml_op = CV.Rescale(rescale_nml, shift_nml) | |||
rescale_op = CV.Rescale(rescale, shift) | |||
hwc2chw_op = CV.HWC2CHW() | |||
type_cast_op = C.TypeCast(mstype.int32) | |||
# apply map operations on images | |||
mnist_ds = mnist_ds.map(operations=type_cast_op, input_columns="label", num_parallel_workers=num_parallel_workers) | |||
mnist_ds = mnist_ds.map(operations=resize_op, input_columns="image", num_parallel_workers=num_parallel_workers) | |||
mnist_ds = mnist_ds.map(operations=rescale_op, input_columns="image", num_parallel_workers=num_parallel_workers) | |||
mnist_ds = mnist_ds.map(operations=rescale_nml_op, input_columns="image", num_parallel_workers=num_parallel_workers) | |||
mnist_ds = mnist_ds.map(operations=hwc2chw_op, input_columns="image", num_parallel_workers=num_parallel_workers) | |||
# apply DatasetOps | |||
buffer_size = 10000 | |||
mnist_ds = mnist_ds.shuffle(buffer_size=buffer_size) # 10000 as in LeNet train script | |||
mnist_ds = mnist_ds.batch(batch_size, drop_remainder=True) | |||
mnist_ds = mnist_ds.repeat(repeat_size) | |||
return mnist_ds |
@@ -0,0 +1,139 @@ | |||
""" | |||
######################## single-dataset inference lenet example ######################## | |||
This example is a single-dataset inference tutorial. | |||
######################## Instructions for using the inference environment ######################## | |||
1、Inference task requires predefined functions | |||
(1)Copy single dataset from obs to inference image. | |||
function ObsToEnv(obs_data_url, data_dir) | |||
(2)Copy ckpt file from obs to inference image. | |||
function ObsUrlToEnv(obs_ckpt_url, ckpt_url) | |||
(3)Copy the output result to obs. | |||
function EnvToObs(train_dir, obs_train_url) | |||
3、4 parameters need to be defined. | |||
--data_url is the dataset you selected on the Qizhi platform | |||
--ckpt_url is the weight file you choose on the Qizhi platform | |||
--data_url,--ckpt_url,--result_url,--device_target,These 4 parameters must be defined first in a single dataset, | |||
otherwise an error will be reported. | |||
There is no need to add these parameters to the running parameters of the Qizhi platform, | |||
because they are predefined in the background, you only need to define them in your code. | |||
4、How the dataset is used | |||
Inference task uses data_url as the input, and data_dir (ie: '/cache/data') as the calling method | |||
of the dataset in the image. | |||
For details, please refer to the following sample code. | |||
""" | |||
import os | |||
import argparse | |||
import moxing as mox | |||
import mindspore.nn as nn | |||
from mindspore import context | |||
from mindspore.train.serialization import load_checkpoint, load_param_into_net | |||
from mindspore.train import Model | |||
from mindspore.nn.metrics import Accuracy | |||
from mindspore import Tensor | |||
import numpy as np | |||
from glob import glob | |||
from dataset import create_dataset | |||
from config import mnist_cfg as cfg | |||
from lenet import LeNet5 | |||
### Copy single dataset from obs to inference image ### | |||
def ObsToEnv(obs_data_url, data_dir): | |||
try: | |||
mox.file.copy_parallel(obs_data_url, data_dir) | |||
print("Successfully Download {} to {}".format(obs_data_url, data_dir)) | |||
except Exception as e: | |||
print('moxing download {} to {} failed: '.format(obs_data_url, data_dir) + str(e)) | |||
return | |||
### Copy ckpt file from obs to inference image### | |||
### To operate on folders, use mox.file.copy_parallel. If copying a file. | |||
### Please use mox.file.copy to operate the file, this operation is to operate the file | |||
def ObsUrlToEnv(obs_ckpt_url, ckpt_url): | |||
try: | |||
mox.file.copy(obs_ckpt_url, ckpt_url) | |||
print("Successfully Download {} to {}".format(obs_ckpt_url,ckpt_url)) | |||
except Exception as e: | |||
print('moxing download {} to {} failed: '.format(obs_ckpt_url, ckpt_url) + str(e)) | |||
return | |||
### Copy the output result to obs### | |||
def EnvToObs(train_dir, obs_train_url): | |||
try: | |||
mox.file.copy_parallel(train_dir, obs_train_url) | |||
print("Successfully Upload {} to {}".format(train_dir,obs_train_url)) | |||
except Exception as e: | |||
print('moxing upload {} to {} failed: '.format(train_dir,obs_train_url) + str(e)) | |||
return | |||
### --data_url,--ckpt_url,--result_url,--device_target,These 4 parameters must be defined first in a inference task, | |||
### otherwise an error will be reported. | |||
### There is no need to add these parameters to the running parameters of the Qizhi platform, | |||
### because they are predefined in the background, you only need to define them in your code. | |||
parser = argparse.ArgumentParser(description='MindSpore Lenet Example') | |||
parser.add_argument('--data_url', | |||
type=str, | |||
default= '/cache/data/', | |||
help='path where the dataset is saved') | |||
parser.add_argument('--ckpt_url', | |||
help='model to save/load', | |||
default= '/cache/checkpoint.ckpt') | |||
parser.add_argument('--result_url', | |||
help='result folder to save/load', | |||
default= '/cache/result/') | |||
parser.add_argument('--device_target', type=str, default="Ascend", choices=['Ascend', 'GPU', 'CPU'], | |||
help='device where the code will be implemented (default: Ascend)') | |||
if __name__ == "__main__": | |||
args = parser.parse_args() | |||
###Initialize the data and result directories in the inference image### | |||
data_dir = '/cache/data' | |||
result_dir = '/cache/result' | |||
ckpt_url = '/cache/checkpoint.ckpt' | |||
if not os.path.exists(data_dir): | |||
os.makedirs(data_dir) | |||
if not os.path.exists(result_dir): | |||
os.makedirs(result_dir) | |||
###Copy dataset from obs to inference image | |||
ObsToEnv(args.data_url, data_dir) | |||
###Copy ckpt file from obs to inference image | |||
ObsUrlToEnv(args.ckpt_url, ckpt_url) | |||
context.set_context(mode=context.GRAPH_MODE, device_target=args.device_target) | |||
network = LeNet5(cfg.num_classes) | |||
net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean") | |||
repeat_size = cfg.epoch_size | |||
net_opt = nn.Momentum(network.trainable_params(), cfg.lr, cfg.momentum) | |||
model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()}) | |||
print("============== Starting Testing ==============") | |||
param_dict = load_checkpoint(os.path.join(ckpt_url)) | |||
load_param_into_net(network, param_dict) | |||
ds_test = create_dataset(os.path.join(data_dir, "test"), batch_size=1).create_dict_iterator() | |||
data = next(ds_test) | |||
images = data["image"].asnumpy() | |||
labels = data["label"].asnumpy() | |||
print('Tensor:', Tensor(data['image'])) | |||
output = model.predict(Tensor(data['image'])) | |||
predicted = np.argmax(output.asnumpy(), axis=1) | |||
pred = np.argmax(output.asnumpy(), axis=1) | |||
print('predicted:', predicted) | |||
print('pred:', pred) | |||
print(f'Predicted: "{predicted[0]}", Actual: "{labels[0]}"') | |||
filename = 'result.txt' | |||
file_path = os.path.join(result_dir, filename) | |||
with open(file_path, 'a+') as file: | |||
file.write(" {}: {:.2f} \n".format("Predicted", predicted[0])) | |||
###Copy result data from the local running environment back to obs, | |||
###and download it in the inference task corresponding to the Qizhi platform | |||
EnvToObs(result_dir, args.result_url) |
@@ -0,0 +1,158 @@ | |||
""" | |||
######################## multi-dataset inference lenet example ######################## | |||
This example is a single-dataset inference tutorial. | |||
######################## Instructions for using the inference environment ######################## | |||
1、Inference task requires predefined functions | |||
(1)Copy multi dataset from obs to inference image. | |||
function MultiObsToEnv(obs_data_url, data_dir) | |||
(2)Copy ckpt file from obs to inference image. | |||
function ObsUrlToEnv(obs_ckpt_url, ckpt_url) | |||
(3)Copy the output result to obs. | |||
function EnvToObs(train_dir, obs_train_url) | |||
3、5 parameters need to be defined. | |||
--data_url is the first dataset you selected on the Qizhi platform | |||
--multi_data_url is the multi dataset you selected on the Qizhi platform | |||
--ckpt_url is the weight file you choose on the Qizhi platform | |||
--result_url is the output | |||
--data_url,--multi_data_url,--ckpt_url,--result_url,--device_target,These 5 parameters must be defined first in a single dataset, | |||
otherwise an error will be reported. | |||
There is no need to add these parameters to the running parameters of the Qizhi platform, | |||
because they are predefined in the background, you only need to define them in your code. | |||
4、How the dataset is used | |||
Multi-datasets use multi_data_url as input, data_dir + dataset name + file or folder name in the dataset as the | |||
calling path of the dataset in the inference image. | |||
For example, the calling path of the test folder in the MNIST_Data dataset in this example is | |||
data_dir + "/MNIST_Data" +"/test" | |||
For details, please refer to the following sample code. | |||
""" | |||
import os | |||
import argparse | |||
import moxing as mox | |||
import mindspore.nn as nn | |||
from mindspore import context | |||
from mindspore.train.serialization import load_checkpoint, load_param_into_net | |||
from mindspore.train import Model | |||
from mindspore.nn.metrics import Accuracy | |||
from mindspore import Tensor | |||
import numpy as np | |||
from glob import glob | |||
from dataset import create_dataset | |||
from config import mnist_cfg as cfg | |||
from lenet import LeNet5 | |||
import json | |||
### Copy multiple datasets from obs to inference image ### | |||
def MultiObsToEnv(multi_data_url, data_dir): | |||
#--multi_data_url is json data, need to do json parsing for multi_data_url | |||
multi_data_json = json.loads(multi_data_url) | |||
for i in range(len(multi_data_json)): | |||
path = data_dir + "/" + multi_data_json[i]["dataset_name"] | |||
if not os.path.exists(path): | |||
os.makedirs(path) | |||
try: | |||
mox.file.copy_parallel(multi_data_json[i]["dataset_url"], path) | |||
print("Successfully Download {} to {}".format(multi_data_json[i]["dataset_url"],path)) | |||
except Exception as e: | |||
print('moxing download {} to {} failed: '.format( | |||
multi_data_json[i]["dataset_url"], path) + str(e)) | |||
return | |||
### Copy ckpt file from obs to inference image### | |||
### To operate on folders, use mox.file.copy_parallel. If copying a file. | |||
### Please use mox.file.copy to operate the file, this operation is to operate the file | |||
def ObsUrlToEnv(obs_ckpt_url, ckpt_url): | |||
try: | |||
mox.file.copy(obs_ckpt_url, ckpt_url) | |||
print("Successfully Download {} to {}".format(obs_ckpt_url,ckpt_url)) | |||
except Exception as e: | |||
print('moxing download {} to {} failed: '.format(obs_ckpt_url, ckpt_url) + str(e)) | |||
return | |||
### Copy the output result to obs### | |||
def EnvToObs(train_dir, obs_train_url): | |||
try: | |||
mox.file.copy_parallel(train_dir, obs_train_url) | |||
print("Successfully Upload {} to {}".format(train_dir,obs_train_url)) | |||
except Exception as e: | |||
print('moxing upload {} to {} failed: '.format(train_dir,obs_train_url) + str(e)) | |||
return | |||
### --data_url,--multi_data_url,--ckpt_url,--result_url,--device_target,These 5 parameters must be defined first in a multi dataset inference task, | |||
### otherwise an error will be reported. | |||
### There is no need to add these parameters to the running parameters of the Qizhi platform, | |||
### because they are predefined in the background, you only need to define them in your code. | |||
parser = argparse.ArgumentParser(description='MindSpore Lenet Example') | |||
parser.add_argument('--data_url', | |||
type=str, | |||
default= '/cache/data1/', | |||
help='path where the dataset is saved') | |||
parser.add_argument('--multi_data_url', | |||
type=str, | |||
default= '/cache/data/', | |||
help='path where the dataset is saved') | |||
parser.add_argument('--ckpt_url', | |||
help='model to save/load', | |||
default= '/cache/checkpoint.ckpt') | |||
parser.add_argument('--result_url', | |||
help='result folder to save/load', | |||
default= '/cache/result/') | |||
parser.add_argument('--device_target', type=str, default="Ascend", choices=['Ascend', 'GPU', 'CPU'], | |||
help='device where the code will be implemented (default: Ascend)') | |||
if __name__ == "__main__": | |||
args = parser.parse_args() | |||
###Initialize the data and result directories in the inference image### | |||
data_dir = '/cache/data' | |||
result_dir = '/cache/result' | |||
ckpt_url = '/cache/checkpoint.ckpt' | |||
if not os.path.exists(data_dir): | |||
os.makedirs(data_dir) | |||
if not os.path.exists(result_dir): | |||
os.makedirs(result_dir) | |||
###Copy multiple dataset from obs to inference image | |||
MultiObsToEnv(args.multi_data_url, data_dir) | |||
###Copy ckpt file from obs to inference image | |||
ObsUrlToEnv(args.ckpt_url, ckpt_url) | |||
context.set_context(mode=context.GRAPH_MODE, device_target=args.device_target) | |||
network = LeNet5(cfg.num_classes) | |||
net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean") | |||
repeat_size = cfg.epoch_size | |||
net_opt = nn.Momentum(network.trainable_params(), cfg.lr, cfg.momentum) | |||
model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()}) | |||
print("============== Starting Testing ==============") | |||
param_dict = load_checkpoint(os.path.join(ckpt_url)) | |||
load_param_into_net(network, param_dict) | |||
ds_test = create_dataset(os.path.join(data_dir + "/MNISTData", "test"), batch_size=1).create_dict_iterator() | |||
data = next(ds_test) | |||
images = data["image"].asnumpy() | |||
labels = data["label"].asnumpy() | |||
print('Tensor:', Tensor(data['image'])) | |||
output = model.predict(Tensor(data['image'])) | |||
predicted = np.argmax(output.asnumpy(), axis=1) | |||
pred = np.argmax(output.asnumpy(), axis=1) | |||
print('predicted:', predicted) | |||
print('pred:', pred) | |||
print(f'Predicted: "{predicted[0]}", Actual: "{labels[0]}"') | |||
filename = 'result.txt' | |||
file_path = os.path.join(result_dir, filename) | |||
with open(file_path, 'a+') as file: | |||
file.write(" {}: {:.2f} \n".format("Predicted", predicted[0])) | |||
###Copy result data from the local running environment back to obs, | |||
###and download it in the inference task corresponding to the Qizhi platform | |||
EnvToObs(result_dir, args.result_url) |
@@ -0,0 +1,60 @@ | |||
# Copyright 2020 Huawei Technologies Co., Ltd | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); | |||
# you may not use this file except in compliance with the License. | |||
# You may obtain a copy of the License at | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, software | |||
# distributed under the License is distributed on an "AS IS" BASIS, | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
# ============================================================================ | |||
"""LeNet.""" | |||
import mindspore.nn as nn | |||
from mindspore.common.initializer import Normal | |||
class LeNet5(nn.Cell): | |||
""" | |||
Lenet network | |||
Args: | |||
num_class (int): Number of classes. Default: 10. | |||
num_channel (int): Number of channels. Default: 1. | |||
Returns: | |||
Tensor, output tensor | |||
Examples: | |||
>>> LeNet(num_class=10) | |||
""" | |||
def __init__(self, num_class=10, num_channel=1, include_top=True): | |||
super(LeNet5, self).__init__() | |||
self.conv1 = nn.Conv2d(num_channel, 6, 5, pad_mode='valid') | |||
self.conv2 = nn.Conv2d(6, 16, 5, pad_mode='valid') | |||
self.relu = nn.ReLU() | |||
self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2) | |||
self.include_top = include_top | |||
if self.include_top: | |||
self.flatten = nn.Flatten() | |||
self.fc1 = nn.Dense(16 * 5 * 5, 120, weight_init=Normal(0.02)) | |||
self.fc2 = nn.Dense(120, 84, weight_init=Normal(0.02)) | |||
self.fc3 = nn.Dense(84, num_class, weight_init=Normal(0.02)) | |||
def construct(self, x): | |||
x = self.conv1(x) | |||
x = self.relu(x) | |||
x = self.max_pool2d(x) | |||
x = self.conv2(x) | |||
x = self.relu(x) | |||
x = self.max_pool2d(x) | |||
if not self.include_top: | |||
return x | |||
x = self.flatten(x) | |||
x = self.relu(self.fc1(x)) | |||
x = self.relu(self.fc2(x)) | |||
x = self.fc3(x) | |||
return x |
@@ -0,0 +1,201 @@ | |||
""" | |||
######################## single-dataset train lenet example ######################## | |||
This example is a single-dataset training tutorial. If it is a multi-dataset, please refer to the multi-dataset training | |||
tutorial train_for_multidataset.py. This example cannot be used for multi-datasets! | |||
######################## Instructions for using the training environment ######################## | |||
The image of the debugging environment and the image of the training environment are two different images, | |||
and the working local directories are different. In the training task, you need to pay attention to the following points. | |||
1、(1)The structure of the dataset uploaded for single dataset training in this example | |||
MNISTData.zip | |||
├── test | |||
└── train | |||
2、Single dataset training requires predefined functions | |||
(1)Copy single dataset from obs to training image | |||
function ObsToEnv(obs_data_url, data_dir) | |||
(2)Copy the output to obs | |||
function EnvToObs(train_dir, obs_train_url) | |||
(3)Download the input from Qizhi And Init | |||
function DownloadFromQizhi(obs_data_url, data_dir) | |||
(4)Upload the output to Qizhi | |||
function UploadToQizhi(train_dir, obs_train_url) | |||
3、3 parameters need to be defined | |||
--data_url is the dataset you selected on the Qizhi platform | |||
--data_url,--train_url,--device_target,These 3 parameters must be defined first in a single dataset task, | |||
otherwise an error will be reported. | |||
There is no need to add these parameters to the running parameters of the Qizhi platform, | |||
because they are predefined in the background, you only need to define them in your code. | |||
4、How the dataset is used | |||
A single dataset uses data_url as the input, and data_dir (ie:'/cache/data') as the calling method | |||
of the dataset in the image. | |||
For details, please refer to the following sample code. | |||
""" | |||
import os | |||
import argparse | |||
import moxing as mox | |||
from config import mnist_cfg as cfg | |||
from dataset import create_dataset | |||
from dataset_distributed import create_dataset_parallel | |||
from lenet import LeNet5 | |||
import mindspore.nn as nn | |||
from mindspore import context | |||
from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, LossMonitor, TimeMonitor | |||
from mindspore.train import Model | |||
from mindspore.nn.metrics import Accuracy | |||
from mindspore.context import ParallelMode | |||
from mindspore.communication.management import init, get_rank | |||
import mindspore.ops as ops | |||
import time | |||
### Copy single dataset from obs to training image### | |||
def ObsToEnv(obs_data_url, data_dir): | |||
try: | |||
mox.file.copy_parallel(obs_data_url, data_dir) | |||
print("Successfully Download {} to {}".format(obs_data_url, data_dir)) | |||
except Exception as e: | |||
print('moxing download {} to {} failed: '.format(obs_data_url, data_dir) + str(e)) | |||
#Set a cache file to determine whether the data has been copied to obs. | |||
#If this file exists during multi-card training, there is no need to copy the dataset multiple times. | |||
f = open("/cache/download_input.txt", 'w') | |||
f.close() | |||
try: | |||
if os.path.exists("/cache/download_input.txt"): | |||
print("download_input succeed") | |||
except Exception as e: | |||
print("download_input failed") | |||
return | |||
### Copy the output to obs### | |||
def EnvToObs(train_dir, obs_train_url): | |||
try: | |||
mox.file.copy_parallel(train_dir, obs_train_url) | |||
print("Successfully Upload {} to {}".format(train_dir,obs_train_url)) | |||
except Exception as e: | |||
print('moxing upload {} to {} failed: '.format(train_dir,obs_train_url) + str(e)) | |||
return | |||
def DownloadFromQizhi(obs_data_url, data_dir): | |||
device_num = int(os.getenv('RANK_SIZE')) | |||
if device_num == 1: | |||
ObsToEnv(obs_data_url,data_dir) | |||
context.set_context(mode=context.GRAPH_MODE,device_target=args.device_target) | |||
if device_num > 1: | |||
# set device_id and init for multi-card training | |||
context.set_context(mode=context.GRAPH_MODE, device_target=args.device_target, device_id=int(os.getenv('ASCEND_DEVICE_ID'))) | |||
context.reset_auto_parallel_context() | |||
context.set_auto_parallel_context(device_num = device_num, parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True, parameter_broadcast=True) | |||
init() | |||
#Copying obs data does not need to be executed multiple times, just let the 0th card copy the data | |||
local_rank=int(os.getenv('RANK_ID')) | |||
if local_rank%8==0: | |||
ObsToEnv(obs_data_url,data_dir) | |||
#If the cache file does not exist, it means that the copy data has not been completed, | |||
#and Wait for 0th card to finish copying data | |||
while not os.path.exists("/cache/download_input.txt"): | |||
time.sleep(1) | |||
return | |||
def UploadToQizhi(train_dir, obs_train_url): | |||
device_num = int(os.getenv('RANK_SIZE')) | |||
local_rank=int(os.getenv('RANK_ID')) | |||
if device_num == 1: | |||
EnvToObs(train_dir, obs_train_url) | |||
if device_num > 1: | |||
if local_rank%8==0: | |||
EnvToObs(train_dir, obs_train_url) | |||
return | |||
### --data_url,--train_url,--device_target,These 3 parameters must be defined first in a single dataset, | |||
### otherwise an error will be reported. | |||
###There is no need to add these parameters to the running parameters of the Qizhi platform, | |||
###because they are predefined in the background, you only need to define them in your code. | |||
parser = argparse.ArgumentParser(description='MindSpore Lenet Example') | |||
parser.add_argument('--data_url', | |||
help='path to training/inference dataset folder', | |||
default= '/cache/data/') | |||
parser.add_argument('--train_url', | |||
help='output folder to save/load', | |||
default= '/cache/output/') | |||
parser.add_argument( | |||
'--device_target', | |||
type=str, | |||
default="Ascend", | |||
choices=['Ascend', 'CPU'], | |||
help='device where the code will be implemented (default: Ascend),if to use the CPU on the Qizhi platform:device_target=CPU') | |||
parser.add_argument('--epoch_size', | |||
type=int, | |||
default=5, | |||
help='Training epochs.') | |||
if __name__ == "__main__": | |||
args = parser.parse_args() | |||
data_dir = '/cache/data' | |||
train_dir = '/cache/output' | |||
if not os.path.exists(data_dir): | |||
os.makedirs(data_dir) | |||
if not os.path.exists(train_dir): | |||
os.makedirs(train_dir) | |||
###Initialize and copy data to training image | |||
DownloadFromQizhi(args.data_url, data_dir) | |||
###The dataset path is used here:data_dir +"/train" | |||
device_num = int(os.getenv('RANK_SIZE')) | |||
if device_num == 1: | |||
ds_train = create_dataset(os.path.join(data_dir, "train"), cfg.batch_size) | |||
if device_num > 1: | |||
ds_train = create_dataset_parallel(os.path.join(data_dir, "train"), cfg.batch_size) | |||
if ds_train.get_dataset_size() == 0: | |||
raise ValueError("Please check dataset size > 0 and batch_size <= dataset size") | |||
network = LeNet5(cfg.num_classes) | |||
net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean") | |||
net_opt = nn.Momentum(network.trainable_params(), cfg.lr, cfg.momentum) | |||
time_cb = TimeMonitor(data_size=ds_train.get_dataset_size()) | |||
if args.device_target != "Ascend": | |||
model = Model(network, | |||
net_loss, | |||
net_opt, | |||
metrics={"accuracy": Accuracy()}) | |||
else: | |||
model = Model(network, | |||
net_loss, | |||
net_opt, | |||
metrics={"accuracy": Accuracy()}, | |||
amp_level="O2") | |||
config_ck = CheckpointConfig( | |||
save_checkpoint_steps=cfg.save_checkpoint_steps, | |||
keep_checkpoint_max=cfg.keep_checkpoint_max) | |||
#Note that this method saves the model file on each card. You need to specify the save path on each card. | |||
# In this example, get_rank() is added to distinguish different paths. | |||
if device_num == 1: | |||
outputDirectory = train_dir + "/" | |||
if device_num > 1: | |||
outputDirectory = train_dir + "/" + str(get_rank()) + "/" | |||
ckpoint_cb = ModelCheckpoint(prefix="checkpoint_lenet", | |||
directory=outputDirectory, | |||
config=config_ck) | |||
print("============== Starting Training ==============") | |||
epoch_size = cfg['epoch_size'] | |||
if (args.epoch_size): | |||
epoch_size = args.epoch_size | |||
print('epoch_size is: ', epoch_size) | |||
model.train(epoch_size, | |||
ds_train, | |||
callbacks=[time_cb, ckpoint_cb, | |||
LossMonitor()]) | |||
###Copy the trained output data from the local running environment back to obs, | |||
###and download it in the training task corresponding to the Qizhi platform | |||
UploadToQizhi(train_dir,args.train_url) |
@@ -0,0 +1,99 @@ | |||
""" | |||
######################## train lenet dataparallel example ######################## | |||
train lenet and get network model files(.ckpt) | |||
The training of the intelligent computing network currently supports single dataset training, and does not require | |||
the obs copy process.It only needs to define two parameters and then call it directly: | |||
train_dir = '/cache/output' #The location of the output | |||
data_dir = '/cache/dataset' #The location of the dataset | |||
""" | |||
import os | |||
import argparse | |||
from dataset import create_dataset | |||
from dataset_distributed import create_dataset_parallel | |||
import moxing as mox | |||
from config import mnist_cfg as cfg | |||
from lenet import LeNet5 | |||
import mindspore.nn as nn | |||
from mindspore import context | |||
from mindspore.common import set_seed | |||
from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, LossMonitor, TimeMonitor | |||
from mindspore.train import Model | |||
from mindspore.nn.metrics import Accuracy | |||
from mindspore.context import ParallelMode | |||
from mindspore.communication.management import init, get_rank, get_group_size | |||
import mindspore.ops as ops | |||
parser = argparse.ArgumentParser(description='MindSpore Lenet Example') | |||
parser.add_argument( | |||
'--device_target', | |||
type=str, | |||
default="Ascend", | |||
choices=['Ascend', 'CPU'], | |||
help='device where the code will be implemented (default: Ascend),if to use the CPU on the Qizhi platform:device_target=CPU') | |||
parser.add_argument('--epoch_size', | |||
type=int, | |||
default=5, | |||
help='Training epochs.') | |||
if __name__ == "__main__": | |||
args = parser.parse_args() | |||
###define two parameters and then call it directly### | |||
data_dir = '/cache/dataset' | |||
train_dir = '/cache/output' | |||
device_num = int(os.getenv('RANK_SIZE')) | |||
if device_num == 1: | |||
context.set_context(mode=context.GRAPH_MODE,device_target=args.device_target) | |||
ds_train = create_dataset(os.path.join(data_dir, "train"), cfg.batch_size) | |||
if device_num > 1: | |||
# set device_id and init for multi-card training | |||
context.set_context(mode=context.GRAPH_MODE, device_target=args.device_target, device_id=int(os.getenv('ASCEND_DEVICE_ID'))) | |||
context.reset_auto_parallel_context() | |||
context.set_auto_parallel_context(device_num = device_num, parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True, parameter_broadcast=True) | |||
init() | |||
ds_train = create_dataset_parallel(os.path.join(data_dir, "train"), cfg.batch_size) | |||
if ds_train.get_dataset_size() == 0: | |||
raise ValueError( | |||
"Please check dataset size > 0 and batch_size <= dataset size") | |||
network = LeNet5(cfg.num_classes) | |||
net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean") | |||
net_opt = nn.Momentum(network.trainable_params(), cfg.lr, cfg.momentum) | |||
time_cb = TimeMonitor(data_size=ds_train.get_dataset_size()) | |||
if args.device_target != "Ascend": | |||
model = Model(network, | |||
net_loss, | |||
net_opt, | |||
metrics={"accuracy": Accuracy()}) | |||
else: | |||
model = Model(network, | |||
net_loss, | |||
net_opt, | |||
metrics={"accuracy": Accuracy()}, | |||
amp_level="O2") | |||
config_ck = CheckpointConfig( | |||
save_checkpoint_steps=cfg.save_checkpoint_steps, | |||
keep_checkpoint_max=cfg.keep_checkpoint_max) | |||
#Note that this method saves the model file on each card. You need to specify the save path on each card. | |||
# In the example, get_rank() is added to distinguish different paths. | |||
if device_num == 1: | |||
outputDirectory = train_dir + "/" | |||
if device_num > 1: | |||
outputDirectory = train_dir + "/" + str(get_rank()) + "/" | |||
ckpoint_cb = ModelCheckpoint(prefix="checkpoint_lenet", | |||
directory=outputDirectory, | |||
config=config_ck) | |||
print("============== Starting Training ==============") | |||
epoch_size = cfg['epoch_size'] | |||
if (args.epoch_size): | |||
epoch_size = args.epoch_size | |||
print('epoch_size is: ', epoch_size) | |||
model.train(epoch_size,ds_train, callbacks=[time_cb, ckpoint_cb, LossMonitor()], dataset_sink_mode=False) | |||
@@ -0,0 +1,220 @@ | |||
""" | |||
######################## multi-dataset train lenet example ######################## | |||
This example is a multi-dataset training tutorial. If it is a single dataset, please refer to the single dataset | |||
training tutorial train.py. This example cannot be used for a single dataset! | |||
""" | |||
""" | |||
######################## Instructions for using the training environment ######################## | |||
1、(1)The structure of the dataset uploaded for multi-dataset training in this example | |||
MNISTData.zip | |||
├── test | |||
└── train | |||
checkpoint_lenet-1_1875.zip | |||
├── checkpoint_lenet-1_1875.ckpt | |||
(2)The dataset structure in the training image for multiple datasets in this example | |||
workroot | |||
├── MNISTData | |||
| ├── test | |||
| └── train | |||
└── checkpoint_lenet-1_1875 | |||
├── checkpoint_lenet-1_1875.ckpt | |||
2、Multi-dataset training requires predefined functions | |||
(1)Copy multi-dataset from obs to training image | |||
function MultiObsToEnv(multi_data_url, data_dir) | |||
(2)Copy the output to obs | |||
function EnvToObs(train_dir, obs_train_url) | |||
(2)Download the input from Qizhi And Init | |||
function DownloadFromQizhi(multi_data_url, data_dir) | |||
(2)Upload the output to Qizhi | |||
function UploadToQizhi(train_dir, obs_train_url) | |||
3、4 parameters need to be defined | |||
--data_url is the first dataset you selected on the Qizhi platform | |||
--multi_data_url is the multi-dataset you selected on the Qizhi platform | |||
--data_url,--multi_data_url,--train_url,--device_target,These 4 parameters must be defined first in a multi-dataset task, | |||
otherwise an error will be reported. | |||
There is no need to add these parameters to the running parameters of the Qizhi platform, | |||
because they are predefined in the background, you only need to define them in your code | |||
4、How the dataset is used | |||
Multi-datasets use multi_data_url as input, data_dir + dataset name + file or folder name in the dataset as the | |||
calling path of the dataset in the training image. | |||
For example, the calling path of the train folder in the MNIST_Data dataset in this example is | |||
data_dir + "/MNIST_Data" +"/train" | |||
For details, please refer to the following sample code. | |||
""" | |||
import os | |||
import argparse | |||
import moxing as mox | |||
from config import mnist_cfg as cfg | |||
from dataset import create_dataset | |||
from dataset_distributed import create_dataset_parallel | |||
from lenet import LeNet5 | |||
import json | |||
import mindspore.nn as nn | |||
from mindspore import context | |||
from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, LossMonitor, TimeMonitor | |||
from mindspore.train import Model | |||
from mindspore.nn.metrics import Accuracy | |||
from mindspore import load_checkpoint, load_param_into_net | |||
from mindspore.context import ParallelMode | |||
from mindspore.communication.management import init, get_rank | |||
import time | |||
### Copy multiple datasets from obs to training image ### | |||
def MultiObsToEnv(multi_data_url, data_dir): | |||
#--multi_data_url is json data, need to do json parsing for multi_data_url | |||
multi_data_json = json.loads(multi_data_url) | |||
for i in range(len(multi_data_json)): | |||
path = data_dir + "/" + multi_data_json[i]["dataset_name"] | |||
if not os.path.exists(path): | |||
os.makedirs(path) | |||
try: | |||
mox.file.copy_parallel(multi_data_json[i]["dataset_url"], path) | |||
print("Successfully Download {} to {}".format(multi_data_json[i]["dataset_url"],path)) | |||
except Exception as e: | |||
print('moxing download {} to {} failed: '.format( | |||
multi_data_json[i]["dataset_url"], path) + str(e)) | |||
#Set a cache file to determine whether the data has been copied to obs. | |||
#If this file exists during multi-card training, there is no need to copy the dataset multiple times. | |||
f = open("/cache/download_input.txt", 'w') | |||
f.close() | |||
try: | |||
if os.path.exists("/cache/download_input.txt"): | |||
print("download_input succeed") | |||
except Exception as e: | |||
print("download_input failed") | |||
return | |||
### Copy the output model to obs ### | |||
def EnvToObs(train_dir, obs_train_url): | |||
try: | |||
mox.file.copy_parallel(train_dir, obs_train_url) | |||
print("Successfully Upload {} to {}".format(train_dir, | |||
obs_train_url)) | |||
except Exception as e: | |||
print('moxing upload {} to {} failed: '.format(train_dir, | |||
obs_train_url) + str(e)) | |||
return | |||
def DownloadFromQizhi(multi_data_url, data_dir): | |||
device_num = int(os.getenv('RANK_SIZE')) | |||
if device_num == 1: | |||
MultiObsToEnv(multi_data_url,data_dir) | |||
context.set_context(mode=context.GRAPH_MODE,device_target=args.device_target) | |||
if device_num > 1: | |||
# set device_id and init for multi-card training | |||
context.set_context(mode=context.GRAPH_MODE, device_target=args.device_target, device_id=int(os.getenv('ASCEND_DEVICE_ID'))) | |||
context.reset_auto_parallel_context() | |||
context.set_auto_parallel_context(device_num = device_num, parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True, parameter_broadcast=True) | |||
init() | |||
#Copying obs data does not need to be executed multiple times, just let the 0th card copy the data | |||
local_rank=int(os.getenv('RANK_ID')) | |||
if local_rank%8==0: | |||
MultiObsToEnv(multi_data_url,data_dir) | |||
#If the cache file does not exist, it means that the copy data has not been completed, | |||
#and Wait for 0th card to finish copying data | |||
while not os.path.exists("/cache/download_input.txt"): | |||
time.sleep(1) | |||
return | |||
def UploadToQizhi(train_dir, obs_train_url): | |||
device_num = int(os.getenv('RANK_SIZE')) | |||
local_rank=int(os.getenv('RANK_ID')) | |||
if device_num == 1: | |||
EnvToObs(train_dir, obs_train_url) | |||
if device_num > 1: | |||
if local_rank%8==0: | |||
EnvToObs(train_dir, obs_train_url) | |||
return | |||
parser = argparse.ArgumentParser(description='MindSpore Lenet Example') | |||
### --data_url,--multi_data_url,--train_url,--device_target,These 4 parameters must be defined first in a multi-dataset, | |||
### otherwise an error will be reported. | |||
### There is no need to add these parameters to the running parameters of the Qizhi platform, | |||
### because they are predefined in the background, you only need to define them in your code. | |||
parser.add_argument('--data_url', | |||
help='path to training/inference dataset folder', | |||
default= '/cache/data1/') | |||
parser.add_argument('--multi_data_url', | |||
help='path to multi dataset', | |||
default= '/cache/data/') | |||
parser.add_argument('--train_url', | |||
help='model folder to save/load', | |||
default= '/cache/output/') | |||
parser.add_argument( | |||
'--device_target', | |||
type=str, | |||
default="Ascend", | |||
choices=['Ascend', 'CPU'], | |||
help='device where the code will be implemented (default: Ascend),if to use the CPU on the Qizhi platform:device_target=CPU') | |||
parser.add_argument('--epoch_size', | |||
type=int, | |||
default=5, | |||
help='Training epochs.') | |||
if __name__ == "__main__": | |||
args = parser.parse_args() | |||
data_dir = '/cache/data' | |||
train_dir = '/cache/output' | |||
if not os.path.exists(data_dir): | |||
os.makedirs(data_dir) | |||
if not os.path.exists(train_dir): | |||
os.makedirs(train_dir) | |||
###Initialize and copy data to training image | |||
DownloadFromQizhi(args.multi_data_url, data_dir) | |||
###The dataset path is used here:data_dir + "/MNIST_Data" +"/train" | |||
device_num = int(os.getenv('RANK_SIZE')) | |||
if device_num == 1: | |||
ds_train = create_dataset(os.path.join(data_dir + "/MNISTData", "train"), cfg.batch_size) | |||
if device_num > 1: | |||
ds_train = create_dataset_parallel(os.path.join(data_dir + "/MNISTData", "train"), cfg.batch_size) | |||
if ds_train.get_dataset_size() == 0: | |||
raise ValueError( | |||
"Please check dataset size > 0 and batch_size <= dataset size") | |||
network = LeNet5(cfg.num_classes) | |||
net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean") | |||
net_opt = nn.Momentum(network.trainable_params(), cfg.lr, cfg.momentum) | |||
time_cb = TimeMonitor(data_size=ds_train.get_dataset_size()) | |||
###The dataset path is used here:data_dir + "/checkpoint_lenet-1_1875"+"/checkpoint_lenet-1_1875.ckpt" | |||
load_param_into_net(network, load_checkpoint(os.path.join(data_dir + "/checkpoint_lenet-1_1875", | |||
"checkpoint_lenet-1_1875.ckpt"))) | |||
if args.device_target != "Ascend": | |||
model = Model(network,net_loss,net_opt,metrics={"accuracy": Accuracy()}) | |||
else: | |||
model = Model(network, net_loss,net_opt,metrics={"accuracy": Accuracy()},amp_level="O2") | |||
config_ck = CheckpointConfig(save_checkpoint_steps=cfg.save_checkpoint_steps, | |||
keep_checkpoint_max=cfg.keep_checkpoint_max) | |||
#Note that this method saves the model file on each card. You need to specify the save path on each card. | |||
# In this example, get_rank() is added to distinguish different paths. | |||
if device_num == 1: | |||
outputDirectory = train_dir + "/" | |||
if device_num > 1: | |||
outputDirectory = train_dir + "/" + str(get_rank()) + "/" | |||
ckpoint_cb = ModelCheckpoint(prefix="checkpoint_lenet", | |||
directory=outputDirectory, | |||
config=config_ck) | |||
print("============== Starting Training ==============") | |||
epoch_size = cfg['epoch_size'] | |||
if (args.epoch_size): | |||
epoch_size = args.epoch_size | |||
print('epoch_size is: ', epoch_size) | |||
model.train(epoch_size, | |||
ds_train, | |||
callbacks=[time_cb, ckpoint_cb, | |||
LossMonitor()]) | |||
###Copy the trained output data from the local running environment back to obs, | |||
###and download it in the training task corresponding to the Qizhi platform | |||
UploadToQizhi(train_dir,args.train_url) | |||