You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

train.py 5.9 kB

4 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171
  1. #! /usr/bin/python
  2. # -*- coding: utf-8 -*-
  3. """
  4. tl train
  5. ========
  6. (Alpha release - usage might change later)
  7. The tensorlayer.cli.train module provides the ``tl train`` subcommand.
  8. It helps the user bootstrap a TensorFlow/TensorLayer program for distributed training
  9. using multiple GPU cards or CPUs on a computer.
  10. You need to first setup the `CUDA_VISIBLE_DEVICES <http://acceleware.com/blog/cudavisibledevices-masking-gpus>`_
  11. to tell ``tl train`` which GPUs are available. If the CUDA_VISIBLE_DEVICES is not given,
  12. ``tl train`` would try best to discover all available GPUs.
  13. In distribute training, each TensorFlow program needs a TF_CONFIG environment variable to describe
  14. the cluster. It also needs a master daemon to
  15. monitor all trainers. ``tl train`` is responsible
  16. for automatically managing these two tasks.
  17. Usage
  18. -----
  19. tl train [-h] [-p NUM_PSS] [-c CPU_TRAINERS] <file> [args [args ...]]
  20. .. code-block:: bash
  21. # example of using GPU 0 and 1 for training mnist
  22. CUDA_VISIBLE_DEVICES="0,1"
  23. tl train example/tutorial_mnist_distributed.py
  24. # example of using CPU trainers for inception v3
  25. tl train -c 16 example/tutorial_imagenet_inceptionV3_distributed.py
  26. # example of using GPU trainers for inception v3 with customized arguments
  27. # as CUDA_VISIBLE_DEVICES is not given, tl would try to discover all available GPUs
  28. tl train example/tutorial_imagenet_inceptionV3_distributed.py -- --batch_size 16
  29. Command-line Arguments
  30. ----------------------
  31. - ``file``: python file path.
  32. - ``NUM_PSS`` : The number of parameter servers.
  33. - ``CPU_TRAINERS``: The number of CPU trainers.
  34. It is recommended that ``NUM_PSS + CPU_TRAINERS <= cpu count``
  35. - ``args``: Any parameter after ``--`` would be passed to the python program.
  36. Notes
  37. -----
  38. A parallel training program would require multiple parameter servers
  39. to help parallel trainers to exchange intermediate gradients.
  40. The best number of parameter servers is often proportional to the
  41. size of your model as well as the number of CPUs available.
  42. You can control the number of parameter servers using the ``-p`` parameter.
  43. If you have a single computer with massive CPUs, you can use the ``-c`` parameter
  44. to enable CPU-only parallel training.
  45. The reason we are not supporting GPU-CPU co-training is because GPU and
  46. CPU are running at different speeds. Using them together in training would
  47. incur stragglers.
  48. """
  49. import argparse
  50. import json
  51. import multiprocessing
  52. import os
  53. import platform
  54. import re
  55. import subprocess
  56. import sys
  57. PORT_BASE = 10000
  58. def _get_gpu_ids():
  59. if 'CUDA_VISIBLE_DEVICES' in os.environ:
  60. return [int(x) for x in os.environ.get('CUDA_VISIBLE_DEVICES', '').split(',')]
  61. if platform.system() in ['Darwin', 'Linux']:
  62. return [int(d.replace('nvidia', '')) for d in os.listdir('/dev') if re.match('^nvidia\d+$', d)]
  63. else:
  64. print('Please set CUDA_VISIBLE_DEVICES (see http://acceleware.com/blog/cudavisibledevices-masking-gpus)')
  65. return []
  66. GPU_IDS = _get_gpu_ids()
  67. def create_tf_config(cluster_spec, task_type, task_index):
  68. return {
  69. 'cluster': cluster_spec,
  70. 'task': {
  71. 'type': task_type,
  72. 'index': task_index
  73. },
  74. }
  75. def create_tf_jobs(cluster_spec, prog, args):
  76. gpu_assignment = dict((('worker', idx), gpu_idx) for (idx, gpu_idx) in enumerate(GPU_IDS))
  77. for job_type in cluster_spec:
  78. for task_index in range(len(cluster_spec[job_type])):
  79. new_env = os.environ.copy()
  80. new_env.update(
  81. {
  82. 'CUDA_VISIBLE_DEVICES': str(gpu_assignment.get((job_type, task_index), '')),
  83. 'TF_CONFIG': json.dumps(create_tf_config(cluster_spec, job_type, task_index)),
  84. }
  85. )
  86. yield subprocess.Popen(['python3', prog] + args, env=new_env)
  87. def validate_arguments(args):
  88. if args.num_pss < 1:
  89. print('Value error: must have ore than one parameter servers.')
  90. exit(1)
  91. if not GPU_IDS:
  92. num_cpus = multiprocessing.cpu_count()
  93. if args.cpu_trainers > num_cpus:
  94. print('Value error: there are %s available CPUs but you are requiring %s.' % (num_cpus, args.cpu_trainers))
  95. exit(1)
  96. if not os.path.isfile(args.file):
  97. print('Value error: model trainning file does not exist')
  98. exit(1)
  99. def main(args):
  100. validate_arguments(args)
  101. num_workers = len(GPU_IDS) if GPU_IDS else args.cpu_trainers
  102. print('Using program %s with args %s' % (args.file, ' '.join(args.args)))
  103. print('Using %d workers, %d parameter servers, %d GPUs.' % (num_workers, args.num_pss, len(GPU_IDS)))
  104. cluster_spec = {
  105. 'ps': ['localhost: %d' % (PORT_BASE + i) for i in range(args.num_pss)],
  106. 'worker': ['localhost: %d' % (PORT_BASE + args.num_pss + i) for i in range(num_workers)]
  107. }
  108. processes = list(create_tf_jobs(cluster_spec, args.file, args.args))
  109. try:
  110. print('Press ENTER to exit the training ...')
  111. sys.stdin.readline()
  112. except KeyboardInterrupt: # https://docs.python.org/3/library/exceptions.html#KeyboardInterrupt
  113. print('Keyboard interrupt received')
  114. finally:
  115. print('stopping all subprocesses ...')
  116. for p in processes:
  117. p.kill()
  118. for p in processes:
  119. p.wait()
  120. print('END')
  121. def build_arg_parser(parser):
  122. parser.add_argument('-p', '--pss', dest='num_pss', type=int, default=1, help='number of parameter servers')
  123. parser.add_argument('-c', '--cpu_trainers', dest='cpu_trainers', type=int, default=1, help='number of CPU trainers')
  124. parser.add_argument('file', help='model trainning file path')
  125. parser.add_argument('args', nargs='*', type=str, help='arguments to <file>')
  126. if __name__ == "__main__":
  127. parser = argparse.ArgumentParser()
  128. build_arg_parser(parser)
  129. args = parser.parse_args()
  130. main(args)

TensorLayer3.0 是一款兼容多种深度学习框架为计算后端的深度学习库。计划兼容TensorFlow, Pytorch, MindSpore, Paddle.