199 lines
6.7 KiB
Python
199 lines
6.7 KiB
Python
r"""
|
|
Module ``torch.distributed.launch``.
|
|
|
|
``torch.distributed.launch`` is a module that spawns up multiple distributed
|
|
training processes on each of the training nodes.
|
|
|
|
.. warning::
|
|
|
|
This module is going to be deprecated in favor of :ref:`torchrun <launcher-api>`.
|
|
|
|
The utility can be used for single-node distributed training, in which one or
|
|
more processes per node will be spawned. The utility can be used for either
|
|
CPU training or GPU training. If the utility is used for GPU training,
|
|
each distributed process will be operating on a single GPU. This can achieve
|
|
well-improved single-node training performance. It can also be used in
|
|
multi-node distributed training, by spawning up multiple processes on each node
|
|
for well-improved multi-node distributed training performance as well.
|
|
This will especially be beneficial for systems with multiple Infiniband
|
|
interfaces that have direct-GPU support, since all of them can be utilized for
|
|
aggregated communication bandwidth.
|
|
|
|
In both cases of single-node distributed training or multi-node distributed
|
|
training, this utility will launch the given number of processes per node
|
|
(``--nproc-per-node``). If used for GPU training, this number needs to be less
|
|
or equal to the number of GPUs on the current system (``nproc_per_node``),
|
|
and each process will be operating on a single GPU from *GPU 0 to
|
|
GPU (nproc_per_node - 1)*.
|
|
|
|
**How to use this module:**
|
|
|
|
1. Single-Node multi-process distributed training
|
|
|
|
::
|
|
|
|
python -m torch.distributed.launch --nproc-per-node=NUM_GPUS_YOU_HAVE
|
|
YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other
|
|
arguments of your training script)
|
|
|
|
2. Multi-Node multi-process distributed training: (e.g. two nodes)
|
|
|
|
|
|
Node 1: *(IP: 192.168.1.1, and has a free port: 1234)*
|
|
|
|
::
|
|
|
|
python -m torch.distributed.launch --nproc-per-node=NUM_GPUS_YOU_HAVE
|
|
--nnodes=2 --node-rank=0 --master-addr="192.168.1.1"
|
|
--master-port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
|
|
and all other arguments of your training script)
|
|
|
|
Node 2:
|
|
|
|
::
|
|
|
|
python -m torch.distributed.launch --nproc-per-node=NUM_GPUS_YOU_HAVE
|
|
--nnodes=2 --node-rank=1 --master-addr="192.168.1.1"
|
|
--master-port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
|
|
and all other arguments of your training script)
|
|
|
|
3. To look up what optional arguments this module offers:
|
|
|
|
::
|
|
|
|
python -m torch.distributed.launch --help
|
|
|
|
|
|
**Important Notices:**
|
|
|
|
1. This utility and multi-process distributed (single-node or
|
|
multi-node) GPU training currently only achieves the best performance using
|
|
the NCCL distributed backend. Thus NCCL backend is the recommended backend to
|
|
use for GPU training.
|
|
|
|
2. In your training program, you must parse the command-line argument:
|
|
``--local-rank=LOCAL_PROCESS_RANK``, which will be provided by this module.
|
|
If your training program uses GPUs, you should ensure that your code only
|
|
runs on the GPU device of LOCAL_PROCESS_RANK. This can be done by:
|
|
|
|
Parsing the local_rank argument
|
|
|
|
::
|
|
|
|
>>> # xdoctest: +SKIP
|
|
>>> import argparse
|
|
>>> parser = argparse.ArgumentParser()
|
|
>>> parser.add_argument("--local-rank", type=int)
|
|
>>> args = parser.parse_args()
|
|
|
|
Set your device to local rank using either
|
|
|
|
::
|
|
|
|
>>> torch.cuda.set_device(args.local_rank) # before your code runs
|
|
|
|
or
|
|
|
|
::
|
|
|
|
>>> with torch.cuda.device(args.local_rank):
|
|
>>> # your code to run
|
|
>>> ...
|
|
|
|
3. In your training program, you are supposed to call the following function
|
|
at the beginning to start the distributed backend. It is strongly recommended
|
|
that ``init_method=env://``. Other init methods (e.g. ``tcp://``) may work,
|
|
but ``env://`` is the one that is officially supported by this module.
|
|
|
|
::
|
|
|
|
>>> torch.distributed.init_process_group(backend='YOUR BACKEND',
|
|
>>> init_method='env://')
|
|
|
|
4. In your training program, you can either use regular distributed functions
|
|
or use :func:`torch.nn.parallel.DistributedDataParallel` module. If your
|
|
training program uses GPUs for training and you would like to use
|
|
:func:`torch.nn.parallel.DistributedDataParallel` module,
|
|
here is how to configure it.
|
|
|
|
::
|
|
|
|
>>> model = torch.nn.parallel.DistributedDataParallel(model,
|
|
>>> device_ids=[args.local_rank],
|
|
>>> output_device=args.local_rank)
|
|
|
|
Please ensure that ``device_ids`` argument is set to be the only GPU device id
|
|
that your code will be operating on. This is generally the local rank of the
|
|
process. In other words, the ``device_ids`` needs to be ``[args.local_rank]``,
|
|
and ``output_device`` needs to be ``args.local_rank`` in order to use this
|
|
utility
|
|
|
|
5. Another way to pass ``local_rank`` to the subprocesses via environment variable
|
|
``LOCAL_RANK``. This behavior is enabled when you launch the script with
|
|
``--use-env=True``. You must adjust the subprocess example above to replace
|
|
``args.local_rank`` with ``os.environ['LOCAL_RANK']``; the launcher
|
|
will not pass ``--local-rank`` when you specify this flag.
|
|
|
|
.. warning::
|
|
|
|
``local_rank`` is NOT globally unique: it is only unique per process
|
|
on a machine. Thus, don't use it to decide if you should, e.g.,
|
|
write to a networked filesystem. See
|
|
https://github.com/pytorch/pytorch/issues/12042 for an example of
|
|
how things can go wrong if you don't do this correctly.
|
|
|
|
|
|
|
|
"""
|
|
|
|
import logging
|
|
import warnings
|
|
|
|
from torch.distributed.run import get_args_parser, run
|
|
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
|
|
def parse_args(args):
|
|
parser = get_args_parser()
|
|
parser.add_argument(
|
|
"--use-env",
|
|
"--use_env",
|
|
default=False,
|
|
action="store_true",
|
|
help="Use environment variable to pass "
|
|
"'local rank'. For legacy reasons, the default value is False. "
|
|
"If set to True, the script will not pass "
|
|
"--local-rank as argument, and will instead set LOCAL_RANK.",
|
|
)
|
|
return parser.parse_args(args)
|
|
|
|
|
|
def launch(args):
|
|
if args.no_python and not args.use_env:
|
|
raise ValueError(
|
|
"When using the '--no-python' flag,"
|
|
" you must also set the '--use-env' flag."
|
|
)
|
|
run(args)
|
|
|
|
|
|
def main(args=None):
|
|
warnings.warn(
|
|
"The module torch.distributed.launch is deprecated\n"
|
|
"and will be removed in future. Use torchrun.\n"
|
|
"Note that --use-env is set by default in torchrun.\n"
|
|
"If your script expects `--local-rank` argument to be set, please\n"
|
|
"change it to read from `os.environ['LOCAL_RANK']` instead. See \n"
|
|
"https://pytorch.org/docs/stable/distributed.html#launch-utility for \n"
|
|
"further instructions\n",
|
|
FutureWarning,
|
|
)
|
|
args = parse_args(args)
|
|
launch(args)
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|