Run on Clusters#

Basic Steps#

Mars can be deployed on a cluster. First, you need to run

pip install pymars

on every node in the cluster. This will install dependencies needed for distributed execution on your cluster. After that, you may select a node as supervisor which also integrated web service, leaving other nodes as workers.

The supervisor can be started with the following command:

mars-supervisor -H <host_name> -p <supervisor_port> -w <web_port>

Web service will be started as well.

Workers can be started with the following command:

mars-worker -H <host_name> -p <worker_port> -s <supervisor_ip>:<supervisor_port>

After all Mars processes are started, you can open a Python console and run

import mars
import mars.tensor as mt
import mars.dataframe as md
# create a default session that connects to the cluster
mars.new_session('http://<web_ip>:<web_port>')
a = mt.random.rand(2000, 2000, chunk_size=200)
b = mt.inner(a, a)
b.execute()  # submit tensor to cluster
df = md.DataFrame(a).sum()
df.execute()  # submit DataFrame to cluster

You can open a web browser and type http://<web_ip>:<web_port> to open Mars UI to look up resource usage of workers and execution progress of the task submitted just now.

Using Command Lines#

When running Mars with command line, you can specify arguments to control the behavior of Mars processes. All Mars services have common arguments listed below.

Argument	Description
`-H`	Service IP binding, `0.0.0.0` by default
`-p`	Port of the service. If absent, a randomized port will be used
`-f`	Path to service configuration file. Absent when use default configuration.
`-s`	List of supervisor endpoints, separated by commas. Useful for workers and webs to spot supervisors, or when you want to run more than one supervisor
`--log-level`	Log level, can be `debug`, `info`, `warning`, `error`
`--log-format`	Log format, can be Python logging format
`--log-conf`	Python logging configuration file, `logging.conf` by default
`--use-uvloop`	Whether to use `uvloop` to accelerate, `auto` by default

Extra arguments for supervisors are listed below.

Argument	Description
`-w`	Port of web service in supervisor

Extra arguments for workers are listed below. Details about memory tuning can be found at the next section.

Argument	Description
`--n-cpu`	Number of CPU cores to use. If absent, the value will be the available number of cores
`--n-io-process`	Number of IO processes for network operations. 1 by default
`--cuda-devices`	Index of CUDA devices to use. If not specified, all devices will be used. Specifying an empty string will ignore all devices

For instance, if you want to start a Mars cluster with two supervisors and two workers, you can run commands below (memory and CPU tunings are omitted):

On Supervisor 1 (192.168.1.10):

mars-supervisor -H 192.168.1.10 -p 7001 -w 7005 -s 192.168.1.10:7001,192.168.1.11:7002

On Supervisor 2 (192.168.1.11):

mars-supervisor -H 192.168.1.11 -p 7002 -s 192.168.1.10:7001,192.168.1.11:7002

On Worker 1 (192.168.1.20):

mars-worker -H 192.168.1.20 -p 7003 -s 192.168.1.10:7001,192.168.1.11:7002

On Worker 2 (192.168.1.21):

mars-worker -H 192.168.1.21 -p 7004 -s 192.168.1.10:7001,192.168.1.11:7002