Run on Clusters¶

Basic Steps¶

Mars can be deployed on a cluster. First, you need to run

pip install pymars

on every node in the cluster. This will install dependencies needed for distributed execution on your cluster. After that, you may select a node as supervisor which also integrated web service, leaving other nodes as workers.

The supervisor can be started with the following command:

mars-supervisor -H <host_name> -p <supervisor_port> -w <web_port>

Web service will be started as well.

Workers can be started with the following command:

mars-worker -H <host_name> -p <worker_port> -s <supervisor_ip>:<supervisor_port>

After all Mars processes are started, you can open a Python console and run

import mars
import mars.tensor as mt
import mars.dataframe as md
# create a default session that connects to the cluster
mars.new_session('http://<web_ip>:<web_port>')
a = mt.random.rand(2000, 2000, chunk_size=200)
b = mt.inner(a, a)
b.execute()  # submit tensor to cluster
df = md.DataFrame(a).sum()
df.execute()  # submit DataFrame to cluster

You can open a web browser and type http://<web_ip>:<web_port> to open Mars UI to look up resource usage of workers and execution progress of the task submitted just now.

Using Command Lines¶

When running Mars with command line, you can specify arguments to control the behavior of Mars processes. All Mars services have common arguments listed below.

Argument	Description
`-H`	Service IP binding, `0.0.0.0` by default
`-p`	Port of the service. If absent, a randomized port will be used
`-f`	Path to service configuration file. Absent when use default configuration.
`-s`	List of supervisor endpoints, separated by commas. Useful for workers and webs to spot supervisors, or when you want to run more than one supervisor
`--log-level`	Log level, can be `debug`, `info`, `warning`, `error`
`--log-format`	Log format, can be Python logging format
`--log-conf`	Python logging configuration file, `logging.conf` by default
`--use-uvloop`	Whether to use `uvloop` to accelerate, `auto` by default

Extra arguments for supervisors are listed below.

Argument	Description
`-w`	Port of web service in supervisor

Extra arguments for workers are listed below. Details about memory tuning can be found at the next section.

Argument	Description
`--n-cpu`	Number of CPU cores to use. If absent, the value will be the available number of cores
`--n-io-process`	Number of IO processes for network operations. 1 by default
`--cuda-devices`	Index of CUDA devices to use. If not specified, all devices will be used. Specifying an empty string will ignore all devices

For instance, if you want to start a Mars cluster with two supervisors and two workers, you can run commands below (memory and CPU tunings are omitted):

On Supervisor 1 (192.168.1.10):

mars-supervisor -H 192.168.1.10 -p 7001 -w 7005 -s 192.168.1.10:7001,192.168.1.11:7002

On Supervisor 2 (192.168.1.11):

mars-supervisor -H 192.168.1.11 -p 7002 -s 192.168.1.10:7001,192.168.1.11:7002

On Worker 1 (192.168.1.20):

mars-worker -H 192.168.1.20 -p 7003 -s 192.168.1.10:7001,192.168.1.11:7002

On Worker 2 (192.168.1.21):

mars-worker -H 192.168.1.21 -p 7004 -s 192.168.1.10:7001,192.168.1.11:7002

Installation

Run on Ray