Metrics#
Mars has a unified metrics API and three different backends.
A Unified Metrics API#
Mars metrics API are in mars/metrics/api.py
and there are four metric types:
Counter
is a cumulative type of data which represents a monotonically increasing number.Gauge
is a single numerical value.Meter
is the rate at which a set of events occur. we can use it as qps or tps.Histogram
is a type of statistics which records the average value of a window data.
And we can use these types as follows:
# Four metrics have a unified parameter list:
# 1. Declarative method: Metrics.counter(name: str, description: str = "", tag_keys: Optional[Tuple[str]] = None)
# 2. Record method: record(value=1, tags: Optional[Dict[str, str]] = None)
c1 = Metrics.counter('counter1', 'A counter')
c1.record(1)
c2 = Metrics.counter('counter2', 'A counter', ('service', 'tenant'))
c2.record(1, {'service': 'mars', 'tenant': 'test'})
g1 = Metrics.gauge('gauge1')
g1.record(1)
g2 = Metrics.gauge('gauge2', 'A gauge', ('service', 'tenant'))
g2.record(1, {'service': 'mars', 'tenant': 'test'})
m1 = Metrics.meter('meter1')
m1.record(1)
m2 = Metrics.meter('meter1', 'A meter', ('service', 'tenant'))
m2.record(1, {'service': 'mars', 'tenant': 'test'})
h1 = Metrics.histogram('histogram1')
h1.record(1)
h2 = Metrics.histogram('histogram1', 'A histogram', ('service', 'tenant')))
h2.record(1, {'service': 'mars', 'tenant': 'test'})
Note: If tag_keys
is declared, tags
must be specified when invoking
record
method and tags’ keys must be consistent with tag_keys
.
Three different Backends#
Mars metrics support three different backends:
console
is used for debug and it just prints the value.prometheus
is an open-source systems monitoring and alerting toolkit.ray
is a metric backend which just runs on ray engine.
Console#
The default metric backend is console
. It just logs the value when log level
is debug
.
Prometheus#
Firstly, we should download Prometheus. For details, please refer to Prometheus Getting Started.
Secondly, we can new a Mars session by configuring Prometheus backend as follows:
In [1]: import mars
In [2]: session = mars.new_session(
...: n_worker=1,
...: n_cpu=2,
...: web=True,
...: config={"metrics.backend": "prometheus"}
...: )
Finished startup prometheus http server and port is 15768
Finished startup prometheus http server and port is 44303
Finished startup prometheus http server and port is 63391
Finished startup prometheus http server and port is 13722
Web service started at http://0.0.0.0:15518
Thirdly, we should config Prometheus, more configurations please refer to Prometheus Configuration.
scrape_configs:
- job_name: 'mars'
scrape_interval: 5s
static_configs:
- targets: ['localhost:15768', 'localhost:44303', 'localhost:63391', 'localhost:13722']
Then start Prometheus:
$ prometheus --config.file=promconfig.yaml
level=info ts=2022-06-07T13:05:01.484Z caller=main.go:296 msg="no time or size retention was set so using the default time retention" duration=15d
level=info ts=2022-06-07T13:05:01.484Z caller=main.go:332 msg="Starting Prometheus" version="(version=2.13.1, branch=non-git, revision=non-git)"
level=info ts=2022-06-07T13:05:01.484Z caller=main.go:333 build_context="(go=go1.13.1, user=brew@Mojave.local, date=20191018-01:13:04)"
level=info ts=2022-06-07T13:05:01.485Z caller=main.go:334 host_details=(darwin)
level=info ts=2022-06-07T13:05:01.485Z caller=main.go:335 fd_limits="(soft=256, hard=unlimited)"
level=info ts=2022-06-07T13:05:01.485Z caller=main.go:336 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2022-06-07T13:05:01.487Z caller=main.go:657 msg="Starting TSDB ..."
level=info ts=2022-06-07T13:05:01.488Z caller=web.go:450 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2022-06-07T13:05:01.494Z caller=head.go:514 component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2022-06-07T13:05:01.495Z caller=head.go:562 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=1
level=info ts=2022-06-07T13:05:01.495Z caller=head.go:562 component=tsdb msg="WAL segment loaded" segment=1 maxSegment=1
level=info ts=2022-06-07T13:05:01.497Z caller=main.go:672 fs_type=1a
level=info ts=2022-06-07T13:05:01.497Z caller=main.go:673 msg="TSDB started"
level=info ts=2022-06-07T13:05:01.497Z caller=main.go:743 msg="Loading configuration file" filename=promconfig_mars.yaml
level=info ts=2022-06-07T13:05:01.501Z caller=main.go:771 msg="Completed loading of configuration file" filename=promconfig_mars.yaml
level=info ts=2022-06-07T13:05:01.501Z caller=main.go:626 msg="Server is ready to receive web requests."
Fourthly, run a Mars task:
In [3]: import numpy as np
In [4]: import mars.dataframe as md
In [5]: df1 = md.DataFrame(np.random.randint(0, 3, size=(10, 4)),
...: columns=list('ABCD'), chunk_size=5)
...: df2 = md.DataFrame(np.random.randint(0, 3, size=(10, 4)),
...: columns=list('ABCD'), chunk_size=5)
...:
...: r = md.merge(df1, df2, on='A').execute()
Finally, we can check metrics in Prometheus web http://localhost:9090.
Ray#
We could config metrics.backend
when creating a Ray cluster or new a session.
Metrics Naming Convention#
We propose a naming convention for metrics as follows:
namespace.[component].metric_name[_units]
namespace
could bemars
.component
could be supervisor, worker or band etc, and can be omitted.units
is the metric unit which may be seconds when recording time, or_count
when metric type isCounter
,_number
when metric type isGauge
if there is no suitable unit.