Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/architecture/.nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ flatten_single_child_sections: true
nav:
- "Introduction": index.md
- "Sizing": sizing.md
- "Latency": latency.md
253 changes: 253 additions & 0 deletions docs/architecture/latency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
# Latency

## Introduction

The DSS keeps its data across all instances of a DSS pool using a single
distributed database cluster (e.g. CockroachDB, Yugabyte).

Because the cluster is distributed and strongly consistent, the physical
distance between nodes directly affects how fast the DSS can answer. Latency is
therefore not a detail: it drives the responsiveness and the throughput of the
whole service.

## Impact on performance

### Time to answer a request

When a call is made to the DSS, the core service queries the database. The
database is not a single node: it is a cluster spread across DSS participants.
To stay strongly consistent, a write must reach a quorum of nodes before it is
acknowledged, and a read may also need to contact other nodes depending on where
the data lives.

Each of these internal hops adds to the round-trip latency between nodes. A single
API call can trigger more than one database operation, so the latencies stack
up. If the nodes are far apart, every consensus round pays that distance, and
the time to answer grows accordingly.

![](../assets/generated/latency_tta.png)
/// caption
A generic request. Each
arrow adds cumulative latency
///

### Bandwidth

Latency and bandwidth interact. With high latency, each operation (assuming no parallelism) takes longer
to complete, so fewer operations finish per second, which lowers effective
throughput even when raw bandwidth is available. Synchronization traffic between
nodes also competes for that bandwidth.

This matters for high-volume cases such as many subscriptions or large query
results: the more data must be transferred and synchronized between distant
nodes, the more latency limits how much can realistically be moved in a given
time window. There is a practical ceiling on how much can be transferred before
responsiveness degrades.

![](../assets/latency_impact.png)
/// caption
**Simulated** throughput vs
latency on a virtual link, practical throughput is even lower.
///

### Example of latency vs performance

The following graph shows, in a controlled environment, the impact of latency on
simple requests.

Tests have been done with DSS version v0.22.0, CockroachDB, 3 USS with one node,
on a virtual machine with ample resources.

Latency is injected with
`tc qdisc add dev eth0 root netem delay Xms 2ms distribution paretonormal`, with
X ranging from 0 to 50, in steps of 5ms. No loss is applied, nor latency between
a DSS and its datastore.

Performance is measured by creating and deleting RID ISAs, without any
subscriptions, as it performs simple queries and doesn't create congestion
issues.

Queries are done against all DSS, with 3 processes in parallel (for a total
of 9) and at the same time to remove as many variations as possible. Notice
there is still some expected variance, the goal here is to show global trends.

![](../assets/latency_rid.png)
/// caption
Results of testing RID ISA calls with
various latencies.
///

This is just an example of one call, but it shows that even with a simple
operation, there is a limit on how fast queries can be processed, no matter
optimizations performed.

## Recommendations

Keep nodes as close as possible, ideally within the same region, to minimize the
round-trip cost of consensus and synchronization.

At the same time, you should probably keep distinct areas, datacenters and
providers for redundancy. A set of servers in a single rack will be very fast,
but have very low redundancy, for example if the rack loses power. On the
opposite side of the options, servers spread throughout the world will offer
maximum redundancy, for example against country-level issues, but will be very
slow. The goal is a balance: geographically close enough for low latency,
diverse enough for redundancy.

Think and plan for the failures you want to handle, as a pool, and then spread
nodes across failure domains accordingly for resilience.

## Typical scenarios examples

This section shows sample deployments, their latencies and their results on a
specific test to have as reference real-life examples.

Notice that values here are only a snapshot when they have been run. DSS code
has and will improve in the future, cloud providers network resources will
evolve with time and even placement of your virtual machine could randomly
impact latency and performances.

All examples are run with version v0.22.0, one node per location, with virtual
machines with ample resources to focus on showing latency variations.
CockroachDB encryption was off to simplify tests.

Locust instances are always located next to the DSS being tested, on the same
machine. The test ran is
`docker run -e AUTH_SPEC="DummyOAuth(http://172.17.0.1:8085/token,localhost)" -p 8089:8089 -v .:/app/ interuss/monitoring-dev uv run locust -f loadtest/locust_files/ISA.py -H http://172.17.0.1 -u 10 --uss-base-url http://dss.localututm`.
The test has been chosen to be light and to be able to run it across a wide
number of configurations, with the possibility of adding one subscription to
force 'synchronization' between nodes. Others, more complex tests (like
FlightsInSubs which create one implicit subscription for each call) generate too
many contentions with version v0.22.0 to be meaningful.

### US-West

Cluster has been deployed on the west coast, in two providers. Regions between
providers are close, and the two machines in the same provider are in the same
region, but different availability zone.

Latency measured by CockroachDB is as follows:

![](../assets/generated/latency_west.png)

Results after 2 minutes are:

| Node | q/s | 50th latency | 95th latency |
| ---- | ----- | ------------ | ------------ |
| N1 | 19.35 | 2ms | 11ms |
| N2 | 19.01 | 3ms | 14ms |
| N3 | 18.2 | 17ms | 97ms |

showing good performances in general.

Notice primary CockroachDB node is probably located in 'Provider 1'. Other test
runs showed swapped latency, but that is to be expected.

With one subscription added in the database, forcing a synchronized update:

| Node | q/s | 50th latency | 95th latency | Failures/s |
| ---- | ----- | ------------ | ------------ | ---------- |
| N1 | 19.1 | 2ms | 17ms | 0 |
| N2 | 18.81 | 3ms | 14ms | 0 |
| N3 | 17.53 | 16ms | 120ms | 0 |

While this shows a slight decrease in performance (approximately 2 percent for
q/s and 50th latency), the extra, synchronized step needed is almost invisible.

### Across US

Cluster has been deployed across the US, in two providers: one node in the west,
one node in the east and, in a different provider, in the center.

Latency measured by CockroachDB is as follows:

![](../assets/generated/latency_coast_to_coast.png)

Results after 2 minutes are:

| Node | q/s | 50th latency | 95th latency |
| ---- | ----- | ------------ | ------------ |
| N1 | 17.42 | 2ms | 120ms |
| N2 | 17.08 | 64ms | 350ms |
| N3 | 17.95 | 35ms | 170ms |

That's a 7% loss in performances, 700% increased latency for the 50th percentile
and more than 1000% of increase for the 95% percentile.

With one subscription added in the database, forcing a synchronized update:

| Node | q/s | 50th latency | 95th latency | Failures/s |
| ---- | ---- | ------------ | ------------ | ---------- |
| N1 | 17.6 | 2ms | 92ms | 0 |
| N2 | 4.2 | 2200ms | 2500ms | 0 |
| N3 | 12.3 | 54ms | 100ms | 0 |

That's a 40% loss in performances, 24000% increased latency for the 50th
percentile and more than 6000% of increase for the 95% percentile, compared to
the same step in previous scenario.

The impact is way more visible there, as one is below 5q/s and very high
latency. CockroachDB leader still performs well.

### Across Atlantic Ocean

Cluster has been deployed across the US and in Europe, in two providers: one
node in the west, one in Europe and, in a different provider, in the center.

Latency measured by CockroachDB is as follows:

![](../assets/generated/latency_europe.png)

Results after 2 minutes are:

| Node | q/s | 50th latency | 95th latency |
| ---- | ---- | ------------ | ------------ |
| N1 | 18.4 | 2ms | 180ms |
| N2 | 14.2 | 150ms | 630ms |
| N3 | 18.4 | 34ms | 240ms |

That's a 9% loss in performances, 1650% increased latency for the 50th
percentile and more than 2000% of increase for the 95% percentile.

With one subscription added in the database, forcing a synchronized update:

| Node | q/s | 50th latency | 95th latency | Failures/s |
| ---- | ---- | ------------ | ------------ | ---------- |
| N1 | 10.4 | 570ms | 4600ms | 0.15 |
| N2 | 1.85 | 10000ms | 10000ms | 0.5 |
| N3 | 5.45 | 1200ms | 10000ms | 0.18 |

That's a 69% loss in performances, 122000% increased latency for the 50th
percentile and more than 35000% of increase for the 95% percentile, compared to
the same step in the first scenario. Compared to the non-synchronized case, we
lost about 63% of queries per second.

The DSS is almost unusable there, especially for locations without the
CockroachDB leader, but even direct queries to it are affected. Most queries in
Europe timeout.

### Summary

![](../assets/latency_summary.png)

The DSS runs on a 3-node CockroachDB cluster, tested across 3 geographic setups.
The further apart the nodes, the higher the inter-node latency (8.9ms -> 38.9ms
-> 94ms), which directly hits performance. Without a synchronized step, the
impact stays moderate: q/s drops only ~7% (across US) to ~9% (Atlantic). But
once a synchronized operation is forced (adding a subscription), the gap
explodes with distance. In US-West it's invisible, coast-to-coast q/s falls ~40%
with 50th latencies around ~750ms, and across the Atlantic the system becomes
nearly unusable: average q/s down to ~6 (−69%), latencies capped at 10000ms, and
failures appearing only there (up to 0.5/s).

This test was done on endpoints performing relatively simple queries. More
complex ones, like SCD are even more affected.

## Conclusion

Repartition of nodes is a trade-off of latency against redundancy. Because the
cluster is distributed and strongly consistent, node distance directly shapes
both response time and effective throughput. Placing nodes close together keeps
the service fast, while spreading them across failure domains keeps it
resilient. Choosing where nodes live is mainly about finding the right balance
between these two for the operational needs of the DSS pool.
Binary file added docs/assets/generated/latency_coast_to_coast.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/generated/latency_europe.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/generated/latency_tta.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/generated/latency_west.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 14 additions & 0 deletions docs/assets/latency_coast_to_coast.gv
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
digraph time_to_answer {
rankdir=LR;
node [shape=circle, colorscheme=paired8, width=2.0, fixedsize=true];
graph [dpi = 300];

N1 [label="N1\nProvider 1\nRegion W",color=1]
N2 [label="N2\nProvider 1\nRegion E",color=2]
N3 [label="N3\nProvider 2\nRegion C",color=3]

N1 -> N2 [label="~58.8ms",dir=both]
N1 -> N3 [label="~31.1ms",dir=both]
N2 -> N3 [label="~26.7ms",dir=both]

}
14 changes: 14 additions & 0 deletions docs/assets/latency_europe.gv
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
digraph time_to_answer {
rankdir=LR;
node [shape=circle, colorscheme=paired8, width=2.0, fixedsize=true];
graph [dpi = 300];

N1 [label="N1\nProvider 1\nRegion W",color=1]
N2 [label="N2\nProvider 1\nRegion E",color=2]
N3 [label="N3\nProvider 2\nRegion C",color=3]

N1 -> N2 [label="~141.9ms",dir=both]
N1 -> N3 [label="~31.2ms",dir=both]
N2 -> N3 [label="~109ms",dir=both]

}
Binary file added docs/assets/latency_impact.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/latency_rid.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/latency_summary.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
28 changes: 28 additions & 0 deletions docs/assets/latency_tta.gv
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
digraph time_to_answer {
rankdir=LR;
node [shape=box, colorscheme=paired8];
graph [dpi = 300];

client [label="Client"];
core [label="DSS"];

subgraph cluster_db {
label="Distributed DB cluster";
style=dashed;
leader [label="Node\n(leaseholder)"];
peer1 [label="Peer node"];
peer2 [label="Peer node"];
}

client -> core [label="API call"];
core -> leader [label="query"];

leader -> peer1 [label="consensus\nround-trip"];
leader -> peer2 [label="consensus\nround-trip"];
peer1 -> leader [label="ack"];
peer2 -> leader [label="ack (quorum)"];

leader -> core [label="result"];
core -> client [label="response"];

}
14 changes: 14 additions & 0 deletions docs/assets/latency_west.gv
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
digraph time_to_answer {
rankdir=LR;
node [shape=circle, colorscheme=paired8, width=2.0, fixedsize=true];
graph [dpi = 300];

N1 [label="N1\nProvider 1\nRegion 1\nZone 1",color=1]
N2 [label="N2\nProvider 1\nRegion 1\nZone 2",color=2]
N3 [label="N3\nProvider 2\nRegion 1",color=3]

N1 -> N2 [label="~0.78ms",dir=both]
N1 -> N3 [label="~13.4ms",dir=both]
N2 -> N3 [label="~12.6ms",dir=both]

}
1 change: 1 addition & 0 deletions docs/deployment_checklist.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ This checklist outlines the major decisions and steps required to deploy a non-l
* This repository provides Terraform configurations for [Amazon Web Services (EKS)](infrastructure/aws.md) and [Google Cloud (GKE)](infrastructure/google.md) to deploy a Kubernetes cluster (the infrastructure into which the Services will be deployed).
* This repository provides [Tanka](https://github.com/interuss/dss/blob/master/deploy/services/tanka/) files and [Helm Charts](https://github.com/interuss/dss/blob/master/deploy/services/helm-charts/dss) to be used to deploy Services into a Kubernetes cluster. Terraform will automatically generate these configurations if needed.
* You may also choose to deploy manually or use custom configuration tools.
* Latency is an important factor to achieve good performances. Review the [latency documentation](architecture/latency.md) and plan with others participants of your DSS pool.
* [ ] Prepare sufficient resources for the services.
* In particular, review the [CockroachDB recommendations](https://www.cockroachlabs.com/docs/v24.1/recommended-production-settings#cloud-specific-recommendations) and [YugabyteDB recommendations](https://docs.yugabyte.com/stable/deploy/checklist/#public-clouds); the datastore will consume the majority of the resources.
* Example sizing is also describled in [sizing](architecture/sizing.md).
Expand Down
3 changes: 3 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -104,5 +104,8 @@ markdown_extensions:

- attr_list # Enable creation of custom anchors

- pymdownx.blocks.caption


extra_javascript:
- assets/checkbox.js
Loading