There are two principals with potentially conflicting interests:
The relay is operated by LINBIT. For the share-mode path, the relay is explicitly not trusted to enforce access limits -- limits are enforced by the customer-side producer shim, which the customer can audit.
A third party is anyone who is neither the customer nor a LINBIT support engineer: a random internet host, another customer, a compromised CDN, etc. The following layers prevent a third party from observing or interfering with a session.
The broker socket (/run/linbit-tunl/<id>.sock) is
a Unix domain socket on the relay's filesystem. It is not reachable from
the internet by any path. The relay API (relay-api.py)
listens on 127.0.0.1:8080 inside the relay pod, but a small
set of customer-bootstrap endpoints is exposed publicly via nginx (port
443) -- see "Relay API security" below and
docs/deployment.html for the exact whitelist. Everything else
is loopback-only.
Before the broker is even started, the customer's ssh
authenticates to the relay as the tunl user. A third party
cannot connect as tunl without either:
If a third party obtains the session ID but not the token, they cannot produce valid credentials to connect.
Support engineers SSH to the relay as the support user.
That account's authorized_keys is managed by LINBIT; a
third party has no path in. Even if they reached the relay host somehow,
the broker socket requires membership in the linbit-tunl OS
group, which is also LINBIT-controlled.
If a third party tried to interpose a fake relay between the customer
and the real relay, the customer script would reject the connection: all
SSH invocations use StrictHostKeyChecking=yes with a
known_hosts file pinned to the relay's host CA. A fake
relay cannot produce a valid host certificate without the CA private
key.
All data between the customer and the relay travels inside the customer's SSH session. All data between support and the relay travels inside the support SSH session. On the relay itself the data moves through a local Unix socket. There is no segment of the path that is unencrypted or exposed to the network.
The session ID is not itself a secret -- it is spoken aloud on a
support call and appears in the relay session registry. Knowing it is
not sufficient to join a session: a third party still needs valid SSH
credentials (either as tunl with a token-derived cert, or
as support with an authorized key). The ID is a session
label, not an access credential.
When the relay does not enforce one-time tokens
(TUNL_REQUIRE_TOKEN=0), the four-word session ID is itself
the primary authentication credential the customer presents at the
tunl SSH layer. A third party who learns the ID before the
legitimate customer has connected can then authenticate as
tunl and impersonate a customer.
The same analysis applies to any other path by which an attacker
could end up authenticated as tunl for a session-id they do
not legitimately own -- including the (rate-limit-and-math-bounded)
successful brute force discussed in known limitation 3 below. The
consequences described here are the consequence ceiling for either
path.
Two things to be clear about:
Such a third party cannot interfere with the legitimate customer's system. The producer end of the broker is the customer; the relay only fans tmux output out to the support side. It does not have a channel into a customer host that has not chosen to open one. At no point does obtaining a session ID give an attacker the ability to inject keystrokes, read files, or run code on the customer's machine. The customer-side allowlist documented above sits behind the trust boundary and would refuse arbitrary commands even if the attacker somehow reached it.
The realistic attack is impersonation against
support. The attacker connects to the relay using the leaked
session ID, presents themselves as the customer's producer, and waits
for a support engineer to attach via tunl join. The
engineer believes they are looking at the customer's terminal but is
instead looking at one the attacker controls. What they gain in that
situation:
docs/escape-filter.html) -- enough to mislead the engineer
about what is on the "customer's" screen.In short, it is a social-engineering pivot against the support
engineer, not a path to the customer. Mitigations are operational:
prefer the one-time-token mode (TUNL_REQUIRE_TOKEN=1) for
any context where the ID may have leaked; out-of-band confirmation with
the customer ("are you on the call? did you start the session just
now?") before typing anything sensitive; and the brute-force defences on
the relay sshd that make grinding random IDs impractical.
The customer starts the linbit-tunl.py process. It runs
under their credentials on their machine. The producer shim in
ShareSession is the sole trust enforcement point on the
customer side.
_cmd_forwarder in linbit-tunl.py reads
commands from the relay over SSH stdio and applies a strict
allowlist:
| Command | Effect | Allowed |
|---|---|---|
SNAPSHOT |
Triggers capture-pane, injects current screen into
stream |
yes |
KEYS <role> <hex> |
Injects keystrokes into a registered pane | yes |
RESIZE <C>x<R> |
Resizes the tmux window | yes |
STATUS <text> |
Writes text to the chat-pane status header | yes |
UPGRADE <port> |
Requests ControlMaster port forward | yes (restricted: no) |
SUPPORT_KEY <key> |
Accumulates a support public key | yes (tunnel flow only) |
TUNNEL_CONFIRMED |
Installs accumulated keys in authorized_keys | yes (tunnel flow only) |
| anything else | Logged and dropped | no |
KEYS is further restricted: <role>
must resolve via the producer's registered pane-role map (currently
main and chat; the lookup is a dict and
unknown roles silently drop). The relay cannot name a pane the producer
did not register.
In --restricted mode, UPGRADE,
SUPPORT_KEY, and TUNNEL_CONFIRMED are also
rejected.
Every inbound command from the relay is logged verbatim to
/tmp/linbit-tunl-<hash>/ssh.log, including the
timestamp and whether it was allowed or rejected. The log appends across
reconnects. The customer can inspect it mid-session or after. The
session ID does not appear in the path; <hash> is the
first 12 hex digits of SHA-256(session_id).
A compromised relay can only:
main or chat
pane (the customer's working shell and the chat pane they already
see)A compromised relay cannot:
PTY ECHO mode is a kernel terminal attribute, not a VT sequence. When
ECHO is disabled (password prompts, sudo, etc.), keystrokes
are not written to PTY output and therefore do not appear in
%output events. They do not appear in the asciinema cast
file.
There is no per-keystroke audit log on the customer side -- the
customer's .cast is the authoritative recording, and
ECHO-off content stays out of it by construction.
Tracing caveat: when TUNL_TRACE_WIRE or
TUNL_TRACE_ALL=1 is set, the raw inbound CTRL frames
(including support-side KEYS payloads, which carry the
bytes the support engineer typed) are written verbatim to
<session_dir>/wire.broker.bin for diagnostic replay.
Do not enable wire tracing on a session where you want password-prompt
content to stay off disk. See docs/debuggability.html and
relay/tunl_trace.py for the trace controls.
The tmux session name on the customer machine uses a short SHA-256
hash of the session ID (linbit-{sha256[:12]}), not the
session ID itself. The session ID is not visible in ps
output to other users on the host.
The customer script authenticates to the relay as the
tunl user:
CA-signed ephemeral certificate: An ephemeral
ed25519 keypair is generated per session. The public key is submitted to
POST /api/v1/sign with a one-time token. The relay CA signs
it with a short TTL (default 60 min), a force-command
critical option, restricted port-forward permissions, and a
per-session principal
(linbit-tunl-<session-id>). A cert expiring
mid-session does not drop an established connection.
sshd accepts a cert if its principal list intersects with
AuthorizedPrincipalsCommand output; that command is
relay/check-principal.py, which queries the relay API for
the session-id encoded in the cert's key_id and emits the
principal only when the session is in pending/active status. This binds
the cert to one specific session at the SSH-auth layer (B5.1 fix,
2026-05-07): a cert signed for session A1 cannot authenticate as A2 even
within its TTL. cmd_produce additionally re-checks
SSH_USER_AUTH against $SSH_ORIGINAL_COMMAND
for defence-in-depth.
Keyboard-interactive credential: The customer
types (or the script provides via SSH_ASKPASS) one of two
credential forms. sshd delegates validation to
relay/validate-token.py via PAM pam_exec,
which calls the relay API.
PAM/cmd_produce binding (B5.1 fix,
2026-05-07): on a successful session-id ACCEPT,
validate-token.py writes the validated session-id to
/run/linbit-tunl/auth/<ppid>.sid (mode 0640, group
linbit-tunl). <ppid> is the OpenSSH
priv-monitor pid -- the common parent of both the PAM-time
validate-token.py process and the post-auth cmd_produce
ForceCommand. cmd_produce reads + unlinks that file at
startup and rejects when its content disagrees with the session-id
claimed in $SSH_ORIGINAL_COMMAND. This binds the
PAM-validated id to the session-id used downstream and prevents an
authenticated tunl user from claiming any other pending
session. The token-auth path does not write a hand-off (tokens are not
API-bound to a specific session-id); for token-auth users the
cross-check is currently skipped.
GET /api/v1/pending/{sid}. Validated against the pending
registry; not consumed at PAM time, so the customer can
reconnect within the session lifetime.GET /api/v1/expected/{token}. Also validated only;
consumption is deferred.Token consumption happens at the next relay-API call:
POST /api/v1/sign consumes a token by deling
it from _expected_tokens (when REQUIRE_TOKEN=1
and the session-id is not yet in _known_session_ids);
POST /api/v1/sessions does the same on the on-demand path.
After consumption the session-id is whitelisted in
_known_session_ids and subsequent API calls for that
session skip the token requirement.
Brute-force defence (B3.3, 2026-05-06 audit, updated for the nginx-on-443 deployment):
sshd_config.d/tunl.conf carries
MaxAuthTries 3, LoginGraceTime 20, and a
global MaxStartups 5:50:10.relay/fail2ban/ ships a filter + jail
that bans source IPs after 5 PAM rejects within 5 minutes (currently
dead weight pending PROXY-protocol-to-sshd; see known limitation
6).limit_conn 10 per source IP on
the 443 stream block, and limit_req 30r/m per source IP on
/api/v1/sign. These see real source IPs even though sshd
doesn't, and cover both legs.POST /api/v1/sign has a per-session-id 2 s
application-level cooldown; the per-sid window for a leaked-sid
token-grind shrinks to ~23 days against a 60-min token TTL.With these caps in place, the ~30-bit session-id keyspace is not directly brute-forceable from any single source within the pending TTL. See "Consequences of a successful brute force" in known limitation 3 for what an attacker would actually obtain if they beat the math anyway.
The relay's host CA public key is embedded in the customer script at
build time (RELAY_CA_PUBKEY). All SSH invocations use a
temp known_hosts file with a @cert-authority
entry pointing at that CA. StrictHostKeyChecking=yes is in
effect for all relay connections. A script that was not updated with the
real CA key will fail to connect rather than silently trust any
host.
Sessions are isolated by the broker socket path
(TUNL_SHARE_DIR/<session-id>.sock, default
/run/linbit-tunl/<id>.sock). The socket is created
with permissions 0o660 and owned by the user running the
broker (typically a dedicated tunl-share service account).
Consumers must be in the linbit-tunl group to connect.
The SQLite database (TUNL_DB) stores session metadata
(IP, case number, customer name) and is created with restrictive
permissions.
The support SSH user on the relay has no
ForceCommand. Support engineers SSH in and run
relay/relay-share.py consume SESSION directly. They can
only interact with sessions that exist in the broker socket directory
(i.e., sessions the customer's daemon has registered).
Read-only mode: --ro /
tunl join --ro sends the connection header
CONSUMER_RO\n. The broker does not forward keystrokes from
a read-only support-side connection to the customer.
The relay API (relay-api.py) listens on
127.0.0.1:8080 inside the relay pod. Access from the public
internet is mediated by nginx, which proxies only a whitelist of
customer-bootstrap endpoints to the api:
GET /api/v1/ca.pub -- session-cert CA public key
(inherently public)GET /api/v1/host-ca.pub -- host-cert CA public key
(inherently public)POST /api/v1/sign -- customer cert issuance; carries
the token or session-id in the body, has a per-session-id cooldown and a
global signing semaphore in the application, and a per-source-IP
limit_req zone at the nginx layer (default 30 req/min with
a small burst). 500 responses are sanitised: details are logged
server-side and the body carries only the request id.All other endpoints (session lookups, pending registry,
info, expected token validation, support-keys,
uploads, etc.) are 404'd at the nginx layer and are reachable only from
the pod's loopback. info in particular is loopback-only
because it would otherwise expose the relay's
require_pending posture to anonymous scanners. The
ForceCommand (relay-share.py) calls those endpoints over
127.0.0.1, validate-token.py calls
info and expected/{token} over
127.0.0.1, and external support access (e.g.,
tunl list) is via SSH port-forwarding through the
support user.
relay-api.py itself answers identical handlers on every
interface it listens on; it does not infer "public" vs. "private" from
the request's source. The trust boundary is the nginx allowlist and the
loopback bind. Implication: widening the allowlist in
nginx is a security decision, not just a config tweak. Any endpoint that
would be exposed to anonymous internet traffic must continue to enforce
its own auth (cooldown, token consumption, ...), and any error message
it emits must assume an attacker as the reader. The
expected/{token} GET is explicitly never exposed
because it would be a brute-force oracle for the 6-digit token if
reachable from the internet. Per-source rate-limit coverage on the
public surfaces is provided by the nginx limit_req
(/api/v1/sign) and limit_conn (stream block)
zones, both keyed on the real source IP -- the SSH-side
sshd_config.d/tunl.conf and relay/fail2ban/
caps see only loopback at this time and are not the HTTP path's
defence.
Token-based session pre-registration
(POST /api/v1/pending) stores tokens in memory with a
configurable TTL (TUNL_TOKEN_TTL_MINUTES, default 60
minutes). Tokens are 6-digit random integers (10^6 combinations).
Two-phase use:
relay/validate-token.py) only validates
the token via GET /api/v1/expected/{token}; the token is
not consumed at this step, because the same credential is reused across
reconnects within the session.POST /api/v1/sign (or
POST /api/v1/sessions on the on-demand path) for that
session-id consumes the token (del _expected_tokens[token])
and adds the session-id to _known_session_ids, so further
API calls for that session skip the token requirement.A used token cannot be replayed once the consume step has run. Until
then (i.e., between PAM accept and the first
/sign//sessions call), the token is multi-use;
the brute-force defences described under "Authentication to the relay"
above prevent that window from being practically exploitable.
When the customer grants a tunnel upgrade, the sequence is:
UPGRADE <port> to the broker.UPGRADE <port> to the producer
over the command channel.ssh -S ctl.sock -O forward -R <port>:localhost:22 tunl@relay.LINBIT_TUNNEL_PORT <port> to the broker.tunnel_port,
mode=tunnel).SUPPORT_KEY commands.authorized_keys on
TUNNEL_CONFIRMED.TUNNEL_PORT <port> to the support
side.In --restricted mode, the producer drops
UPGRADE commands, so step 3 never happens.
The broker writes a v2 .cast file to
/var/log/linbit-tunl/<session>.cast. This captures
all decoded VT bytes fanned out to the support side. The file is owned
by the broker process user and is not world-readable. It does not
contain keystroke content (logged in the customer-side audit log
only).
No single-writer enforcement: multiple read-write support clients can inject keystrokes simultaneously. Coordination is social.
Broker socket on relay: The Unix socket is
group-accessible; any process in the linbit-tunl group on
the relay can connect to the broker as a support-side client. This is
intentional for the multi-support-engineer use case but requires
OS-level group membership control.
Credential brute-force surface: the four-word
session-id has ~30 bits of entropy and the 6-digit token has ~20 bits.
Neither is consumed at PAM time (consumption is deferred to the first
/sign//sessions call), so each is multi-use
within the pending TTL. Caps from front to back:
limit_req 30r/m per source IP on
/api/v1/sign.limit_conn 10 per source IP at
accept, covering both mux arms on real IPs.MaxStartups 5:50:10,
Match User tunl: MaxAuthTries 3, LoginGraceTime 20./sign.With these caps active, the credential search space is not reachable from any single source within the pending TTL.
Consequences of a successful brute force (the cap on what an
attacker gains if they beat the math anyway). The attacker
becomes the producer for the one session-id they guessed. They do
not gain a channel into the legitimate customer's
machine: the relay only fans tmux output out to the support
side; the producer→broker direction does not contain customer-machine
access of any kind. What they do gain is what the impersonation case
under "Session-ID-only auth mode and the impersonation case" describes
-- keystroke / chat capture, the right to render arbitrary content on
the engineer's terminal, and the range-limited tunnel redirect (B5.2
caps LINBIT_TUNNEL_PORT to 30000-39999, so an engineer's
tunl ssh lands on an attacker-controlled relay-side
-R forward rather than on the customer's box).
Per-session-principal cert binding (B5.1) confines this to the one
guessed sid; the attacker cannot pivot to a different session within the
same cert.
In short: the credential math is one input; the consequence math is
the other. The consequence ceiling is "social-engineering pivot against
the support engineer for one bounded session," not "compromise of the
customer's host." Mitigations stay operational: prefer
TUNL_REQUIRE_TOKEN=1, expect support to confirm out-of-band
before typing anything sensitive into a freshly-attached session, and
keep the rate-limit caps tight.
StrictHostKeyChecking=no for customer
sshd: When support SSHes to the customer machine via the
reverse tunnel, host key verification is disabled. Security is provided
by the relay authentication layer; the customer sshd is reached only via
an authenticated tunnel.
Public exposure of relay-api via
nginx: the production relay reverse-proxies a small whitelist
of relay-api endpoints to the internet on port 443 (see
"Relay API security" above and docs/deployment.html). This
was a deployment shift after the 2026-05-06 audit was scoped, so the
audit did not consider the implications. Status of each concern:
POST /api/v1/sign brute force -- not a live
threat in token-required mode. Combined entropy is ~50 bits
(~30 from the session-id, ~20 from the token). /sign
consumes the token on first success, so there is no "narrow the search"
attack; one hit closes the window. In the more interesting
leaked-session-id case (attacker knows the sid out-of-band and
brute-forces only the 20 bits of token against it), the application's
per-session-id 2 s cooldown alone pins the search to roughly 23 days vs.
a 60 min token TTL -- already a non-issue. In session-id-only mode (no
token), the residual threat is impersonation against support, not
interference with the customer system; see "Session-ID-only auth mode
and the impersonation case" above.
Per-source limit_req on
POST /api/v1/sign -- landed as defence-in-depth.
nginx enforces limit_req_zone $binary_remote_addr (30
req/min, burst 10) on /api/v1/sign; over-budget requests
get 429. Cheap and helpful against future code changes that might widen
the surface; not load-bearing against the current threat model.
/api/v1/sign 500-body sanitisation --
done. Earlier revisions echoed ssh-keygen stderr
(filesystem paths, tool fingerprints) to the caller. Now: details are
logged on the server only; the response body carries
{"error": ..., "request_id": "<nginx-request-id>"},
and a real customer report can be joined to the relevant log line by
request id.
/api/v1/info loopback-only -- done.
Was previously on the public allowlist; moved to loopback-only because
info returns a require_pending flag that
advertises the relay's auth posture. No public consumer relied on it
(linbit-tunl.py POSTs /sign; the landing page
fetches /ca.pub and /host-ca.pub;
validate-token.py reaches info from inside the
pod).
limit_conn on the 443 stream block --
done. nginx's stream block now enforces
limit_conn_zone $binary_remote_addr with a cap of 10
concurrent connections per source IP, applied at accept time before
ssl_preread decides which mux arm the connection goes to.
Covers both legs (SSH passthrough and HTTPS) on real source IPs. Legit
use (one long-lived customer SSH, a few support SSH connections,
occasional HTTPS) stays comfortably under; useful brute-force throughput
does not. Over-budget connections are closed at accept. No
success-promotion machinery: legit single-shot use fits inside the cap
with margin.
A static trusted-IP allowlist sits in front of both
nginx caps (limit_req on /api/v1/sign and
limit_conn on the stream block). Anything listed in
/etc/nginx/tunl-trusted.conf (image stub; bind-mount to
override at deploy time) gets an empty zone key and bypasses both gates.
Intended for low-cardinality use (office, support VPN exit, CI smoke
runner -- a handful of entries). See docs/deployment.html
"Trusted-IP allowlist" and the file's own header for syntax.
Token-distinguishability oracle on /sign --
not addressed. relay-api.py:_sign_key returns
distinguishable 403 messages (token required vs.
invalid or expired token). Mild compared to the
search-space arithmetic above; collapsing to one opaque "authentication
failed" would close it.
_known_session_ids unbounded -- not
addressed. relay-api.py keeps an in-process set
that grows for the lifetime of the relay; on a successful token
consumption the session-id is whitelisted to skip token checks on
subsequent /sign//sessions calls. Memory cost
is small; the soft concern is "trust this sid forever (in-process)".
Bounding via LRU keyed to session TTL is the obvious fix.
nginx mux passes traffic to sshd without PROXY
protocol: the stream block in
relay/nginx/nginx.conf (lines 50-58, comment) intentionally
does not enable PROXY protocol toward sshd, because upstream OpenSSH
does not consume it. Consequence: sshd sees every SSH-leg connection as
coming from 127.0.0.1, the existing fail2ban PAM filter
(relay/fail2ban/filter.d/linbit-tunl-pam.conf) can only
ever "ban" the relay pod itself, and any per-IP sshd-level cap is
evaluated against a single shared loopback source. The brute-force
defences on the SSH leg therefore protect against connection-count
exhaustion (MaxStartups, MaxAuthTries) but not
against a sufficiently slow per-IP attacker visible at sshd's own
logs.
This is independent of the stream-block limit_conn (item
5): nginx's stream block sees the real source IP and rate-limits on it
before handing the connection to sshd, even without PROXY protocol
downstream. PROXY-to-sshd (or TPROXY) is still useful for sshd's own
audit logs and for the PAM fail2ban filter to do something meaningful
again, but it is not on the critical path for front-door rate limiting.
Falls in the same backlog tier as re-adding a fail2ban filter on the
nginx access log: do it when there is a concrete audit or forensics
reason, not before.