linbit-tunl: security

linbit-tunl: Security Analysis

Threat model summary

There are two principals with potentially conflicting interests:

The relay is operated by LINBIT. For the share-mode path, the relay is explicitly not trusted to enforce access limits -- limits are enforced by the customer-side producer shim, which the customer can audit.


Third-party exclusion

A third party is anyone who is neither the customer nor a LINBIT support engineer: a random internet host, another customer, a compromised CDN, etc. The following layers prevent a third party from observing or interfering with a session.

Network access to the broker

The broker socket (/run/linbit-tunl/<id>.sock) is a Unix domain socket on the relay's filesystem. It is not reachable from the internet by any path. The relay API (relay-api.py) listens on 127.0.0.1:8080 inside the relay pod, but a small set of customer-bootstrap endpoints is exposed publicly via nginx (port 443) -- see "Relay API security" below and docs/deployment.html for the exact whitelist. Everything else is loopback-only.

Customer authentication to the relay

Before the broker is even started, the customer's ssh authenticates to the relay as the tunl user. A third party cannot connect as tunl without either:

If a third party obtains the session ID but not the token, they cannot produce valid credentials to connect.

Support-side authentication to the relay

Support engineers SSH to the relay as the support user. That account's authorized_keys is managed by LINBIT; a third party has no path in. Even if they reached the relay host somehow, the broker socket requires membership in the linbit-tunl OS group, which is also LINBIT-controlled.

Relay host impersonation (MitM)

If a third party tried to interpose a fake relay between the customer and the real relay, the customer script would reject the connection: all SSH invocations use StrictHostKeyChecking=yes with a known_hosts file pinned to the relay's host CA. A fake relay cannot produce a valid host certificate without the CA private key.

Transport confidentiality

All data between the customer and the relay travels inside the customer's SSH session. All data between support and the relay travels inside the support SSH session. On the relay itself the data moves through a local Unix socket. There is no segment of the path that is unencrypted or exposed to the network.

Session ID secrecy

The session ID is not itself a secret -- it is spoken aloud on a support call and appears in the relay session registry. Knowing it is not sufficient to join a session: a third party still needs valid SSH credentials (either as tunl with a token-derived cert, or as support with an authorized key). The ID is a session label, not an access credential.

Session-ID-only auth mode and the impersonation case

When the relay does not enforce one-time tokens (TUNL_REQUIRE_TOKEN=0), the four-word session ID is itself the primary authentication credential the customer presents at the tunl SSH layer. A third party who learns the ID before the legitimate customer has connected can then authenticate as tunl and impersonate a customer.

The same analysis applies to any other path by which an attacker could end up authenticated as tunl for a session-id they do not legitimately own -- including the (rate-limit-and-math-bounded) successful brute force discussed in known limitation 3 below. The consequences described here are the consequence ceiling for either path.

Two things to be clear about:


Customer-as-auditor perspective

What the customer controls

The customer starts the linbit-tunl.py process. It runs under their credentials on their machine. The producer shim in ShareSession is the sole trust enforcement point on the customer side.

Command allowlist (customer-side, in the producer shim)

_cmd_forwarder in linbit-tunl.py reads commands from the relay over SSH stdio and applies a strict allowlist:

Command Effect Allowed
SNAPSHOT Triggers capture-pane, injects current screen into stream yes
KEYS <role> <hex> Injects keystrokes into a registered pane yes
RESIZE <C>x<R> Resizes the tmux window yes
STATUS <text> Writes text to the chat-pane status header yes
UPGRADE <port> Requests ControlMaster port forward yes (restricted: no)
SUPPORT_KEY <key> Accumulates a support public key yes (tunnel flow only)
TUNNEL_CONFIRMED Installs accumulated keys in authorized_keys yes (tunnel flow only)
anything else Logged and dropped no

KEYS is further restricted: <role> must resolve via the producer's registered pane-role map (currently main and chat; the lookup is a dict and unknown roles silently drop). The relay cannot name a pane the producer did not register.

In --restricted mode, UPGRADE, SUPPORT_KEY, and TUNNEL_CONFIRMED are also rejected.

Audit log

Every inbound command from the relay is logged verbatim to /tmp/linbit-tunl-<hash>/ssh.log, including the timestamp and whether it was allowed or rejected. The log appends across reconnects. The customer can inspect it mid-session or after. The session ID does not appear in the path; <hash> is the first 12 hex digits of SHA-256(session_id).

What the relay cannot do even if compromised

A compromised relay can only:

A compromised relay cannot:

Password safety

PTY ECHO mode is a kernel terminal attribute, not a VT sequence. When ECHO is disabled (password prompts, sudo, etc.), keystrokes are not written to PTY output and therefore do not appear in %output events. They do not appear in the asciinema cast file.

There is no per-keystroke audit log on the customer side -- the customer's .cast is the authoritative recording, and ECHO-off content stays out of it by construction.

Tracing caveat: when TUNL_TRACE_WIRE or TUNL_TRACE_ALL=1 is set, the raw inbound CTRL frames (including support-side KEYS payloads, which carry the bytes the support engineer typed) are written verbatim to <session_dir>/wire.broker.bin for diagnostic replay. Do not enable wire tracing on a session where you want password-prompt content to stay off disk. See docs/debuggability.html and relay/tunl_trace.py for the trace controls.

Session ID visibility

The tmux session name on the customer machine uses a short SHA-256 hash of the session ID (linbit-{sha256[:12]}), not the session ID itself. The session ID is not visible in ps output to other users on the host.


Support/relay-operator perspective

Authentication to the relay

The customer script authenticates to the relay as the tunl user:

  1. CA-signed ephemeral certificate: An ephemeral ed25519 keypair is generated per session. The public key is submitted to POST /api/v1/sign with a one-time token. The relay CA signs it with a short TTL (default 60 min), a force-command critical option, restricted port-forward permissions, and a per-session principal (linbit-tunl-<session-id>). A cert expiring mid-session does not drop an established connection.

    sshd accepts a cert if its principal list intersects with AuthorizedPrincipalsCommand output; that command is relay/check-principal.py, which queries the relay API for the session-id encoded in the cert's key_id and emits the principal only when the session is in pending/active status. This binds the cert to one specific session at the SSH-auth layer (B5.1 fix, 2026-05-07): a cert signed for session A1 cannot authenticate as A2 even within its TTL. cmd_produce additionally re-checks SSH_USER_AUTH against $SSH_ORIGINAL_COMMAND for defence-in-depth.

  2. Keyboard-interactive credential: The customer types (or the script provides via SSH_ASKPASS) one of two credential forms. sshd delegates validation to relay/validate-token.py via PAM pam_exec, which calls the relay API.

    PAM/cmd_produce binding (B5.1 fix, 2026-05-07): on a successful session-id ACCEPT, validate-token.py writes the validated session-id to /run/linbit-tunl/auth/<ppid>.sid (mode 0640, group linbit-tunl). <ppid> is the OpenSSH priv-monitor pid -- the common parent of both the PAM-time validate-token.py process and the post-auth cmd_produce ForceCommand. cmd_produce reads + unlinks that file at startup and rejects when its content disagrees with the session-id claimed in $SSH_ORIGINAL_COMMAND. This binds the PAM-validated id to the session-id used downstream and prevents an authenticated tunl user from claiming any other pending session. The token-auth path does not write a hand-off (tokens are not API-bound to a specific session-id); for token-auth users the cross-check is currently skipped.

    Token consumption happens at the next relay-API call: POST /api/v1/sign consumes a token by deling it from _expected_tokens (when REQUIRE_TOKEN=1 and the session-id is not yet in _known_session_ids); POST /api/v1/sessions does the same on the on-demand path. After consumption the session-id is whitelisted in _known_session_ids and subsequent API calls for that session skip the token requirement.

    Brute-force defence (B3.3, 2026-05-06 audit, updated for the nginx-on-443 deployment):

    With these caps in place, the ~30-bit session-id keyspace is not directly brute-forceable from any single source within the pending TTL. See "Consequences of a successful brute force" in known limitation 3 for what an attacker would actually obtain if they beat the math anyway.

Relay host key verification

The relay's host CA public key is embedded in the customer script at build time (RELAY_CA_PUBKEY). All SSH invocations use a temp known_hosts file with a @cert-authority entry pointing at that CA. StrictHostKeyChecking=yes is in effect for all relay connections. A script that was not updated with the real CA key will fail to connect rather than silently trust any host.

Session isolation

Sessions are isolated by the broker socket path (TUNL_SHARE_DIR/<session-id>.sock, default /run/linbit-tunl/<id>.sock). The socket is created with permissions 0o660 and owned by the user running the broker (typically a dedicated tunl-share service account). Consumers must be in the linbit-tunl group to connect.

The SQLite database (TUNL_DB) stores session metadata (IP, case number, customer name) and is created with restrictive permissions.

Support-side connections

The support SSH user on the relay has no ForceCommand. Support engineers SSH in and run relay/relay-share.py consume SESSION directly. They can only interact with sessions that exist in the broker socket directory (i.e., sessions the customer's daemon has registered).

Read-only mode: --ro / tunl join --ro sends the connection header CONSUMER_RO\n. The broker does not forward keystrokes from a read-only support-side connection to the customer.

Relay API security

The relay API (relay-api.py) listens on 127.0.0.1:8080 inside the relay pod. Access from the public internet is mediated by nginx, which proxies only a whitelist of customer-bootstrap endpoints to the api:

All other endpoints (session lookups, pending registry, info, expected token validation, support-keys, uploads, etc.) are 404'd at the nginx layer and are reachable only from the pod's loopback. info in particular is loopback-only because it would otherwise expose the relay's require_pending posture to anonymous scanners. The ForceCommand (relay-share.py) calls those endpoints over 127.0.0.1, validate-token.py calls info and expected/{token} over 127.0.0.1, and external support access (e.g., tunl list) is via SSH port-forwarding through the support user.

relay-api.py itself answers identical handlers on every interface it listens on; it does not infer "public" vs. "private" from the request's source. The trust boundary is the nginx allowlist and the loopback bind. Implication: widening the allowlist in nginx is a security decision, not just a config tweak. Any endpoint that would be exposed to anonymous internet traffic must continue to enforce its own auth (cooldown, token consumption, ...), and any error message it emits must assume an attacker as the reader. The expected/{token} GET is explicitly never exposed because it would be a brute-force oracle for the 6-digit token if reachable from the internet. Per-source rate-limit coverage on the public surfaces is provided by the nginx limit_req (/api/v1/sign) and limit_conn (stream block) zones, both keyed on the real source IP -- the SSH-side sshd_config.d/tunl.conf and relay/fail2ban/ caps see only loopback at this time and are not the HTTP path's defence.

Token-based session pre-registration (POST /api/v1/pending) stores tokens in memory with a configurable TTL (TUNL_TOKEN_TTL_MINUTES, default 60 minutes). Tokens are 6-digit random integers (10^6 combinations).

Two-phase use:

A used token cannot be replayed once the consume step has run. Until then (i.e., between PAM accept and the first /sign//sessions call), the token is multi-use; the brute-force defences described under "Authentication to the relay" above prevent that window from being practically exploitable.

Tunnel upgrade (ControlMaster forward)

When the customer grants a tunnel upgrade, the sequence is:

  1. Support sends UPGRADE <port> to the broker.
  2. Broker forwards UPGRADE <port> to the producer over the command channel.
  3. Producer runs ssh -S ctl.sock -O forward -R <port>:localhost:22 tunl@relay.
  4. On success, producer writes LINBIT_TUNNEL_PORT <port> to the broker.
  5. Broker PATCHes the session record (tunnel_port, mode=tunnel).
  6. Broker sends accumulated support public keys via SUPPORT_KEY commands.
  7. Producer installs them in authorized_keys on TUNNEL_CONFIRMED.
  8. Broker sends TUNNEL_PORT <port> to the support side.

In --restricted mode, the producer drops UPGRADE commands, so step 3 never happens.

Asciinema cast file

The broker writes a v2 .cast file to /var/log/linbit-tunl/<session>.cast. This captures all decoded VT bytes fanned out to the support side. The file is owned by the broker process user and is not world-readable. It does not contain keystroke content (logged in the customer-side audit log only).


Known limitations

  1. No single-writer enforcement: multiple read-write support clients can inject keystrokes simultaneously. Coordination is social.

  2. Broker socket on relay: The Unix socket is group-accessible; any process in the linbit-tunl group on the relay can connect to the broker as a support-side client. This is intentional for the multi-support-engineer use case but requires OS-level group membership control.

  3. Credential brute-force surface: the four-word session-id has ~30 bits of entropy and the 6-digit token has ~20 bits. Neither is consumed at PAM time (consumption is deferred to the first /sign//sessions call), so each is multi-use within the pending TTL. Caps from front to back:

    With these caps active, the credential search space is not reachable from any single source within the pending TTL.

    Consequences of a successful brute force (the cap on what an attacker gains if they beat the math anyway). The attacker becomes the producer for the one session-id they guessed. They do not gain a channel into the legitimate customer's machine: the relay only fans tmux output out to the support side; the producer→broker direction does not contain customer-machine access of any kind. What they do gain is what the impersonation case under "Session-ID-only auth mode and the impersonation case" describes -- keystroke / chat capture, the right to render arbitrary content on the engineer's terminal, and the range-limited tunnel redirect (B5.2 caps LINBIT_TUNNEL_PORT to 30000-39999, so an engineer's tunl ssh lands on an attacker-controlled relay-side -R forward rather than on the customer's box). Per-session-principal cert binding (B5.1) confines this to the one guessed sid; the attacker cannot pivot to a different session within the same cert.

    In short: the credential math is one input; the consequence math is the other. The consequence ceiling is "social-engineering pivot against the support engineer for one bounded session," not "compromise of the customer's host." Mitigations stay operational: prefer TUNL_REQUIRE_TOKEN=1, expect support to confirm out-of-band before typing anything sensitive into a freshly-attached session, and keep the rate-limit caps tight.

  4. StrictHostKeyChecking=no for customer sshd: When support SSHes to the customer machine via the reverse tunnel, host key verification is disabled. Security is provided by the relay authentication layer; the customer sshd is reached only via an authenticated tunnel.

  5. Public exposure of relay-api via nginx: the production relay reverse-proxies a small whitelist of relay-api endpoints to the internet on port 443 (see "Relay API security" above and docs/deployment.html). This was a deployment shift after the 2026-05-06 audit was scoped, so the audit did not consider the implications. Status of each concern:

  6. nginx mux passes traffic to sshd without PROXY protocol: the stream block in relay/nginx/nginx.conf (lines 50-58, comment) intentionally does not enable PROXY protocol toward sshd, because upstream OpenSSH does not consume it. Consequence: sshd sees every SSH-leg connection as coming from 127.0.0.1, the existing fail2ban PAM filter (relay/fail2ban/filter.d/linbit-tunl-pam.conf) can only ever "ban" the relay pod itself, and any per-IP sshd-level cap is evaluated against a single shared loopback source. The brute-force defences on the SSH leg therefore protect against connection-count exhaustion (MaxStartups, MaxAuthTries) but not against a sufficiently slow per-IP attacker visible at sshd's own logs.

    This is independent of the stream-block limit_conn (item 5): nginx's stream block sees the real source IP and rate-limits on it before handing the connection to sshd, even without PROXY protocol downstream. PROXY-to-sshd (or TPROXY) is still useful for sshd's own audit logs and for the PAM fail2ban filter to do something meaningful again, but it is not on the critical path for front-door rate limiting. Falls in the same backlog tier as re-adding a fail2ban filter on the nginx access log: do it when there is a concrete audit or forensics reason, not before.