2026-04-23

How we run iSCSI over the internet

iSCSI was designed for the SAN inside a rack, not the open internet. Here's the pile of small decisions that makes scsipub work anyway — Ranch 2.x listeners, a BEAM process per session, COW overlays, Caddy-terminated TLS, and a handful of open-iscsi quirks that cost us a day each.

by Tom #architecture#iscsi#elixir

iSCSI is a protocol from the era when “the network” meant a rack-scale fibre channel replacement. Initiators and targets trusted each other, CHAP was optional theatre, and a packet from an initiator carried the implicit assumption “we’re on the same L2 segment.”

scsipub serves iSCSI targets to arbitrary clients on the public internet. That’s a different set of assumptions. This post is the decision log — the small choices that add up to “this works and doesn’t break from day one.”

The listener

Both ports are Ranch 2.x listeners — plain TCP on 3260, TLS on 3261. Scsipub.Target.Listener returns a pair of child specs that the application supervisor adds at boot:

def child_specs(opts) do
  tcp_spec = tcp_child_spec(opts[:port] || 3260, protocol_opts)

  if opts[:tls_certfile] && opts[:tls_keyfile] &&
       File.exists?(opts[:tls_certfile]) && File.exists?(opts[:tls_keyfile]) do
    tls_spec = tls_child_spec(opts[:tls_port] || 3261, certfile, keyfile, protocol_opts)
    [tcp_spec, tls_spec]
  else
    [tcp_spec]
  end
end

Ranch runs a small acceptor pool in front of a :ranch_protocol callback. When a connection arrives, Ranch spawns a fresh BEAM process and hands it the socket. For iSCSI that’s the unit we want: one process per TCP connection, one TCP connection per initiator session, one initiator session per user-visible mountable disk.

“One BEAM process per connection” only works because processes here aren’t OS threads. A BEAM process is ~2.5 KB of initial heap and some bookkeeping — the scheduler happily runs tens of thousands of them on a single core. iSCSI sessions sit idle waiting for SCSI PDUs most of the time, which is the ideal shape for green threads: cheap to park, cheap to wake.

Contrast with the C implementations: target_core_iblock and friends carry a thread pool and a queue, and tuning the pool size is an ongoing concern. We don’t tune anything and the BEAM happily handled 446 req/s in our web-side load test before latency started climbing — and that’s the Phoenix surface with its DB hops, not the iSCSI listener, which has smaller payloads and no SQL in the hot path at all.

One process per session

The protocol module is Scsipub.Target.Session, a plain GenServer. Its state machine walks through three phases:

phase: :security_negotiation  # csg=0, CHAP challenge/response
phase: :operational           # csg=1, negotiate parameters
phase: :full_feature           # csg=1 transit done, handling SCSI PDUs

Each PDU comes in on the socket, gets parsed into a struct, and routed to a handler. If a handler raises — malformed PDU, unexpected state transition, disk error — the process dies. That’s on purpose. The supervisor doesn’t restart it, because there’s no meaningful recovery; the initiator will notice the TCP close and try to log in again. State doesn’t leak between sessions because state doesn’t leave the process.

This is the standard Erlang story (“let it crash”), but it’s more than a platitude for iSCSI. The real-world alternative — carefully defending every parser branch against every attacker-shaped PDU — is how RFC 7143’s more colourful edge cases turn into CVEs in other implementations. We don’t defend; we fence. One bad PDU kills one session.

The Registry (Scsipub.Sessions.Registry, ETS-backed) is how a session announces itself once it reaches Full Feature Phase:

Registry.set_pid(iqn, self())

The Registry monitors the pid and auto-cleans the entry on :DOWN. The admin dashboard reads from the same ETS table to show live connections.

COW overlays

The base image is a regular file — .img, .iso, or .qcow2 decompressed to raw on fetch. It’s read-only. Every concurrent session gets its own overlay file, sparse-allocated to the same size as the base:

/var/lib/scsipub/overlays/
  71a61232479cc467.img          ← overlay, sparse
  71a61232479cc467.img.bitmap   ← 1 bit per sector

The bitmap tracks which 512-byte sectors have been written. Reads check the bit: if set, the overlay has the sector; if clear, fall through to the base image. Writes set the bit and write to the overlay.

The layout means:

The base image is never touched. CI verifies this — we SHA-256 the base before and after an integration run.
The overlay file is sparse. A session that only writes the MBR costs ~512 bytes on disk, not “the full virtual size of the disk.” Filesystem holes do the work.
Disconnecting is cheap. Non-persistent tiers delete the overlay on the TCP close; persistent tiers keep it until the session’s TTL elapses or the user destroys it explicitly.
Writes are counted. Each overlay write bumps a counter against write_limit from the user’s tier config. Hit the limit and the target responds WRITE_PROTECT until the session ends.

The Janitor, a GenServer on a 10-minute tick, sweeps the overlay directory and deletes files that don’t match any live session in the database. That’s how we clean up from the rare case where a process dies before its terminate callback runs.

Caddy in front, TLS everywhere

Caddy terminates HTTPS on port 443 and reverse-proxies to the Phoenix app on port 4000. The same Let’s Encrypt certificate also protects the iSCSI-TLS listener on port 3261 — which is the interesting part, because the iSCSI listener isn’t behind Caddy. It binds :ranch_ssl directly.

Caddy writes the ACME-obtained cert to its internal storage (/var/lib/caddy/.local/share/caddy/...), which the app user can’t read. The bridge is a tiny systemd service running inotifywait against that directory and copying the cert into /var/lib/scsipub/tls/ — owned by a shared group both users can read — whenever the bytes change.

The iSCSI listener picks up rotations without a restart because its sni_fun re-reads the PEM on every TLS handshake, with guardrails:

# lib/scsipub/target/tls_certs.ex
def sni_opts(certfile, keyfile) do
  now = System.monotonic_time(:second)

  case :persistent_term.get(cache_key, nil) do
    {_cert_mtime, _key_mtime, loaded_at, opts}
    when now - loaded_at < @min_reload_interval ->
      opts                           # 60s cooldown — serve cache unconditionally

    {cert_mtime, key_mtime, _loaded_at, opts} ->
      if stat_unchanged?(certfile, keyfile, cert_mtime, key_mtime) do
        opts                         # mtime unchanged — still fresh
      else
        reload_and_cache(...)        # rotation happened — re-read PEM
      end

    nil ->
      reload_and_cache(...)          # cold cache — first load
  end
end

Two guards, in order: a 60-second cooldown that serves the cached opts without any syscall (absorbs a thundering-herd handshake burst), and an mtime check after the cooldown that only pays for a fresh PEM read when the files have actually changed. Both matter — sni_fun is on the hot path for every TLS handshake, and without them a rotation every few months would still cost two stat syscalls per mount.

Things open-iscsi cares about

If you’re building against the open-iscsi initiator that ships in every Linux distro, the protocol is less “what’s on the wire” and more “what iscsiadm does with what’s on the wire.” Three concrete examples that each cost us a day.

`/` in the IQN type-name separator

Our first cut of anonymous target names was iqn.2025-01.pub.scsipub:image/ubuntu. That parses fine as an IQN. iscsiadm even does discovery against it happily. What it can’t do is log in:

iscsiadm: Could not make /etc/iscsi/nodes/iqn.2025-01.pub.scsipub:image/ubuntu

open-iscsi stores its persistent state in /etc/iscsi/nodes/<iqn>/... — it uses the IQN verbatim as a filesystem path. Any / in the name becomes a subdirectory boundary, and the create-if-missing path walk fails. We switched to . as the type/name separator (iqn.2025-01.pub.scsipub:image.ubuntu), which parses the same way and sidesteps the whole problem.

SendTargets has to advertise an address the client can reach

When an initiator does discovery, the target replies with a list of TargetName + TargetAddress records. The initiator saves that address as the portal for future logins — even if the discovery request itself went through a different IP.

In our CI, the target runs inside a CI container and the initiator inside a QEMU VM. QEMU’s user-mode networking NATs to 10.0.2.2 from the VM’s perspective. If we let the server advertise whatever sockname() returns — 127.0.0.1:3260 — iscsiadm dutifully saves that as the portal, and every subsequent login attempt tries to reach the runner’s loopback from inside the VM and fails forever.

# lib/scsipub/target/session.ex
defp advertise_address(socket, transport) do
  case Application.get_env(:scsipub, :public_host) do
    host when is_binary(host) -> "#{host}:#{port(socket, transport)}"
    _ -> sockname_string(socket, transport)
  end
end

Pin :public_host (we ship this as PHX_HOST in deploy env) and SendTargets returns something the client can actually get back to.

The `-o new` dance for static logins

Once you’ve been bitten by the SendTargets-saves-the-portal behaviour enough times, you learn to skip discovery for anything that needs a non-default portal. For example: iSCSI-over-TLS via stunnel. The natural flow would be “discover via the tunnel, then log in.” But the discovery response names the server’s public portal, not 127.0.0.1:3260 where stunnel is terminating, so iscsiadm saves the wrong portal and logs in plain instead of through the tunnel.

The fix is static login:

IQN=iqn.2025-01.pub.scsipub:blank

iscsiadm -m node -T $IQN -p 127.0.0.1:3260 -o new
iscsiadm -m node -T $IQN -p 127.0.0.1:3260 \
  -o update -n node.session.auth.authmethod -v None
iscsiadm -m node -T $IQN -p 127.0.0.1:3260 --login

-o new creates a fresh node record at the portal you specify instead of using whatever the discovery step saved. Our landing page renders exactly that command sequence for the TLS path, because the alternative is an infuriating 30 minutes with iscsiadm --debug=6.

Bonus: stale records retry forever

Once a node record exists under /etc/iscsi/nodes/, iscsid retries the login indefinitely if the session drops. If the target has been destroyed server-side, that manifests as a steady 1-every-3-second stream of “unknown target” login attempts in our server logs. The cure is on the client:

iscsiadm -m node -T <iqn> -o delete

On the server we throttle the log line (once per (ip, target) per 5 minutes at warning level, debug after that) so a stale initiator doesn’t bury real warnings under 17,280 lines of the same complaint per day. See Scsipub.Target.Session.log_unknown_target/2.

What we’re not solving

Deliberate omissions, for the record:

Multi-region. Everything runs in a single datacenter. A multi-region story would need per-session persistence to be a distributed system problem; it currently isn’t, and we like that.
S3- or NBD-backed base images. Images are local sparse files. Upload via the admin UI or an ecto run script; that’s the whole ingestion story. Cloud-backed storage changes the read-path latency distribution meaningfully enough that we’d want to think about it rather than bolt it on.
iSER / RDMA. No. scsipub is a public-internet service; RDMA is a rack-scale protocol. If you need 40 Gbit/s into a block device, the physics say you aren’t on the public internet anyway.
MPIO. Not yet. The initiator side of multipath works fine, but until we have multi-region there’s nowhere to failover to.
Per-session encryption above TLS. The iSCSI protocol has IPsec and a few other approaches for payload secrecy; none are widely deployed, and adding our own on top of TLS would just be framing for framing’s sake.

What comes next

The immediate line of work is the hardware bridges — the ESP32 firmware that turns an ESP32-S3 into a wireless iSCSI-to-USB dongle, and the Pi4 netboot shim that lets a Raspberry Pi boot directly into an iSCSI target over TFTP. Both are separate projects, both are linked from the front page, and both will get their own write-ups shortly.

Past that, the interesting question is what happens when a Phoenix app serving iSCSI meets someone who really wants to use it — tens of thousands of sessions, sustained writes, a pathological initiator. We’ve done a load test up to a few hundred concurrent web requests; we haven’t yet found the shape of the BEAM’s failure mode under actual iSCSI load. That’s the next thing to measure.