2026-04-23
How we run iSCSI over the internet
iSCSI was designed for the SAN inside a rack, not the open internet. Here's the pile of small decisions that makes scsipub work anyway — Ranch 2.x listeners, a BEAM process per session, COW overlays, Caddy-terminated TLS, and a handful of open-iscsi quirks that cost us a day each.
by Tom #architecture#iscsi#elixir
iSCSI is a protocol from the era when “the network” meant a rack-scale fibre channel replacement. Initiators and targets trusted each other, CHAP was optional theatre, and a packet from an initiator carried the implicit assumption “we’re on the same L2 segment.”
scsipub serves iSCSI targets to arbitrary clients on the public internet. That’s a different set of assumptions. This post is the decision log — the small choices that add up to “this works and doesn’t break from day one.”
The listener
Both ports are Ranch 2.x listeners — plain TCP on 3260, TLS on 3261.
Scsipub.Target.Listener returns a pair of child specs that the
application supervisor adds at boot:
def child_specs(opts) do
tcp_spec = tcp_child_spec(opts[:port] || 3260, protocol_opts)
if opts[:tls_certfile] && opts[:tls_keyfile] &&
File.exists?(opts[:tls_certfile]) && File.exists?(opts[:tls_keyfile]) do
tls_spec = tls_child_spec(opts[:tls_port] || 3261, certfile, keyfile, protocol_opts)
[tcp_spec, tls_spec]
else
[tcp_spec]
end
end
Ranch runs a small acceptor pool in front of a :ranch_protocol
callback. When a connection arrives, Ranch spawns a fresh BEAM
process and hands it the socket. For iSCSI that’s the unit we
want: one process per TCP connection, one TCP connection per
initiator session, one initiator session per user-visible
mountable disk.
“One BEAM process per connection” only works because processes here aren’t OS threads. A BEAM process is ~2.5 KB of initial heap and some bookkeeping — the scheduler happily runs tens of thousands of them on a single core. iSCSI sessions sit idle waiting for SCSI PDUs most of the time, which is the ideal shape for green threads: cheap to park, cheap to wake.
Contrast with the C implementations: target_core_iblock and friends carry a thread pool and a queue, and tuning the pool size is an ongoing concern. We don’t tune anything and the BEAM happily handled 446 req/s in our web-side load test before latency started climbing — and that’s the Phoenix surface with its DB hops, not the iSCSI listener, which has smaller payloads and no SQL in the hot path at all.
One process per session
The protocol module is Scsipub.Target.Session, a plain
GenServer. Its state machine walks through three phases:
phase: :security_negotiation # csg=0, CHAP challenge/response
phase: :operational # csg=1, negotiate parameters
phase: :full_feature # csg=1 transit done, handling SCSI PDUs
Each PDU comes in on the socket, gets parsed into a struct, and routed to a handler. If a handler raises — malformed PDU, unexpected state transition, disk error — the process dies. That’s on purpose. The supervisor doesn’t restart it, because there’s no meaningful recovery; the initiator will notice the TCP close and try to log in again. State doesn’t leak between sessions because state doesn’t leave the process.
This is the standard Erlang story (“let it crash”), but it’s more than a platitude for iSCSI. The real-world alternative — carefully defending every parser branch against every attacker-shaped PDU — is how RFC 7143’s more colourful edge cases turn into CVEs in other implementations. We don’t defend; we fence. One bad PDU kills one session.
The Registry (Scsipub.Sessions.Registry, ETS-backed) is how
a session announces itself once it reaches Full Feature Phase:
Registry.set_pid(iqn, self())
The Registry monitors the pid and auto-cleans the entry on
:DOWN. The admin dashboard reads from the same ETS table to
show live connections.
COW overlays
The base image is a regular file — .img, .iso, or .qcow2
decompressed to raw on fetch. It’s read-only. Every concurrent
session gets its own overlay file, sparse-allocated to the
same size as the base:
/var/lib/scsipub/overlays/
71a61232479cc467.img ← overlay, sparse
71a61232479cc467.img.bitmap ← 1 bit per sector
The bitmap tracks which 512-byte sectors have been written. Reads check the bit: if set, the overlay has the sector; if clear, fall through to the base image. Writes set the bit and write to the overlay.
The layout means:
- The base image is never touched. CI verifies this — we SHA-256 the base before and after an integration run.
- The overlay file is sparse. A session that only writes the MBR costs ~512 bytes on disk, not “the full virtual size of the disk.” Filesystem holes do the work.
- Disconnecting is cheap. Non-persistent tiers delete the overlay on the TCP close; persistent tiers keep it until the session’s TTL elapses or the user destroys it explicitly.
-
Writes are counted. Each overlay write bumps a counter
against
write_limitfrom the user’s tier config. Hit the limit and the target respondsWRITE_PROTECTuntil the session ends.
The Janitor, a GenServer on a 10-minute tick, sweeps the overlay directory and deletes files that don’t match any live session in the database. That’s how we clean up from the rare case where a process dies before its terminate callback runs.
Caddy in front, TLS everywhere
Caddy terminates HTTPS on port 443 and reverse-proxies to the
Phoenix app on port 4000. The same Let’s Encrypt certificate
also protects the iSCSI-TLS listener on port 3261 — which is
the interesting part, because the iSCSI listener isn’t behind
Caddy. It binds :ranch_ssl directly.
Caddy writes the ACME-obtained cert to its internal storage
(/var/lib/caddy/.local/share/caddy/...), which the app user
can’t read. The bridge is a tiny systemd service running
inotifywait against that directory and copying the cert into
/var/lib/scsipub/tls/ — owned by a shared group both users
can read — whenever the bytes change.
The iSCSI listener picks up rotations without a restart because
its sni_fun re-reads the PEM on every TLS handshake, with
guardrails:
# lib/scsipub/target/tls_certs.ex
def sni_opts(certfile, keyfile) do
now = System.monotonic_time(:second)
case :persistent_term.get(cache_key, nil) do
{_cert_mtime, _key_mtime, loaded_at, opts}
when now - loaded_at < @min_reload_interval ->
opts # 60s cooldown — serve cache unconditionally
{cert_mtime, key_mtime, _loaded_at, opts} ->
if stat_unchanged?(certfile, keyfile, cert_mtime, key_mtime) do
opts # mtime unchanged — still fresh
else
reload_and_cache(...) # rotation happened — re-read PEM
end
nil ->
reload_and_cache(...) # cold cache — first load
end
end
Two guards, in order: a 60-second cooldown that serves the
cached opts without any syscall (absorbs a thundering-herd
handshake burst), and an mtime check after the cooldown that
only pays for a fresh PEM read when the files have actually
changed. Both matter — sni_fun is on the hot path for every
TLS handshake, and without them a rotation every few months
would still cost two stat syscalls per mount.
Things open-iscsi cares about
If you’re building against the open-iscsi initiator that ships in every
Linux distro, the protocol is less “what’s on the wire” and more
“what iscsiadm does with what’s on the wire.” Three concrete
examples that each cost us a day.
/ in the IQN type-name separator
Our first cut of anonymous target names was iqn.2025-01.pub.scsipub:image/ubuntu.
That parses fine as an IQN. iscsiadm even does discovery against it
happily. What it can’t do is log in:
iscsiadm: Could not make /etc/iscsi/nodes/iqn.2025-01.pub.scsipub:image/ubuntu
open-iscsi stores its persistent state in /etc/iscsi/nodes/<iqn>/...
— it uses the IQN verbatim as a filesystem path. Any / in the name
becomes a subdirectory boundary, and the create-if-missing path walk
fails. We switched to . as the type/name separator
(iqn.2025-01.pub.scsipub:image.ubuntu), which parses the same way and
sidesteps the whole problem.
SendTargets has to advertise an address the client can reach
When an initiator does discovery, the target replies with a list of
TargetName + TargetAddress records. The initiator saves that
address as the portal for future logins — even if the discovery
request itself went through a different IP.
In our CI, the target runs inside a CI container and the initiator
inside a QEMU VM. QEMU’s user-mode networking NATs to 10.0.2.2 from
the VM’s perspective. If we let the server advertise whatever
sockname() returns — 127.0.0.1:3260 — iscsiadm dutifully saves
that as the portal, and every subsequent login attempt tries to
reach the runner’s loopback from inside the VM and fails forever.
# lib/scsipub/target/session.ex
defp advertise_address(socket, transport) do
case Application.get_env(:scsipub, :public_host) do
host when is_binary(host) -> "#{host}:#{port(socket, transport)}"
_ -> sockname_string(socket, transport)
end
end
Pin :public_host (we ship this as PHX_HOST in deploy env) and
SendTargets returns something the client can actually get back to.
The -o new dance for static logins
Once you’ve been bitten by the SendTargets-saves-the-portal behaviour
enough times, you learn to skip discovery for anything that needs a
non-default portal. For example: iSCSI-over-TLS via stunnel. The
natural flow would be “discover via the tunnel, then log in.” But the
discovery response names the server’s public portal, not
127.0.0.1:3260 where stunnel is terminating, so iscsiadm saves
the wrong portal and logs in plain instead of through the tunnel.
The fix is static login:
IQN=iqn.2025-01.pub.scsipub:blank
iscsiadm -m node -T $IQN -p 127.0.0.1:3260 -o new
iscsiadm -m node -T $IQN -p 127.0.0.1:3260 \
-o update -n node.session.auth.authmethod -v None
iscsiadm -m node -T $IQN -p 127.0.0.1:3260 --login
-o new creates a fresh node record at the portal you specify instead
of using whatever the discovery step saved. Our landing page renders
exactly that command sequence for the TLS path, because the alternative
is an infuriating 30 minutes with iscsiadm --debug=6.
Bonus: stale records retry forever
Once a node record exists under /etc/iscsi/nodes/, iscsid retries
the login indefinitely if the session drops. If the target has been
destroyed server-side, that manifests as a steady 1-every-3-second
stream of “unknown target” login attempts in our server logs. The
cure is on the client:
iscsiadm -m node -T <iqn> -o delete
On the server we throttle the log line (once per (ip, target) per 5
minutes at warning level, debug after that) so a stale initiator
doesn’t bury real warnings under 17,280 lines of the same complaint
per day. See Scsipub.Target.Session.log_unknown_target/2.
What we’re not solving
Deliberate omissions, for the record:
- Multi-region. Everything runs in a single datacenter. A multi-region story would need per-session persistence to be a distributed system problem; it currently isn’t, and we like that.
-
S3- or NBD-backed base images. Images are local sparse
files. Upload via the admin UI or an
ecto runscript; that’s the whole ingestion story. Cloud-backed storage changes the read-path latency distribution meaningfully enough that we’d want to think about it rather than bolt it on. - iSER / RDMA. No. scsipub is a public-internet service; RDMA is a rack-scale protocol. If you need 40 Gbit/s into a block device, the physics say you aren’t on the public internet anyway.
- MPIO. Not yet. The initiator side of multipath works fine, but until we have multi-region there’s nowhere to failover to.
- Per-session encryption above TLS. The iSCSI protocol has IPsec and a few other approaches for payload secrecy; none are widely deployed, and adding our own on top of TLS would just be framing for framing’s sake.
What comes next
The immediate line of work is the hardware bridges — the ESP32 firmware that turns an ESP32-S3 into a wireless iSCSI-to-USB dongle, and the Pi4 netboot shim that lets a Raspberry Pi boot directly into an iSCSI target over TFTP. Both are separate projects, both are linked from the front page, and both will get their own write-ups shortly.
Past that, the interesting question is what happens when a Phoenix app serving iSCSI meets someone who really wants to use it — tens of thousands of sessions, sustained writes, a pathological initiator. We’ve done a load test up to a few hundred concurrent web requests; we haven’t yet found the shape of the BEAM’s failure mode under actual iSCSI load. That’s the next thing to measure.