Steven's Knowledge

File Handling

Uploads, presigned URLs, streaming large files, content validation, virus scanning, resumable uploads, and serving downloads without blowing up memory

File Handling

Files are where naive back-end code breaks. The endpoint that works for a 2 MB avatar falls over on a 2 GB video, because the difference is not "bigger" — it is "does not fit in memory." File handling is mostly about one discipline: never load the whole thing into memory, and never trust what the client tells you it is.

This page covers accepting uploads, validating them safely, storing them, and serving them back — with the memory and security pitfalls that bite real systems called out as you go.

The Core Problem: Don't Buffer the Whole File

The first instinct — read the upload into a buffer, then write it somewhere — works in development and dies in production. A handful of concurrent large uploads each held fully in memory will OOM your server.

// ❌ Buffers the entire file in memory — fine for 2 MB, fatal for 2 GB ×100 users
app.post('/upload', async (req) => {
  const buffer = await readEntireBody(req);   // whole file in RAM
  await fs.writeFile(path, buffer);
});

// ✅ Stream: constant memory regardless of file size
app.post('/upload', (req) => {
  const out = fs.createWriteStream(path);
  pipeline(req, out, (err) => { /* handle */ });   // backpressure handled for you
});

The principle holds everywhere: uploads, downloads, transforms. Stream, don't buffer. Pipe the bytes from source to destination so memory stays flat no matter how big the file is, and so a slow consumer applies backpressure instead of filling a queue.

Upload Strategies

There are two fundamentally different ways to get a file from a client into storage, and the choice shapes everything else.

StrategyPathBest for
Through your serverclient → your API → object storageSmall files, when you must inspect/transform before storing
Direct to storage (presigned)client → object storage directlyLarge files, high volume, anything you can validate after

Through the Server (Multipart)

The classic multipart/form-data upload. Your server receives the stream, can validate it as bytes flow, and forwards it to storage.

  • Use it when you genuinely need the server in the path — server-side validation that must happen before storage, or transformation.
  • Stream it straight through to storage; don't stage the whole file on local disk unless you must.
  • The cost: every byte flows through your compute, consuming bandwidth and connection time. This does not scale for large files or high volume.

Direct to Storage with Presigned URLs

For large files and scale, take your server out of the data path entirely. The server only issues a short-lived, scoped credential; the client uploads straight to S3/GCS/R2.

// 1. Client asks your API for permission to upload
app.post('/uploads/presign', async (req) => {
  const key = `uploads/${userId}/${randomUUID()}/${sanitize(req.body.filename)}`;
  const url = await s3.getSignedUrl('putObject', {
    Bucket: BUCKET,
    Key: key,
    Expires: 300,                       // 5 minutes — short-lived
    ContentType: req.body.contentType,  // lock the declared type
    ContentLengthRange: [1, 50_000_000] // enforce a max size at the storage layer
  });
  return { url, key };
});

// 2. Client PUTs the bytes directly to `url` — never touches your server
// 3. Client calls back with `key`; you record it and validate server-side

Why this is the default for serious file handling:

  • Your server never touches the bytes — no bandwidth cost, no memory pressure, no connection held open for minutes.
  • It scales trivially — object storage handles the upload load, not your app servers.
  • The credential is constrained — scoped to one key, one content type, a size range, and expires in minutes.

The trade-off: you validate after the upload (on the callback), not during. For most systems that is the right deal.

Validating Untrusted Files

Every uploaded file is hostile until proven otherwise. The cardinal rule: never trust the client. Not the filename, not the Content-Type header, not the extension.

Size

Enforce a maximum before you accept the bytes, not after. Reject at the boundary — 413 Payload Too Large. With presigned uploads, enforce it in the policy (ContentLengthRange) so the storage layer rejects oversized files. Without a limit, file upload is a denial-of-service vector.

Type — Sniff, Don't Trust

The Content-Type header and the .jpg extension are both client-controlled and trivially faked. An attacker uploads malware.exe renamed to photo.jpg with Content-Type: image/jpeg. Verify the actual content by inspecting the file's magic bytes:

import { fileTypeFromBuffer } from 'file-type';

const type = await fileTypeFromBuffer(firstChunk);   // reads magic-number header
if (!type || !ALLOWED_MIME.has(type.mime)) {
  throw new BadRequest('Unsupported file type');      // based on real content, not claims
}

Maintain an allowlist of permitted types, never a denylist — you cannot enumerate everything dangerous. And derive the stored extension from the sniffed type, not the client's filename.

Filename — Sanitize Aggressively

A client-supplied filename like ../../etc/passwd or one with null bytes is a path-traversal attack. Never use the raw filename as a storage path:

  • Generate your own key (a UUID), and store the original filename as metadata only if you need it.
  • If you must preserve the name, sanitize hard: strip path separators, null bytes, and leading dots.
  • Never reflect a user filename into a filesystem path or a shell command.

Content Scanning

For anything users will download or that you'll process, scan for malware (e.g. ClamAV) before making it available. Run scanning asynchronously after upload — mark the file pending until it clears, then available. For images that other users will view, consider re-encoding them server-side, which strips embedded payloads and normalizes the format.

Resumable Uploads

For large files over flaky networks (mobile, video), a single failed PUT means starting over — unacceptable at gigabyte scale. Resumable protocols (the tus protocol, or S3 multipart upload) let the client upload in chunks and resume from the last successful one.

File split into chunks ──┐
  chunk 1 ✓ ──────────────┤
  chunk 2 ✓ ──────────────┤── server tracks which chunks landed
  chunk 3 ✗ (network drop) ┤
  chunk 3 ↻ (resume here) ─┘── only re-send the failed chunk
  ... assemble when all chunks received
  • S3 multipart upload handles this natively for direct-to-storage: initiate, upload parts (each retryable independently), then complete to assemble. Parts can upload in parallel for speed.
  • The tus protocol is an open standard for resumable uploads through your server when you need that path.
  • Either way, you need a cleanup job to abort and reclaim storage from multipart uploads that were started but never completed.

Serving Downloads

Getting files back out has its own pitfalls — memory, security, and performance.

  • Stream, don't buffer (again). Pipe from storage to the response; never read the whole file into memory to send it.
  • Redirect to a presigned download URL for large or private files. Don't proxy the bytes through your server — issue a short-lived signed GET URL and redirect the client to it. Storage serves the bytes; your server stays out of the path.
  • Authorize before issuing the URL. A presigned URL is a bearer token: anyone with the link can fetch the file until it expires. Check the requesting user is allowed this file, keep expiry short, and never make private buckets public.
  • Set headers deliberately. Content-Disposition: attachment forces a download rather than in-browser rendering — important for user-uploaded content, because serving an uploaded HTML or SVG file inline can execute scripts in your origin (stored XSS). Serve user content from a separate domain where you can.
  • Support range requests (Accept-Ranges: bytes) for media so clients can seek and resume. Object storage does this for you when you redirect to it.

Decision Tree

File larger than a few MB, or high upload volume?
  → Presigned direct-to-storage upload. Keep your server out of the data path.

Must inspect or transform the bytes before they're stored?
  → Stream through your server, validating as bytes flow. Never buffer the whole file.

Large files over unreliable networks (video, mobile)?
  → Resumable: S3 multipart, or the tus protocol.

Serving files back?
  → Authorize, then redirect to a short-lived presigned URL. Stream if you must proxy.

Accepting user content others will view?
  → Sniff the real type, scan for malware, re-encode images, and serve from a separate
    domain with Content-Disposition: attachment.

Checklist

  • Files stream end to end — never buffered fully in memory, on upload or download.
  • A maximum size is enforced at the boundary (or in the presigned policy), not after the fact.
  • File type is validated by sniffing magic bytes against an allowlist — never the Content-Type header or extension.
  • Storage keys are server-generated (UUIDs); client filenames are sanitized and never used as paths.
  • Large/high-volume uploads go direct to object storage via short-lived presigned URLs.
  • User-downloadable content is malware-scanned; images are re-encoded; downloads authorize before issuing a signed URL.
  • User content is served with Content-Disposition: attachment, ideally from a separate domain.
  • Incomplete multipart/resumable uploads are cleaned up by a background job.

On this page