Steven's Knowledge

Notebooks & Experimentation

Where prompt iteration and model exploration actually happen — and where notebooks stop being the right tool

Notebooks are the de facto environment for AI experimentation. They sit in a sweet spot: interactive, REPL-like feedback, but with the structure of code and the readability of a document. Almost every model evaluation, prompt iteration, and dataset exploration starts in a notebook. The trick is knowing when to stay in one and when to graduate out.

Why Notebooks Fit AI Work

  • Stateful exploration. Load a model once, ask many questions of it.
  • Mixed media. Markdown explanation, code, output, plots, in line.
  • Cheap iteration. Tweak a prompt, rerun the cell, see the diff.
  • Shareable artifacts. A notebook is both the experiment and the report.

For prompt engineering, dataset inspection, eval analysis, and model comparison, the notebook is genuinely the right shape.

The Major Environments

Jupyter — the open standard. Local with jupyter lab, anywhere with jupyter server. Works offline, full filesystem access, runs whatever Python environment you give it. The default for serious work.

Google Colab — Jupyter in a managed environment with free and paid GPUs. Best path to "I want to fine-tune something" without standing up infrastructure. Less great for long-lived projects (sessions time out, environment is ephemeral).

Kaggle Notebooks — similar to Colab, with built-in dataset and competition integration. Heavily used in the ML community.

VS Code / Cursor notebooks.ipynb opened in your editor. Better diff support, integrates with the rest of your codebase, weaker plotting and rich-output story.

Marimo — a newer reactive notebook for Python. Cells re-run automatically when their inputs change; the file is a regular Python script, not JSON. Pleasant for exploration; better git story than Jupyter.

Databricks / Hex / Deepnote / Observable — managed notebook platforms with collaboration, scheduling, and warehouse integration. Mostly aimed at data teams; relevant to AI when your work crosses into analytics.

What Notebooks Are Good For

  • Prompt iteration on a small set of examples.
  • Eval result analysis — slicing, plotting, drilling into failures.
  • Dataset exploration — looking at the actual data before you train or fine-tune.
  • Model comparison — same input across providers or versions, side by side.
  • Tutorials and reproducible experiments.
  • One-off data prep that doesn't justify a script.

Where Notebooks Get Dangerous

  • Production code. A notebook is not a deployable artifact. Things that need to run on a schedule or in response to events belong in scripts or services.
  • Long-running training jobs. Running anything that takes hours in a notebook is asking for the kernel to die at 90% complete.
  • Reproducibility. Out-of-order cell execution, hidden state, and unmanaged dependencies make notebooks notoriously hard to reproduce. Restart-and-run-all should always work; if it doesn't, the notebook is broken even when the cells appear green.
  • Source control. JSON-formatted notebooks diff badly. Output cells balloon the file size.
  • Team collaboration on a single file. Last-write-wins doesn't work for two people editing simultaneously.

Healthy Notebook Hygiene

  • Restart the kernel and run-all before declaring a notebook done.
  • Strip outputs before committing (nbstripout, jupytext, or move to Marimo).
  • Pin dependencies at the top of the notebook or in a requirements.txt.
  • Promote stable code out of the notebook into a module the notebook imports. The notebook becomes the experiment harness; the module is the real code.
  • Clear narrative. A good notebook reads top-to-bottom. If a colleague can't follow the story, refactor.

Graduating Out

When notebook experimentation produces something worth productionizing:

  1. Extract reusable pieces into a Python module.
  2. Write tests against that module. The notebook is not a test suite.
  3. Wrap in a CLI or service. Notebooks are bad cron jobs.
  4. Keep a "demo notebook" that imports the module and shows usage. That stays useful.

The mistake is leaving production logic embedded in the notebook because "it works there." It works there until it doesn't.

Notebooks in the AI Loop

For LLM work specifically, notebooks shine at:

  • Eval set curation — load examples, manually grade outputs, save labels.
  • Failure mode analysis — pull production traces, cluster, characterize.
  • Prompt versioning — compare V1 vs V2 outputs on the same inputs in a single view.
  • Embedding visualization — UMAP or t-SNE plots for retrieval debugging.

That work, done well in notebooks, makes everything downstream easier — including the production system that doesn't run in a notebook.

On this page