This page was generated from doc/motivation.jupyter. Interactive online version: Binder badge

Motivation

Status Quo

The original format for Jupyter notebooks uses JSON as underlying storage format. This has the great advantage that such files are very easy to handle programmatically in many different environments, because JSON parsers are readily available for many programming languages.

One disadvantage, however, is that the format is only semi-human-readable and not very well human-editable. All textual content (e.g. text in Markdown cells and source code in code cells) is stored in lists of JSON strings – one string for each line. This means that each line is surrounded by quotes (") and strings are separated by commas (,), while lists of strings are surrounded by brackets ([ and ]). On top of that, several common characters are not allowed in JSON strings, which means that they have to be escaped by backslashes, e.g. \" and \n. And since a backslash is used for escaping, a literal backslash occurring in the text (which is quite common in programming languages and markup languages) has to be escaped itself (\\).

As an example, let’s create a notebook containing the previous two sentences:

[1]:
import nbformat
nb = nbformat.v4.new_notebook()
nb.cells.append(nbformat.v4.new_markdown_cell(
r"""On top of that,
several common characters are not allowed in JSON strings,
which means that they have to be escaped by backslashes,
e.g. `\"` and `\n`.
And since a backslash is used for escaping,
a literal backslash occurring in the text
(which is quite common in programming languages and markup languages)
has to be escaped itself (`\\`)."""))

The JSON-based storage of this minimal notebook looks like this:

[2]:
print(nbformat.writes(nb))
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On top of that,\n",
    "several common characters are not allowed in JSON strings,\n",
    "which means that they have to be escaped by backslashes,\n",
    "e.g. `\\\"` and `\\n`.\n",
    "And since a backslash is used for escaping,\n",
    "a literal backslash occurring in the text\n",
    "(which is quite common in programming languages and markup languages)\n",
    "has to be escaped itself (`\\\\`)."
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}

Escaped characters and JSON syntax elements make this harder than necessary to read, and even harder to modify with a text editor. When editing this by hand, it is easy to mess up the JSON representation by e.g. forgetting a comma.

As a comparison, the same notebook is stored like this in the proposed new format:

[3]:
import jupyter_format
print(jupyter_format.serialize(nb))
nbformat 4
nbformat_minor 4
markdown
    On top of that,
    several common characters are not allowed in JSON strings,
    which means that they have to be escaped by backslashes,
    e.g. `\"` and `\n`.
    And since a backslash is used for escaping,
    a literal backslash occurring in the text
    (which is quite common in programming languages and markup languages)
    has to be escaped itself (`\\`).

This is exactly the same as the original Markdown content, except that it is indented by 4 spaces.

YAML as Almost-Solution

It has been known for a long time (probably since the inception of Jupyter/IPython notebooks) that lists of JSON strings are not nicely readable for humans. An obvious alternative would be to use YAML, which provides multiple ways to store text content. One of those ways is the so-called literal style, which doesn’t require any escaping, making the text much more readable.

This was already suggested in several blog posts:

And there are even some implementations available:

This is an example how a YAML-based storage format could look like:

[4]:
yaml_content = """
nbformat: 4
nbformat_minor: 2
cells:
- cell_type: markdown
  source: |+2
    # A Jupyter Notebook

    This is a code cell:
  metadata: {}
- cell_type: code
  source: |+2
    print('Hello, world!')
  outputs:
  - output_type: stream
    name: stdout
    text: |+2
      Hello, world!
  execution_count: 1
  metadata: {}
metadata: {}
"""

This is valid YAML, compatible with both version 1.1 and 1.2.

Let’s use PyYAML to read this:

[5]:
import yaml
nb_dict = yaml.safe_load(yaml_content)
nb_dict
[5]:
{'nbformat': 4,
 'nbformat_minor': 2,
 'cells': [{'cell_type': 'markdown',
   'source': '# A Jupyter Notebook\n\nThis is a code cell:\n',
   'metadata': {}},
  {'cell_type': 'code',
   'source': "print('Hello, world!')\n",
   'outputs': [{'output_type': 'stream',
     'name': 'stdout',
     'text': 'Hello, world!\n'}],
   'execution_count': 1,
   'metadata': {}}],
 'metadata': {}}

This Python dictionary can easily be converted to a notebook node:

[6]:
nb = nbformat.from_dict(nb_dict)

And we can use nbconvert to convert this to HTML:

[7]:
from nbconvert.exporters import HTMLExporter
html_content, resources = HTMLExporter().from_notebook_node(nb)
[8]:
import urllib
data_uri = 'data:text/html;charset=utf-8,' + urllib.parse.quote(html_content)
[9]:
from IPython.display import IFrame
IFrame(data_uri, width='100%', height='250')
[9]:

This looks promising, doesn’t it?

The problem is, as so often, in the details. It’s great that we can use literal style without littering the text with quotes and escape characters, but sadly, YAML only allows printable characters. This means that we cannot use some control characters which might occur in cell outputs, for example ANSI escape characters.

There are two options here:

  • Go back to escaped strings, at least in some circumstances. But that’s exactly what we wanted to avoid by using YAML!
  • Don’t use YAML after all

Other Partial Solutions

Some alternative notebook formats are supported by the very popular projects https://github.com/aaren/notedown (Markdown) and https://github.com/mwouts/jupytext (Markdown, Rmd, Julia/Python/R-scripts etc.).

Those can be very useful, but none of them can store cell outputs, therefore they cannot be a full replacement for the current storage format.

The Need for a Custom Format

Looks like none of the existing formats are sufficient. Probably we can achieve our goals with a custom format.

Having to implement a custom parser for such a custom format is of course a disadvantage, but if we keep it really simple, probably we can get away with it?

Remember the YAML example from above?

[10]:
print(yaml_content)

nbformat: 4
nbformat_minor: 2
cells:
- cell_type: markdown
  source: |+2
    # A Jupyter Notebook

    This is a code cell:
  metadata: {}
- cell_type: code
  source: |+2
    print('Hello, world!')
  outputs:
  - output_type: stream
    name: stdout
    text: |+2
      Hello, world!
  execution_count: 1
  metadata: {}
metadata: {}

The contained text (Markdown and Python source code) is quite readable, but it is still stuffed with many distracting things inbetween.

Since we are not limited by YAML anymore, we can agressively reduce this to only contain the absolutely necessary information:

[11]:
content = """nbformat 4
nbformat_minor 2
markdown
    # A Jupyter Notebook

    This is a code cell:
code 1
    print('Hello, world!')
 stream stdout
    Hello, world!
"""

And that’s the proposed new format!

It can be converted to a notebook node (which will look the same as in the YAML example above):

[12]:
jupyter_format.deserialize(content)
[12]:
{'nbformat': 4,
 'nbformat_minor': 2,
 'metadata': {},
 'cells': [{'cell_type': 'markdown',
   'source': '# A Jupyter Notebook\n\nThis is a code cell:',
   'metadata': {}},
  {'cell_type': 'code',
   'metadata': {},
   'execution_count': 1,
   'source': "print('Hello, world!')",
   'outputs': [{'output_type': 'stream',
     'name': 'stdout',
     'text': 'Hello, world!'}]}]}

Just to make sure it is a valid Jupyter notebook node:

[13]:
nbformat.validate(_)

Complementary Tools

Oftentimes cell outputs (e.g. plots) stored in notebooks make it hard to read and manipulate the text representation of such notebooks. They make it also hard to use with version control systems (e.g. Git).

The proposed new format has the same problem, outputs are still stored in the notebook file, right next to the code cells that generated them.

It is recommended to remove all outputs from a notebook before storing it in version control or before doing any manipulations with a text editor.

Outputs can be removed manually in the Jupyter user interface, but there are also tools to remove outputs programmatically:

If you want to present your notebooks publicly, you often want to show the outputs to your audience, without them having to run the notebooks themselves. So do you have to store your outputs after all?

No! You can still store your notebooks without outputs and run your notebooks on a server that will re-create the outputs. One tool to do this is:

This is a Sphinx extension that can convert a bunch of Jupyter notebooks (and other source files) to HTML and PDF pages (and other output formats). This way you have the best of both worlds: No outputs in your (version controlled) notebook files, but full outputs in the public HTML (or PDF) version.

There are still some cases where you do want to store the outputs for some reason. Because of the outputs, it is hard to see the changes to the text/code content of the notebook with traditional tools like diff. But luckily, there is a tool that can make meaningful “diffs” for Jupyter notebooks: