#confusing things about markdown

This document is a list of confusing, ambiguous, or complex things in the markdown spec. This list does not contain anything I think is missing from the markdown spec. None of this relates directly to Gruber’s initial design, markdown much more problematic before commonmark appeared, but it isn’t wholly unreleated either.

Most of these problems are present in all versions of markdown, although how various ambiguities are handled differ from parser to parser. That is partly due to the complexity of markdown and partly due to how it developed over time. Flavours of markdown that add to the core markdown syntax inherit almost all of these problems; many add new ones on top.

While commonmark hasn’t really achieved what I hoped it would (very few parsers correctly implement the full commonmark spec), it was always an impossible task. What commonmark has done is brought some order to chaos and provided a precise reference point on which we can centre discussion of markdown, even if that reference point is often confusing and difficult to understand.

HTML handling has been explicitly ignored in this document because my use-case requires comprehensive HTML parsing, which is another matter entirely. There are significant limitations with HTML handling in the spec that do not offer the flexibility necessary for many use-cases that are common today. However, this is understandable due to the complexity of HTML itself.

Discussion of possible solutions to these issues is out of scope for this document.

##indentation

The use of indentation to denote code blocks leads to many confusing features.

It can be hard to work out what is going to happen with a tab or whitespace character. It could be of semantic significance (code blocks) or ignored (leading/trailing tab or space in headings, leading whitespace in a paragraph).

###indented code-blocks

Indented code blocks themselves are confusing and do not match the semantics of fenced code blocks.

They cannot interrupt a paragraph:

hello
····hi

<p>hello 
hi</p>

A paragraph can appear immediately after an indented code block:

····hi

hello

<pre><code>hi
</code></pre>
<p>hello</p>

The ambiguity between indented lists and indented code-blocks requires explicit precedence rules:

- foo

··bar

<ul>
	<li>
		<p>foo</p>
		<p>bar</p>
	</li>
</ul>

###fenced code-blocks

Leading indentation for a fenced code block is valid. The length of the leading indentation for the fence will be subtracted from each line of the code-block contents (if it has leading space/ enough leading space).

As soon as you hit four spaces that no longer holds, this gets treated as code (including the fence syntax):

····```
····aaa
····```

<pre><code>```
aaa
```
</code></pre>

###almost everything

Leading spaces up to 3 are valid for many constructs, but 4 gives confusing results:

····---

<pre><code>***
</code></pre>

This holds for headings, thematic breaks, link references, other stuff.

##link references

Links have several rules but aren’t usually an issue, either for parsing or understanding. Reference links, however, pose some challenges.

Reference links require parsing the whole document before working out whether something is a reference link or just plain text.

[title][link][text]

[link]: google.com

<p>
	<a href="google.com">title</a>
	[text]
</p>

In this case, link (defined later in the document) is the URL for the title text. [text] is literal text content. We don’t know any of this until we finally get to the definition of [link] at the bottom (or wherever it appears in the document). Without the [link] definition, this would result in:

<p>[title][link][text]</p>

This complicates parsing and makes syntax highlighting more difficult.

##blockquotes

They are just confusing.

##setext headings

There is ambiguity/ collisions between thematic breaks and setext headings requiring explicit precedence rules:

## Foo

bar

<h2>Foo</h2>
<p>bar</p>

##lists

Loose and tight lists are a little confusing but allow for nice flexibility.

However, nested lists are very confusing; I rarely get them right. Most of the complexity comes from handling indented code blocks.

The commonmark spec states that nested lists must be indented to the level of the first non-space character after the list marker; this is an improvement over the initial markdown ‘spec’ suggesting four spaces of indentation:

- a
··- b

<ul>
	<li>
		a
		<ul>
			<li>b</li>
		</ul>
	</li>
</ul>

It is somewhat confusing that this is not a nested list, however:

- a
·- b

<ul>
	<li>a</li>
	<li>b</li>
</ul>

This is a ‘loose’ nested list:

- a

·····- b

<ul>
	<li>
		<p>a</p>
		<ul>
			<li>b</li>
		</ul>
	</li>
</ul>

But this is a single list item containing a paragraph and code block; there is only one additional space character here:

- a

······- b

<ul>
	<li>
		<p>a</p>
		<pre><code>- b
    </code></pre>
	</li>
</ul>

##emphasis

Nothing in markdown is more complex or more confusing than emphasis and strong emphasis. The commonmark spec defines 17 rules for emphasis and strong emphasis

This kind of thing is confusing:

**hello* world**

<p>
	<em>
		<em>hello</em> 
		world
	</em>*
</p>

but this also feels correct:

<p>
	<strong>hello* world</strong>
</p>

The use of double emphasis characters confuses matters.

Intra-word emphasis is even more confusing. What should this produce?

hello***a*friendsss

<p>hello**<em>a</em>friends</p>

This is according to the commonmark spec as best I can make it out. Actual implementations across the ecosystem vary widely when it comes to these cases.