Markdown: Generating Heading IDs
15 / Nov 2014In late October, I submitted a pull request (#125) to blackfriday, a Go package for processing Markdown, in order to satisfy a feature request deemed useful for Hugo (GitHub-style header generation).
The change itself wasn’t very difficult—only about 25 lines of code, including code that Dmitri Shuralyov (@shurcooL) had written to create a sanitized anchor name1.
The change to enable this in Hugo is stuck because there’s no good way to prevent or handle duplicate heading IDs.
There are two problems here:
- Single document duplicate IDs; and
- Multiple document duplicate IDs.
Single Document Duplicate IDs
It is possible to generate the same heading ID more than once in a single Markdown document. This is a problem for blackfriday. Markdown like this:
# Header
# Header
# Header
produces HTML like this:
<h1 id="header">Header</h1>
<h1 id="header">Header</h1>
<h1 id="header">Header</h1>
I implemented a naïve approach in a second pull request
(#126) that is the wrong way to implement this (I will
be modifying it at my next opportunity). It uses an incrementing counter and
suffix to prevent header collisions (the heading IDs for the above example
would be header
, header-1
, and header-2
under this model). This doesn’t
prevent a simple name collision:
# Header
# Header 1
# Header
It also does not prevent a collision like this (resulting in header
,
header
, and header-1
):
# Header
# Header {#header}
# Header
Both collisions are undesirable, but the second example (where header
is
provided by the explicit desire of the user) is a worse collision than the
first2.
In a slightly less-naïve approach, we can detect header collisions and append
a suffix (like -1
) to each header that collides (resulting in header
,
header-1
, and header-1-1
), but that feels wrong, unintuitive, and
unnecessarily complex:
parser.headers
would be changed frommap[string]int
tomap[string]bool
, and would grow larger for each header that collides, because each of the formsheader
,header-1
, andheader-1-1
would be put intoparser.headers
.parser.createSanitizedAnchorName
would need to be modified to be something like what follows. (The code here is untested.)
func (p *parser) createSanitizedAnchorName(text string) string {
var anchorName []rune
for _, r := range []rune(text) {
switch {
case r == ' ':
anchorName = append(anchorName, '-')
case unicode.IsLetter(r) || unicode.IsNumber(r):
anchorName = append(anchorName, unicode.ToLower(r))
}
}
return ensureUniqueAnchor(string(anchorName))
}
func (p *parser) ensureUniqueAnchor(anchor string) string {
for _, found := p.headers[anchor]; found; found = p.headers[anchor] {
anchor = anchor + "-1";
}
p.headers[anchor] = true
return anchor
}
While I don’t like the collision-with-append approach, it will solve all but
the most pathological cases where a user actively tries to sabotage header ID
generation. Most of that can be solved by running ensureUniqueAnchor
over
any provided ID, whether it was generated by createSanitizedAnchorName
or
not. The header IDs may not match what a user has given, but it will at least
be guaranteed to not collide within a single document3. This is the
modified approach I will be submitting to blackfriday soon.
Multiple Document Duplicate IDs
This is a problem for Hugo, and blackfriday can’t solve it. It is likely that
two or more rendered documents will have identical headings (consider a site
based on a web API documentation; they will both have a heading ##
Endpoint
). If these documents are then rendered into a list page (such as the
default Hugo index page), that list page will have multiple headings with
identical fragment IDs.
Most Hugo themes that render page {{ .Content }}
into a list page (including
Hyde, the more-or-less default theme, and my own theme, Cabaret) do
not include the {{ .TableOfContents }}
, so this won’t usually be a
problem, except for HTML validation. If possible, we don’t want to generate
invalid HTML.
We’ve already fixed this once in Hugo, by adding a prefix based on the page
ID4. This technique can be reused to fix cross-document IDs so that
the two headings would be rendered with #endpoint-deadbeef
and
#endpoint-beefdead
5.
A different alternative6 would be to strip all header IDs from list
pages. Not impossible, but equally unpredictable (especially if there were a
theme that rendered the {{ .TableOfContents }}
).
To make this reliable to use, it would be necessary to introduce a new
method that would work on both Node
and Page
objects to render a URL or
URL-fragment with the appropriate page ID appended. Pending feedback on this
and a couple of other link helper methods, this is how I will be proposing
that this feature be fully enabled in Hugo.
- If you want similar functionality, you should instead use his extracted package for this through
import "github.com/shurcooL/go/github_flavored_markdown/sanitized_anchor_name"
, rather than copying the code as I did. ↩ - This type of collision cannot be fixed with the “smarter” approach described, either, because the explicit header ID (
#header
) would be changed to something different. Blackfriday can’t be modified to detect this and parse it differently because it parses and renders documents in a single pass. ↩ - This could also be helped if blackfriday provided any logging facility whatsoever. When header ID collisions are detected, it should be logged and reported to the user at a minimum. Ideally, the document should be rejected (e.g., crash-first behaviour), but that would not match user expectations. ↩
- An MD5 of the logical name for the source file. ↩
- This does require an additional change to blackfriday such that the desired suffix can be passed as part of the renderer, but I do not expect much resistance to that change. ↩
- This would be difficult in the current implementation of Hugo because the renderer does not know the difference between rendering in a
Page
context as opposed to aList
orNode
context. ↩