Contents

Markdown: Generating Heading IDs

¹⁵ / Nov 2014

In late October, I submitted a pull request (#125) to blackfriday, a Go package for processing Markdown, in order to satisfy a feature request deemed useful for Hugo (GitHub-style header generation).

The change itself wasn’t very difficult—only about 25 lines of code, including code that Dmitri Shuralyov (@shurcooL) had written to create a sanitized anchor name¹.

The change to enable this in Hugo is stuck because there’s no good way to prevent or handle duplicate heading IDs.

There are two problems here:

Single document duplicate IDs; and
Multiple document duplicate IDs.

Single Document Duplicate IDs

It is possible to generate the same heading ID more than once in a single Markdown document. This is a problem for blackfriday. Markdown like this:

# Header
# Header
# Header

produces HTML like this:

<h1 id="header">Header</h1>
<h1 id="header">Header</h1>
<h1 id="header">Header</h1>

I implemented a naïve approach in a second pull request (#126) that is the wrong way to implement this (I will be modifying it at my next opportunity). It uses an incrementing counter and suffix to prevent header collisions (the heading IDs for the above example would be header, header-1, and header-2 under this model). This doesn’t prevent a simple name collision:

# Header
# Header 1
# Header

It also does not prevent a collision like this (resulting in header, header, and header-1):

# Header
# Header {#header}
# Header

Both collisions are undesirable, but the second example (where header is provided by the explicit desire of the user) is a worse collision than the first².

In a slightly less-naïve approach, we can detect header collisions and append a suffix (like -1) to each header that collides (resulting in header, header-1, and header-1-1), but that feels wrong, unintuitive, and unnecessarily complex:

parser.headers would be changed from map[string]int to map[string]bool, and would grow larger for each header that collides, because each of the forms header, header-1, and header-1-1 would be put into parser.headers.
parser.createSanitizedAnchorName would need to be modified to be something like what follows. (The code here is untested.)

func (p *parser) createSanitizedAnchorName(text string) string {
  var anchorName []rune
  for _, r := range []rune(text) {
    switch {
    case r == ' ':
      anchorName = append(anchorName, '-')
    case unicode.IsLetter(r) || unicode.IsNumber(r):
      anchorName = append(anchorName, unicode.ToLower(r))
    }
  }

  return ensureUniqueAnchor(string(anchorName))
}

func (p *parser) ensureUniqueAnchor(anchor string) string {
  for _, found := p.headers[anchor]; found; found = p.headers[anchor] {
    anchor = anchor + "-1";
  }

  p.headers[anchor] = true

  return anchor
}

While I don’t like the collision-with-append approach, it will solve all but the most pathological cases where a user actively tries to sabotage header ID generation. Most of that can be solved by running ensureUniqueAnchor over any provided ID, whether it was generated by createSanitizedAnchorName or not. The header IDs may not match what a user has given, but it will at least be guaranteed to not collide within a single document³. This is the modified approach I will be submitting to blackfriday soon.

Multiple Document Duplicate IDs

This is a problem for Hugo, and blackfriday can’t solve it. It is likely that two or more rendered documents will have identical headings (consider a site based on a web API documentation; they will both have a heading ## Endpoint). If these documents are then rendered into a list page (such as the default Hugo index page), that list page will have multiple headings with identical fragment IDs.

Most Hugo themes that render page {{ .Content }} into a list page (including Hyde, the more-or-less default theme, and my own theme, Cabaret) do not include the {{ .TableOfContents }}, so this won’t usually be a problem, except for HTML validation. If possible, we don’t want to generate invalid HTML.

We’ve already fixed this once in Hugo, by adding a prefix based on the page ID⁴. This technique can be reused to fix cross-document IDs so that the two headings would be rendered with #endpoint-deadbeef and #endpoint-beefdead⁵.

A different alternative⁶ would be to strip all header IDs from list pages. Not impossible, but equally unpredictable (especially if there were a theme that rendered the {{ .TableOfContents }}).

To make this reliable to use, it would be necessary to introduce a new method that would work on both Node and Page objects to render a URL or URL-fragment with the appropriate page ID appended. Pending feedback on this and a couple of other link helper methods, this is how I will be proposing that this feature be fully enabled in Hugo.

If you want similar functionality, you should instead use his extracted package for this through import "github.com/shurcooL/go/github_flavored_markdown/sanitized_anchor_name", rather than copying the code as I did. ↩
This type of collision cannot be fixed with the “smarter” approach described, either, because the explicit header ID (#header) would be changed to something different. Blackfriday can’t be modified to detect this and parse it differently because it parses and renders documents in a single pass. ↩
This could also be helped if blackfriday provided any logging facility whatsoever. When header ID collisions are detected, it should be logged and reported to the user at a minimum. Ideally, the document should be rejected (e.g., crash-first behaviour), but that would not match user expectations. ↩
An MD5 of the logical name for the source file. ↩
This does require an additional change to blackfriday such that the desired suffix can be passed as part of the renderer, but I do not expect much resistance to that change. ↩
This would be difficult in the current implementation of Hugo because the renderer does not know the difference between rendering in a Page context as opposed to a List or Node context. ↩

Tags// Markdown, go, golang

halo • statue

technology, opinions, and recipes

Markdown: Generating Heading IDs

Single Document Duplicate IDs

Multiple Document Duplicate IDs