banner
DIYgod

Hi, DIYgod

写代码是热爱,写到世界充满爱!
github
twitter
follow
bilibili
telegram
email
steam
playstation
nintendo switch

How to elegantly compile a Markdown document

Markdown is a widely used lightweight markup language that allows people to write documents in an easy-to-read and easy-to-write plain text format. It is also the main article format used by xLog. This article will illustrate how to elegantly parse a Markdown document using xLog Flavored Markdown as an example.

Architecture#

The parsing process can be represented by the following architecture:

Mermaid Loading...

Key concepts:

  • unified: A library for parsing, checking, transforming, and serializing content through syntax trees and plugins.
  • remark: One of the unified ecosystem projects, a plugin-driven Markdown processing library.
  • rehype: One of the unified ecosystem projects, a plugin-driven HTML processing library.
  • mdast: The abstract syntax tree specification used by remark to represent Markdown.
  • hast: The abstract syntax tree specification used by rehype to represent HTML.

In simple terms, it involves handing a Markdown document to a parser from the unified ecosystem to parse it into a syntax tree recognizable by unified. Then, through a series of plugins from the unified ecosystem, it is transformed into the required content, and finally output in the desired format through a series of tools from the unified ecosystem. Below, we will explain the three steps: parsing, transforming, and outputting.

Parse#

Mermaid Loading...

Regardless of whether the input is Markdown, HTML, or plain text, it needs to be parsed into an operable format. This format is called a syntax tree. Specifications (like mdast) define what such a syntax tree looks like. Processors (like remark for mdast) are responsible for creating them.

The simplest step is that we need to parse Markdown, so we should use remark-parse to compile the Markdown document into an mdast format syntax tree.

Corresponding to xLog Flavored Markdown:

const processor = unified().use(remarkParse)

const file = new VFile(content)
const mdastTree = processor.parse(file)

Transform#

Mermaid Loading...

This is where the magic happens. Users combine plugins and the order in which they run. Plugins are inserted at this stage to transform and check the formats they receive.

This step is crucial as it not only includes the conversion from Markdown to HTML but also includes the additional features we want to include during the compilation process, such as adding some non-standard syntactic sugar, cleaning HTML to prevent XSS, adding syntax highlighting, embedding custom components, etc.

There are many plugins in the unified ecosystem, and they are updated quite timely, meeting almost all basic needs. For specific needs that cannot be met, it is also easy to write custom transformation scripts.

One special plugin is remark-rehype, which converts the mdast syntax tree into a hast syntax tree. Therefore, a remark plugin for processing Markdown must be used before it, and a rehype plugin for processing HTML must be used after it.

xLog Flavored Markdown includes many transformation plugins:

const processor = unified()
  .use(remarkParse)
  .use(remarkGithubAlerts)
  .use(remarkBreaks)
  .use(remarkFrontmatter, ["yaml"])
  .use(remarkGfm, {
    singleTilde: false,
  })
  .use(remarkDirective)
  .use(remarkDirectiveRehype)
  .use(remarkCalloutDirectives)
  .use(remarkYoutube)
  .use(remarkMath, {
    singleDollarTextMath: false,
  })
  .use(remarkPangu)
  .use(emoji)
  .use(remarkRehype, { allowDangerousHtml: true })
  .use(rehypeRaw)
  .use(rehypeIpfs)
  .use(rehypeSlug)
  .use(rehypeAutolinkHeadings, {
    behavior: "append",
    properties: {
      className: "xlog-anchor",
      ariaHidden: true,
      tabIndex: -1,
    },
    content(node) {
      return [
        {
          type: "text",
          value: "#",
        },
      ]
    },
  })
  .use(rehypeSanitize, strictMode ? undefined : sanitizeScheme)
  .use(rehypeTable)
  .use(rehypeExternalLink)
  .use(rehypeMermaid)
  .use(rehypeWrapCode)
  .use(rehypeInferDescriptionMeta)
  .use(rehypeEmbed, {
    transformers,
  })
  .use(rehypeRemoveH1)
  .use(rehypePrism, {
    ignoreMissing: true,
    showLineNumbers: true,
  })
  .use(rehypeKatex, {
    strict: false,
  })
  .use(rehypeMention)

const hastTree = pipeline.runSync(mdastTree, file)

Here are some of the plugins used:

  • remarkGithubAlerts: Adds GitHub-style Alerts syntax, demo
  • remarkBreaks: No longer requires a blank line to be recognized as a new paragraph.
  • remarkFrontmatter: Supports front matter (YAML, TOML, etc.).
  • remarkGfm: Supports a series of non-standard GitHub syntax extensions to the original Markdown syntax specification (which has actually become a de facto standard).
  • remarkDirective remarkDirectiveRehyp: Supports non-standard Markdown generic directive proposal.
  • remarkMath rehypeKatex: Supports complex mathematical formulas, demo.
  • rehypeRaw: Supports custom HTML mixed in Markdown.
  • rehypeIpfs: Custom plugin to support ipfs:// protocol addresses for images, audio, and video.
  • rehypeSlug: Adds IDs to headings.
  • rehypeAutolinkHeadings: Adds links to headings that point to themselves with rel = "noopener noreferrer".
  • rehypeSanitize: Cleans HTML to ensure safety and avoid XSS attacks.
  • rehypeExternalLink: Custom plugin to add target="_blank" and rel="noopener noreferrer" to external links.
  • rehypeMermaid: Custom plugin to render diagrams and charts using Mermaid. The architecture diagram in this article is rendered using Mermaid.
  • rehypeInferDescriptionMeta: Used to automatically generate document descriptions.
  • rehypeEmbed: Custom plugin to automatically embed YouTube, Twitter, GitHub, etc., cards based on links.
  • rehypeRemoveH1: Custom plugin to convert h1 to h2.
  • rehypePrism: Supports syntax highlighting.
  • rehypeMention: Custom plugin to support mentioning other xLog users like @DIYgod.

Stringify#

Mermaid Loading...

The final step is to convert the (adjusted) format into Markdown, HTML, or plain text (which may differ from the input format!).

There are also many tools in the unified ecosystem that can output various formats we need.

For example, xLog needs to display an automatically generated table of contents on the right side of the article, needs to output plain text to calculate estimated reading time and generate AI summaries, needs to generate HTML for RSS use, needs to generate React Elements to render on the page, and needs to extract images and descriptions from the article to display article cards. This is achieved using tools like mdast-util-toc, hast-util-to-text, hast-util-to-html, hast-util-to-jsx-runtime, and unist-util-visit.

Corresponding to xLog Flavored Markdown:

{
  toToc: () =>
    mdastTree &&
    toc(mdastTree, {
      tight: true,
      ordered: true,
    }),
  toHTML: () => hastTree && toHtml(hastTree),
  toElement: () =>
    hastTree &&
    toJsxRuntime(hastTree, {
      Fragment,
      components: {
        // @ts-expect-error
        img: AdvancedImage,
        mention: Mention,
        mermaid: Mermaid,
        // @ts-expect-error
        audio: APlayer,
        // @ts-expect-error
        video: DPlayer,
        tweet: Tweet,
        "github-repo": GithubRepo,
        "xlog-post": XLogPost,
        // @ts-expect-error
        style: Style,
      },
      ignoreInvalidStyle: true,
      jsx,
      jsxs,
      passNode: true,
    }),
  toMetadata: () => {
    let metadata = {
      frontMatter: undefined,
      images: [],
      audio: undefined,
      excerpt: undefined,
    } as {
      frontMatter?: Record<string, any>
      images: string[]
      audio?: string
      excerpt?: string
    }

    metadata.excerpt = file.data.meta?.description || undefined

    if (mdastTree) {
      visit(mdastTree, (node, index, parent) => {
        if (node.type === "yaml") {
          metadata.frontMatter = jsYaml.load(node.value) as Record<
            string,
            any
          >
        }
      })
    }
    if (hastTree) {
      visit(hastTree, (node, index, parent) => {
        if (node.type === "element") {
          if (
            node.tagName === "img" &&
            typeof node.properties.src === "string"
          ) {
            metadata.images.push(node.properties.src)
          }
          if (node.tagName === "audio") {
            if (typeof node.properties.cover === "string") {
              metadata.images.push(node.properties.cover)
            }
            if (!metadata.audio && typeof node.properties.src === "string") {
              metadata.audio = node.properties.src
            }
          }
        }
      })
    }

    return metadata
  },
}

Thus, we elegantly start from the original Markdown document and obtain various formats of output that we need.

In addition, we can also utilize the parsed unified syntax tree to write a Markdown editor that synchronizes scrolling and real-time preview, similar to xLog's dual-column Markdown editor (code). We can discuss this next time.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.