如何優雅編譯一個 Markdown 文檔

Markdown 是一種廣泛使用的輕量級標記語言，允許人們使用易讀易寫的純文本格式編寫文檔，也是 xLog 主要使用的文章格式，本文就以 xLog Flavored Markdown 為例來說明如何優雅地解析一個 Markdown 文檔

架構#

解析過程可以用這樣一個架構來表示：

Mermaid Loading...

關鍵概念：

unified：通過語法樹和插件來解析、檢查、轉換和序列化內容的庫
remark：unified 的生態項目之一，由插件驅動的 Markdown 處理庫
rehype：unified 的生態項目之一，由插件驅動的 HTML 處理庫
mdast：remark 使用的用於表示 Markdown 的抽象語法樹規範
hast：rehype 使用的用於表示 HTML 的抽象語法樹規範

簡單來說就是把 Markdown 文檔交給一個 unified 生態的解析器解析成 unified 可識別的語法樹，再通過一系列 unified 生態的插件轉換為需要的內容，再通過一系列 unified 生態的工具庫輸出為需要的格式，下面就從解析、轉換、輸出這三個步驟來分別說明

解析 Parse#

Mermaid Loading...

無論輸入是 Markdown、HTML 還是純文本，都需要將其解析為可操作的格式。這種格式被稱為語法樹。規範（例如 mdast）定義了這樣一個語法樹的外觀。處理器（如 mdast 的 remark）負責創建它們。

最簡單的一步，我們需要解析的是 Markdown，所以這裡就應該使用 remark-parse 來把 Markdown 文檔編譯成 mdast 格式的語法樹

對應 xLog Flavored Markdown 中的

const processor = unified().use(remarkParse)

const file = new VFile(content)
const mdastTree = processor.parse(file)

轉換 Transform#

Mermaid Loading...

這就是魔法發生的地方。用戶組合插件以及它們運行的順序。插件在此階段插入並轉換和檢查它們獲得的格式。

這一步最為關鍵，不僅包含了從 Markdown 到 HTML 的轉換，還包含我們想在編譯過程中夾帶的私貨，比如增加一些非標準的語法糖、清理 HTML 防止 XSS、增加語法高亮、嵌入自定義組件等

unified 的插件非常多，更新也比較及時，基本需求幾乎都能滿足，對於不能滿足的特定需求，自己編寫轉換腳本也很容易實現

裡面有一個特殊的插件是 remark-rehype，它會把 mdast 語法樹轉為 hast 語法樹，所以在它之前必須使用處理 Markdown 的 remark 插件，在它之後必須使用處理 HTML 的 rehype 插件

xLog Flavored Markdown 中就加入了非常多的轉換插件

const processor = unified()
  .use(remarkParse)
  .use(remarkGithubAlerts)
  .use(remarkBreaks)
  .use(remarkFrontmatter, ["yaml"])
  .use(remarkGfm, {
    singleTilde: false,
  })
  .use(remarkDirective)
  .use(remarkDirectiveRehype)
  .use(remarkCalloutDirectives)
  .use(remarkYoutube)
  .use(remarkMath, {
    singleDollarTextMath: false,
  })
  .use(remarkPangu)
  .use(emoji)
  .use(remarkRehype, { allowDangerousHtml: true })
  .use(rehypeRaw)
  .use(rehypeIpfs)
  .use(rehypeSlug)
  .use(rehypeAutolinkHeadings, {
    behavior: "append",
    properties: {
      className: "xlog-anchor",
      ariaHidden: true,
      tabIndex: -1,
    },
    content(node) {
      return [
        {
          type: "text",
          value: "#",
        },
      ]
    },
  })
  .use(rehypeSanitize, strictMode ? undefined : sanitizeScheme)
  .use(rehypeTable)
  .use(rehypeExternalLink)
  .use(rehypeMermaid)
  .use(rehypeWrapCode)
  .use(rehypeInferDescriptionMeta)
  .use(rehypeEmbed, {
    transformers,
  })
  .use(rehypeRemoveH1)
  .use(rehypePrism, {
    ignoreMissing: true,
    showLineNumbers: true,
  })
  .use(rehypeKatex, {
    strict: false,
  })
  .use(rehypeMention)

const hastTree = pipeline.runSync(mdastTree, file)

下面介紹部分用到的插件

remarkGithubAlerts：增加 GitHub 風格的 Alerts 語法，演示
remarkBreaks：不再需要空一行才能被識別為新的自然段
remarkFrontmatter：支持前置內容（YAML、TOML 等）
remarkGfm：支持非標準的 GitHub 在原版 Markdown 語法上擴展的一系列語法（但其實這系列語法已經被非常廣泛使用，成為了事實意義上的標準）
remarkDirective remarkDirectiveRehyp：支持非標準的 Markdown 通用指令提案
remarkMath rehypeKatex：支持複雜的數學公式，演示
rehypeRaw：支持 Markdown 中夾雜的自定義 HTML
rehypeIpfs：自定義插件，為圖片、音頻、視頻支持 ipfs:// 協議的地址
rehypeSlug：為標題添加 id
rehypeAutolinkHeadings：為標題添加指向自身的鏈接 rel = "noopener noreferrer"
rehypeSanitize：清理 HTML，用於確保 HTML 安全避免 XSS 攻擊
rehypeExternalLink：自定義插件，給外部鏈接添加 target="_blank" 和 rel="noopener noreferrer"
rehypeMermaid：自定義插件，渲染繪圖和制表工具 Mermaid，本文的架構圖就是通過 Mermaid 渲染的
rehypeInferDescriptionMeta：用於自動生成文檔的描述
rehypeEmbed：自定義插件，用於根據鏈接自動嵌入 YouTube、Twitter、GitHub 等卡片
rehypeRemoveH1：自定義插件，用於把 h1 轉為 h2
rehypePrism：支持語法高亮
rehypeMention：自定義插件，支持 @DIYgod 這樣艾特其他 xLog 用戶

輸出 Stringify#

Mermaid Loading...

最後一步是將（調整後的）格式轉換為 Markdown、HTML 或純文本（可能與輸入格式不同！）

unified 的工具庫也很多，可以輸出各種我們需要的格式

比如 xLog 需要在文章右側展示自動生成的目錄、需要輸出純文本來計算預估閱讀時間和生成 AI 摘要、需要生成 HTML 來給 RSS 使用、需要生成 React Element 來渲染到頁面、需要提取文章的圖片和描述來展示文章卡片，就分別使用了 mdast-util-toc、hast-util-to-text、hast-util-to-html、hast-util-to-jsx-runtime、unist-util-visit 這些工具

對應 xLog Flavored Markdown 中的

{
  toToc: () =>
    mdastTree &&
    toc(mdastTree, {
      tight: true,
      ordered: true,
    }),
  toHTML: () => hastTree && toHtml(hastTree),
  toElement: () =>
    hastTree &&
    toJsxRuntime(hastTree, {
      Fragment,
      components: {
        // @ts-expect-error
        img: AdvancedImage,
        mention: Mention,
        mermaid: Mermaid,
        // @ts-expect-error
        audio: APlayer,
        // @ts-expect-error
        video: DPlayer,
        tweet: Tweet,
        "github-repo": GithubRepo,
        "xlog-post": XLogPost,
        // @ts-expect-error
        style: Style,
      },
      ignoreInvalidStyle: true,
      jsx,
      jsxs,
      passNode: true,
    }),
  toMetadata: () => {
    let metadata = {
      frontMatter: undefined,
      images: [],
      audio: undefined,
      excerpt: undefined,
    } as {
      frontMatter?: Record<string, any>
      images: string[]
      audio?: string
      excerpt?: string
    }

    metadata.excerpt = file.data.meta?.description || undefined

    if (mdastTree) {
      visit(mdastTree, (node, index, parent) => {
        if (node.type === "yaml") {
          metadata.frontMatter = jsYaml.load(node.value) as Record<
            string,
            any
          >
        }
      })
    }
    if (hastTree) {
      visit(hastTree, (node, index, parent) => {
        if (node.type === "element") {
          if (
            node.tagName === "img" &&
            typeof node.properties.src === "string"
          ) {
            metadata.images.push(node.properties.src)
          }
          if (node.tagName === "audio") {
            if (typeof node.properties.cover === "string") {
              metadata.images.push(node.properties.cover)
            }
            if (!metadata.audio && typeof node.properties.src === "string") {
              metadata.audio = node.properties.src
            }
          }
        }
      })
    }

    return metadata
  },
}

這樣我們就優雅地從原始 Markdown 文檔開始，獲得了我們需要的各種格式的輸出

除此之外，我們還能利用解析出的 unified 語法樹來編寫一個可以左右同步滾動和實時預覽的 Markdown 編輯器，可以參考 xLog 的雙欄 Markdown 編輯器（代碼），有機會我們下次再聊