Documentation notebook

Field guide for DocShark internals

Home / Docs / Scraping Pipeline

Scraping Pipeline

DocShark processes documentation through a six-phase pipeline:

Architecture

Pipeline principle

Every stage emits a cleaner intermediate representation than the last. This keeps indexing deterministic and makes retrieval outputs easier for agents to consume.
URL → Discover → Fetch → Extract → Convert → Chunk → Index
PhaseWhatToolOutput
1. DiscoverFind all page URLssitemap.xml / link crawlURL list
2. FetchHTTP GET with cachingfetch / PuppeteerRaw HTML
3. ExtractStrip nav/footer/adsReadability + linkedomClean HTML
4. ConvertHTML → MarkdownTurndownMarkdown
5. ChunkSplit by headingsCustom semantic splitterChunks
6. IndexStore + FTS5SQLiteSearch-ready
Cascading discovery

Starts with sitemap and navigation extraction before falling back to breadth-first crawling.

Structure-preserving chunks

Heading-aware split logic retains context and keeps chunk retrieval grounded.

1. Discovery

DocShark uses a cascading strategy to find all pages in a documentation site:

  • Sitemap.xml (preferred) — Parse /sitemap.xml for page URLs
  • Navigation-aware extraction — Extract URLs from the site’s navigation structure
  • Link crawl (fallback) — BFS from root URL, following internal links
  • robots.txt — Respect disallowed paths via robots-parser

2. Fetching

  • Default: Native fetch with proper User-Agent header
  • Incremental: Sends If-None-Match / If-Modified-Since to skip unchanged pages (HTTP 304)
  • Rate limiting: Configurable delay (default 500ms between requests)
  • Retry: Exponential backoff on failure (3 attempts)
  • Timeout: 30 seconds per request
  • JS rendering: Auto-detects SPA sites and upgrades to puppeteer-core when needed

3. Content Extraction

Uses @mozilla/readability with linkedom (lighter than jsdom):

  • Strips navigation, sidebars, footers, and ads
  • Extracts article title and main content
  • Pre-processes the DOM to rescue complex elements like <pre>, <table>, and <details> before Readability processing

4. HTML → Markdown

turndown configured for documentation-friendly output:

  • ATX-style headings (# h1, ## h2)
  • Fenced code blocks with language tags
  • Preserved language-* classes on code blocks
  • Table support via GFM plugin
  • <details> and <summary> elements preserved

5. Chunking

Recursive heading-based splitting preserves document structure:

  1. Split on # h1 → major sections
  2. Within h1, split on ## h2 → subsections
  3. Within h2, split on ### h3 → fine-grained sections
  4. If section exceeds max tokens → split on paragraphs
  5. Never split mid-code-block (atomic units)
  6. Heading hierarchy preserved as breadcrumbs: "Getting Started > Installation"
  7. Target: 500–1,500 tokens per chunk
  8. Minimum: 50 tokens (skip tiny fragments)

6. Indexing

  • Insert chunks into SQLite chunks table
  • FTS5 index synced via database triggers
  • Library stats updated (page count, chunk count)

Dependencies

PackagePurposeSize
@mozilla/readabilityContent extraction~40KB
linkedomDOM environment~100KB
turndownHTML → Markdown~30KB
robots-parserrobots.txt parsing~5KB
puppeteer-coreJS sites (optional)~2MB

Core footprint: ~375KB — No LangChain, no heavy browser runtimes by default.

A local-first research notebook for software documentation. Crawl, index, and serve real docs to coding agents without adding a cloud layer to the workflow.

GitHub repository

Built for grounded documentation workflows and long-form technical reading.

© 2026 DocShark