Scraping Pipeline
DocShark processes documentation through a six-phase pipeline:
Architecture
Pipeline principle
Every stage emits a cleaner intermediate representation than the last. This
keeps indexing deterministic and makes retrieval outputs easier for agents to consume.
URL → Discover → Fetch → Extract → Convert → Chunk → Index | Phase | What | Tool | Output |
|---|---|---|---|
| 1. Discover | Find all page URLs | sitemap.xml / link crawl | URL list |
| 2. Fetch | HTTP GET with caching | fetch / Puppeteer | Raw HTML |
| 3. Extract | Strip nav/footer/ads | Readability + linkedom | Clean HTML |
| 4. Convert | HTML → Markdown | Turndown | Markdown |
| 5. Chunk | Split by headings | Custom semantic splitter | Chunks |
| 6. Index | Store + FTS5 | SQLite | Search-ready |
Cascading discovery
Starts with sitemap and navigation extraction before falling back to breadth-first crawling.
Structure-preserving chunks
Heading-aware split logic retains context and keeps chunk retrieval grounded.
1. Discovery
DocShark uses a cascading strategy to find all pages in a documentation site:
- Sitemap.xml (preferred) — Parse
/sitemap.xmlfor page URLs - Navigation-aware extraction — Extract URLs from the site’s navigation structure
- Link crawl (fallback) — BFS from root URL, following internal links
- robots.txt — Respect disallowed paths via
robots-parser
2. Fetching
- Default: Native
fetchwith properUser-Agentheader - Incremental: Sends
If-None-Match/If-Modified-Sinceto skip unchanged pages (HTTP 304) - Rate limiting: Configurable delay (default 500ms between requests)
- Retry: Exponential backoff on failure (3 attempts)
- Timeout: 30 seconds per request
- JS rendering: Auto-detects SPA sites and upgrades to
puppeteer-corewhen needed
3. Content Extraction
Uses @mozilla/readability with linkedom (lighter than jsdom):
- Strips navigation, sidebars, footers, and ads
- Extracts article title and main content
- Pre-processes the DOM to rescue complex elements like
<pre>,<table>, and<details>before Readability processing
4. HTML → Markdown
turndown configured for documentation-friendly output:
- ATX-style headings (
# h1,## h2) - Fenced code blocks with language tags
- Preserved
language-*classes on code blocks - Table support via GFM plugin
<details>and<summary>elements preserved
5. Chunking
Recursive heading-based splitting preserves document structure:
- Split on
# h1→ major sections - Within h1, split on
## h2→ subsections - Within h2, split on
### h3→ fine-grained sections - If section exceeds max tokens → split on paragraphs
- Never split mid-code-block (atomic units)
- Heading hierarchy preserved as breadcrumbs:
"Getting Started > Installation" - Target: 500–1,500 tokens per chunk
- Minimum: 50 tokens (skip tiny fragments)
6. Indexing
- Insert chunks into SQLite
chunkstable - FTS5 index synced via database triggers
- Library stats updated (page count, chunk count)
Dependencies
| Package | Purpose | Size |
|---|---|---|
@mozilla/readability | Content extraction | ~40KB |
linkedom | DOM environment | ~100KB |
turndown | HTML → Markdown | ~30KB |
robots-parser | robots.txt parsing | ~5KB |
puppeteer-core | JS sites (optional) | ~2MB |
Core footprint: ~375KB — No LangChain, no heavy browser runtimes by default.