For RAG engineersPreview · in development

Clean web data for your RAG pipeline

Strip boilerplate, skip unchanged pages, and emit LLM-ready markdown with metadata, ready to chunk and embed with your preferred splitter.

This is a teaser of an upcoming integration. The API shape, SDKs, and docs below reflect our in-development direction. Join the waitlist to get early access and shape the final API.

What you get

Key capabilities

Clean markdown, not HTML soup

Raw HTML typically carries 3-5x more tokens than the readable content: navigation, scripts, inline styles, tracking. We strip boilerplate and emit semantic markdown that preserves headings, lists, and tables.

Metadata and change detection

Every page returns URL, fetch timestamp, language, and a content hash over normalized markdown. Skip unchanged pages on re-crawl and avoid re-embedding work you already did.

Structured extraction, not blind chunking

Define the fields you care about (title, section, body, price) and we return them as discrete records with stable ids. Chunk with your preferred splitter, we do not force a chunking strategy on you.

Cost-aware routing across 8 providers

Route each URL through the cheapest provider that works across Firecrawl, Jina, Brightdata, Zyte, Scrapingbee, Oxylabs, ScraperAPI, and Apify, with automatic failover. One API, one bill, no per-provider rewrites when one breaks.

API preview

Scrape, chunk, embed, upsert

Illustrative shape of the upcoming API. Endpoints and field names may change before GA.

tsPreview
import { OpenAIEmbeddings } from "@langchain/openai"
import { Pinecone } from "@pinecone-database/pinecone"
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter"

const embeddings = new OpenAIEmbeddings()
const index = new Pinecone().Index("kb")
const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 800 })

export async function ingest(url: string) {
  const res = await fetch("https://api.webscraping.app/v1/scrape", {
    method: "POST",
    headers: { Authorization: `Bearer ${process.env.WEBSCRAPING_API_KEY}` },
    body: JSON.stringify({
      url,
      schema: { title: "string", body: "string" },
      skipIfUnchanged: true, // compares normalized markdown hash to last fetch
    }),
  })
  const { skipped, records, metadata } = await res.json()
  if (skipped) return

  // You own the chunking strategy
  const docs = records.flatMap((r: { body: string }) => splitter.splitText(r.body))
  const vectors = await Promise.all(
    docs.map(async (text, i) => ({
      id: `${metadata.url}#${i}`,
      values: await embeddings.embedQuery(text),
      metadata: { url: metadata.url, fetched_at: metadata.fetched_at, text },
    })),
  )
  await index.upsert(vectors)
}

Integrations

Works with any vector DB

Pinecone
Weaviate
Qdrant
pgvector
Chroma
Milvus

Shape the API with us

Join the waitlist. Early adopters get direct input on the API surface before GA.