This is a teaser of an upcoming integration. The API shape, SDKs, and docs below reflect our in-development direction. Join the waitlist to get early access and shape the final API.
What you get
Raw HTML typically carries 3-5x more tokens than the readable content: navigation, scripts, inline styles, tracking. We strip boilerplate and emit semantic markdown that preserves headings, lists, and tables.
Every page returns URL, fetch timestamp, language, and a content hash over normalized markdown. Skip unchanged pages on re-crawl and avoid re-embedding work you already did.
Define the fields you care about (title, section, body, price) and we return them as discrete records with stable ids. Chunk with your preferred splitter, we do not force a chunking strategy on you.
Route each URL through the cheapest provider that works across Firecrawl, Jina, Brightdata, Zyte, Scrapingbee, Oxylabs, ScraperAPI, and Apify, with automatic failover. One API, one bill, no per-provider rewrites when one breaks.
API preview
import { OpenAIEmbeddings } from "@langchain/openai"
import { Pinecone } from "@pinecone-database/pinecone"
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter"
const embeddings = new OpenAIEmbeddings()
const index = new Pinecone().Index("kb")
const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 800 })
export async function ingest(url: string) {
const res = await fetch("https://api.webscraping.app/v1/scrape", {
method: "POST",
headers: { Authorization: `Bearer ${process.env.WEBSCRAPING_API_KEY}` },
body: JSON.stringify({
url,
schema: { title: "string", body: "string" },
skipIfUnchanged: true, // compares normalized markdown hash to last fetch
}),
})
const { skipped, records, metadata } = await res.json()
if (skipped) return
// You own the chunking strategy
const docs = records.flatMap((r: { body: string }) => splitter.splitText(r.body))
const vectors = await Promise.all(
docs.map(async (text, i) => ({
id: `${metadata.url}#${i}`,
values: await embeddings.embedQuery(text),
metadata: { url: metadata.url, fetched_at: metadata.fetched_at, text },
})),
)
await index.upsert(vectors)
}
Integrations