This is a teaser of an upcoming integration. The API shape, SDKs, and docs below reflect our in-development direction. Join the waitlist to get early access and shape the final API.
What you get
Route across 8 extraction providers with fallback on failure. Rows conform to a user-defined schema; malformed extractions are quarantined, not silently dropped, so eval sets stay intact.
source_url, fetched_at, language, provider, http_status, content_hash, robots_decision, user_agent, request_id. Dedup on content_hash, filter soft-404s on http_status, audit any row back to its fetch.
Language detection, encoding validation, and content-length stats per row. Filter before ingestion, then plug into your dedup and quality-scoring stage in Datatrove or NeMo Curator.
Per-domain robots.txt fetch + cache, User-agent matching, and configurable per-host QPS with backoff on 429/503. Disallowed paths are dropped before they reach your corpus.
API preview
import { createWriteStream } from "node:fs"
const urls: string[] = [
// batch size: tune per plan limits
]
const res = await fetch("https://api.webscraping.app/v1/bulk", {
method: "POST",
headers: { Authorization: `Bearer ${process.env.WEBSCRAPING_API_KEY}` },
body: JSON.stringify({
urls,
schema: { text: "string", title: "string" },
respectRobotsTxt: true,
perDomainRateLimitMs: 1000,
output: "jsonl",
}),
})
/**
* Rows emit as:
* {
* text, title,
* source_url, fetched_at, language, provider,
* http_status, content_hash, robots_decision, user_agent, request_id
* }
*/
const out = createWriteStream("./training.jsonl")
const reader = res.body!.getReader()
const decoder = new TextDecoder()
while (true) {
const { done, value } = await reader.read()
if (done) break
out.write(decoder.decode(value))
}
Integrations
Streaming to training loops
Columnar, cloud-optimized
Direct push to the Hub
Compliance
robots.txt is fetched, cached per host, and matched against the configured User-agent before each request; disallowed URLs are skipped and logged. Per-domain rate limits are enforced with token-bucket backoff on 429/503. Every emitted row carries source_url, fetched_at, provider, and the robots.txt decision, so downstream audits can trace any row back to the fetch and its compliance verdict.