For ML data engineersPreview · in development

Web-to-JSONL training data pipelines

Domain-specific dataset curation with schema-aligned rows, robots.txt compliance, and provenance metadata baked into every row.

This is a teaser of an upcoming integration. The API shape, SDKs, and docs below reflect our in-development direction. Join the waitlist to get early access and shape the final API.

Join the waitlist Read the curation guide

What you get

Key capabilities

Schema-aligned rows with quarantine

Route across 8 extraction providers with fallback on failure. Rows conform to a user-defined schema; malformed extractions are quarantined, not silently dropped, so eval sets stay intact.

Provenance on every row

source_url, fetched_at, language, provider, http_status, content_hash, robots_decision, user_agent, request_id. Dedup on content_hash, filter soft-404s on http_status, audit any row back to its fetch.

Quality signals exposed

Language detection, encoding validation, and content-length stats per row. Filter before ingestion, then plug into your dedup and quality-scoring stage in Datatrove or NeMo Curator.

Compliance with a real mechanism

Per-domain robots.txt fetch + cache, User-agent matching, and configurable per-host QPS with backoff on 429/503. Disallowed paths are dropped before they reach your corpus.

API preview

Bulk scrape to JSONL for fine-tuning

Illustrative shape of the upcoming API. Endpoints and field names may change before GA.

tsPreview

import { createWriteStream } from "node:fs"

const urls: string[] = [
  // batch size: tune per plan limits
]

const res = await fetch("https://api.webscraping.app/v1/bulk", {
  method: "POST",
  headers: { Authorization: `Bearer ${process.env.WEBSCRAPING_API_KEY}` },
  body: JSON.stringify({
    urls,
    schema: { text: "string", title: "string" },
    respectRobotsTxt: true,
    perDomainRateLimitMs: 1000,
    output: "jsonl",
  }),
})

/**
 * Rows emit as:
 * {
 *   text, title,
 *   source_url, fetched_at, language, provider,
 *   http_status, content_hash, robots_decision, user_agent, request_id
 * }
 */
const out = createWriteStream("./training.jsonl")
const reader = res.body!.getReader()
const decoder = new TextDecoder()

while (true) {
  const { done, value } = await reader.read()
  if (done) break
  out.write(decoder.decode(value))
}

Integrations

Output formats

JSONL

Streaming to training loops

Parquet

Columnar, cloud-optimized

HuggingFace Datasets

Direct push to the Hub

Compliance

Built for reviewers who care

robots.txt is fetched, cached per host, and matched against the configured User-agent before each request; disallowed URLs are skipped and logged. Per-domain rate limits are enforced with token-bucket backoff on 429/503. Every emitted row carries source_url, fetched_at, provider, and the robots.txt decision, so downstream audits can trace any row back to the fetch and its compliance verdict.

Shape the API with us

Join the waitlist. Early adopters get direct input on the API surface before GA.

Join the waitlist Read the curation guide