Introducing @purepageio/fetch-engines: reliable web fetching

Extracting content from websites is unreliable. Plain HTTP requests miss content rendered by JavaScript, and bot detection can block automated traffic. Developers often rebuild the same glue code for retries, proxies, and headless browsers.

@purepageio/fetch-engines packages these patterns into a robust API. It provides a lightweight FetchEngine for simple pages and a smart HybridEngine that starts with a fast request and automatically escalates to a full browser when needed. It simplifies fetching HTML, Markdown, or even raw files like PDFs.

@purepageio/fetch-engines on npm

Features

  • Smart Engine Selection: Use FetchEngine for speed on static sites or HybridEngine for reliability on complex, JavaScript-heavy pages.
  • Unified API: Fetch processed web pages with fetchHTML() or raw files with fetchContent().
  • Automatic Escalation: The HybridEngine tries a simple fetch first and only falls back to a full browser (Playwright) if the request fails or the response looks like an empty SPA shell.
  • Built-in Stealth & Retries: The browser-based engine integrates stealth measures to avoid common bot detection, and all engines have configurable retries.
  • Content Conversion: fetchHTML() can be configured to return clean Markdown instead of HTML.
  • Structured Content Extraction: Supply a Zod schema to fetchStructuredContent() or the StructuredContentEngine and receive typed JSON generated via OpenAI.
  • Raw File Handling: fetchContent() retrieves any type of file - PDFs, images, APIs - returning the raw content as a Buffer or string.

Quick start

First, install the package and its browser dependencies.

pnpm add @purepageio/fetch-engines
pnpm exec playwright install

This example uses the HybridEngine to reliably fetch a potentially complex page.

import { HybridEngine, FetchError } from "@purepageio/fetch-engines";

// Initialise the engine. HybridEngine is best for general use.
const engine = new HybridEngine();

async function main() {
  try {
    const url = "https://quotes.toscrape.com/"; // A JS-heavy site
    const result = await engine.fetchHTML(url);

    console.log(`Fetched ${result.url}`);
    console.log(`Title: ${result.title}`);
    console.log(`HTML (excerpt): ${result.content.substring(0, 150)}...`);
  } catch (error) {
    if (error instanceof FetchError) {
      console.error(`Fetch failed: ${error.message} (Code: ${error.code})`);
    }
  } finally {
    // Shut down the browser instance managed by the engine.
    await engine.cleanup();
  }
}

main();

Structured content extraction

Some crawls do not just need HTML - they need typed entities that can flow straight into a database or workflow. @purepageio/fetch-engines ships with fetchStructuredContent() and a StructuredContentEngine that combine Playwright-grade fetching with OpenAI-powered extraction. You describe the shape of the data with Zod, and the helper ensures the response matches that schema before handing it back.

import { fetchStructuredContent } from "@purepageio/fetch-engines";
import { z } from "zod";

const articleSchema = z.object({
  title: z.string(),
  summary: z.string(),
  author: z.string().optional(),
});

async function fetchArticleSummary() {
  const result = await fetchStructuredContent(
    "https://example.com/press-release",
    articleSchema,
    { model: "gpt-4.1-mini" }
  );

  console.log(result.data.summary);
}

Behind the scenes the helper:

  • Runs the same HTTP-first workflow as the other engines, promoting tricky pages to Playwright automatically.
  • Sends the cleaned content and your schema to OpenAI, so you get structured data without juggling prompts.
  • Validates the response with Zod before returning it, which keeps downstream pipelines predictable.

Set OPENAI_API_KEY in the environment before using structured extraction, and call await engine.cleanup() if you instantiate the long-lived StructuredContentEngine.

Fetching Markdown and Raw Files (like PDFs)

To get clean prose from an article, configure the engine to return Markdown. To download a PDF, use fetchContent() to get the raw file buffer.

import { HybridEngine } from "@purepageio/fetch-engines";
import { writeFileSync } from "fs";

const engine = new HybridEngine();

async function fetchDocuments() {
  // 1. Fetch an article and convert it to Markdown
  const article = await engine.fetchHTML("https://example.com/blog/post", {
    markdown: true,
  });
  if (article.content) {
    console.log(article.content);
  }

  // 2. Fetch a raw PDF file
  const pdf = await engine.fetchContent("https://example.com/report.pdf");
  if (pdf.content instanceof Buffer) {
    // The library returns the raw file; parsing it is up to you
    writeFileSync("report.pdf", pdf.content);
    console.log("Downloaded report.pdf");
  }

  await engine.cleanup();
}

fetchDocuments();

Choosing an engine

  • FetchEngine: Best for speed with trusted, static sites or APIs that return HTML.
  • HybridEngine: The recommended default. It offers the speed of a simple fetch with the reliability of a full browser fallback for dynamic sites.

This project is open source. If you use it, please report issues and share ideas on the GitHub repository to help guide its development.