GroktoCrawl v0.9.0: You Could Rent a Web Scraper. Or You Could Own One.

Magnus Hedemark 4 min read
A vintage steel engraving of a kaiju octopus with an exposed brain rising from Tokyo Harbor, six tentacles holding engraved feature signs for GroktoCrawl v0.9.0
The Groktopus kaiju in a 19th-century scientific etching, tentacles holding the six pillars of the v0.9.0 release: crawl engine, deep search, find-similar, monitors, observability, and security.

You Could Rent a Web Scraper. Or You Could Own One.

The full stack runs at ~12 GB disk and ~2.7 GB RAM: less disk than a single AAA game title, less memory than two Chrome tabs with a hundred open browser tabs. Yet the SaaS alternative charges per page, caps your throughput, and processes everything through someone else's infrastructure. The pricing starts reasonable. Then your usage grows. Then you get the email about your new plan tier.

We hit that wall with Groktopus. We needed a web research stack that could crawl deeply, search broadly, and answer questions with grounded citations, all without bleeding budget on per-API-call pricing. So we built one. MIT licensed. Runs entirely on your hardware. One docker compose up and you have a full web research platform.

Today we're shipping GroktoCrawl v0.9.0, the biggest release since the project started. It's also a good moment to reintroduce what GroktoCrawl is and why you'd run it yourself.


What GroktoCrawl Is

GroktoCrawl is a self-hosted alternative to Firecrawl with significant extensions. It implements the Firecrawl v2 API surface (/v2/scrape, /v2/search, /v2/map, /v2/crawl, /v2/extract, browser sessions, and monitors). Then it adds capabilities Firecrawl doesn't offer:

  • Persistent semantic search: a Qdrant vector index that remembers everything you've crawled, enabling similarity search across your entire archive
  • Grounded Q&A: the /v2/answer endpoint that searches, scrapes, and synthesizes with citations in a single round-trip
  • Site adapters: specialized extraction for GitHub, Substack, Reddit, YouTube, Bluesky, Gutenberg, Greenhouse, AshbyHQ, and Shopify stores
  • SlopSearX: a multi-engine meta-search aggregator (48 engines) that replaced the single-backend search limitation. This is another tool we built in-house at Groktopus, and we are open-sourcing it alongside GroktoCrawl. Same philosophy: the tools we use internally are the tools we give to the world.
  • Intelligent scrape cache: ETag/Last-Modified revalidation that avoids re-downloading unchanged pages
  • An autonomous research agent: kick off a multi-source research task and let it work through results
  • A web portal: :8082 for human users who prefer a UI

Every service runs in Docker on your hardware.


What v0.9.0 Ships

This release centers on a production-grade Crawl Engine: BFS crawl with configurable concurrency, domain scope controls, glob and regex path filtering, a three-mode sitemap parser, and per-page webhooks with HMAC signatures. Crawl progress streams over SSE so you can watch pages come in live. A Valkey-backed cache with maxAge/minAge semantics prevents redundant work.

Beyond the crawl engine, v0.9.0 adds five major capabilities:

  • Deep Search: multi-pass agentic search across all 48 SlopSearX engines, with auto-correction, intent classification, and grounded summarization. Set search_type=deep on /v2/search and get synthesized answers with citations instead of raw link lists.
  • Find Similar: POST /v2/find-similar takes a URL and returns semantically related pages from your crawled archive. Content-based discovery for when keyword search misses the connection.
  • Search Monitors: the /v2/monitor endpoint lets you watch pages for changes. Cron-triggered re-crawl with diff detection. Alert on content shifts, new keywords, or structural changes.
  • Observability: Prometheus counters per endpoint, Grafana dashboards for agent-svc and scraper-svc, structured logging with request-ID tracing, and alerting rules with per-alert runbooks. You can see what the system is doing.
  • Security: a SensitiveDataFilter that redacts secrets from log output, a reusable CircuitBreaker for resilience, Gitleaks secret scanning in CI, and pip-audit + mypy enforcement on every commit.

What It Costs to Run

The full stack fits on an engineer-grade laptop. These numbers come from a production deployment measured on an Intel i7 with 32 GB RAM.

Total disk footprint is ~12 GB, with the BGE-M3 embedding model alone taking 6 GB. Everything else (Playwright, Chromium, Qdrant, Valkey, SlopSearX, the document parser) adds up to the other 6 GB.

Running memory sits at ~2.7 GB RSS. The embedding model is 65% of that (1.77 GB). Everything else (the crawl engine, search backend, browser sessions, cache) runs in under 1 GB combined.

docker compose up on a machine with 16 GB RAM and 20 GB free disk, and the entire web research stack is yours.


The Architecture in Brief

GroktoCrawl runs as ten Docker services orchestrated through a single docker-compose.yml. Valkey handles queues and cache. Qdrant provides the vector index. SlopSearX aggregates 48 search engines. The scraper uses a three-tier fetch strategy: /llms.txt for agent-friendly sites, Accept: text/markdown for standards-compliant ones, and Playwright rendering for JavaScript-heavy pages. Every response includes a post-extraction quality assessment.

An agent service orchestrates the flow. A semantic service handles embedding and near-duplicate detection. A parse service extracts text from PDFs, EPUBs, and Office documents. A portal provides the human interface. An Ofelia cron scheduler drives the monitor system. Each service is independently scalable and restartable.


Why Self-Host?

The argument for self-hosting a web research stack isn't purely about cost, though the economics of per-API-call pricing become punishing at scale. It's about control over your data pipeline. Every page you scrape through a third-party service passes through their infrastructure. Every query you send becomes part of their training data consideration set. Every crawl you run is subject to their rate limits, their content policies, their definition of fair use.

With GroktoCrawl, your data stays on your hardware. The embedding model runs locally. The search cache is yours. The crawl queue is yours. There's no tier upgrade email waiting for you when you start using the tool the way it was meant to be used.


v0.9.0 and Beyond

This release also includes 49 resolved type errors across the codebase, Grafana dashboards with 20+ panels, Prometheus alerting with full runbooks, integration tests across all seven services, and a complete set of Architecture Decision Records documenting every architectural choice.

The crawl engine, the deep search, the monitors, the grounded answers: it all runs on your hardware, under your control. That's the point.

GroktoCrawl is at github.com/groktopus/groktocrawl. SlopSearX is at github.com/magnus919/SlopSearX. Both are MIT licensed. Both are the same tools we use internally at Groktopus, given to the world as we built them. Contributions welcome. docker compose up and you're running.