PRODUCTJan 30, 20269 min read

Inside the Scan Agent: How We Monitor Billions of Web Pages

A technical look at how Brandog's autonomous scan agents crawl, index, and analyze billions of web pages across 500+ marketplaces, social networks, the open web, and domain registries.

Sarah ChenVP of Engineering

Inside the Scan Agent: How We Monitor Billions of Web Pages

The Scale of the Monitoring Challenge

The modern counterfeit economy operates across a fragmented digital landscape. A single counterfeiting operation might list products on Amazon, eBay, AliExpress, and Wish simultaneously while running Instagram storefronts, Facebook Marketplace listings, and standalone websites. To protect a brand effectively, you need to monitor all of these channels — not once a week, not once a day, but continuously.

At Brandog, our scan agents continuously crawl billions of web pages and monitor over 500 structured marketplaces across 50 countries — and that number grows every quarter as new platforms emerge. Beyond marketplaces, our agents index the broader open web: standalone e-commerce sites, social storefronts, forum listings, and even dark web marketplaces. On an average day, our agents process over 200 million web pages, analyze 15 million product images, and evaluate 8 million seller profiles. This article is a technical deep-dive into how we built a scanning infrastructure that operates at this scale while maintaining the accuracy that brand protection demands.

Architecture: The Agent Mesh

Our scanning infrastructure is organized as a mesh of specialized agents, each responsible for a specific aspect of the monitoring workflow. This architecture allows us to scale each function independently and deploy platform-specific adaptations without affecting the broader system.

The primary agent types are:

Crawler agents: Platform-specific agents that navigate marketplace interfaces, extract listing data, and normalize it into a common schema. Each crawler is adapted to its target platform's structure, pagination logic, and rate limiting policies.
Visual analysis agents: Agents that process product images through our computer vision pipeline, performing image fingerprinting, logo detection, and anomaly scoring.
Text analysis agents: Natural language processing agents that analyze listing titles, descriptions, and seller profiles for trademark usage, suspicious keywords, and deceptive claims.
Correlation agents: Agents that link related findings across platforms, connecting seller accounts, shared images, and overlapping inventory to map counterfeiting networks.
Prioritization agents: Agents that score each finding based on infringement severity, commercial impact, and enforcement feasibility, determining which cases should be actioned first.

Each agent type operates as an independent service, communicating through an event-driven message bus. This decoupled architecture means a surge in crawling activity on one platform does not bottleneck image analysis across the system, and a new platform can be added by deploying a new crawler agent without modifying any other component.

Crawling at Scale: Respecting Limits While Maximizing Coverage

Web crawling for brand protection is fundamentally different from general-purpose search engine crawling. Search engines crawl broadly, indexing everything; brand protection crawlers must crawl deeply within specific product categories while respecting each platform's terms of service and rate limiting infrastructure.

Our crawler agents use adaptive rate limiting that calibrates request frequency based on real-time platform signals. If response latency increases, the crawler automatically reduces its request rate. If a platform returns rate-limiting headers, the crawler honors them exactly. This allows us to maximize data collection without triggering blocks.

"Aggressive crawling is counterproductive. If a platform blocks your IP range, you lose visibility entirely. The goal is sustainable monitoring — consistent coverage over months and years, not a one-time data dump."

For platforms that offer official APIs (Amazon's Product Advertising API, eBay's Browse API), we use API access wherever possible. API-first crawling provides more reliable data, explicit authorization, and structured output that requires less normalization.

Data Normalization: The Common Schema

With data flowing in from billions of web pages and 500+ structured marketplaces, each with its own data format and metadata conventions, normalization is one of the most underappreciated challenges in the system.

Our common listing schema captures:

Product title and description (normalized to UTF-8, language-tagged)
Image URLs and perceptual hashes
Price (converted to USD using daily exchange rates)
Seller identifier (platform-specific ID, display name, location if available)
Category (mapped to a unified taxonomy)
Listing URL and platform identifier
Temporal metadata (first seen, last seen, last modified)

This normalization allows downstream agents to operate on a uniform data structure regardless of source platform. When we add a new platform, integration work is confined to the crawler; every other component works immediately with the normalized output.

Visual Analysis Pipeline

The visual analysis pipeline is the system's most computationally intensive component and the one most critical to detection accuracy. Each product image passes through a three-stage pipeline:

Stage 1: Relevance filtering. A lightweight classification model determines whether the image contains a product in the brand's category. Images of unrelated products (or non-product images like lifestyle photos) are discarded, reducing the load on subsequent stages by approximately 85%.

Stage 2: Feature extraction. A deep convolutional neural network generates a high-dimensional feature vector — the image's fingerprint. This is compared against the authenticated image database using approximate nearest neighbor search, producing a similarity score.

Stage 3: Anomaly detection. For images with high product similarity but no authenticated match, a specialized model examines fine-grained details: logo geometry, label text rendering, hardware finish, and packaging construction — catching counterfeits that use original photography.

The full pipeline processes an image in under 200 milliseconds on our inference infrastructure, enabling a throughput of over 15 million images per day per GPU cluster.

Correlation: Mapping Networks, Not Just Listings

Individual listing detection is table stakes. The real value in brand protection intelligence comes from connecting individual listings into a coherent picture of counterfeiting operations.

Our correlation agents maintain a persistent graph database that links entities across platforms. When two seller accounts on different platforms use the same product image (identified by perceptual hash), the correlation agent creates an edge between those accounts. When a cluster of accounts shares similar naming patterns or coordinated listing timing, the agent groups them into a single counterfeiting operation.

This network intelligence transforms enforcement strategy. Instead of filing individual takedowns, the enforcement team can target an entire operation — requesting account-level suspensions across multiple platforms simultaneously. This approach reduces recidivism by an average of 50% compared to listing-level enforcement.

Prioritization: Not All Infringements Are Equal

With thousands of potential infringements detected daily, prioritization is essential. Our prioritization agents score each case across four dimensions:

Confidence: How certain is the system that this listing is genuinely infringing (vs. a false positive or authorized use)?
Impact: What is the estimated revenue impact, based on the listing's price, sales volume indicators, and platform traffic?
Enforceability: How likely is a takedown to succeed on this platform, based on historical removal rates and the strength of the evidence?
Strategic value: Is this listing part of a larger network that, if disrupted, would have cascading effects?

The composite score determines whether a case is routed for automatic enforcement, human review, or watchlist monitoring. This triage ensures enforcement resources are concentrated on the cases with the greatest effect on reducing counterfeit availability.

Continuous Improvement: The Feedback Loop

The scan agent system is not static. Every enforcement outcome — a successful takedown, a rejected notice, a false positive identified by a human reviewer — feeds back into the system's models, improving accuracy over time.

This feedback loop is particularly important for visual analysis, where the definition of "suspicious" evolves as counterfeiters adjust tactics. Each confirmed counterfeit becomes a training example; each false positive becomes a negative example. The models retrain weekly, incorporating the latest data from all monitored platforms.

The result is a system that improves every day — not through software updates, but because the agents learn from every interaction with the counterfeiting ecosystem they are designed to combat.

Ready to deploy your agent workforce?

Join the waitlist for early access to Brandog's autonomous IP management platform.

← Back to all articles