The short answer

Google indexes only the first 2MB of raw, uncompressed HTML. Everything beyond that is silently ignored for ranking. Bing flags pages over 125KB with an indexing warning. AI retrieval crawlers – GPTBot, PerplexityBot, ClaudeBot – do not render JavaScript at all, so content loaded via JS is invisible to them regardless of page size.

A bloated page is not just a performance problem. It is a visibility problem. If your content is not in the first 2MB, it does not rank. If your JavaScript is inline, AI engines cannot read it. If your HTML is over 125KB, Bing may not cache it fully. The crawlers that control your traffic have hard limits – and most developers have never been told what they are.

The crawl limits your developer probably never mentioned

Every search engine and AI engine that sends traffic to your site has a limit on how much of your page it will actually read. Exceed that limit and your content – however good it is – becomes invisible to the crawler. No ranking signal. No AI citation. No traffic.

Most developers focus on page load speed, Core Web Vitals, and mobile responsiveness. These matter. But the page size limits set by crawlers are a parallel issue that sits upstream of all of them. A fast page that exceeds Google’s indexing limit will still have its content truncated.

Here is the full picture across every major crawler that sends traffic in 2026.

Crawler / Engine Operator Page size limit Type Renders JS?
Googlebot Google Search 2MB HTML (indexing), 15MB fetch Hard limit Yes (delayed)
Google-InspectionTool Google Search Console 15MB fetch only Fetch only Yes
Bingbot Bing / Microsoft Copilot 125KB soft limit (HTML) Soft limit Partial
GPTBot OpenAI (training) No published limit Context window No
ChatGPT-User OpenAI (retrieval) No published limit Context window No
PerplexityBot Perplexity AI No published limit Context window No
ClaudeBot Anthropic No published limit Context window No
Applebot Apple (Siri, Spotlight) No published limit Context window No

The key insight from this table: the crawlers that matter most in 2026 – Google, Bing, and every AI retrieval bot – all have constraints. They differ in type, but the result is the same. Exceed those constraints and your content does not get seen.

Google’s 2MB rule – what it actually means

On February 3, 2026, Google reorganised its crawler documentation and made explicit what had been implied for years: Googlebot indexes only the first 2MB of raw, uncompressed HTML. This is the hard limit for what gets sent to Google’s indexing pipeline. Content beyond that threshold is silently cut off.

Important distinction

The 2MB limit is applied to uncompressed HTML. Even if your server delivers a 200KB gzipped file, if the decompressed HTML is over 2MB, Googlebot will truncate it. Compression reduces transfer size but not the indexing cut-off.

The Google Search Console trap

There is a common misunderstanding here that catches developers out. When you use Google Search Console’s URL Inspection tool and run a Live Test, it shows you the complete source code – even for a 3MB page. This leads developers to believe the 2MB limit does not apply to their site.

It does. The URL Inspection tool uses Google-InspectionTool, which operates under the general 15MB fetch limit – not the 2MB indexing limit. What you see in Search Console is not what Googlebot sends to the indexer. The indexer only sees the first 2MB.

What causes pages to breach 2MB?

The median HTML page is around 30-33KB. At the 90th percentile, pages reach roughly 151KB. That is well within the 2MB threshold. But specific patterns can push pages dramatically over the limit.

E-commerce category pages

Hundreds of inline product descriptions, attributes, and reviews. Each product with embedded structured data. 1MB+ of pure HTML is common on large catalogue pages.

JavaScript-heavy frameworks

Next.js, Nuxt, SvelteKit inject large JSON hydration payloads (__NEXT_DATA__) directly into the HTML. On data-rich pages, these alone can exceed several hundred KB.

Inline CSS and JavaScript

Stylesheets and scripts embedded directly into the HTML source inflate page size without adding content. These should be external files in every case.

Excessive schema markup

JSON-LD structured data is valuable for SEO and GEO. But adding exhaustive schema for dozens of items inline adds significant weight to the HTML document.

Bing’s 125KB soft limit – stricter than you think

Bing operates with a soft limit of 125KB for HTML page size. Pages that exceed this threshold trigger an “HTML size is too long” error in Bing Webmaster Tools, with the following warning: the page “risks not being fully cached” and content may not be fully acquired by the crawler.

125KB is a significantly tighter threshold than Google’s 2MB. A page that is comfortably within Google’s limits can still fail Bing’s soft cap. Given that Bing powers Microsoft Copilot – one of the most-used AI assistants for enterprise and B2B audiences – treating Bing visibility as secondary is a mistake for anyone targeting professional clients.

Why this matters for B2B

Microsoft Copilot – which powers AI answers in Bing, Windows, and Microsoft 365 – uses Bingbot as its primary retrieval crawler. If your pages exceed 125KB and are not fully cached, they may not appear in Copilot’s answers. For B2B service providers targeting enterprise or SMB clients, Copilot is an increasingly important visibility channel.

Unlike Google’s hard 2MB cut-off, Bing’s 125KB is a soft limit – the crawler may still attempt to index beyond it. But the error is a documented signal that your page is at risk of incomplete caching, and it shows up explicitly in Bing Webmaster Tools, unlike Google’s silent truncation.

AI crawlers in 2026 – a different kind of limit

In 2026, AI retrieval crawlers have overtaken traditional search crawlers in total request volume. Data from Cloudflare’s January 2026 analysis confirmed that AI-related crawlers are making 3.6x more requests than traditional search crawlers across their network. GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, and Amazonbot are now significant drivers of crawl activity.

3.6x
AI crawler requests vs traditional search crawlers (2026)
42%
of all AI bot requests from OpenAI alone
0
major AI crawlers that render JavaScript

The critical difference with AI crawlers is not a documented MB limit – it is the context window constraint. Each AI retrieval bot can only process a certain amount of content per page request. Bloated pages filled with inline scripts, verbose HTML attributes, and redundant markup compete with your actual content for that limited processing window.

None of them render JavaScript

This is the most significant technical constraint for AI visibility, and most developers are unaware of it. Vercel’s analysis of nextjs.org confirmed that none of the major AI crawlers – including GPTBot, ClaudeBot, ChatGPT-User, and PerplexityBot – currently render JavaScript. If your content is loaded via JavaScript, it does not exist for AI engines.

High-risk patterns for AI visibility

Content loaded via React, Vue, or Angular client-side rendering. Lazy-loaded text and descriptions. Product details populated via AJAX. Any content that relies on JavaScript execution to appear in the DOM – all of this is invisible to every AI retrieval crawler. If you want to appear in ChatGPT answers, Perplexity citations, or Google AI Overviews, your content must be in the raw server-rendered HTML.

What AI crawlers actually want

AI retrieval crawlers optimise for clean, parseable, content-dense HTML. The less they have to wade through – in terms of inline scripts, redundant markup, and non-content code – the more of your actual content fits into their processing window. A lean, well-structured HTML document is not just good for Google. It is good for every AI engine that decides whether to cite your content.

The business cost of page bloat

Page size is not a technical metric. It is a business metric with a direct line to revenue. Here is what happens when pages get heavy.

Search rankings drop

Content beyond Google’s 2MB is not indexed. If your primary service description, key headings, or FAQ content appears late in a bloated document, it may not rank at all.

AI engine invisibility

GPTBot and PerplexityBot cannot read JavaScript-loaded content. If key facts about your service are client-rendered, no AI engine will cite you – regardless of how authoritative your content is.

Conversion rate falls

Amazon found that every 100ms increase in load time reduced sales by 1%. Google data shows 53% of mobile users abandon a site if it takes over 3 seconds to load. Page size is the starting point for that delay.

Paid media ROI drops

Google Ads uses landing page experience as a quality score factor. Slow, bloated landing pages receive lower quality scores – which means higher cost-per-click for the same position. You pay more for less traffic.

How to keep your pages within limits

These are not abstract optimisation suggestions. They are direct, practical steps that bring measurable improvement to crawl coverage, indexing completeness, and AI engine visibility.

Move CSS and JavaScript to external files

This is the single highest-impact change on most sites. Inline styles and scripts inflate raw HTML size without adding content. Move every stylesheet to an external .css file and every script to an external .js file. For WordPress sites, this is standard practice under WordPress Coding Standards – but many page builders and theme frameworks inject significant amounts of inline code that needs to be audited and externalised.

Audit and reduce JSON-LD schema payload

Structured data is valuable for both SEO and GEO. But bulky schema blocks – particularly those with exhaustive product or FAQ arrays – add meaningful weight to your HTML. Audit your schema markup for efficiency. Use concise property values. Avoid duplicating content that already exists in the visible HTML.

Paginate or split long content pages

Category pages with hundreds of products, long-form content pages, and documentation hubs are the most common sources of 2MB+ HTML. Split them into logical subpages. Pagination and content hubs are not just better for crawlability – they allow you to target more specific keyword clusters and provide more relevant landing experiences.

Test actual uncompressed HTML size

Your server probably delivers gzip-compressed responses – so the 50KB transfer size shown in your browser’s network tab is not the number that matters. What matters is the uncompressed HTML size. Use tools like DebugBear, Screaming Frog, or your server logs to check the raw, uncompressed size of your HTML files. That is the number Googlebot and Bingbot measure against their limits.

Ensure critical content is server-rendered

For AI engine visibility, server-side rendering (SSR) is not optional. Any content that must appear in ChatGPT answers, Perplexity citations, or Google AI Overviews needs to exist in the server-rendered HTML before JavaScript executes. Review your site architecture for content that relies on client-side rendering and move it to SSR.

Quick audit checklist

  • Check uncompressed HTML size – target under 125KB for Bing, keep key content in the first 2MB for Google
  • Move all CSS into external files – no inline <style> blocks in the HTML source
  • Move all JavaScript into external files – no inline <script> blocks except minimal critical JS
  • Verify critical content is visible in server-rendered HTML before JS executes
  • Audit JSON-LD schema for size – remove redundant properties, tighten descriptions
  • Review JavaScript framework hydration payloads (__NEXT_DATA__, etc.) – move where possible to API calls
  • Paginate category pages with more than 50 inline product descriptions
  • Enable Bing Webmaster Tools and check for “HTML size is too long” errors
  • Confirm robots.txt allows access for ChatGPT-User, PerplexityBot, ClaudeBot – these are the retrieval bots, not the training bots