Ahrefs Study Reveals ChatGPT Uses Reddit for Context but Rarely Provides Citations

The landscape of digital information retrieval is undergoing a seismic shift as generative artificial intelligence begins to supplement, and in some cases replace, traditional search engine interactions. A comprehensive analysis conducted by Ahrefs, a leading search engine optimization (SEO) toolset provider, has uncovered a significant discrepancy in how OpenAI’s ChatGPT utilizes source material. After examining approximately 1.4 million ChatGPT prompts, researchers found that while Reddit is a primary source for the AI to understand human sentiment and topical consensus, it is almost never credited in the final output. This phenomenon, now being termed the "Reddit Gap," suggests a complex relationship between AI training data, real-time search retrieval, and the eventual attribution of information to content creators.
The Scope of the Ahrefs Analysis
The study, released in early 2025, represents one of the most significant independent audits of ChatGPT’s citation behavior to date. Ahrefs analyzed 1.4 million prompts specifically processed through ChatGPT 5.2. The methodology involved tracking the entire lifecycle of a query: from the moment a user enters a prompt to the background retrieval of web pages, and finally, the generation of the response containing citations.
The researchers found that ChatGPT’s internal search mechanism is highly active. When a user submits a complex query, the AI does not simply perform a single search; instead, it often breaks the prompt down into a series of narrower sub-questions. According to the data, roughly 50% of the pages retrieved during this background search process eventually make it into the final response as a visible citation. However, the distribution of these citations is far from uniform across different types of web sources.
Unpacking the Reddit Gap
One of the most striking findings of the report involves Reddit, the social news aggregation and discussion website. Despite its massive influence on internet culture and its role as a repository for human experience, Reddit content faces a steep uphill battle for citation within the ChatGPT interface.
Ahrefs identified a specific "Reddit source" within the ChatGPT retrieval data—distinct from general web searches—and found that pages from this source were cited only 1.93% of the time. This is particularly notable because Reddit content was found to be "retrieved" at a very high frequency. In fact, of all the pages that ChatGPT looked at but chose not to cite, a staggering 67.8% originated from this dedicated Reddit source.
Ahrefs’ researchers concluded that ChatGPT uses Reddit extensively "upstream." The AI appears to leverage the platform to gauge public opinion, understand the nuances of a topic, and build a contextual framework for its answer. However, when it comes to "downstream" output—the final text shown to the user—the AI tends to favor more traditional web pages for its formal citations. This suggests that while Reddit is essential for the AI’s "understanding," it is not viewed as a primary authoritative source for "attribution."
The OpenAI and Reddit Partnership Context
To understand why this gap is significant, one must look back at the evolving relationship between OpenAI and Reddit. In May 2024, the two companies announced a landmark data partnership. This deal granted OpenAI access to Reddit’s Data API, allowing the AI firm to integrate Reddit content into ChatGPT and other products in a more structured, real-time manner.
The partnership was intended to help ChatGPT "surface" Reddit content more effectively. However, the Ahrefs data suggests that "surfacing" does not necessarily equate to "citing." While the partnership likely facilitated the high retrieval rates observed in the study, the AI’s internal logic seems programmed to prioritize standard web search results—such as news articles, blogs, and official company sites—for its footnotes. It is important to note that Reddit threads can still be cited if they appear in a standard web search result, but the specific, direct-access source provided by the partnership remains largely invisible to the end-user.
Factors Influencing Citation: The Role of Sub-Queries
Beyond the Reddit findings, the Ahrefs report sheds light on the technical mechanics of how ChatGPT decides which pages are worthy of a link. The study indicates that the "broadness" of a webpage is often a disadvantage.
When ChatGPT processes a prompt, it generates internal sub-queries to find specific pieces of information. Ahrefs used open-source tools to calculate "similarity scores" between these sub-queries and the titles and URLs of the retrieved pages. The results were clear: pages with titles and URLs that closely matched the specific sub-questions were significantly more likely to be cited.
For example, if a user asks "How do I fix a leaking faucet in an old house?", ChatGPT might internally search for "types of old faucet valves" or "tools for vintage plumbing." A comprehensive guide titled "The Ultimate Home Repair Manual" might be retrieved, but a specific page titled "Identifying and Repairing Compression Valves in 1950s Faucets" is far more likely to receive the actual citation because it aligns more precisely with the AI’s narrow sub-query.
The Importance of URL Structure and Metadata
The data also highlighted the enduring importance of technical SEO, specifically regarding URL slugs. Descriptive, human-readable URLs continue to perform better in the age of AI search. According to Ahrefs, pages with clear and descriptive URL slugs were cited approximately 89.78% of the time they appeared in search results. In contrast, pages with non-descriptive or cluttered URLs (containing strings of random numbers or symbols) saw their citation rate drop to 81.11%.
This finding aligns with previous research from SE Ranking, which noted that ChatGPT tends to favor URLs that describe broader topics or clear categories over those that are hyper-focused on a single keyword or are structurally incoherent. For digital publishers, this underscores a critical reality: the same "best practices" that helped pages rank in Google—such as clear titles and logical URL structures—are now the primary signals used by AI agents to determine source credibility and relevance.
Chronology of AI Search Evolution
The shift in how AI cites sources has moved rapidly over the last year:
- May 2024: OpenAI and Reddit announce a partnership for data access, aiming to enhance the "search" capabilities of ChatGPT.
- Late 2024: OpenAI begins rolling out "ChatGPT Search" (formerly SearchGPT) to a wider audience, moving the model toward a hybrid of a chatbot and a search engine.
- February 2025: The period during which Ahrefs collected data for the 1.4 million prompt study, focusing on the ChatGPT 5.2 model.
- March 2025: OpenAI introduces the GPT-5.3 "Instant" transition. Early analysis of this update by firms like Resoneo suggests a 20% decrease in the number of unique domains cited per response, indicating a trend toward more concise and perhaps more "closed" AI responses.
Broader Implications for Digital Publishers and SEO
The "Reddit Gap" and the emphasis on sub-query matching have profound implications for the future of the web. If AI models continue to use high-quality community data (like Reddit) to learn and synthesize answers without providing traffic-driving citations, it creates a "value extraction" problem. Platforms and creators provide the context, but the AI provides the answer, potentially cutting the original creator out of the loop.
For businesses and SEO professionals, the Ahrefs report suggests a shift in strategy. The traditional "keyword-first" approach is becoming less effective than a "sub-query" approach. To be cited by ChatGPT, content must:
- Anticipate the AI’s Search Plan: Creators should look beyond the main topic and address the specific, granular questions an AI might ask to verify a broader claim.
- Optimize for Semantic Alignment: Page titles should be literal and descriptive rather than clever or clickbait-oriented, as the AI’s similarity scoring favors clarity.
- Maintain Technical Excellence: Clean URL structures are not just for humans or Google bots; they are a key trust signal for AI retrieval systems.
Conclusion and Future Outlook
The Ahrefs study provides a rare glimpse into the "black box" of AI search. It reveals a system that is highly sophisticated in its retrieval—extensively mining platforms like Reddit for context—but highly selective in its attribution. As OpenAI continues to update its models, such as the transition to GPT-5.3, the number of citations per query appears to be shrinking.
This trend suggests that the window for being cited by an AI is narrowing. Only the most relevant, structurally clear, and authoritative sources will earn a place in the final response. For Reddit and its millions of contributors, the study serves as a reminder that being "useful" to an AI does not necessarily guarantee being "visible" to the AI’s users. As the digital ecosystem evolves, the balance between AI utility and creator compensation remains one of the most critical unresolved issues in the technology sector.







