1
1
Users of ChatGPT have become accustomed to the numbered blue links that accompany its responses, serving as citations to external information. However, new research indicates that while ChatGPT retrieves information from dozens of pages for a single query, it ultimately cites only about 50% of them. This finding, derived from an analysis of 1.4 million ChatGPT prompts from February 2025, sheds light on the complex decision-making process behind the AI’s source selection.

According to studies by AI expert Dan Petrovic, when ChatGPT retrieves information, each result includes the page title, a brief snippet, the URL, and an ID number. This metadata is crucial for ChatGPT’s initial decision-making process regarding which pages to open and cite. The title, snippet, and URL play a significant role before the AI even accesses the full content of a webpage. Researchers aimed to understand what factors influence this selection, specifically whether semantic similarity between page retrieval data and user queries increases citation likelihood, which fields are most important, and if human-readable URLs perform better.

The research, conducted with the assistance of Ahrefs data scientist Xibeijia Guan, categorized retrieved sources into five internal categories, or ref_types: search, news, reddit, youtube, and academia. The citation rates across these categories vary dramatically. The general "search" index is the dominant source, accounting for the vast majority of both volume and citations, with 88.46% of search results being cited. This underscores the importance of ranking in search results for a page to be considered by ChatGPT.

In contrast, specialized verticals like YouTube and Academia, despite being retrieved in large volumes, are cited at significantly lower rates (0.51% and 0.40%, respectively). News sources are cited at 12.01%, while Reddit content, though retrieved extensively, is cited at a mere 1.93%.

A particularly striking finding is that 67.8% of all non-cited URLs originate from Reddit. This suggests that ChatGPT extensively utilizes Reddit for understanding topics and gauging consensus but rarely attributes information directly to it. The AI appears to learn from the crowd on Reddit but then cites other, more established sources.

The study also examined the data fields associated with retrieved URLs. Initially, it appeared that non-cited pages had more populated fields, such as snippets and publication dates, than cited pages. However, upon closer inspection, this discrepancy was largely attributed to the compositional artifact of Reddit data. Reddit content retrieved via API often includes pub_date metadata, inflating the percentage for non-cited pages.

Furthermore, research by David McSweeney on ChatGPT’s retrieval process indicates that the AI abandons the snippet field once it decides to cite a URL, opting instead to access the full page. This explains why cited pages show a lower percentage of snippets, not because they are disfavored, but as a byproduct of the AI’s workflow.

When the analysis was isolated to the "search" ref_type, a clearer picture emerged. Snippet data proved to be largely absent for both cited and non-cited search results, rendering it an unusable signal. Publication date percentages were closer, though non-cited search pages were slightly more likely to carry this metadata. Ultimately, the researchers concluded that strong conclusions about the influence of snippet or publication date fields on citation likelihood could not be drawn from this data, highlighting the risk of misinterpreting data quirks as real patterns in aggregate comparisons without accounting for source type.

Relevance to user queries emerged as a critical factor. ChatGPT estimates relevance through a process akin to "semantic scoring." By approximating how ChatGPT might work using cosine similarity with embeddings from open-source models, the study found that titles semantically aligned with the AI’s internal "fanout queries"—the sub-questions it generates to find specific facts—significantly increase citation likelihood. Cited URLs consistently demonstrated higher similarity between their titles and the original prompt, and even more so when compared against fanout queries. This reinforces that content relevant to ChatGPT’s internal sub-questions is key to selection.

For search results, pages with natural language URL slugs also showed a higher citation rate (89.78%) compared to those without (81.11%). The core takeaway is that if a URL and its title do not semantically align with the AI’s internal inquiries, the page is less likely to be cited.

The age of content also plays a role, though nuanced. While previous research indicated a strong preference for fresher content in AI citations, this study found that cited pages in the search index span a wide range of ages, with a median of around 500 days. Intriguingly, non-cited pages were overwhelmingly very young. This suggests that within a given retrieval set, while freshness is a factor, relevance to fanout queries remains the primary driver. A new page that matches these queries well will be cited, while a new page that does not, despite being retrieved, will be ignored. The smaller pool of non-cited search pages limits definitive conclusions on age gaps.

Freshness becomes a more decisive factor in the "news" category. Here, title relevance scores for cited and non-cited pages were nearly identical. In such cases, the AI defaults to page age as a tie-breaker, with cited news pages skewing younger (median age around 200 days) compared to non-cited news pages (median around 300 days).

In summary, ChatGPT operates as a discerning editor, favoring its general search index and employing semantic relevance to select sources. Reddit is heavily utilized for information gathering but rarely cited. The research also highlights the importance of careful analysis, particularly the need to segment data by source type (ref_type) to avoid misleading conclusions. Ultimately, pages that are semantically aligned with ChatGPT’s internal queries and sourced through appropriate retrieval channels are most likely to be cited.