An Anonymous Source Shared Thousands of Leaked Google Search API Documents with Me; Everyone in SEO Should See Them

Update: We are diving deep into the Google API leak and what it means for marketers in our next episode of SparkToro Office Hours on June 27. Join Michael King, Founder & CEO, iPullRank, and Rand Fishkin, Co-founder & CEO, SparkToro, to learn more. RSVP for the free webinar.

On Sunday, May 5th, I received an email from a person claiming to have access to a massive leak of API documentation from inside Google’s Search division. The email further claimed that these leaked documents were confirmed as authentic by ex-Google employees, and that those ex-employees and others had shared additional, private information about Google’s search operations.

Many of their claims directly contradict public statements made by Googlers over the years, in particular the company’s repeated denial that click-centric user signals are employed, denial that subdomains are considered separately in rankings, denials of a sandbox for newer websites, denials that a domain’s age is collected or considered, and more. 

Naturally, I was skeptical. The claims made by this source (who asked to remain anonymous) seemed extraordinary–claims like:

  • In their early years, Google’s search team recognized a need for full clickstream data (every URL visited by a browser) for a large percent of web users to improve their search engine’s result quality.
  • A system called “NavBoost” (cited by VP of Search, Pandu Nayak, in his DOJ case testimony) initially gathered data from Google’s Toolbar PageRank, and desire for more clickstream data served as the key motivation for creation of the Chrome browser (launched in 2008).
  • NavBoost uses the number of searches for a given keyword to identify trending search demand, the number of clicks on a search result (I ran several experiments on this from 2013-2015), and long clicks versus short clicks (which I presented theories about in this 2015 video).
  • Google utilizes cookie history, logged-in Chrome data, and pattern detection (referred to in the leak as “unsquashed” clicks versus “squashed” clicks) as effective means for fighting manual & automated click spam.
  • NavBoost also scores queries for user intent. For example, certain thresholds of attention and clicks on videos or images will trigger video or image features for that query and related, NavBoost-associated queries.
  • Google examines clicks and engagement on searches both during and after the main query (referred to as a “NavBoost query”). For instance, if many users search for “Rand Fishkin,” don’t find SparkToro, and immediately change their query to “SparkToro” and click SparkToro.com in the search result, SparkToro.com (and websites mentioning “SparkToro”) will receive a boost in the search results for the “Rand Fishkin” keyword.
  • NavBoost’s data is used at the host level for evaluating a site’s overall quality (my anonymous source speculated that this could be what Google and SEOs called “Panda”). This evaluation can result in a boost or a demotion.
  • Other minor factors such as penalties for domain names that exactly match unbranded search queries (e.g. mens-luxury-watches.com or milwaukee-homes-for-sale.net), a newer “BabyPanda” score, and spam signals are also considered during the quality evaluation process.
  • NavBoost geo-fences click data, taking into account country and state/province levels, as well as mobile versus desktop usage. However, if Google lacks data for certain regions or user-agents, they may apply the process universally to the query results.
  • During the Covid-19 pandemic, Google employed whitelists for websites that could appear high in the results for Covid-related searches
  • Similarly, during democratic elections, Google employed whitelists for sites that should be shown (or demoted) for election-related information

And these are only the tip of the iceberg.

Extraordinary claims require extraordinary evidence. And while some of these overlap with information revealed during the Google/DOJ case (some of which you can read about on this thread from 2020), many are novel and suggest insider knowledge.

So, this past Friday, May 24th (following several emails), I had a video call with the anonymous source.

Update (5/28 at 10:00am Pacific): The anonymous source has decided to come forward. This video announces their identity, Erfan Azimi, an SEO practitioner and the founder of EA Eagle Digital.

Prior to the email and call, I had neither met nor heard of Erfan. He asked that his identity remain veiled, and that I merely include the quote below:

An eagle uses the storm to reach unimaginable heights.
– Matshona Dhliwayo

After the call I was able to confirm details of Erfan’s work history, mutual people we both know from the marketing world, and several of their claims about being at particular events with industry insiders (including Googlers), though I cannot confirm details of the meetings nor the content of discussions they claim to have had.

During our call, Erfan showed me the leak itself: more than 2,500 pages of API documentation containing 14,014 attributes (API features) that appear to come from Google’s internal “Content API Warehouse.” Based on the document’s commit history, this code was uploaded to GitHub on Mar 27, 2024 and not removed until May 7, 2024. (Note: because this piece was, post-publishing, edited to reflect Erfan’s identity, he’s referred to below as “the anonymous source”).

This documentation doesn’t show things like the weight of particular elements in the search ranking algorithm, nor does it prove which elements are used in the ranking systems. But, it does show incredible details about data Google collects. Here’s an example of the document format:

After walking me through a handful of these API modules, the source explained their motivations (around transparency, holding Google to account, etc.) and their hope: that I would publish an article sharing this leak, revealing some of the many interesting pieces of data it contained, and refuting some “lies” Googlers “had been spreading for years.”

Is this API Leak Authentic? Can We Trust It?

A critical next step in the process was verifying the authenticity of the API Content Warehouse documents.  So, I reached out to some ex-Googler friends, shared the leaked docs, and asked for their thoughts. Three ex-Googlers wrote back: one said they didn’t feel comfortable looking at or commenting on it. The other two shared the following (off the record and anonymously):

  • “I didn’t have access to this code when I worked there. But this certainly looks legit. “
  • “It has all the hallmarks of an internal Google API.”
  • “It’s a Java-based API. And someone spent a lot of time adhering to Google’s own internal standards for documentation and naming.”
  • “I’d need more time to be sure, but this matches internal documentation I’m familiar with.”
  • “Nothing I saw in a brief review suggests this is anything but legit.”

Next, I needed help analyzing and deciphering the naming conventions and more technical aspects of the documentation. I’ve worked with APIs a bit, but it’s been 20 years since I wrote code and 6 years since I practiced SEO professionally. So, I reached out to one of the world’s foremost technical SEOs: Mike King, founder of iPullRank.

During a 40-minute phone call on Friday afternoon, Mike reviewed the leak and confirmed my suspicions: this appears to be a legitimate set of documents from inside Google’s Search division, and contains an extraordinary amount of previously-unconfirmed information about Google’s inner workings.

2,500 technical documents is an unreasonable amount of material to ask one man (a dad, husband, and entrepreneur, no less) to review in a single weekend. But, that didn’t stop Mike from doing his best.
He’s put together an exceptionally detailed initial review of the Google API leak here, which I’ll reference more in the findings below. And he’s also agreed to join us at SparkTogether 2024 in Seattle, WA on Oct. 8, where he’ll present the fully transparent story of this leak in far greater detail, and with the benefit of the next few months of analysis.


Qualifications and Motivations for this Post

Before we go further, a few disclaimers: I no longer work in the SEO field. My knowledge of and experience with SEO is 6+ years out of date. I don’t have the technical expertise or knowledge of Google’s internal operations to analyze an API documentation leak and confirm with certainty whether it’s authentic (hence getting Mike’s help and the input of ex-Googlers).

So why publish on this topic?

Because when I spoke to the party that sent me this information, I found them credible, thoughtful, and deeply knowledgeable. Despite going into the conversation deeply skeptical, I could identify no red flags, nor any malicious motivation. This person’s sole aim appeared quite aligned with my own: to hold Google accountable for public statements that conflict with private conversations and leaked documentation, and to bring greater transparency to the field of search marketing. And they believed that, despite my years removed from SEO, I was the best person to share this publicly.

These are goals I cared about deeply for almost two decades. And while my professional life has moved on (I now run two companies: SparkToro, which makes audience research software and Snackbar Studio, an indie video game developer), my interest in and connections to the world of Search Engine Optimization remain strong. I feel a deep obligation to share information about how the world’s dominant search engine works, especially information Google would prefer to keep quiet. And sadly, I’m not sure where else to send something this potentially groundbreaking.

Years ago, before he left journalism to become Google’s Search Liaison, Danny Sullivan, would have been my go-to source for a leak of this magnitude. He had the gravitas, resume, knowledge, and experience to examine a claim like this and present it fairly in the court of public opinion. There have been so many times in the last few years I’ve wished for Danny’s calm, even-handed, tough-but-fair-on-Google approach to newsworthy pieces like this–pieces that could reach as far as the company’s statements on the witness stand (e.g. his eloquent writing on Google’s indefensible privacy claims about organic keyword data).

Whatever Google’s paying him, it isn’t nearly enough.

Apologies that instead of Danny, dear reader, you’re stuck with me. But since you are, I’m going to assume you may not be familiar with my background or credentials, and briefly share those.

OK. Back to the Google leak.


What is the Google API Content Warehouse?

When looking through the massive trove of API documentation, the first reasonable set of questions might be: “What is this? What is it used for? Why does it exist in the first place?”

The leak appears to come from GitHub, and the most credible explanation for its exposure matches what my anonymous source told me on our call: these documents were inadvertently and briefly made public (many links in the documentation point to private GitHub repositories and internal pages on Google’s corporate site that require specific, Google-credentialed logins). During this probably-accidental, public period between March and May of 2024, the API documentation was spread to Hexdocs (which indexes public GitHub repos) and found/circulated by other sources (I’m certain that others have a copy, though it’s odd that I could find no public discourse until now).

According to my ex-Googler sources, documentation like this exists on almost every Google team, explaining various API attributes and modules to help familiarize those working on a project with the data elements available. This leak matches others in public GitHub repositories and on Google’s Cloud API documentation, using the same notation style, formatting, and even process/module/feature names and references.

If that all sounds like a technical mouthful, think of this as instructions for members of Google’s search engine team. It’s like an inventory of books in a library, a card catalogue of sorts, telling those employees who need to know what’s available and how they can get it.

But, whereas libraries are public, Google search is one of the most secretive, closely-guarded black boxes in the world. In the last quarter century, no leak of this magnitude or detail has ever been reported from Google’s search division.

How certain can we be that Google’s search engine uses everything detailed in these API docs?

That’s open to interpretation. Google could have retired some of these, used others exclusively for testing or internal projects, or may even have made API features available that were never employed.

However, there are references in the documentation to deprecated features and specific notes on others indicating they should no longer be used. That strongly suggests those not marked with such details were still in active use as of the March, 2024 leak.

We also can’t say for certain whether the March leak is of the most recent version of this documentation. The most recent date I can find referenced in the API docs is August of 2023:

The relevant text reads:

“The domain-level display name of the website, such as “Google” for google.com. See go/site-display-name for more details. As of Aug 2023, this field is being deprecated in favor of info.[AlternativeTitlesResponse].site_display_name_response field, which also contains host-level site display names with additional information.”

A reasonable reader would conclude that the documentation was up-to-date as of last summer (references to other changes in 2023 and earlier years, all the way back to 2005, are also present), and possibly even up-to-date as of the March 2024 date of disclosure.

Google search obviously changes massively from year to year, and recent introductions like their much-maligned AI Overviews, do not make an appearance in this leak. Which of the items mentioned are actively used today in Google’s ranking systems? That’s open to speculation. This trove contains fascinating references, many that will be entirely new to non-Google-search-engineers.

But, I would urge readers not to point to a particular API feature in this leak and say: “SEE! That’s proof Google uses XYZ in their rankings.” It’s not quite proof. It’s a strong indication, stronger than patent applications or public statements from Googlers, but still no guarantee.

That said, it’s as close to a smoking gun as anything since Google’s execs testified in the DOJ trial last year. And, speaking of that testimony, much of it is corroborated and expanded on in the document leak, as Mike details in his post. 👀

What can we learn from the Data Warehouse Leak?

I expect that interesting and marketing-applicable insights will be mined from this massive file set for years to come. It’s simply too big and too dense to think that a weekend of browsing could unearth a comprehensive set of takeaways, or even come close.

However, I will share five of the most interesting, early discoveries in my perusal, some that shed new light on things Google has long been assumed to be doing, and others that suggest the company’s public statements (especially those on what they “collect”) have been erroneous. Because doing so would be tedious and could be perceived as personal grievances (given Google’s historic attacks on my work), I won’t bother showing side-by-sides of what Googlers said vs. what this document insinuates. Besides, Mike did a great job of that in his post.

Instead, I’ll focus on interesting and/or useful takeaways, and my conclusions from the whole of the modules I’ve been able to review, Mike’s piece on the leak, and how this combines with other things we know to be true of Google.

#1: Navboost and the use of clicks, CTR, long vs. short clicks, and user data

A handful of modules in the documentation make reference to features like “goodClicks,” “badClicks,” “lastLongestClicks,” impressions, squashed, unsquashed, and unicorn clicks. These are tied to Navboost and Glue, two words that may be familiar to folks who reviewed Google’s DOJ testimony. Here’s a relevant excerpt from DOJ attorney Kenneth Dintzer’s cross-examination of Pandu Nayak, VP of Search on the Search Quality team:

Q. So remind me, is navboost all the way back to 2005?
A. It’s somewhere in that range. It might even be before that.

Q. And it’s been updated. It’s not the same old navboost that it was back then?
A. No.

Q. And another one is glue, right?
A. Glue is just another name for navboost that includes all of the other features on the page.

Q. Right. I was going to get there later, but we can do that now. Navboost does web results, just like we discussed, right?
A. Yes.

Q. And glue does everything else that’s on the page that’s not web results, right?
A. That is correct.

Q. Together they help find the stuff and rank the stuff that ultimately shows up on our SERP?
A. That is true. They’re both signals into that, yes.

A savvy reader of these API documents would find they support Mr. Nayak’s testimony (and align with Google’s patent on site quality):

Google appears to have ways to filter out clicks they don’t want to count in their ranking systems, and include ones they do. They also seem to measure length of clicks (i.e. pogo-sticking – when a searcher clicks a result and then quickly clicks the back button, unsatisfied by the answer they found) and impressions.

Plenty has already been written about Google’s use of click data, so I won’t belabor the point. What matters is that Google has named and described features for that measurement, adding even more evidence to the pile.

#2: Use of Chrome browser clickstreams to power Google Search

My anonymous source claimed that way back in 2005, Google wanted the full clickstream of billions of Internet users, and with Chrome, they’ve now got it. The API documents suggest Google calculates several types of metrics that can be called using Chrome views related to both individual pages and entire domains.

This document, describing the features around how Google creates Sitelinks, is particularly interesting. It showcases a call named topUrl, which is “A list of top urls with highest two_level_score, i.e., chrome_trans_clicks.” My read is that Google likely uses the number of clicks on pages in Chrome browsers and uses that to determine the most popular/important URLs on a site, which go into the calculation of which to include in the sitelinks feature.

E.G. In the above screenshot from Google’s results, pages like “Pricing,” the “Blog,” and the “Login” pages are our most-visited, and Google knows this through their tracking of billions of Chrome users’ clickstreams.

#3: Whitelists in Travel, Covid, and Politics

A module on “Good Quality Travel Sites” would lead reasonable readers to conclude that a whitelist exists for Google in the travel sector (unclear if this is exclusively for Google’s “Travel” search tab, or web search more broadly). References in several places to flags for “isCovidLocalAuthority” and “isElectionAuthority” further suggests that Google is whitelisting particular domains that are appropriate to show for highly controversial of potentially problematic queries. 

For example, following the 2020 US Presidential election, one candidate claimed (without evidence) that the election had been stolen, and encouraged their followers to storm the Capital and take potentially violent action against lawmakers, i.e. commit an insurrection.

Google would almost certainly be one of the first places people turned to for information about this event, and if their search engine returned propaganda websites that inaccurately portrayed the election evidence, that could directly lead to more contention, violence, or even the end of US democracy. Those of us who want free and fair elections to continue should be very grateful Google’s engineers are employing whitelists in this case.

#4: Employing Quality Rater Feedback

Google has long had a quality rating platform called EWOK (Cyrus Shepard, a notable leader in the SEO space, spent several years contributing to this and wrote about it here). We now have evidence that some elements from the quality raters are used in the search systems.

How influential these rater-based signals are, and what precisely they’re used for is unclear to me in an initial read, but I suspect some thoughtful SEO detectives will dig into the leak, learn, and publish more about it. What I find fascinating is that scores and data generated by EWOK’s quality raters may be directly involved in Google’s search system, rather than simply a training set for experiments. Of course, it’s possible these are “just for testing,” but as you browse through the leaked documents, you’ll find that when that’s true, it’s specifically called out in the notes and module details.

This one calls out a “per document relevance rating” sourced from evaluations done via EWOK. There’s no detailed notation, but it’s not much of a logic-leap to imagine how important those human evaluations of websites really are.

This one calls out “Human Ratings (e.g. ratings from EWOK)” and notes that they’re “typically only populated in the evaluation pipelines,” which suggests they may be primarily training data in this module (I’d argue that’s still a hugely important role, and marketers shouldn’t dismiss how important it is that quality raters perceive and rate their websites well).

#5: Google Uses Click Data to Determine How to Weight Links in Rankings

This one’s fascinating, and comes directly from the anonymous source who first shared the leak. In their words: “Google has three buckets/tiers for classifying their link indexes (low, medium, high quality). Click data is used to determine which link graph index tier a document belongs to. See SourceType here, and TotalClicks here.” In summary:

  • If Forbes.com/Cats/ has no clicks it goes into the low-quality index and the link is ignored
  • If Forbes.com/Dogs/ has a high volume of clicks from verifiable devices (all the Chrome-related data discussed previously), it goes into the high-quality index and the link passes ranking signals

Once the link becomes “trusted” because it belongs to a higher tier index, it can flow PageRank and anchors, or be filtered/demoted by link spam systems. Links from the low-quality link index won’t hurt a site’s ranking; they are merely ignored.


Big Picture Takeaways for Marketers who Care About Organic Search Traffic

If you care strategically about the value of organic search traffic, but don’t have much use for the technical details of how Google works, this section’s for you. It’s my attempt to sum up much of Google’s evolution from the period this leak covers: 2005 – 2023, and I won’t limit myself exclusively to confirmed elements of the leak.

  1. Brand matters more than anything else
    Google has numerous ways to identify entities, sort, rank, filter, and employ them. Entities include brands (brand names, their official websites, associated social accounts, etc.), and as we’ve seen in our clickstream research with Datos, they’ve been on an inexorable path toward exclusively ranking and sending traffic to big, powerful brands that dominate the web > small, independent sites and businesses.

    If there was one universal piece of advice I had for marketers seeking to broadly improve their organic search rankings and traffic, it would be: “Build a notable, popular, well-recognized brand in your space, outside of Google search.”
  2. Experience, expertise, authoritativeness, and trustworthiness (“E-E-A-T”) might not matter as directly as some SEOs think.
    The only mention of topical expertise in the leak we’ve found so far is a brief notation about Google Maps review contributions. The other aspects of E-E-A-T are either buried, indirect, labeled in hard-to-identify ways, or, more likely (in my opinion) correlated with things Google uses and cares about, but not specific elements of the ranking systems.

    As Mike noted in his article, there is documentation in the leak suggesting Google can identify authors and treats them as entities in the system. Building up one’s influence as an author online may indeed lead to ranking benefits in Google. But what exactly in the ranking systems makes up “E-E-A-T” and how powerful those elements are is an open question. I’m a bit worried that E-E-A-T is 80% propaganda, 20% substance. There are plenty of powerful brands that rank remarkably well in Google and have very little experience, expertise, authoritativeness, or trustworthiness, as HouseFresh’s recent, viral article details in depth.
  3. Content and links are secondary when user intention around navigation (and the patterns that intent creates) are present.
    Let’s say, for example, that many people in the Seattle area search for “Lehman Brothers” and scroll to page 2, 3, or 4 of the search results until they find the theatre listing for the Lehman Brother stage production, then click that result. Fairly quickly, Google will learn that’s what searchers for those words in that area want.

    Even if the Wikipedia article about Lehman Brothers’ role in the financial crisis of 2008 were to invest heavily in link building and content optimization, it’s unlikely they could outrank the user-intent signals (calculated from queries and clicks) of Seattle’s theatre-goers.

    Extending this example to the broader web and search as a whole, if you can create demand for your website among enough likely searchers in the regions you’re targeting, you may be able to end-around the need for classic on-and-off-page SEO signals like links, anchor text, optimized content, and the like. The power of Navboost and the intent of users is likely the most powerful ranking factor in Google’s systems. As Google VP Alexander Grushetsky put it in a 2019 email to other Google execs (including Danny Sullivan and Pandu Nayak):

    We already know, one signal could be more powerful than the whole big system on a given metric. For example, I’m pretty sure that NavBoost alone was / is more positive on clicks (and likely even on precision / utility metrics) by itself than the rest of ranking (BTW, engineers outside of Navboost team used to be also not happy about the power of Navboost, and the fact it was “stealing wins”)

    Those seeking even more confirmation could review Google engineer Paul Haahr’s detailed resume, which states:

    “I’m the manager for logs-based ranking projects. The team’s efforts are currently split among four areas: 1) Navboost. This is already one of Google’s strongest ranking signals. Current work is on automation in building new navboost data;”
  4. Classic ranking factors: PageRank, anchors (topical PageRank based on the anchor text of the link), and text-matching have been waning in importance for years. But Page Titles are still quite important.
    This is a finding from Mike’s excellent analysis that I’d be foolish not to call out here. PageRank still appears to have a place in search indexing and rankings, but it’s almost certainly evolved from the original 1998 paper. The document leak insinuates multiple versions of PageRank (rawPagerank, a deprecated PageRank referencing “nearest seeds,” firstCoveragePageRank from when the document was first served, etc.) have been created and discarded over the years. And anchor text links, while present in the leak, don’t seem to be as crucial or omnipresent as I’d have expected from my earlier years in SEO.
  5. For most small and medium businesses and newer creators/publishers, SEO is likely to show poor returns until you’ve established credibility, navigational demand, and a strong reputation among a sizable audience.
    SEO is a big brand, popular domain’s game. As an entrepreneur, I’m not ignoring SEO, but I strongly expect that for the years ahead, until/unless SparkToro becomes a much larger, more popular, more searched-for and clicked-on brand in its industry, this website will continue to be outranked, even for its original content, by aggregators and publishers who’ve existed for 10+ years.

    This is almost certainly true for other creators, publishers, and SMBs. The content you create is unlikely to perform well in Google if competition from big, popular websites with well-known brands exists. Google no longer rewards scrappy, clever, SEO-savvy operators who know all the right tricks. They reward established brands, search-measurable forms of popularity, and established domains that searchers already know and click. From 1998 – 2018 (or so), one could reasonable start a powerful marketing flywheel with SEO for Google. In 2024, I don’t think that’s realistic, at least, not on the English-language web in competitive sectors.

Next Steps for the Search Industry

I’m excited to see how practitioners with more recent experience and deeper technical knowledge go about analyzing this leak. I encourage anyone curious to dig into the documentation, attempt to connect it to other public documents, statements, testimony, and ranking experiments, then publish their findings.

Historically, some of the search industry’s loudest voices and most prolific publishers have been happy to uncritically repeat Google’s public statements. They write headlines like “Google says XYZ is true,” rather than “Google Claims XYZ; Evidence Suggests Otherwise.”

The SEO industry doesn’t benefit from these kinds of headlines

Please, do better. If this leak and the DOJ trial can create just one change, I hope this is it.

When those new to the field read Search Engine Roundtable, Search Engine Land, SE Journal, and the many agency blogs and websites that cover the SEO field’s news, they don’t necessarily know how seriously to take Google’s statements. Journalists and authors should not presume that readers are savvy enough to know that dozens or hundreds of past public comments by Google’s official representatives were later proven wrong.

This obligation isn’t just about helping the search industry—it’s about helping the whole world. Google is one of the most powerful, influential forces for the spread of information and commerce on this planet. Only recently have they been held to some account by governments and reporters. The work of journalists and writers in the search marketing field carries weight in the courts of public opinion, in the halls of elected officials, and in the hearts of Google employees, all of whom have the power to change things for the better or ignore them at our collective peril.


Thank you to Mike King for his invaluable help on this document leak story, to Amanda Natividad for editing help, and to the anonymous source who shared this leak with me. I expect that updates to this piece may arrive over the next few days and weeks as it reaches more eyeballs. If you have findings that support or contradict statements I’ve made here, please feel free to share them in the comments below.