Given that Wikipedia is user edited and therefore a fluid, ongoing source of information, we wanted to know what percentage of Knowledge Graph entries differed from their most recent Wikipedia entry. And when the two do differ, how far behind the most recent entry is Google?
In May, Google rolled out what they are calling the “Knowledge Graph”, a collection of information culled from a number outside sources that appears in the right frame of the search page for many queries.
A major informational source for Google’s Knowledge Graph is Wikipedia. Wikipedia has been shown to be highly visible in the SERPs before Knowledge Graph—with Wikipedia appearing on page one for 6 out of 10 informational keywords—but given the increased prominence it now has in appearing at the top of the SERP for many informational queries with the launch of Knowledge Graph, we were curious to see how in-sync Knowledge Graph results are with actual Wikipedia results.
Specifically, given that Wikipedia is user edited and therefore a fluid, ongoing source of information, we wanted to know what percentage of Knowledge Graph entries differed from their most recent Wikipedia entry. And when the two do differ, how far behind the most recent entry is Google?
We hypothesized that ‘active’ queries—those queries that have experienced a recent spike in search activity—would show a higher mismatch rate than ‘normal’ keywords since a spike in search volume would stem from recent real world events around the subject, resulting in both more frequent and more recent Wikipedia editing.
For example, LeBron James’s Wikipedia entry would have recently been edited to reflect his winning the NBA championship. Our evaluation of the Knowledge Graph-Wikipedia mismatch and lag is, therefore, user impacting because substantial lag means searchers will not be viewing relevant information about the subject in Google’s Knowledge Graph that reflects recent events.
Two Distinct Groups Evaluated
To measure the mismatch rate of trending and ‘normal’ keywords between the Knowledge Graph and Wikipedia we built two keyword lists of 50 queries each. Although a sample size of several hundred would have been ideal, a portion of the analysis was fairly manual and therefore time consuming to collect. To ensure uniformity in the analysis and to select samples that are likely to have both a Wikipedia and Knowledge graph entry, we focused on ‘people’ keywords:
- High Activity: The Top 50 ‘people’ in Google Trends and Google Insights
- Low Activity: The Top 50 on Forbes’ Celebrity 100 list
For each query we compared the Knowledge Graph result on the SERP to its Wikipedia entry and noted whether it was or was not an exact match.
When they did not match, we measured the lag distribution of the mismatched queries by using WikiBlame to determine when the change occurred and, subsequently, the number of days the Knowledge Graph was behind.
1 out of 5 High Activity Queries Do Not Match
Looking at the results, we see that our hypothesis seems to hold up. High activity queries, whose Wikipedia entries are likely to change on a more frequent basis, are mismatched far more often than lower activity queries, with one out of five (20%) not matching compared to a 4% mismatch for low activity keywords.
Half of Mismatched Queries Lag By Two Days or More
When we dig deeper into the size of the lag between Knowledge Graph and Wikipedia for mismatched queries, we see that half of the mismatched queries are two or more days behind. This finding may surprise readers even more than the percentage of mismatches and may ultimately say something about Google’s Knowledge Graph infrastructure (e.g. the frequency with which they can refresh data from Wikipedia).
Google Can Do Better
Our analysis of both low and high activity queries tells us that Google and Wikipedia are mismatched for a substantial ‘one out of five’ high activity queries. And, when they are mismatched, half the time, Google is behind by 2 or more days. The implication is that searchers may not be seeing the most relevant information for their query. For some context, in our LeBron James example, this means his Knowledge Graph entry could have been without reference to his recent championship for up to four days.
While a real time Wikipedia update may ultimately not be practical, if Google is indeed positioning Knowledge Graph as the future of search, we have to believe that they can do better than the 2-4 day lag many of their mismatched keywords currently reflect.