Thursday, July 10, 2014

How does Google Search really work?

Find out the innards of how Google Search functions and how your business appears in the results.

I recently wrote an article about change management in the data center and needed to provide some examples of related software, so I consulted Google to see what I could find. Evolven was one of the top listed results so I examined their products and included a reference in the article.

That got me thinking later: "How exactly was Evolven near the top of the search results? Is it because they're most closely identified with change management software? Do they have the highest customer satisfaction? Are more people searching for them than other companies? Did they recently update their website? How does this Google Search stuff really work, anyhow? Why am I asking myself this when I can go find out on Google?" And so I did.

Google provides what is inarguably the most popular and effective search engine (the only one which has become a verb, such as when people say they will "Google" a topic to find out more). Many of us take it for granted that Google Search just works and the results are valid without bothering to wonder what's going on behind the curtain or how things are arranged. However, since you're reading this on TechRepublic, chances are you're the type of person who likes to see what's behind that curtain, especially if you own or work in a business on which Google returns search results for the public. Let's pull it back, shall we?

Figure A

figure a.jpg 

Note there are related elements to Google Search such as Instant Search and the "I'm Feeling Lucky" function but for the purpose of this article I'll focus on the underlying structure and how results are returned.

Let's start with the basics. According to Wikipedia, Google Search was created in 1997 by Larry Page and Sergey Brin and it now performs more than three billion daily searches for users. These searches are conducted across 60 trillion web pages using an index (a directory of data) which Google reports is 95 petabytes in size - approximately 100 million gigabytes!

Figure B

figure b.jpg 

Google provides an interesting explanation of the search process which states that they use special software (known as "Googlebot") running on a large number of computers to crawl the web, following links "from page to page" (this reminds me of the eerily efficient spider robots from the 2002 film "Minority Report" - perhaps it's more comforting to think of the Minions from "Despicable Me"). Googlebot starts from its last crawl status and busily looks for new sites, changes to current sites and invalid links.

Google stresses they do not accept money to favor one page or another by crawling it more often, but site owners can specify to some extent how their sites are crawled (or whether they are crawled at all). For instance, they can prevent summaries from appearing in results or keep their sites from being cached on Google servers.

Now that they've scouted out the territory, Google's web crawlers report on the pages they visited, and that ginormous 95 petabyte index is updated. To put things in perspective, this is 95,000,000,000,000,000 bytes in size - almost twice the amount of information ever written by mankind. That's not even all they could potentially index, however - Google says "Googlebot can process many, but not all, content types. For example, we cannot process the content of some rich media files or dynamic pages."

A Google search doesn't just dive into this index and fish around for what it needs. That would take a long time and return a lot of garbage. Several factors are used to present the most relevant search results, and this is where the "Coca Cola recipe" lies. Some of these factors are known and others are kept confidential to thwart malcontents who might try to unfairly rig the system (read: spammers and other scum and villainy).

What are the known factors?

Type of content (how relevant the data on the site is to the search terms)Quality of content (spell check is used to separate professional sites from sloppy wannabes)Freshness of content (sites from 1996 are less likely to be returned before sites from 2013)The user's region (no sense returning webpages in another language)Legitimacy of the site (whether the page is deemed likely to be spam-related)Name and address of the websiteSearch word synonymsSocial media promotionsHow many links point to a particular web pageThe value of those links

These last two involve a crucial process called "PageRank." PageRank rates web pages based on a score. Sites are assigned these scores based on whether links to them come from important or "higher authority" sites - high-traffic, well-established pages. These sites are then presented higher in the search result list, allowing the searcher to hit the right target.

Figure C

figure c.jpg 

Note that PageRank is intelligent enough to differentiate by focusing on quantity and quality. If one site has 5 high-quality links to it from important sites and another has 10 low-quality links from unimportant sites, the first site will end up with a higher PageRank score.

For instance, if the New York Times contains several links to my writing blog, my blog will be given a higher PageRank than if one person, Joe-Bob Taylor of Huckaloosa, Arkansas links to my blog from his nearly-extinct fishing webpage (sorry, Joe-Bob!) This means if someone searches for writing topics and my blog contains relevant information that meets their criteria, they're more likely to see my page in the results thanks to those New York Times links.

Similarly, in the case of Evolven, I would guess their site came up among the top search results for "change management software" because of the relevance and freshness of their content, the links pointing to their site, and the importance of the sites providing those links.

It sounds simple, but there's still way more behind that curtain. Wikipedia's entry on PageRank presents a "simplified" algorithm:

Figure D

 figure d.jpg 

If you like equations this page has them a'plenty!

Naturally, there have been attempts to raise website PageRank scores through sneaky tactics such as Google Bombs and Link Farms. Getting Joe Bob and all his siblings and cousins to link to your site won't help much either, as we've seen. However, there are legitimate ways to improve your website's PageRank score. That's where search engine optimization (SEO) comes in. In my next article I'll discuss how SEO works and cover principles you can apply to your business.

No comments:

Post a Comment