Thursday, April 26, 2012

How Google Search Works, In a Nutshell

Unless you're an SEO expert, it's no surprise if you're mystified by the inner-workings of Google search. How on Earth does Google decide how to rank the pages on your website? If this question has ever crossed your mind, keep reading. This post should simplify things for you.

Yesterday, Google's head of web spam, Matt Cutts, published a video on the GoogleWebmasterHelp YouTube channel called "How does Google search work?" In the video, Cutts addresses the following question he received in the Google Webmaster Help Forum:

Hi Matt, could you please explain how Google's ranking and website evaluation process works starting with the crawling and analysis of a site, crawling timelines, frequencies, priorities, indexing and filtering processes within the databases etc. - RobertvH, Munich

"So that’s basically just like, tell me everything about Google. Right?" Cutts chuckles.

All kidding aside, this isn't an unreasonable question -- but it's not an easy one to answer, either. The Google search ranking algorithm is a big, hairy beast, taking into account a variety of factors (over 200, in fact) to deliver the best results to Google searchers. But sometimes the most helpful explanation is the simplest one. As Cutts states in the video, he could spend hours and hours talking about how Google search works, but he was nice enough to parse it into the following 8-minute video. Without further ado, here's the video, accompanied by a written breakdown of what Cutts says in the video (via the transcript provided by Search Engine Land), from crawling to indexing to ranking by Google search:

How Google Crawls the WebCrawling Then

In the video, Cutts explains how Google used to crawl the web, which was a long and drawn out process. Google would crawl for 30 days -- that's right, over the course of several weeks! Afterward, Google would take about a week to index what it found, and then it'd push that data out through the search engine, which would also take a week. "And so that was what the Google dance was," says Cutts.

Sometimes Google would find a data center that had old data, and sometimes it'd hit one containing new data. To make this more efficient, after Google crawled for 30 days, it'd recrawl pages with a high PageRank (a Google ranking factor that ranks web pages by how many other pages link to it, and how reputable those pages are), such as the CNN homepage, to see if anything new or important had been published. But Cutts admits that, for the most part, this was not a great process, since search results would quickly be out of date considering the 30-day crawl time.

Crawling Now

Today, things are a bit different. Cutts admits that Google still uses PageRank as the primary determinant in its ranking algorithm. The better PageRank your web page has, the more likely that Google will discover that page relatively early in the crawl. For example, crawling in strict PageRank order, Google would find the CNNs and The New York Times of the world, as well as other very high PageRank websites, first.

In 2003, Cutts remarks, as part of an update called Update Fritz, Google switched to crawling a significant chunk of the web every day. Google broke the web into various segments and Google crawled that part of the web, refreshing it every night. In other words, at any given point, Google's main base index would only be so out of date, because then it'd loop back around and refresh it with the newly crawled pages. This was a much more efficient way to crawl since, rather than waiting for everything to finish, Google was incrementally updating its index. "And we’ve gotten even better over time," says Cutts. And at this point, Google search has gotten very fresh; any time there are updates, Google can usually find them very quickly.

As a comparison, Cutts talks about how in the early days of Google, it'd have a supplemental index in addition to the main/base index. The supplemental index was something that Google wouldn’t crawl and refresh quite as often, and it also consisted of a lot more web pages. So essentially, Google would have really fresh content not only from the layer of its main index, but also from other pages that weren't refreshed quite as often, which Google has a lot more of.

How Google Indexes Web Pages

After Google crawls the web, it indexes the pages it finds. So say Google has crawled a large fraction of the web, and within that portion of the web, it's looking at each web page. To explain how indexing is done, Cutts uses the search term 'Katy Perry' as an example:

"In a document, Katy Perry appears right next to each other. But what you want in an index is which documents does the word Katy appear in, and which documents does the word Perry appear in? So you might say Katy appears in documents 1, and 2, and 89, and 555, and 789. And Perry might appear in documents number 2, and 8, and 73, and 555, and 1,000. And so the whole process of doing the index is reversing, so that instead of having the documents in word order, you have the words, and they have it in document order."

In other words, what indexing says is, "Okay, these are all the web pages a search term appears in."

How Google Ranks Web Pages in Search Results

The last piece of the puzzle is how Google ranks which pages appear for the search terms someone types into Google. In the video, Cutts continues his Katy Perry example.

So for instance, if someone visits Google.com and types 'Katy Perry,' Google thinks, "Okay, what web pages might match 'Katy Perry'?" If web page 1 has 'Katy' but not 'Perry,' and web page 2 has 'Perry' but not 'Katy,' those two are out of the running. If web page 5 has both 'Katy' and 'Perry,' it's a possibility. Furthermore, web pages 89 and 73 would be out because they also don’t have the right combination of words. If web page 555 has both 'Katy' and 'Perry,' it's still in as well.

So when searchers visit Google.com and type in whatever their search term is, whether it's 'Chicken Little,' 'Britney Spears,' 'Matt Cutts,' 'Katy Perry,' or something else, Google finds the pages it believes contains those words, either on the page itself, in backlinks to the page, or in anchor text pointing to the page. Once Google has completed what is called 'Document Selection,' it tries to figure out how to rank those pages. It's a tricky thing, says Cutts, since Google takes into consideration PageRank as well as over 200 other factors when deciding whether a web page is really authoritative, and how to rank it.

For example, one page may have a good reputation because it has a high PageRank, but it also may only have the word 'Perry' in it once, and it might just happen to have the word 'Katy' somewhere else on the page. On the other hand, there might be a page that has the words 'Katy' and 'Perry' right next to each other (so it has proximity), and the page also has a good reputation with a lot of links pointing to it.

Therefore, Google tries to balance that -- relevancy and authority -- to surface reputable pages that are also about what the searcher is looking for. But it's not as simple as that, considering Google is taking those 200

View the Original article

No comments:

Post a Comment