Interested in a specific aspect of how search engines work? Use the links below to skip to a specific section within the article. If you want to know specifically about keywords check out this article on how search engines use keywords.
In this video Matt Cutts from Google explains the basics of how Google works. We're going to go into a bit more detail than this video does. But it's a great primer to the content.
As mentioned in the video Google crawls the web using a bit of a code called a 'spider'. This is a small program that follows links from one page to the next and each page it lands on is copied and passed on to the servers. The web (hence spider) is huge, and as such if Google were to keep a record of all the content it found it would be unmanageable. This is why Google only records the page code and will dump pages it doesn't think are useful (duplicates, low value, etc).
Spiders work in a very specific way, hopping from link to link discovering new pages. This is why if your content is not linked to it won't get indexed. When a new domain is encountered the spider will first look for this page:
Any messages you have for the spider, such as what content you want to be indexed or where to find your sitemap, can be left on this page. The spider should then follow these instructions. However, it doesn't have to. Google's spiders are generally well behaved through and will respect the commands left here.
You can find out more about how robots.txt works here, where we cover some of the more technical aspects of SEO.
The spider itself is a small, simple program. There are lots of open source versions which you can download and let loose on the web yourself for free. As vital as it is to Google, finding the content is not the clever bit. That comes next.
When you have a large amount of content you need a way to shortcut to that content. Google can't just have one big database containing all the pages, which they sort through every time a query is entered. It would be way too slow. Instead, they create an index which essentially shortcuts this process. Search engines use technology such as Hadoop to manage and query large amounts of data very quickly. Searching the index is far quicker than searching the entire database each time.
Common words such as 'and', 'the', 'if' are not stored. These are known as stop words. They don’t generally add to the search engine's interpretation of the content (although there are exceptions: “To be or not to be” is made up of stop words) so they are removed to save space. It might be a very small amount of space per page, but when dealing with billions of pages it becomes an important consideration. This kind of thinking is worth bearing in mind when trying to understand Google and the decisions it makes. A small per page change can be very different at scale.
The content has now been indexed. So Google has taken a copy of it and placed a shortcut to the page in the index. Great, it can now be found and displayed when matching a relevant search query. Each search you make in Google will likely have 1000's of results, so now Google needs to decide what order it's going to display the results in. This is really at the heart of SEO - adjusting factors to manipulate the order of results.
Google decides which query goes where through the algorithm. An algorithm is a generic term which means a process or rule-set that's followed in order to solve a problem. In reference to Google, this is the set of weighted metrics which determines the order in which they rank the page.
The Google algorithm is not the mystery it once was and the individual factors and metrics which it is made up of are fairly well documented. We know what all the major on-page and off-page metrics are. The tricky bit is in understanding the weighting or correlation between them.
If you searched for 'chocolate cake recipes' the algorithm will then weight the pages against that search term.
Let's take a simplified look at two metrics and how they might influence each other.
Metric 1 is the URL. The keywords might appear in the URL, such as: www.recipes.com/chocolate-cake
Google can see the keywords 'chocolate cake' and 'recipes' in the URL so it can apply a weighting accordingly.
Now on to Metric 2, the backlinks for the page. Lots of these might have the keywords 'chocolate cake' and 'recipes' in them. However Google would then down-weight this metric because if the keywords appear in the URL you would expect them to appear in the backlinks, relevant or not. Conversely, Google might choose to apply more weight to Metric 2 if the keywords didn't appear anywhere in the URL.
All the different factors Google looks at affect each other. Each one may be worth more or less (in the weighting) and the relationship between them is constantly shifting. Google issues hundreds of updates every year, constantly tweaking this. It is most commonly this relationship and weighting that's changed more than the metrics themselves. When this does happen it is usually in a more major update, such as Penguin or Panda.
The different metrics can be broken down into four key sections:
How relevant is the content to the query? The indexer is the first test on this, determining if it should appear in the results at all. However, this is taken a step further in order to rank the keywords. It makes sense that when searching for something, you want to see the most relevant results possible.
Relevance is determined by a mix of on-page and off-page factors. Both of these focus on the placement of keywords, such as in page titles and anchor text. Some metrics are a combination of these. For example, if the domain as a whole is seen to be relevant to the search term, this is going to boost the relevancy score of the individual page being scored. If you want to find out more about this I recommend reading my article 'How search engines use keywords'.
Authority has its roots in PageRank, invented by Larry Page (hence the name). It’s the backbone of how Google ranks content. Understanding PageRank is part of the key to understanding how Google works, but it’s worth remembering that there are hundreds of additional factors which can also affect ranking, and PageRank is less important than it was in the past.
PageRank is often explained in terms of votes. Each link to a page is a vote, the more votes it has the better it should rank. If a page with a lot of votes links to another page, then some of that voting power is also passed on. So even if a page only has one link, if that link is from a page which has a lot of votes, it may still rank well and pages it links to will also benefit from that. The value passed from page to page via links is known as link juice or page juice.
Relevance is also important in the context of authority. A link with relevant anchor text may pass on more weight than a link which is not from a relevant site and does not have relevant anchor text, and which Google is more likely to disregard in the context of that search result.
This is an anti-spam algorithm, focused on making it harder to artificially manipulate the search results. Google has a love-hate relationship with SEO and the trust mechanism is part of it. On the one hand, lots of SEO is about creating great content and user experience. On the other, it's also about trying to artificially manipulate what Google has determined as the natural order of the results.
Trust metrics are very hard to manipulate and they give Google greater confidence in the other metrics. Things like the age of the content, or the domain are trust metrics. If you have lots of links from 'bad neighbourhoods' (think red light district) these links are not only going to be worthless but will also make Google think twice about ranking your site for that 'chocolate cake recipe' search. In the same way if the page or domain links out to bad neighbourhoods it's going to damage those trust metrics.
Google is actually a domain registrar, meaning they can see all the whois data for different domains. This allows them to incorporate information, such as how often a domain has changed hands or how long until the registration expires, into those trust metrics. These are much more difficult to manipulate.
Trust is also determined by the type of domain or page and what type links to you. With the opposite effect to a bad neighbourhood, academic sites such as .edu domains carry high trust. Other domain types may also have a high trust score, making links from them more valuable.
Google wants the content it displays in its search results to be attractive to humans as well as search engine robots. There is a set of metrics which is dedicated just to these factors. Having great content but then, for instance, covering it in ads is not going to make for a great user experience. This is why Google will down-weight a page where the ad placement is overly prominent.
Page speed is another important factor; pages that load too slowly are an annoyance to searchers, causing people to click back to the search results and pick another page. Google wants people to keep using Google and so it's in their interest that the results they show load quickly. They measure page speed from the HTML but may also use Chrome user data.
If you're searching on a mobile phone that's going to display a different set of results than if you are searching on a desktop computer. The actual results returned from the indexer (so at a low level) will be different. It's not just device type which affects the results you see though, Google may choose to show results in an entirely different format depending on the search terms you use.
Localised searches are weighted differently and show in a different results page format to, for instance, product searches. You also have mixed media searches where Google may return results including videos and images. Some searches have dedicated results pages for a very narrow set of terms. These are commonly related to current events such as sports games or elections.
Another factor is personalisation. What you have previously searched for will influence the results that Google returns. There is a degree of machine learning at play here. So where someone searches for one type of result consistently Google will assume that future similar searches will be of the same nature. This is especially prominent for ambiguous searches, where one word has multiple meanings.