6.2: Search Engines

Last updated
Save as PDF

Page ID: 95049

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

A search engine is a piece of software that helps you find things. As internet access has become near ubiquitous in the industrialized world, we have come to take for granted that any time we want to know something we can just ask the magic box. Understanding how to use the tool and understanding how the tool works are different things, though.

How they Work

A search engine does not go out and search the web each time you type in a word or phrase. It is a service that indexes, or stores, a huge amount of information about the contents of many websites. This information is stored in a database. The database of a search engine contains a list of all the words in all the web pages that the engine knows about. If you type in a keyword or phrase, e.g., ‘Mexican drug cartel,’ the search engine will consult its database and give you a list of links to sites that contain information about Mexican drug cartels. These indexes are formed and updated by web crawlers – programs that copy webpages and repeatedly check for changes. New pages are found by the crawlers by following all links found in a page they have already copied, and the process just keeps on going. The end result is that a lot of webpages get indexed. In 2014, Google, the largest search engine in the world, estimated that they had indexed 35 trillion webpages. While this enormous number might make you feel like we’ve got the internet completely indexed, this only represents around 4% of the information that exists on the internet. When you hear people talk about the ‘deep web,’ or the ‘dark web,’ they are talking about the unindexed portions of the internet (the difference being the deep web is the part that hasn’t been gotten to yet by crawlers, and the dark web is the part that is being intentionally hidden).

The list of sites you are given by the search engine is ordered by the site’s algorithm. The point of the algorithm is to organize the results of the search in an effort to get you to the information you want as quickly as possible, rather than a random list of pages containing the terms in any order. These algorithms are proprietary, so we don’t know all of the factors that go into the ranking or the way these factors are prioritized. It makes sense that these businesses want to keep the full details of their programs a secret from their competitors, although the major ones have shared some insights into the process. Microsoft’s Bing, for instance, includes click-through-rates as a part of its algorithm (pages move up and down the results page based on the frequency they get clicked), while Google does not. Conversely, Google relies heavily on what they call ‘clean backlinks’ (pages move up the ranking the more they are linked to by sites already trusted by Google, and down if they are linked to by disreputable sites) and there is no evidence that Bing cares about this.

Now all of this was likely more information than you wanted, but it’s important understand how search engines work so we can make informed decisions about what search engine to use. Most of you likely default to using Google or Bing to conduct your searches. Do you have principled reasons for using the search engine that you do? Have you even thought about it? Well, now you know each search engine has its own index and its own algorithm, and that these can seriously impact the results we get. In the last chapter we discussed how, when an issue is important enough to us, the best safeguard you can employ is becoming an expert yourself. Unfortunately, this is going to be one of those issues. You might find it boring, but you need to learn a bit more about the ways the various algorithms work, or you won’t have good reasons to trust the results of your searches. We’re not here to tell you which search engine to use, but you should be making a considered decision on what to use, rather than just operating on default.

Some Additional Concerns

Search Engine Optimization

Another reason it’s important to know how various search engines work is because it has become a big business to attempt to manipulate the results. Most people won’t investigate below the first handful of listings in a search. Given this, it becomes very important for business to be as near the top of the list as possible (especially in large fields). This has given rise to search engine optimization, or SEO. SEO is the process whereby a website attempts to improve its ranking in searches. The way this is done is by leveraging what we know about the various algorithms. So, rather than creating its content and letting the algorithms work as intended, business are paying consulting firms to increase click-through-rates, clean backlinks, and all sorts of other maneuvers to trick the algorithm – and as a result, you as well – into thinking they are the best place to obtain the information.

Privacy

You should also spend some time thinking about your online privacy. Remember most search engines are a business. You are not being charged to use the service they provide, which should tell you the search engine isn’t the product – you are. If you are logged into Google or Bing, then they are recording your search histories. These companies like to say they don’t sell your personal information and that’s true, but it isn’t the whole story. They don’t want to sell your information because what they are selling is the services that they provide with your data. The primary form of this is targeted ads. By understanding your viewing habits, these search companies can provide targeted ad services. You end up seeing more ads for things you are likely to be interested in, and as a result you are more likely to click on them and spend your money.

Your search history is by no means the only data these companies keep on you. Your whereabouts are tracked, as is your YouTube history (by Google), your video game habits (by Microsoft), the apps you use on your phone, and a myriad of other things you may not have realized. If you go to https://account.microsoft.com/account/privacy you can check you all the data Microsoft has been collecting about you through Bing and other means. You do the same for Google at https://myaccount.google.com./data-and-personalization. Both sites also offer you options to limit the data they collect and ways to delete already obtained information.

If any of this makes you a bit squeamish then you might want to look into Startpage (https://www.startpage.com). Startpage.com is an alternative search engine that literally runs on Google’s results. Instead of offering users their own algorithm, what they offer is privacy. Startpage doesn’t record your IP address and it doesn’t use tracking cookies. So, if you like how Google’s algorithm works, but would like to avoid advertisements in your results and your search history being tracked and stored you have a simple alternative. Qwant (https://www.qwant.com) does much the same thing, using Bing’s algorithm.

Global Perspectives

One final thing worth considering is that your understanding of search engines is going to be largely shaped by your background. As mentioned above most Americans default to using Google or Bing. While these search engines exist all over the world, they compose a relatively small share of the market in many countries and in some places, like China, they are actually banned. Understanding the other options out there and how they work can you help make better sense of how people in other places are getting their information. At times this might also help you find more accurate results. If you are looking to find the best borsht while visiting Moscow, you should probably be using Yandex and not Google.

Search

Text Color

Text Size

Margin Size

Font Type