Librarian's Ultimate Guide to Search Engines
Published on Friday December 8th , 2006
The Librarians Ultimate Guide to Search Engines
Librarians were the ultimate search guides before search was re-invented with the web. They are trusted, credible sources for historical information, and pioneers and innovators of taxonomy of information. Librarians witness, search for, find, organize and catalog knowledge.Online research and the power of the web, have made accessing information only fingertips away from all of us, but the taxonomies and standards used for search will impact how people learn online and off for years to come. Below are some of the things librarians understand about search - and things that anyone doing online research can benefit from.
Brief Recent History Of Search Engines
While there are many search engines, about 80-90% of the search market belongs to just a few: Google, Yahoo, and MSN, approximately in that order of decreasing use. There are a few other engines that are relatively popular but some are white-labelled versions of the above. If you want to see a chart of approximate web traffic figures for these engines, use
alexaholic.com. Alexaholic uses
Alexa.com but will let you view multiple traffic charts simultaneously. These will give you a relative comparison of which is more popular.
More in depth
history of search engines, and
search glossary.
For example, traffic to Google, Yahoo, and MSN has been
relatively equal over the past 3 months. Though if you plot Alexa traffic figures over the past 5 years, you'll see how incredibly fast Google's popularity has increased, right up through early 2006. Ask.com traffic line is at the very bottom of the chart.
Web2.0 Search Engines
These are the new breed, some labelled as
web2.0 applications. The
definition of web 2.0 is still
fairly broad, but it's easy to see the
award winning stuff. They're the tip of the iceberg of advanced search applications for what is known as the
semantic web. They literally add another dimension to searching. Some offer visual search using an initial image that you select or even draw. Others let you search by color or meta tags of audio files. A few of these engines include
Like,
Princeton Shape, SystemOne
Retrievr,
Mnemomap,
Casual,
KWMap,
Ujiko,
Webbrain. [See
Information Aesthetics for writeups about most of these.]
Most of these new engines are works in progress that need a few generations of revisions. A few are truly brilliant, all of them innovative. Some use meta level concepts such as synonym matching, color or shape similarity, thematic concepts, semantics. There's even a
device that carves rivers, canyons and valleys into foam based on search engine queries.
All of them appear to improve the search experience, but mostly for advanced users who are familiar with unusual search paradigms. If you're interested, visit some to get a sense of them. The rest of this article focuses on traditional text-based search engines.
Text-Based Search Engines - Overview
Text-based search engines are the mainstay of the web. They've come and gone, and will continue to do so. Many left the public web and focused on corporate Intranets (private webs; also part of the "invisble web"). It took Google, however, to successfully monetize a public search engine.
Because Google and other engines store a list of all the search queries that users perform, there is a vast pool of information that can be data-mined from the queries. For example,
CNBC TV ran a segment about how a murderer was convicted of killing his wife, based on the overwhelming evidence of gruesome search queries. They were able to trace the queries back to his personal computer at home. Using pattern analysis and other evidence, they convicted him.
Information is essentially cheap on the Internet. It's what you make of it, data-mining for patterns, that can be valuable. Though the average person does not care about that. They are looking for something specific, and usually, they get more search results than they care to look at. That's where power search techniques come in. They are not very complicated, and use a fairly simple syntax to give you the power to cull the search results down to what you are really looking for - most of the time, anyway.
Glossary: Search Engine + Related
Before discussing ways to refine search queries, let's have a look at a few terms either specifically related to search engines, or related to topics in this article.
Anchor Text
When ever you see a hyperlink on a web page, the actual words used to specify the link are referred to as the anchor text.
Blog/ Weblog
A blog (aka weblog) is a special website that has been structured with articles (blog posts) in reverse chronological order. Blog posts are also organized into page groups and monthly archives. They have a structural advantage in search engines, though they often result in false search results.
Bot/ Spider
A search engine bot or spider is a special automated web application that indexes web pages for a search engine.
Cache
Some engines store the full-text of an indexed web page. Whenever the page is updated, the engine's cache will also be updated, eventually. So you can view a cached page from another site by using the "cache:" operator, without leaving the search engine.
Invisible Web
The
Invisible Web consists of web sites that are difficult or impossible to find, either because they are not indexed in a search engine or because they require a password.
Query Strings
This simply means the actual text that you enter in a search query, including letters, digits, punctuation, and any special operator characters.
SEM/ SEO
Search engine marketing/ Search engine optimization
Semantic Web
The Semantic Web is a project to derive consistent meaning from websites through advanced search engines. Most web content is designed for humans. When you search for something, you don't always get what you were thinking. The semantic web will improve that by allowing search bots to extract meaning from semantically organized informatino. Sir Tim Berners-Lee, father of the modern Internet, gives his
road map for the semantic web (Sept 1998).
SERPs
SERP means Search Engine Results Page - those pages that result when you do a search query.
Stop Words
Stop words are any words, such as "the", "and", "a", "or", that add little value in being part of a search query string. Most engines do not store these when indexing web pages.
Tags
Tags refer to a topic category classification, primarily for weblog sites. So if you write a blog post about food, it might have tags such as "recipe", "italian", "mushrooms", "pasta". Tags are applied by the author of a post.
TLD
TLD means Top Level Domain and refers to the final part of the name of a web domain. For example, http://www.msn.com/ is an URL. The TLD is the ".com" part. The "msn" part is known as the second-level domain.
URL
URL means Uniform Resource Locator and essentially means the web address of a specific web page.
Web Feeds
Web feeds are a special form of web content that organizes new content from a website or blog into the form of headlines and excerpts. Web feeds make it easy to syndicate content online, as well to subscribe to such content for frequent browsing using a "web feed reader". (See "Bloglines" in the final section of this article.)
Refining Search Queries
All text-based search engines work on a query string supplied by the user. But most of the time, the SERPs returned number in the hundreds or even millions of pages, making it difficult to find what you want. To reduce the number of SERPs, we need to refine our search strings. To do that, we need to use special query operators that are derived mostly from
Boolean logic, pus a few specialized operators.
All search engines use a fairly common set of advanced query operators (AQOs). However, not all engines process AQOs the same way. So if you do use advanced operators, you will want to play around with them in your favorite search engine to learn how they're handled. The operator descriptions below are generalized; not all engines will support them in exactly the way described.
General Query Operators
These include using double quotes to force results that include a specific text string, brackets "()", Booleans (AND, OR, NOT), and "+" or "-" (plus/ minus). Plus typically means include a term, and minus means exclude a term. For example:
- library taxonomy is usually the same as +library +taxonomy, which is the same as library AND taxonomy. Both words have to appear in the results, but order and proximity may vary. If you want adjacent words (i.e., that specific string), use double quotes: "library taxonomy". Some engines offer a near operator as well, which controls proximity within a certain number of words, say ten.
- Plural forms are usually automatically offered, as are some verb forms of a root word, unless double quotes are used.
- The OR operator might work on exclusion or it work on supplemental rules. For example, libarary OR taxonomy usually means either/both, but could mean one or the other, only (exclusive or), which would make it the same as the next form.
- library NOT taxonomy means return only those web pages with just the word library, never with taxonomy. This is the same, in most engines, as +library -taxonomy.
- Brackets help arrange processing order in complex queries. For example: (EMF OR "electro magnetic fields") AND health means that the SERPs must have the word health and either of the terms EMF or "electro magnetic fields". You can add a bit more leeway in some engines by using (EMF OR (electro magnetic fields)) AND health. This lets the order and proximity of the words electro magnetic fields be more flexible in the results.
Site Operators
These are powerful operators that most engines have but which are not always well-known. While there is a common set of operators, a few engines have their own variations. Here is an amalgamated list. A few references are included after this section, if you are interested in finding out more. All of them consist of a predefined keyword and a semicolon, ":", character, which are then followed by a word or URL or domain name, etc. There should be no spaces on either side of the semicolon.
allinanchor:, inanchor: - Use allinanchor: to specify one or more words that must all be in anchor text. (See definition of anchor text in Glossary above.) Use inanchor to specify one word in anchor text and one or more words in the rest of the document body.
Example: allinanchor:librarian
allintitle: , intitle: - Use allintitle: to specify one or more words that must all be in the title of a web page. Use intitle: to check for a single word in the title, and one or more words in the document body.
Example: allintitle: librarians
allinurl:, inurl: - Use allinurl: to specify one or more words to be checked in the URL of a web page. Use inurl: to check one word in the URL and one or more words in the document body.
Example:allinurl: librarians
cache: - Varies by engine, but it typically shows the last cached version of a page.
Example:cache:http://lii.org
define: - Returns definitions of a specific word, from various sources.
Example: define:librarian
domain:, site: - Use with a domain name to limit searches to pages on that site.
Example:site:stanford.edu.
filetype: - Use with a media file type (e.g., PDF) to limit SERPs to that type of document.
Example: library filetype:xls
info: - Provides engine-specific info about a particular URL or its parent site.
Example: info:becomealibrarian.org
link:, linkto: - Use this to find websites linking to a specific URL or domain.
E.g.: link:www.librarian.net.
related: - Engines determine topic similarity of web pages on different sites. This operator, when used with an URL, will return pages from other sites that are similar.
Example: related:lii.org
There are actually many more specialized operators, some of which are covered in the references below. They are not absolutely necessary, but are useful for power users.
Miscellaneous Operators
Some engines offer additional operator functionality by allowing you to click on a checkbox. Some such features include domain exclusion, choice of site TLDs, and date published range. In some engines, you can specify year range by using something like 2000..2006.
Additional References
Here are a few links to pages about advanced queries.
Google
While
Google is not the oldest existing engine around, it is the most popular, especially among web-savvy users. They have a whole host of features, and they're always adding more. They
just added support for 17 new languages. While Google is more selective about what websites and web pages they index, advanced users tend to favor this engine over others. Here is Google's full list of
advanced operators for search queries.
Yahoo
Yahoo was originally a human-approved directory that you paid to have your site listed in. They still do that, but they added
YahooSearch to compete with Google, who ousted many other engines that are no longer around, or put their focus elsewhere than the public web.
Yahoo may show far fewer results for some search keywords than Google, though more complex phrases often show significantly different results. Their
advanced features include the ability to search specific TLDs (eg., .gov, .edu, .org, .com) and specific content (
Creative Commons, adult, non-adult, subscription content, language results).
MSN
MSN Search (now called
Live Search) is Microsoft's baby. They long dominated desktop computer software but have lagged behind in the Internet race, if their flat stock share price is any indication. For some reason, the average query in MSN tends to produce more SERPS than for Google. Though this conclusion is based on a very small sample of queries over a year. It's 100% likely that MSN uses different criteria to index web pages, and Google is selective, as mentioned earlier. MSN's advanced features include language choice of search interface, language results, and safe search, amongst others. They also allow you to search for images, video and maps, within news or academic sites only, and in web feeds. There is a new QnA (Questions and Answers) beta feature, at the time of this writing, which lets you ask a question that a member of the community may answer for you - better than a search engine.
Other Textual Search Engines
Below are some of the other text-based search engines, each of which enjoys a varying degree of mild popularity. Not every engine is included below (alphabetical order), but the following should give you a light overview of your options. Defunct engines are not mentioned, and this list is by no means comprehensive.
AllTheWeb
AllTheWeb is a search engine and information portal. Content is divided into web, news, pictures, video, and audio. For example, a search for "trees" under the video category produces SERPS of only video files that have the text string "trees" in the file name. If you were looking for MP3 files of legendary blues master Robert Johnson, you could use the audio tab to get all 2,666 results. (Is that a joke? Who knows. Johnson was said to have made a deal with devil, to be the best blues guitarist ever, and reputed to have 3 graves.) AllTheWeb has a number of advance operators in their query language. A look under the hood reveals that AllTheWeb is just YahooSearch white-labelled.
**Alltheweb utilizes the Yahoo database
AltaVista
Altavista was one of the early challengers for the search engine throne, appearing probably around 1994-95. They were at one time one of the fastest search engines around, based on pure computing horsepower, and were somewhat popular, briefly, until Google appeared. They appear at present to be a white-labelled YahooSearch, with advanced features that are standard.
**Altavista utilizes the Yahoo database
Ask + Excite
Ask.com is part of a group of engines and "information retrieval products" owned by IAC Search & Media. This group includes excite.com, which was once extremely popular when it debuted around 1995, a few years before Google. Ask.com was once called AskJeeves, and was white-labelled by a number of web portals that regular readers visited daily.
Ask also offers some non-standard search functionality (added courtesy of Gary Price)
1) Definition of non-alphanumeric searches
We have started to slowly offer non-alphanumeric searches
2) Zip Code search
Notice the box to help you select the proper state and to see all the Zips for a
specific city.
3) Blog and feed search
NOTE THE pull down boxes to subscribe to a feed (even using a competitors
reader) or post the item with one click using digg, Reddit, etc.
4) New, event listings.
Part of the new AskCity service.
Blogsearch (Google)
Google Blogsearch works in the same way as regular Google, the results are dedicated to weblog sites only. That does not mean blogs are not included in regular Google, but they don't show as prominently there. This way, if you are specifically looking for topics discussed in blogs only, it's easier to find them since many millions of web pages from regular websites have been pre-filtered.
Hotbot (Lycos)
Lycos has enjoyed some popularity and even a loyal following. You can find a summary of advanced features there.
Indeed
Indeed is a job search engine, in case you are looking for a new job in the Library Sciences field. They also have a version for Canadian jobs.
Information
Information appears to be a specialized engine that has also categorizes content into the groups web search, encyclopedia, blogs, articles, groups.
Librarian's Internet Index
The publicly-funded LII, or Librarian's Internet Index, is more of an information portal than a search engine. Each week, hand-selected websites adhering to some current theme are added to the Index, and their content can be searched in the LII. There's also a free newsletter that you can subscribe to. New entries can be subscribed to via the web feed.
Northern Light
Northern Light focuses on offering searches of a wide variety of business content as well as industry journals.
Technorati
Like Google Blogsearch, Technorati is dedicated to weblogs only. However, it's far more than just a search engine and includes many features specifically of use to bloggers. Technorati, amongst other features, lets you know what is popular in a number of blog categories and content types (text, video), as well as in topics. It's also easy to determine what other weblogs are linked to a specific weblog. Finally, in addition to searching indexed blog posts, you can also search through blog post tags and other blog directories.
Digg
Digg is a new form of search engine based on social community as the driving force for relevance. The technique is fairly new, but seems to be catching on with increasing popularity. Digg is currently the 2nd highest trafficked site in the "tech" category according to several resources.
Other Librarian Search Resources
Additional Librarian search resources added courtesy of Gary Price's insight.
Miscellaneous Resources
To round out the discussion, here are a couple of other resources that may be of interest to librarians or anyone doing regular research.
API/ SDKs For White-Labelling Custom Engines
Several search engines offer APIs (Application Programmer Interfaces) and SDKs (Software Development Kits) that allow you to embed their functionality into your own web applications. Thus, you could very easily use, say, a customized Google to build a special librarian's search engine, which would index a select set of websites and weblogs pertaining to library sciences.
Bloglines Web Feed Subscription Tool
If you plan to browse dozens or even hundreds of websites and weblogs on a daily or otherwise regular basis, one of the best tools for this is Bloglines. Professional bloggers and online researchers have been known to use this tool to monitor new articles/ blog posts from as many as 1,000 sites. The drawback is that only websites/ weblogs that publish a web feed can be tracked in this manner. Bloglines is owned by the same company as Ask.com and Excite.
Meta Search Engines
There are a couple of search tools, such as
Dogpile and
Metacrawler, that take your query and submit it to several engines simultaneously, returning to you aggregate SERPs.
Search Tutorials
The Learning Site has a six-part tutorial on
web searching, sleuthing and sifting through information on the Internet.
Web Search Start Point
Accesscom.com has a somewhat out of date list of 200+ categorized hyperlinks, in case you are looking for something but don't know where to start, as they put it.
Query Views
Ever wonder what other people are searching for? Metacrawler's
SearchSpy gives you a scrolling, near-realtime list of actual search query strings. There are two versions: unexposed and exposed, with the latter being unfiltered - that is, with possible adult content