Invisible Web: What it is, Why it exists, How to find it, and Its inherent ambiguity

Finding Information on the Internet: A Tutorial
http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InivisibleWeb.html

Invisible or Deep Web: What it is, Why it exists, How to find it, and Its inherent ambiguity
UC Berkeley - Teaching Library Internet Workshops

About This Tutorial | Table of Contents | Handouts | Glossary

What is the "INvisible Web", a.k.a. the "Deep Web"?

The "visible web" is what you see in the results pages from general web search engines. It's also what you see in almost all subject directories. The "invisible web" is what you cannot retrieve ("see") in the search results and other links contained in these types of tools.

The first version of this web page written in 2000, when this topic was new and baffling to many web searchers. Since then, search engines crawlers and indexing programs have overcome many of the technical barriers that made it impossible for them to find and provide invisible web pages. These types of pages used to be invisible but can now be found in most search engine results:

Pages in non-HTML formats (pdf, Word, Excel, Corell suite, etc.) are "translated" into HTML now in most search engines and can "seen" in search results.
Script-based pages, whose links contain a ? or other script coding, no longer cause most search engines to exclude them.
Pages generated dynamically by other types of database software (e.g., Active Server Pages, Cold Fusion) can be indexed if there is a stable URL somewhere that search engine spiders can find. Once these were largely shunned by search engines. There are now many types of dynamically generated pages like these that are found in most general web search engines.

Why?

There are still some hurdles search engine spiders cannot leap, and these still create a HUGE set of web pages not found in general search engines

Search engines still cannot type or think. If access to a web page requires typing, web crawlers encounter a barrier they cannot go beyond. They cannot search our online catalogs and they cannot enter a password or login.
- The Contents of Searchable Databases. Most of the invisible or deep web is made up of the contents of thousands of specialized searchable databases made available via the web. When you type a search in one of these databases, the search results are delivered to you in web pages that are generated just in answer to your search. Rarely are such pages stored anywhere: it is easier and cheaper to dynamically generate the answer page for each query than to store all the possible pages containing all the possible answers to all the possible queries people could make to the database.
  - Google Scholar is a collection of citations with links to publishers or other sources where one can try to access the publication in full text. In many academic libraries (and some others), Google Scholar is providing convenient links to the online holdings of those libraries, purchased for exclusive use of their constituents. If you search Google Scholar, you find a lot of journal article references. What you are seeing when you search Google Scholar is only tiny fraction of all the scholarly publications that exist online. Much more lurks in a new type of Invisible or Deep Web.
  - WHY? Google Scholar is only able to provide citations to journal contents for which its crawlers can find stable llinks. It cannot construct searches or enter passwords to go into passworded, copyright-protected articles in many publishers' databases. In some experiments conducted at UC Berkeley, we estimate that Google Scholar accesses about 10% of all we subscribe to for our students, faculty, staff, and users present on campus. Think about the millions of articles in Lexis/Nexis, the many thousands of articles indexed in privately licensed databased libraries buy the rights for their users to read (e.g., Sociological Abstracts, ERIC, PscyhInfo, JSTOR, INSPEC).
Excluded Pages. There are some types of pages that search engine companies exclude by policy. There is no technical reason they could not include them if they wanted. It's a matter of selecting what and what not to include in databases that are already huge, expensive to operate, and whose search function is a low revenue producer.
- Dynamically generated pages of little value beyond single use. Think of the billions of possible web pages that can be generated by all the people who have looked for books in our online catalogs. Each of them is creating a results page in reponse to their specific need. Search engines do not want all of these pages in their web databases. They would be clutter of little interest to anyone.
- Many databases in this category. There are many thousands of public-record, official, and special-purpose databases containing government, financial, logistical, and other types of information that is needed to answer very specific inquiries of interest to very few people. Even if stable links existed to such pages, search engines would not want them. More clutter.

How to Find the Invisible Web

Simply think "databases" and keep your eyes open. You can find searchable databases containing invisible web pages in the course of routine searching in most general web directories. Of particular value in academic research are

Use Google and other search engines to locate searchable databases by searching a subject term and the word "database". If the database uses the word database in its own pages, you are likely to find it in Google. The word "database" is also useful in searching a topic in the Google Directory or the Yahoo! directory, because they sometimes use the term to describe searchable databases in their listings.

EXAMPLES for Google & Yahoo:: plane crash database; languages database; toxic chemicals database

Remember that the Invisible Web exists. Remember that, in addition to what you find in search engine results (including Google Scholar) and most web directories, there are these gold mines you have to search directly. This includes all of the licensed article, magazine, reference, news archives, and other research resources that libraries and some industries buy for those authorized to use them. The contents of these are not freely available: libraries and corporations buy the rights for their authorized users to view the contents. If they appear free, it's because you are somehow authorized to search and read the contents (library card holder, member of the company, etc.).

As part of your wise web search strategy, spend a little time looking for databases in your field or topic of study or research. Remember, however, that all proprietary information -- most of the journals, magazines, news, and books -- are not freely available. Publishers and authors control them under copyright and other distribution rules. You will be prompted to pay or enter a password to see full text. A library you have the rights to use may have access to what you want, however.

The Ambiguity Inherent in the Invisible Web:

It is very difficult to predict what sites or kinds of sites or portions of sites will or won't be part of the Invisible Web. There are several factors involved:

Which sites replicate some of their content in static pages (hybrid of visible and invisible in some combination)?
Which replicate it all (visible in search engines if you construct a search matching terms in the page)?
Which databases replicate none of their dynamically generated pages in links and must be searched directly (totally invisible)?
Search engines can change their policies on what the exclude and include.

Want to learn more about the Invisible Web?

The Wikipedia "Deep Web" article provides a fairly up-to-date summary of the problems, current state, and tecchnologies associated with the phenomon. I defer to its links to other resources and readings.

Return to the top of this page

Quick Links

4 types of recommended search tools: Search Engines | Meta-Search Engines | Subject Directories | Invisible Web

Copyright (C) 2006 by the Regents of the University of California. All rights reserved.
Document created & maintained on server: http://www.lib.berkeley.edu/ by Joe Barker
Last updated 1August2006. Server manager: contact