The "visible web" is what you see in the results pages from general
web search
engines. It's also what you see in almost all subject
directories. The "invisible web" is what you cannot retrieve
("see") in the search results and other links contained in these types of tools.
The first version of this web page written in 2000, when this topic was new
and baffling to many web searchers. Since then, search engines crawlers
and indexing programs have overcome many of the technical barriers that made it
impossible for them to find and provide invisible web pages. These types of
pages used to be invisible but can now be found in most search engine
results:
- Pages in
non-HTML formats (pdf, Word, Excel, Corell suite, etc.) are "translated" into
HTML now in most search engines and can "seen" in search results.
- Script-based
pages, whose links contain a ? or other script coding, no longer cause most
search engines to exclude them.
- Pages generated
dynamically by other types of database software (e.g., Active Server Pages,
Cold Fusion) can be indexed if there is a stable URL somewhere that search
engine spiders can find. Once these were largely shunned by search engines.
There are now many types of dynamically generated pages like these that are
found in most general web search engines.
Why?
There are still
some hurdles search engine spiders cannot leap, and these still create a HUGE
set of web pages not found in general search engines
- Search
engines still cannot type or think. If access to a web page requires
typing, web crawlers encounter a barrier they cannot go beyond. They cannot
search our online catalogs and they cannot enter a password or login.
- The
Contents of Searchable Databases. Most of the invisible or deep web is
made up of the contents of thousands of specialized searchable databases
made available via the web. When you type a search in one of these
databases, the search results are delivered to you in web pages that are
generated just in answer to your search. Rarely are such pages stored
anywhere: it is easier and cheaper to dynamically generate the answer page
for each query than to store all the possible pages containing all the
possible answers to all the possible queries people could make to the
database.
- Google
Scholar is a collection of citations with links to publishers or other
sources where one can try to access the publication in full text. In many
academic libraries (and some others), Google Scholar is providing
convenient links to the online holdings of those libraries, purchased for
exclusive use of their constituents. If you search Google Scholar, you
find a lot of journal article references. What you are seeing when you
search Google Scholar is only tiny fraction of all the scholarly
publications that exist online. Much more lurks in a new type of Invisible
or Deep Web.
- WHY?
Google Scholar is only able to provide citations to journal
contents for which its crawlers can find stable llinks. It cannot
construct searches or enter passwords to go into passworded,
copyright-protected articles in many publishers' databases. In some
experiments conducted at UC Berkeley, we estimate that Google Scholar
accesses about 10% of all we subscribe to for our students, faculty,
staff, and users present on campus. Think about the millions of articles
in Lexis/Nexis, the many thousands of articles indexed in privately
licensed databased libraries buy the rights for their users to read (e.g.,
Sociological Abstracts, ERIC, PscyhInfo, JSTOR, INSPEC).
- Excluded
Pages. There are some types of pages that search engine companies exclude
by policy. There is no technical reason they could not include them if they
wanted. It's a matter of selecting what and what not to include in databases
that are already huge, expensive to operate, and whose search function is a
low revenue producer.
- Dynamically generated pages of little value beyond single
use.
Think of the billions of possible web pages that can be generated by all the
people who have looked for books in our online catalogs. Each of them is
creating a results page in reponse to their specific need. Search engines do
not want all of these pages in their web databases. They would be clutter of
little interest to anyone.
- Many
databases in this category. There are many
thousands of public-record, official, and special-purpose databases
containing government, financial, logistical, and other types of information
that is needed to answer very specific inquiries of interest to very few
people. Even if stable links existed to such pages, search engines would not
want them. More clutter.
Simply think
"databases" and keep your eyes open. You can find searchable
databases containing invisible web pages in the course of routine searching in
most general web directories.
Of particular value in academic research are
Use Google and other search
engines to locate searchable databases by searching a subject term and the
word "database". If the database uses the word database in its own pages, you
are likely to find it in Google. The word "database" is also useful in searching
a topic in the Google Directory or
the Yahoo! directory, because they sometimes
use the term to describe searchable databases in their listings.
- EXAMPLES for Google &
Yahoo:
- plane crash
database
- languages
database
- toxic
chemicals database
Remember that
the Invisible Web exists. Remember that, in addition
to what you find in search engine results (including Google Scholar) and most web directories,
there are these gold mines you have to search directly. This includes all of the
licensed article, magazine, reference, news archives, and other research
resources that libraries and some industries buy for those authorized to use
them. The contents of these are not freely available: libraries and corporations
buy the rights for their authorized users to view the contents. If they appear
free, it's because you are somehow authorized to search and read the contents
(library card holder, member of the company, etc.).
As part of your
wise web
search strategy, spend a little time looking for databases in your field or
topic of study or research. Remember, however, that all proprietary information
-- most of the journals, magazines, news, and books -- are not freely available.
Publishers and authors control them under copyright and other distribution
rules. You will be prompted to pay or enter a password to see full text. A
library you have the rights to use may have access to what you want, however.
The Ambiguity Inherent in the Invisible Web:
It is very
difficult to predict what sites or kinds of sites or portions of sites will or
won't be part of the Invisible Web. There are several factors
involved:
- Which sites
replicate some of their content in static pages (hybrid of visible and
invisible in some combination)?
- Which
replicate it all (visible in search engines if you construct a search
matching terms in the page)?
- Which
databases replicate none of their dynamically generated pages in links and
must be searched directly (totally invisible)?
- Search
engines can change their policies on what the exclude and
include.
Want to
learn more about the Invisible Web?
- The Wikipedia "Deep Web" article
provides a fairly up-to-date summary of the problems, current state, and
tecchnologies associated with the phenomon. I defer to its links to other
resources and readings.
Return
to the top of this page
Copyright (C) 2006 by the Regents of the
University of California. All rights reserved.
Document created &
maintained on server: http://www.lib.berkeley.edu/ by Joe Barker
Last updated
1August2006. Server manager: contact