Total Pageviews

Search: This Blog, Linked From Here, The Web, My fav sites, My Blogroll

03 December 2009

The incredible blogosphere world (for newbies)


...You might prefer IE anyway, but the results you get with it are not 
necessarily visible to your viewers. This dichotomy reveals the 
underside of internet site design: What you think you’re 
showing people might not be what they’re seeing...

Digging for traffic

   Digg is an experiment in collaboratively filtering Web content. Collaborative filtering is a way of finding high quality through group intelligence. The theory is that mobs are smart even if the taste and intelligence of individuals vary. Get enough people to vote Yes or No about something, and the cream rises to the top. (Unsatisfied election voters disagree, of course.)
   Digg invites users to submit Web pages, which get categorized and put on long lists. Those lists continually have new pages added to them. When submitting, you can add a short comment or summary of the page. New items stay on the first page of the list until they get pushed to later pages by the continual influx of newer items. Once pushed off, an item’s visibility diminishes. At any time during this marching process, Digg visitors can click through to the item’s page (or not), and “digg” the item by clicking a special link. A tally is kept of the number of times the item is “dugg.” After a certain threshold is reached, that item is moved to Digg’s home page, where it receives tremendously more visibility. Front-page Digg items generate tons of traffic to their pages.
   Some bloggers submit every single one of their entries to Digg, hoping for the one that clicks hard and delivers throngs of visitors. (Remember that each blog entry resides on a separate page with its own URL). Even a modestly dugg item can generate substantial traffic. There is not a one-to-one correspondence of diggs (votes for the item) and visits to the item’s page; to the contrary, it is nearly inevitable that the item’s page receives much more traffic than reflected in the “dugg” number.
   Here again, in Digg as on a blog, some discretion is advisable. Digg users can comment on any submitted item, and if the item is lame by the Digg standard of cool, the submitter is likely to get flamed. Because everyone’s submissions are collected onto a single page, it’s quite possible to ruin your reputation by making trivial submissions that waste time and Digg space. Digg’s purpose is to showcase the best of the best; it is everyone’s responsibility to focus on quality whether voting or submitting.

Del.icio.us links

   Del.icio.us is a popular social (collaborative) tagging site. Tagging is a way of organizing many items into categories such that each item can inhabit many categories. Instead of creating categories like boxes, and dumping items into those boxes, tagging starts with the item and assigns it several (or just one) descriptive tags (lately even browser bookmarks make use of tags). Visitors to a tagging site can click any descriptive tag to see all items that have been assigned that tag. As a voting system, Del.icio.us is not as organized as Digg (described in the preceding section). But Del.icio.us preceded Digg and has a strong following.
   Instead of submitting items (Web pages or blog entries) for group voting, in del.icio.us, you keep a personal store of favorite pages (actually links to pages), each of which is tagged. Del.icio.us provides a bookmarklet (a button for your browser) that lets you save any page you visit with a single click. Once saved and tagged, a page becomes publicly viewable in the del.icio.us site. Others might click through to your saved page and save it themselves in their own del.icio.us cubbyholes. Each saved page displays the number of times it has been saved by the entire community. As in Digg, success breeds success; most people are more curious about popular pages than unpopular pages. The result of a popularly saved page is lots of traffic to that page.
   With no comment system to discourage poor submissions, as in Digg, nothing stops bloggers from saving their entries — perhaps even all their entries. I don’t encourage obsessive promotion of this sort, though; you’d end up spending more time promoting entries than writing them.

Yahoo! 360

   Yahoo! 360 was started in March 2005 as a blog and community site with a few unusual and interesting innovations. Yahoo! is a gigantic suite of Web sites, online communities, and services. It’s hard to imagine anyone being online for one year (or one month) and not touching the Yahoo! empire, even inadvertently. Yahoo! publishes news, hosts Web sites, runs auctions and personal-ad services, stores photos, was the Web’s first major directory and is still one of the three most important search engines, operates music services enjoyed by millions, hosts a gigantic online chatting platform, is one of the world’s largest e-mail providers, and has the world’s most popular personalized home page service. The Yahoo! domain is consistently —month after month, year after year — the first or second most-visited Internet destination.
   The beauty of Yahoo! 360 is that it ties together some parts of the Yahoo! platform and throws them onto your page with little effort on your part. That means if you have ever uploaded pictures to Yahoo! Photos (photos.yahoo.com) or to Flickr (a photo-sharing site owned by Yahoo!), you can put those photos on your Yahoo! 360 page with a click or two — no further uploading. If you have ever written a review of a restaurant in Yahoo! Local (local.yahoo.com), you can easily have that review displayed on your 360 page. This type of integration, or bundling — easily making your Yahoo! stuff appear on your 360 page — is just beginning, and will get more developed, with more options, over time. Yahoo! 360 is already nicely integrated with Yahoo! Messenger, the instant messaging program. That means you can see which of your 360 friends is online at any moment. If you use Yahoo! Groups — another social network that does not include blogging — you can post messages to your groups from your 360 page.
   Yahoo! 360 is neither pretty nor ugly. It is plain; the focus is more on smooth functionality than on dazzling appearance. Recently, Yahoo! added color schemes called Themes to 360, but you can change colors but not the layout style, so Yahoo! 360 pages look similar, one to another, except for the content loaded into those spaces. That content is a mix of what you write and your photos.

Finding a Home in Blogger

All roads lead to Blogger, it sometimes seems. Blogger.com is the first stop on the blogging highway for innumerable newcomers. Here are three good reasons for Blogger’s popularity:
  • Blogger is free
  • Blogger is unlimited; there is no restriction on the number of blogs you can have or the number of pictures you can upload
  • Blogger is easy. Advertising “push-button publishing,” Blogger gets you up and posting faster than any other service.
But Blogger has drawbacks, too. Despite its user-friendliness, Blogger has the soul of a geek and sometimes makes simple tasks unnecessarily complicated. One complaint: the difficulty of adding sidebar content such as a blogroll.




BlogrollIs a list of blogs on a blog (usually placed in the sidebar of a blog) that reads as a list of recommendations by the blogger of other blogs.
Blogger makes you stick your hands into your site’s code to do that, which is why so many Blogger sites don’t display blogrolls (nowadays that feature work out of the box). Blogger stands somewhere between social networks and TypePad. Blogger offers some of the community tools that are characteristic of social sites such as Yahoo! 360 and MSN Spaces.


For example, each listed interest in your Blogger profile links to search results showing everyone else in Blogger with that interest — one of the bedrock features of social networks.
But Blogger lives up to its name and is a more blog-intensive, blog-centric service than the social networks. Blogger is primarily about blogging, not primarily about meeting people.

The Blogger Look



Because Blogger offers relatively few (but fairly attractive) templates, many blogs are instantly recognizable as belonging to Blogger. One typical Blogger look example is the blog you just read right now. That template and similar ones with different colors are much in use. If you scroll down the page, you might notice that now here is visible a feed link. The lack of a visible feed was another drawback to Blogger in past (However, Blogger used a common feed link that you can simply add to the end of any blog’s home-page URL:


/atom.xml


Using Jeff Siever’s Blogger site as an example, the feed link is this:


drumacrat.blogspot.com/atom.xml


Blogger uses an alternative to the RSS feed format called Atom. Atom feeds work just like RSS feeds in the important ways. Atom feeds display blog entries just like RSS feeds in a feed newsreader. Blogger’s choice of Atom has nothing to do with the lack of a feed link (in past days) on Blogger blogs. The missing feed link was simply a design choice at Blogger — inexplicable, perhaps, but there it is. If you use a feed-enabled Web browser (such as Firefox), Blogger feeds appear in the browser just as reliably as on pages that do contain feed links. Firefox finds the feed link in the page’s code, where it lurks invisibly.
One can host his blog at his own domain XYZ.com, not at Blogger’s free hosting domain, blogspot.com. So people takes advantage of Blogger’s free FTP service, which enables Webmasters to use Blogger templates and tools while keeping the blog on their own domain. If you prefer the FTP option, choose it during sign-up, or choose it after sign-up on the Settings, Publishing page of the control panel. Moreover you can stream photos from Flickr, which is a photo-sharing site that allows members to tie their photo collections to their blogs.
A moblog is a mobile blog, where entries are posted remotely using portable computers, PDAs, or cell phones. Many moblogs are rich in photos, and pictures. Notice that Blogger's blogs templates are similar, but with a different color scheme. So they're instantly recognizable as a Blogger blog.

Getting Started with Blogger

   You must start an account with Blogger to begin blogging, but your account is completely free. Furthermore, one account gives you a theoretically unlimited number of blogs. Each blog has a theoretically unlimited amount of space for entries and photos. I say “theoretically” because I have heard of restrictions being placed on accounts that are overused. No published restrictions exist, and with normal use even active bloggers should be able to expand their sites without constraint.
   Starting up is easy. This section walks you through the path of least resistance --> the easiest way to start posting entries in Blogger. You can make changes to your account, and to your blog, later. I always recommend doing some blog setup before starting to write. With Blogger, which is designed as “push button publishing,” I advise pushing those buttons and getting your blog published without delay (except for a few small settings). Then, later I describe how you can customize, personalize, and otherwise exercise other blog settings. Here we go. Go to the Blogger site then, follow along with these steps:
  1. On the Blogger home page, click the Create Your Blog button.
  2. Fill in the username, password, display name, and e-mail address; select the Terms of Service acceptance box; and then click the Continue button. The username and password log you into your Blogger account when you want to post entries or change blog settings. The display name, which can be changed later, appears on your blog. The e-mail address does not necessarily appear on the blog but is used for communication between Blogger and you.
  3. Fill in a name and address for your blog. You can give your blog any title at all, and you can always change it. The blog’s address is partly determined by the Blogger host, which is located at blogspot.com. This means that your blog address will be yourblog.blogspot.com. Simply type what you want that first word of the address to be (press check availability to be sure yourblog name is not just in use); it doesn’t need to be the same as your blog title or display name.

    Note the Advanced Setup option; I am ignoring that option for now. Use it if you have your own domain that resides on a Web host. You can use Blogger tools to operate the blog on your Web host.
  4. Fill in the Word Verification, and then click the Continue button. The word verification step ensures that you are a real person, and not a software robot. Because, yes, software robots do try to create Blogger accounts.
  5. Select a template by clicking the radio button below one of the designs. The template determines the color theme and design layout of your site. Actually you can select from 12 templates. Click the preview template link below any template to see a pop-up window illustrating a sample blog in that design.
  6. Click the Continue button. You might have to wait a few seconds at this point, as Blogger creates your blog template and plugs in your information while “Your blog has been created!” appear.
  7. Click the Start Posting button to . . . well, start posting.
After following these steps, you will have created a blog. It’s that painless. Perhaps that’s enough for one day. If you put aside this project for now, you can return to it at any time by visiting the Blogger.com site and logging in using the Username and Password boxes in the upper-right corner. Doing so presents the Dashboard page; here, Blogger displays news about the service and hints for using it better. On this page, also, are direct links to
  • editing your profile, 
  • changing your password, 
  • adding a photo to your profile, 
  • and — most important to daily blogging — creating a new post and accessing your other Blogger controls. 
Drill this into your brain: The Dashboard is where you access your profile information. Don’t ask me why Blogger does not put your profile settings inside the control panel along with all the other settings. I have no idea why not. To get to the Dashboard, click the Dashboard link near the upper-right corner of the control panel. In the Dashboard screen, click the Edit Profile link to see your profile settings.
Three Crucial Settings

I’ll get to posting in a bit. First, you should address three default settings in your new blog. I don’t mean that you must change the settings, but you should be aware of them before you start writing entries into your blog. Follow these steps to review these crucial settings:
  1. Go to Blogger.com and sign in to your account.
  2. Click the Settings link. Blogger displays the Basic page of your Settings tab.
  3. Use the Add Your Blog to Our Listings drop-down menu to make your blog public or private. The default setting is Yes; this means that your blog is included in the Blogger directory and might be included in Blogger’s list of highlighted blogs. Select No if you prefer keeping the blog private, either temporarily while you build up some entries and practice your Blogger skills or permanently.
  4. Click the Save Settings button. You might have to scroll down the page to see this button. You must save changes you make to any page in the Settings tab before moving to another page; otherwise, your changes will revert to previous settings.
  5. Still in the Settings tab, click the Formatting link.
  6. On the Formatting page, use the Show Title drop-down menu to determine whether your entries are individually titled. Curiously (in old days), this setting is defaulted in the No position, meaning that there is no Title box available when you write an entry. Consequently, when this option is set to No, your blog contains no message titles. Such a design might be to your taste, but I find it odd and recommend changing this setting to Yes. When you do so, the Title box appears on the page on which you compose entries. Then, you have a choice with each entry: create a title or not. Moreover sometime you can find a field that propose to change the WYSIWYG editor with a new updated version. If so choose the use of the new version. Once choose that field disapear from the Formatting page
  7. Click the Save Settings button.
  8. Still in the Settings tab, click the Comments link.
  9. On the Comments page, use the Comments radio buttons to choose whether your entries will allow comments. The default here is Show, which allows visitors to leave comments on your entry pages.
  10. Click the Save Settings button. Now you are ready to blog!

Writing and Posting Entries in Blogger

With your basic settings ready to go, you might want to write and post an entry. That’s what it’s all about. If you’re in your Blogger control panel (where the Settings tab is), a single click gets you to the page where you compose an entry. If you’re entering Blogger after being away from it, go to Blogger.com and sign in, and then click the name of your blog on your Dashboard page. Clicking your blog name takes you to the control panel.

Composing an entry and publishing it

However you get there, follow these steps to write and post your first blog entry. Doing so is just about as easy as sending an e-mail:
  1. Click the Posting tab. You are delivered to the Create page within the Posting tab — just where you want to be.
  2. Enter a title for your blog post. As noted in the preceding section, you must select an option that puts a title field on the Create page. If you don’t, you can’t title your entries.
  3. In the large box, type your entry. Don’t worry about being brilliant, profound, or even interesting. You can delete or edit the post later.
  4. Click the Publish Post button.
That is all there is to writing and posting a blog entry in Blogger. Well, those are the basics. Other features are available. Note the icons and drop-down menus just above the entry-writing box(clik on Post options link). You might be familiar with similar controls if you use e-mail in AOL, Gmail, or another system where you can select fonts, colors, and other formatting options. Run your mouse cursor over the icons to see their labels.

Here is a rundown of Blogger formatting choices:
  • Use the Font and Normal Size drop-down menus to select a typeface and type size, respectively. If you’re uncertain what this means, experiment! Remember, nothing gets published on the blog until you click the Publish Post button.
  • Use the b icon to make bold text. Simply select any text you’ve already typed, and then click the icon. Alternatively, click the icon, start typing (even in the middle of a sentence), and then click the icon again to end the bold formatting.
  • Use the i icon to make italicized text.
  • Use the T icon to change the color of your text. Again, this control may be used on words within sentences. The truth is, most people don’t insert colored text into blog posts. Blogs are more about the writing than fancy and useless formatting. Bold and italic text makes a point; colored text rarely does.
  • That little chain icon next to the T icon is the link icon; use it to insert a link to an outside Web page. Highlight any word or group of words, and then click the link icon and type (or paste) a copied link into the pop-up box.
  • The alignment icons line up your text as follows: on the left (with a ragged right edge), in the middle (ragged on both sides), on the right (with a ragged left edge), or with full justification (even on both sides). The default setting is left alignmentand there is little reason to change it.
  • Use the numbered list and bulleted list options to create indented lists in your entry. The list you’re reading right now is a bullet list. The list before this is a numbered list.
  • The blockquote option creates an indented portion of the entry that many people use when quoting material from another site.
    That illustrates a blockquote.
  • The ABC icon represents a spell-check feature.
  • The small picture is for adding images; I get to that later in the section called “Inserting a photo in an entry.”
  • The rightmost icon, which looks like an eraser, removes formatting from any portion of highlighted text.

That’s it for the formatting. Tempting though these one-click options are, try not to gunk up your blog too much. I mean, it’s your blog; do what you want. But most people don’t like having their eyes savaged by colored text, weirdly sized characters, bizarre fonts, or willy-nilly bolding and italicizing. Sorry to be a sourpuss, but use formatting with discretion.

You might notice a second tab above the entry-writing box, labeled Edit Html. Use this tab if you prefer to manually code your formats with HTML tags. HTML specialists can also use HTML tags that are not represented by the format icons just described. When you want to check your hand-coding efforts, click back to the Compose tab. You can toggle between the two, editing your code and checking it. The Preview link invites you to see what your entry will look like. Unfortunately, it does no such thing. That link displays a composed and formatted version of your entry, but it does not display it in your blog’s template design.
Because the page on which you compose your entry shows all the formatting choices you make, the Preview link seems pointless to me. So my tip is: Ignore it (the Preview feature, not the tip).
Note two final features(Post options link) on the Create page.
  1. Use the  and Don't Allow options to determine whether comments will be allowed on the entry you’re writing. You can use this setting to override the global setting you made in the preceding section.
  2. Use the Time & Date menus to alter the accurate time of the post. You can place the entry in the past or the future (more on why you might want to do this in a moment), and the blog will sort it accordingly in your archives and index page.

Editing your entries

Blogger makes it easy to edit or delete entries. To edit an entry, you use the same basic tools as those for writing a new entry. Proceed like this:
  1. In the Posting tab (control panel), click the Edit posts link.
  2. On the Edit posts page, click the Edit button corresponding to any entry. Use the drop-down menu to select how many published entries appear on this page. You see only the titles, not the full entries.
  3. Alter your entry using the familiar writing and formatting tools. See the preceding section for details on these tools.
  4. Click the Publish Post button.
You have an opportunity to change the date and time of the changed entry. The default setting of the Time & Date menus is the original moment of the entry’s publication. If you’re updating previous information, you can bring the time and date up to the present. If you’re merely correcting a mistake, standard blogging protocol calls for leaving the entry in its original chronological placement.

To delete an entry, go to the Edit Posts page and click the Delete link corresponding to any entry. Blogger displays the post and asks whether you really want to delete it. Click the Delete It button. A deleted post cannot be recovered.


Inserting a photo in an entry

Including a photo in a blog entry is not a problem in Blogger. The photo can come from your computer or from a Web location, the former being more common. Blogger can upload the photo from your machine, store it on its computer, and resize it for display in the blog entry (a big picture could stretch out the entry grotesquely). Visitors who click the resized picture see the full-sized photo on another page. Follow these steps to insert a picture in a Blogger entry:
  1. In your control panel, click the Posting tab.
  2. On the Create page, type an entry title and begin writing your entry. It doesn’t matter when you insert the photo in the entry-writing process. The photo can come first, or the writing, or a portion of the writing.
  3. Click the Add Image icon. To find the Add Image control, run your mouse over the formatting icons. A pop-up window appears to handle your picture selection and upload.
  4. Click the Browse button to select a picture from your computer. If you’re inserting a picture found on the Web, copy that picture’s location (Web address) in the URL box.
  5. In the File Upload window, click the image file you want to insert and then click the Open button.
  6. Back in the Blogger pop-up window, choose a layout by clicking a radio button. The layout selections let you determine whether the picture is positioned to the left of the entry’s text, to the right of the text, or centered above the text(with last editor you can use your mouse for image positioning. You can easily resize or remove an image with the image size “bubble.” Click on the image (Firefox 3 users may need to double-click) to bring up the bubble, and resize the image instantly. You can resize any image, including ones added by URL, but if you resize an image that was uploaded through the post editor we resample the image on our servers to keep the download size small.When you upload an image to the new post editor it will appear as a thumbnail in the image dialog box. That way, you can upload several images at once, and then add them into your post at your convenience. The thumbnails will be available until you close the post editor.
    When you add an image from the dialog into your post it will be placed at the insertion point instead of at the top of the post.
    If you don’t like where an image is in your post, you can drag it around to another spot. If you drag it towards the left side of the editor it will float to the left, likewise for the right, and if you leave it in the center it will be centered. You can drag the image between paragraphs and other block elements. Unlike in the former editor, dragging in the new editor preserves the link to the full-size version of the image).
  7. Select an image size. The size selections are vague: Small, Medium, or Large. I always use the Small setting, so the picture doesn’t dominate the entry. Readers who click the picture see it in full size on a new page.
  8. Click the Upload Image button.
  9. In the confirmation window, click the Done button. Back on the Create page, you can see your uploaded photo in the entry box. Now is a good time to complete your writing of the entry, if necessary.
  10. Click the Publish Post button. You can use this process to insert multiple pictures in a blog entry.

Personalizing Your Blogger Blog

You’ve made basic settings and you’ve posted at least one entry. Your blog is launched. However, you can do more to make it unmistakably your blog. Blogger personalization consists of
  • choosing a site design (which you already did but can change at any time), 
  • creating a personal profile that’s displayed on a unique page, and 
  • setting a few formats. 
  • You can also change the title and description of the blog at any time. Let’s get started finding these features.

Switching Blogger templates

Blogger templates determine the color scheme and layout design of your blog. Templates are interchangeable. You can switch from one to another with a few clicks, and the changes ripple out through every page (except the Profile page) of your blog.

Blogger offers a modest selection of templates — actually 16 main with relative variations (certainly more than Blogger showed you during sign-up). Follow these steps to see how your blog looks with a whole new design:

  1. In your Blogger control panel, in layout page click the Pick new Template tab.
  2. In the Template tab, click the Pick new link.
  3. On the Pick New page, click any template picture(choose any variant) to see a full-size sample of that template. The sample opens in a new browser window and illustrates what an active blog looks like.
  4. Close the sample browser window. You don’t want it hanging around your screen forever. Leaving it open doesn’t affect the template selection process, but closing it reveals the original browser window that you use to select a template.
  5. Click the Save Template button up right after selected template. Your selected template can be the one you just previewed or any other one.
  6. In the pop-up verification window, click OK. This pop-up window warns that if you switch templates, you will lose any customizations to your old (current) template. These customizations are not the settings in the control panel that I have discussed to this point, or those I discuss later in this section. Blogger is referring to customized code inserted into the template’s CSS stylesheets.
  7. Now your blog is Republished. Republishing the blog forces the template change to propagate through the entire site. If your blog has many entries, republishing could take a minute or two. New blogs republish in a few seconds.
  8. Click the View Blog link to see your newly designed site.
The question is, how often should you change your template? Frequent changes are disconcerting to regular visitors. But occasional changes keep the site fresh for you, who must look at it more often than anyone else. Every few months, at the most, is an appropriate guide. It doesn’t hurt to ask your readers for feedback before and after the change.

Republish, republish, republish
I am hammering on this point because it is a frequent source of confusion among Blogger users, and I forget about republishing myself sometimes. Here is the point: In Blogger (as in many other blog services), saving changes is not the same as publishing changes. I’m not talking about posting entries here. The changes I’m talking about are in the Settings and Template tabs of the control panel. Those pages allow you to make global changes that affect many, or all, pages of your blog. A three-stage process
effects those changes:
  1. Make the changes
  2. Save the changes.
  3. Republish the blog.

Building your Blogger profile

Perhaps the most important personalization work you can do on your Blogger blog is creating a profile. The Blogger profile is a page dedicated to who you are and what you like. The profile page is attached to your blog but resides outside the chronological organization of the blog. Your profile never appears on the blog’s index page. Visitors view your profile by clicking the View my complete profile link in the blog’s sidebar.

The profile can be as complete or sketchy as you want, within the options provided by Blogger. A profile setup page is where you determine what information about you appears. You can make it easy, or not so easy, for people to contact you. Specific and direct contact information, such as your address or phone number, are not profile options. Profile options are not part of the main control panel; perhaps you have already noticed that there is no Profile tab in the control panel. This confusing wrinkle might be why many bloggers in the system have blank profiles. You access your profile setup page through the Blogger Dashboard. Here’s the step-by-step for finding and setting your profile choices, starting from scratch:
  1. Go to Blogger.com and sign in to your blog account.
  2. On the Dashboard page, click the Edit Profile link. The link is in the right sidebar.
  3. On the Edit User Profile page, fill in the options you’d like to appear on your profile page. You need to scroll down to see some options.
  4. Click the Save Profile button. It’s at the bottom of the page.
  5. Click the View Updated Profile link to see your profile page. You can amend and alter your profile as much as you like, whenever you like.
An interesting option on the Edit User Profile page might give you pause. It also might give you indigestion, because Blogger does not make the feature particularly easy(in old days). I’m talking about the feature that lets you add a picture to your profile. The next section clarifies how to do that.

Adding a photo to your Blogger profile(old method)

Blogger lets you put a photo in your profile, but makes it outrageously complicated. Most blogging services provide a Browse button for finding a photo in your computer and adding it to your profile. Perhaps you’ve seen those buttons in certain sites. Blogger puts that button in other places. Yet Blogger persists in tormenting its users with a convoluted method of adding a profile photo.

Here is the problem. Notice, on the Edit User Profile page, that Blogger offers to include a photo — if it is already on the Web. The option asks for a “Photo URL.” That means Blogger wants the Web address of a photo that has already been posted online. Quite likely, you have a photo of yourself on your hard drive and have never before uploaded that photo to a Web site (If you have posted the photo on another Web site, plug the URL into this option and be done with it). This is where Blogger should put the Browse button, enabling you to locate the photo in your computer and upload it to your Blogger profile. But noooo. Enough complaining. There is a way around Blogger’s cumbersome and userhostile insufficiency. I intend to make the profile-photo process crystal clear.

If you are familiar with HTML tags, you’ll have no trouble with this. If not, don’t back away. Take this process one step at a time, and you’ll complete it with surprising ease. Here we go:
  1. On the Edit User Profile page, click the Dashboard link. You need to back out of the Edit User Profile page and return to the Dashboard — your starting point when first signing in to your Blogger account. You can get there also by clicking the Home link at the bottom of the Edit User Profile page.
  2. On the Dashboard page, click the New Post icon. As you can see, adding a profile photo has something to do with creating a blog entry. We are going to use the Create page to upload a photo from your computer to Blogger. This tactic is roundabout, but effective.
  3. On the Create page (in the Posting tab), click the Add Image icon. Clicking causes a pop-up window to appear; the same window you used to insert a picture in a blog entry (see the preceding section).
  4. Using the Browse button, select an image from your computer. At last, there is the Browse button. Better here than nowhere. At this point you are following the same steps as those when choosing a picture to insert in a blog entry. The difference this time is that you are not going to actually post the blog entry. We are going through this charade to upload the picture and get it into your profile.
  5. After selecting your photo and layout, click the Upload Image button. It doesn’t matter which photo display size you select — small, medium, or large. Remember, you are not going to actually post this entry. Whichever setting you choose, the full-size photo is uploaded from your computer to Blogger.
    One important note: The Blogger profile does not accept photo files larger than 50 kilobytes. That is a small file and an inexplicable limitation. Photo files in blog entries can be any size, but not so with the profile photo. Using Windows Explorer, right-click your photo file and select Properties to check the file size. This size has nothing to do with the size selections (small, medium, or large) in the uploading window. Blogger always
    uploads the full file size, even when altering the display size (small,
    medium, or large).
  6. In the confirmation window, click the Done button.
  7. Back on the Create page, click the Edit Html tab. Before clicking this tab, you can see your uploaded photo in the entry writing box. When you click the Edit Html tab, that photo disappears and is replaced by a few lines of code.
  8. Highlight the URL address of your photo. This step is the trickiest part, so read carefully. In most cases, the photo URL appears twice, on the second and fourth lines of code — but different screens show the code in different ways. You are looking for a Web address that begins with http:// and is entirely enclosed in quotation marks. If your photo is in the common JPG format, the final part of the address is .jpg. Highlight the entire address but don’t highlight the quotation marks.
  9. Use the Ctrl+C keyboard command to copy the highlighted address to the Windows clipboard.
  10. Click the Dashboard link near the upper-right corner of the page. You are finished with the Create page; you have what you came for — namely, an uploaded photo and its address. Now you can abandon the entry (The entry, not you)
  11. On the Dashboard page, click the Edit Profile link.
  12. On the Edit User Profile page, use the Ctrl+V keyboard combination to paste the photo URL into the Photo URL box.
  13. Scroll down and click the Save Profile button.
  14. Click the View Updated Profile link to see your newly photo-enhanced Blogger profile.
The photo appears in your blog sidebar under the About Me heading, in addition to appearing on your profile page. Blogger automatically sizes the photo to fit in the sidebar. A small version is displayed on the profile page, too; when a visitor clicks that picture, a fill-sized version is displayed on a new page.

Audioblogging in Blogger

Believe it or not, you can put your actual voice right into a Blogger entry. Doing so is free to all Blogger users and fairly easy. Blogging in audio is called audioblogging, and Blogger uses a service called Audioblogger . Once you get the hang of putting an audio message in a blog entry, you can also place one in your profile page. How does it all work?
  1. First you establish an account with Audioblogger (again, it’s free).
  2. Then you call a phone number to record your voice entry.
Your recording, up to five minutes in length, is automatically posted to your log within seconds after you hang up. The entry consists of an audioblogging icon; visitors click the icon to hear your recording. The audio file is recorded in MP3 format; to hear it, a visitor must have MP3-playing software on his or her computer. (Such software is installed on nearly all computers built in the last several years.) When someone clicks the audio file icon in your blog entry, that software opens and plays your voice entry. Some people use audioblogging as their main, or sole, type of blog post. Others use it occasionally and support each audio entry with written text. You are free to experiment. You might use it only once, or you might fall in love with blogging in this manner. Follow these steps to set up a free Audioblogger account:
  1. Go to Audioblogger.
  2. Click the Start Audioblogging Now button.
  3. Log in with your Blogger username and password, and then click the Continue button.
  4. Select the blog in which you want to audioblog from the drop-down menu, and then click the Continue button. If you have just one Blogger blog, the selection is obvious.
  5. Enter your phone number, type a four-digit identification number, and then click the Finish Setup button. Entering a phone number does not restrict you to that number when audioblogging. But you must remember that number as an identifier, even though you also create a four-digit ID number. Consider the username plus password combination required in most Web site registrations. In Audioblogger, you can think of your phone number as your username and the four-digit number as your password. You might have to wait a few minutes while Audioblogger completes your setup.
At the end of this setup, you might see a confirmation screen, and you might see a page of incomprehensible gibberish. No fooling. But even if you see gibberish, chances are good that your Audioblogger account was set up satisfactorily. Try an audio post before you repeat the setup process. To make an audio entry, call up the Audioblogger dial-in number: 1-415-856-0205 Use any phone, and have your entered phone number handy. Audioblogger’s robotic assistant asks for that phone number as a kind of username, and then asks for your PIN as a kind of password. In both cases, you use the phone’s keypad to enter the numbers. When robot-guy cues you, speak your blog entry, and then press the phone’s pound key (#) to end the message.

If you have a cell phone, put the Audioblogger phone number in your phone’s memory. Then you can post an audio entry from anywhere, at any time, without having to recall the number.

Your audio entries are editable just as your written entries are. The audio entries appear on the Edit Posts page along with text entries. You cannot edit the audio, however. You can add a title — I try to do this as soon as possible, because audio entries get posted without titles. And you can add text that explains or enhances the audio.

You can add an audio file to your Blogger profile page. Doing so is no easier than jumping through hoops to add a photo — as I describe in the preceding section. In fact, the process is pretty much the same:
  1. Create an audio entry using Audioblogger. Your recording should be appropriate for the profile page — perhaps a brief welcoming message divulging a bit about you or the blog.
  2. In your Blogger control panel, click the Posting tab.
  3. Click the Edit posts link.
  4. Click the Edit button corresponding to your newly created audio entry.
  5. Click the Edit Html tab.
  6. Highlight the audio file’s URL. The URL address ends in .mp3 and is enclosed in quotation marks.
  7. Press Ctrl+C to copy the address.
  8. Click the Dashboard link near the upper-right corner of the page.
  9. Click the Edit Profile link.
  10. Press Ctrl+V to paste the address into the Audio Clip URL box.
  11. Scroll down and click the Save Profile button.
Your audio clip is now featured as a playable link and icon on your profile page. It does not play automatically when a visitor lands on your profile page; the visitor must click it, just as with an audio blog entry.

Audioblogger and Audioblog Two audio blogging services are named similarly, and operate similarly, so they are sometimes confused. Audioblogger is paired with Blogger.com, and is free to Blogger users. Audioblog is an independent service that charges a monthly fee for phone-in recording and posting to blogs on any platform. You can use Audioblog with Blogger, but you must pay that monthly fee ($4.95 ). One advantage of Audioblog over Audioblogger is that it creates MOV files instead of MP3 files, and posts the MOV files in a miniature player right in the blog entry. Most browsers have no trouble playing these audio entries without the need for another program to open. That seamless performance is a convenience to your readers. If you intend to audioblog seriously, the somewhat more sophisticated service offered by Audioblog is worth considering.

E-mailing Entries to Blogger

Blogger has developed a beautifully simple method of posting entries by e-mail. Using this feature means you can avoid logging into your Blogger account to write a post. This feature is convenient for people who keep their e-mail running on the computer screen all the time. Because e-mail is always handy, posting through that program or Web interface is quicker than signing in to Blogger.

The service is called Mail-to-Blogger, and it works only with text messages —no pictures(in old days). You simply create your own personalized e-mail address on Blogger’s computer, and Blogger assigns all mail received at that address to your blog. When you send an e-mail, Blogger makes the e-mail title a blog entry title, and the body of the e-mail becomes the entry text. Follow these steps to set it up:
  1. In your Blogger control panel, click the Settings tab.
  2. In the Settings tab, click the Email & Mobile link.
  3. On the Email page, create your special address in the Mail2Blogger Address box. You need to think of just one word, because most of the address is set and unchanging. The first part of the address is your account username; the last part of the address is the blogger.com domain. You need to put a word in between, as shown on the page, and it’s best to use a word that you easily remember — perhaps blog. Or entry. Notice the BlogSend Address box. Put an e-mail address in there if you want to be notified of your mailed entry being posted.
  4. Click the Save Settings button. You do not need to republish the blog.
  5. In your e-mail program or interface, address an e-mail to your special e-mail address, title and compose the e-mail, and send it. The body of your e-mail becomes that text of your blog entry.
  6. Check your blog for the entry’s appearance. It can take up to a minute for the mailed entry to be posted.
You can use the Mail-to-Blogger feature from any connected computer and e-mail program in the world. It doesn’t even have to be your e-mail; Blogger doesn’t know or care who owns the e-mail account sending the entry to your special address. That’s why it’s important to keep your special address secret. Anyone who knows it can post to your blog. If that happens, change the secret word on the Settings, Email page.

On the Road with Blogger

Mobile blogging, or moblogging, is supported nicely in Blogger Mobile. Blogger Mobile is a free service included in a standard free Blogger account; you don’t have to sign up for it separately. You must, however, go through a set of steps to get it working for you; once through those steps, moblogging to Blogger is easy. You can take a cell phone that has a built-in camera (a camera phone) and Internet connectivity, snap a picture, send the picture to Blogger, and have it posted to your blog — all in seconds. You can send text, too, or a mix of text and a picture.

Blogger Mobile uses an unusual but effective method of starting a mobile blog path for each user. You can start a new blog in your account for mobile posts, or you can assign mobile posts to an existing blog. (I step you through the exact process here.)

First, you send a picture from your cell phone to a generic, public e-mail address:
go@blogger.com
This e-mail address rejects incoming pictures from non-cell locations. In other words, you can’t use your normal computer e-mail address (such as an AOL, Hotmail, Yahoo! Mail, Gmail, Comcast, or Earthlink address) to remotely post pictures through go@blogger.com. Blogger Mobile works with entries sent from Verizon, AT&T, Cingular, Sprint, or T-Mobile accounts. When go@blogger.com receives your first mobile post, it instantly creates a blog for it. That’s right — Blogger builds an entire blog around one entry and assigns it a blogspot.com Web address.

At this point, the new blog doesn’t belong to anybody, even though it displays your entry. You have to claim the new blog to your account, either keeping it as a separate blog or assigning the entry (and future mobile entries) to an existing blog in your account. Blogger Mobile sends a claim token to the cell phone that sent the entry — that would be your cell phone. You enter the claim token on a special Blogger Mobile page and take control of the new blog. At that moment, you can make the new blog disappear and put the first entry into an existing blog or keep the new blog and rename it.

That’s the general process; here are the specific steps:
  1. Send a text message or picture to go@blogger.com. Check your camera phone’s user manual to find out how to send pictures to e-mail addresses. Within a minute (at most) of sending, Blogger Mobile sends back a text message containing a claim token. That message is sent to your cell phone, not to your computer e-mail.
  2. Read the text message sent to your cell phone by Blogger, and write down the claim token. The claim token is a short string of letters and numbers. Blogger also sends the Web address of the new blog created around your first entry; you can visit the blog but can’t access its controls until you claim it as yours.
  3. In your computer browser, go to go.blogger.com.
  4. Type your token in the Claim Token box, type the correct word in the Verification box, and then click the Continue button. The Mobile Blog Found page appears, showing your Blogger sign-in username.
  5. Click the Continue as this user link.
  6. On the Claim Mobile Blog page, choose to keep the new blog or switch to your existing blog, and then click the Continue button. In this example, I am keeping the new blog, not switching. If you do switch, your mobile post is transferred to the existing blog, and future mobile posts go to the existing blog. Of course, you have access to that blog’s controls. Keeping the new blog (not switching) separates mobile posts from regular posts, for better or worse.
  7. On the Name Your Blog page, select a blog title and domain address, and then click the Continue button. Naming and addressing the blog should remind you of how you first started in Blogger. Name the blog whatever you want. The address will be something like mymobilepics.blogspot.com. You type only the first part of that address; blogspot.com is the default and unchangeable domain.
  8. On the Choose a Template page, click the radio button below your selected template design, and then click the Continue button.
  9. On the You’re Done page, click the View Blog Now button. You’re off and running with a new mobile blog.
Now that your mobile blogging setup is complete, you can continue sending pictures, text, and pictures-plus-text entries from your camera phone to go@blogger.com. Blogger Mobile knows that incoming entries from your phone belong to your Blogger account, and directs them to the blog you selected.


to be continued...

Resources


Blogging For Dummies
by Brad Hill(2006)
ISBN-10: 0-471-77084-1

25 November 2009

The Invisible Web


The world is full of mostly invisible things,

And there is no way but putting the mind’s eye,
Or its nose, in a book, to find them out.
—Howard Nemerov




Intro


Internet search engines, not readily available to the general public until the mid-1990s, have in a few short years made themselves part of our everyday lives. It’s hard to imagine going about our daily routines without them. Indeed, one study from the Fall of 2000 on how people seek answers found that search engines were the top information resource consulted, used nearly 1/3 of the time.

Of course, it’s common to hear gripes about search engines. Almost like bad weather, our failures in locating information with them provide a common experience that everyone can commiserate with. Such complaints overlook the fact that we do indeed tend to find what we are looking for most of the time with search engines. If not, they would have long been consigned to the Internet’s recycle bin and replaced with something better. Nevertheless, it is the search failures that live in our memories, not the successes. “What a stupid search engine! How could it not have found that?” we ask ourselves. The reasons why are multifold.

  • Sometimes we don’t ask correctly, and the search engine cannot interrogate us to better understand what we want.
  • Sometimes we use the wrong search tool, for example, looking for current news headlines on a general-purpose Webwide search engine. It’s the cyberspace equivalent of trying to drive a nail into a board with a screwdriver. Use the right tool, and the job is much easier.
  • Sometimes the information isn’t out there at all, and so a search engine simply cannot find it. Despite the vast resources of the World Wide Web, it does not contain the answers to everything. During such times, turning to information resources such as books and libraries, which have served us valiantly for hundreds of years, may continue to be the best course of action.
  • Of course, sometimes the information is out there but simply hasn’t been accessed by search engines. Web site owners may not want their information to be found. Web technologies may pose barriers to search engine access. Some information simply cannot be retrieved until the right forms are processed. These are all examples of information that is essentially “invisible” to search engines, and if we had a means to access this “Invisible (or Deep) Web” then we might more readily find the answers we are looking for.
The good news is that the Invisible Web is indeed accessible to us, though we might need to look harder to find it. Though we can’t see it easily, there’s nothing to fear from the Invisible Web and plenty to gain from discovering it.

If the Web has become an integral part of daily life you enjoy search engines and Web directories, and these pathfinders are crucial guides that help you navigate through an exploding universe of constantly changing information. Yet you also hate them, because all too
often they fail miserably at answering even the most basic questions or satisfying the simplest queries. They waste your time, they exasperate and frustrate, even provoking an extreme reaction, known as “Web rage,” in some people. It’s fair to ask, “What’s the problem here? Why is it so difficult to find the information I’m looking for?”

The problem is that vast expanses of the Web are completely invisible to general-purpose search engines like AltaVista, HotBot, and Google. Even worse, this “Invisible Web” is in all likelihood growing significantly faster than the visible Web that you’re familiar with. It’s not
that the search engines and Web directories are “stupid” or even badly engineered. Rather, they simply can’t “see” millions of high-quality resources that are available exclusively on the Invisible Web.

So what is this Invisible Web and why aren’t search engines doing anything about making it visible? Good question. There is no dictionary definition for the Invisible Web. Several studies have attempted to map the entire Web, including parts of what we call the Invisible Web. To our knowledge, however, we have found little consensus among the professional Web search community regarding the cartography of the Invisible Web.

Many people—even those “in the know” about Web searching—make many assumptions about the scope and thoroughness of the coverage by Web search engines that are simply untrue. In a nutshell, the Invisible Web consists of material that general purpose search engines either cannot or, perhaps more importantly, will not include in their collections of Web pages (called indexes or indices). The Invisible Web contains vast amounts of authoritative and current information that’s accessible to you, using your Web browser or add-on utility software—but you have to know where to find it ahead of time, since you simply cannot locate it using a search engine like HotBot or Lycos.

Why? There are several reasons. One is technical—search engine technology is actually quite limited in its capabilities, despite its tremendous usefulness in helping searchers locate text documents on the Web. Another reason relates to the costs involved in operating a comprehensive search engine. It’s expensive for search engines to locate Web resources and maintain up-to-date indices. Search engines must also cope with unethical Web page authors who seek to subvert their indexes with millions of bogus “spam” pages—pages that, like
their unsavory e-mail kin, are either junk or offer deceptive or misleading information. Most of the major engines have developed strict guidelines for dealing with spam, which sometimes has the unfortunate effect of excluding legitimate content. These are just a few of the reasons the Invisible Web exists.

The bottom line for the searcher is that understanding the Invisible Web and knowing how to access its treasures can save both time and frustration, often yielding high-quality results that aren’t easily found any other way.


The paradox of the Invisible Web is that it’s easy to understand why it exists, but it’s very hard to actually define in concrete, specific terms. In a nutshell, the Invisible Web consists of content that’s been excluded from general-purpose search engines and Web directories such as Lycos and LookSmart. There’s nothing inherently “invisible” about this content. But since this content is not easily located with the information-seeking tools used by most Web users, it’s effectively invisible because it’s so difficult to find unless you know exactly where to look.

The visible Web is easy to define. It’s made up of HTML Web pages that the search engines have chosen to include in their indices. It’s no more complicated than that. The Invisible Web is much harder to define and classify for several reasons.
  • First, many Invisible Web sites are made up of straightforward Web pages that search engines could easily crawl and add to their indices,but do not, simply because the engines have decided against including them. This is a crucial point—much of the Invisible Web is hidden because search engines have deliberately chosen to exclude some types of Web content. We’re not talking about unsavory “adult” sites or blatant spam sites—quite the contrary! Many Invisible Web sites are first-rate content sources. These exceptional resources simply cannot be found by using general-purpose search engines because they have been effectively locked out. There are a number of reasons for these exclusionary policies. But keep in mind that should the engines change their policies in the future, sites that today are part of the Invisible Web will suddenly join the mainstream as part of the visible Web.
  • Second, it’s relatively easy to classify some sites as either visible or Invisible based on the technology they employ. Some sites using database technology, for example, are genuinely difficult for current generation search engines to access and index. These are “true” Invisible Web sites. Other sites, however, use a variety of media and file types, some of which are easily indexed, and others that are incomprehensible to search engine crawlers. Web sites that use a mixture of these media and file types aren’t easily classified as either visible or Invisible. Rather, they make up what we call the “opaque” Web.
  • Finally, search engines could theoretically index some parts of the Invisible Web, but doing so would simply be impractical, either from a cost standpoint, or because data on some sites is ephemeral and not worthy of indexing—for example, current weather information, moment-by-moment stock quotes, airline flight arrival times, and so on.
Now we define the Invisible Web, and delve into the reasons search engines can’t “see” its content. We also discuss the four different “types” of invisibility, ranging from the “opaque” Web, which is relatively accessible to the searcher, to the truly invisible Web, which requires specialized finding aids to access effectively.


Invisible Web Defined


The definition given above is deliberately very general, because the general-purpose search engines are constantly adding features and improvements to their services. What may be invisible today may become visible tomorrow, should the engines decide to add the capability to index things that they cannot or will not currently index.

Let’s examine the two parts of our definition in more detail. First, we’ll look at the technical reasons search engines can’t index certain types of material on the Web. Then we’ll talk about some of the other non-technical but very important factors that influence the policies that guide search engine operations. At their most basic level, search engines are designed to index Web pages. Search engines use programs called crawlers to find and retrieve Web pages stored on servers all over the world. From a Web server’s standpoint, it doesn’t make any difference if a request for a page comes from a person using a Web browser or from an automated search engine crawler. In either case, the server returns the desired Web page to the computer that requested it.

A key difference between a person using a browser and a search engine crawler is that the person is able to manually type a URL into the browser window and retrieve that Web page. Search engine crawlers lack this capability. Instead, they’re forced to rely on links they find on Web pages to find other pages. If a Web page has no links pointing to it from any other page on the Web, a search engine crawler can’t find it. These “disconnected” pages are the most basic part of the Invisible Web. There’s nothing preventing a search engine from crawling and indexing disconnected pages—there’s simply no way for a crawler to discover and fetch them.

Disconnected pages can easily leave the realm of the Invisible and join the visible Web in one of two ways.
  1. First, if a connected Web page links to a disconnected page, a crawler can discover the link and spider the page
  2. Second, the page author can request that the page be crawled by submitting it to search engine “add URL” forms.
Technical problems begin to come into play when a search engine crawler encounters an object or file type that’s not a simple text document. Search engines are designed to index text, and are highly optimized to perform search and retrieval operations on text. But they don’t do very well with non-textual data, at least in the current generation of tools. Some engines, like AltaVista and HotBot, can do limited searching for certain kinds of non-text files, including images, audio, or video files. But the way they process requests for this type of material are reminiscent of early Archie searches, typically limited to a filename or the minimal alternative (ALT) text that’s sometimes used by page authors in the HTML image tag. Text surrounding an image, sound, or video file can give additional clues about what the file contains. But keyword searching with images and sounds is a far cry from simply telling the search engine to “find me a picture that looks like Picasso’s Guernica” or “let me hum a few bars of this song and you tell me what it is.” Pages that consist primarily of images, audio, or video, with little or no text, make up another type of Invisible Web content. While the pages may actually be included in a search engine index, they provide few textual clues as to their content, making it highly unlikely that they will ever garner high relevance scores. Researchers are working to overcome these limitations.

While search engines have limited capabilities to index pages that are primarily made up of images, audio, and video, they have serious problems with other types of non-text material. Most of the major general-purpose search engines simply cannot handle certain types of formats. These formats include:
  • PDF or Postscript (Google excepted)
  • Flash
  • Shockwave
  • Executables (programs)
  • Compressed files (.zip, .tar, etc.)
The problem with indexing these files is that they aren’t made up of HTML text. Technically, most of the formats in the list above can be indexed. The search engines choose not to index them for business reasons.
  • For one thing, there’s much less user demand for these types of files than for HTML text files.
  • These formats are also “harder” to index, requiring more computing resources. For example, a single PDF file might consist of hundreds or even thousands of pages.
  • Indexing non HTML text file formats tends to be costly.
Pages consisting largely of these “difficult” file types currently make up a relatively small part of the Invisible Web. However, we’re seeing a rapid expansion in the use of many of these file types, particularly for some kinds of high-quality, authoritative information. For example, to comply with federal paperwork reduction legislation, many U.S. government agencies are moving to put all of their official documents on the Web in PDF format. Most scholarly papers are posted to the Web in Postscript or compressed Postscript format. For the searcher, Invisible Web content made up of these file types poses a serious problem. We discuss a partial solution to this problem later.

The biggest technical hurdle search engines face lies in accessing information stored in databases. This is a huge problem, because there are thousands—perhaps millions—of databases containing high-quality information that are accessible via the Web.
Web content creators favor databases because they offer flexible, easily maintained development environments. And increasingly, content-rich databases from universities, libraries, associations, businesses, and government agencies are being made available online, using Web interfaces as front-ends to what were once closed, proprietary information systems.
Databases pose a problem for search engines because every database is unique in both the design of its data structures, and its search and retrieval tools and capabilities. Unlike simple HTML files, which search engine crawlers can simply fetch and index, content stored in databases is trickier to access, for a number of reasons that we’ll describe in detail here.

Search engine crawlers generally have no difficulty finding the interface or gateway pages to databases, because these are typically pages made up of input fields and other controls. These pages are formatted with HTML and look like any other Web page that uses interactive forms. Behind the scenes, however, are the knobs, dials, and switches that provide access to the actual contents of the database, which are literally incomprehensible to a search engine crawler.

Although these interfaces provide powerful tools for a human searcher, they act as roadblocks for a search engine spider. Essentially, when an indexing spider comes across a database, it’s as if it has run smack into the entrance of a massive library with securely bolted doors. A crawler can locate and index the library’s address, but because the crawler cannot penetrate the gateway it can’t tell you anything about the books, magazines, or other documents it contains.

These Web-accessible databases make up the lion’s share of the Invisible Web. They are accessible via the Web, but may or may not actually be on the Web (see Table 4.1). To search a database you must use the powerful search and retrieval tools offered by the database itself. The advantage to this direct approach is that you can use search tools that were specifically designed to retrieve the best results from the database. The disadvantage is that you need to find the database in the first place, a task the search engines may or may not be able to help you with.

There are several different kinds of databases used for Web content, and it’s important to distinguish between them. Just because Web content is stored in a database doesn’t automatically make it part of the Invisible Web. Indeed, some Web sites use databases not so much for their sophisticated query tools, but rather because database architecture is more robust and makes it easier to maintain a site than if it were simply a collection of HTML pages.
  • One type of database is designed to deliver tailored content to individual users. Examples include My Yahoo!, Personal Excite, Quicken.com’s personal portfolios, and so on. These sites use databases that generate “on the fly” HTML pages customized for a specific user. Since this content is tailored for each user, there’s little need to index it in a general-purpose search engine.
  • A second type of database is designed to deliver streaming or realtime data—stock quotes, weather information, airline flight arrival information, and so on. This information isn’t necessarily customized, but is stored in a database due to the huge, rapidly changing quantities of information involved. Technically, much of this kind of data is indexable because the information is retrieved from the database and published in a consistent, straight HTML file format. But because it changes so frequently and has value for such a limited duration (other than to scholars or archivists), there’s no point in indexing it. It’s also problematic for crawlers to keep up with this kind of information. Even the fastest crawlers revisit most sites monthly or even less frequently. Staying current with real-time information would consume so many resources that it is effectively impossible for a crawler.
  • The third type of Web-accessible database is optimized for the data it contains, with specialized query tools designed to retrieve the information using the fastest or most effective means possible. These are often “relational” databases that allow sophisticated querying to find data that is “related” based on criteria specified by the user. The only way of accessing content in these types of databases is by directly interacting with the database. It is this content that forms the core of the Invisible Web. Let’s take a closer look at these elements of the Invisible Web, and demonstrate exactly why search engines can’t or won’t index them.


Why Search Engines Can’t See the Invisible Web
Text—more specifically hypertext—is the fundamental medium of the Web. The primary function of search engines is to help users locate hypertext documents of interest. Search engines are highly tuned and optimized to deal with text pages, and even more specifically, text pages that have been encoded with the HyperText Markup Language (HTML). As the Web evolves and additional media become commonplace, search engines will undoubtedly offer new ways of searching for this information. But for now, the core function of most Web search engines is to help users locate text documents.

HTML documents are simple. Each page has two parts: a “head” and a “body,” which are clearly separated in the source code of an HTML page.
  • The head portion contains a title, which is displayed (logically enough) in the title bar at the very top of a browser’s window. The head portion may also contain some additional metadata describing the document, which can be used by a search engine to help classify the document. For the most part, other than the title, the head of a document contains information and data that help the Web browser display the page but is irrelevant to a search engine.
  • The body portion contains the actual document itself. This is the meat that the search engine wants to digest.
The simplicity of this format makes it easy for search engines to retrieve HTML documents, index every word on every page, and store them in huge databases that can be searched on demand. Problems arise when content doesn’t conform to this simple Web page model. To understand why, it’s helpful to consider the process of crawling and the factors that influence whether a page either can or will be successfully crawled and indexed.
  • The first determination a crawler attempts to make is whether access to pages on a server it is attempting to crawl is restricted. Webmasters can use three methods to prevent a search engine from indexing a page.Two methods use blocking techniques specified in the Robots Exclusion Protocol that most crawlers voluntarily honor and one creates a technical roadblock that cannot be circumvented. The Robots Exclusion Protocol is a set of rules that enables a Webmaster to specify which parts of a server are open to search engine crawlers, and which parts are off-limits. The Webmaster simply creates a list of files or directories that should not be crawled or indexed, and saves this list on the server in a file named robots.txt. This optional file, stored by convention at the top level of a Web site, is nothing more than a polite request to the crawler to keep out, but most major search engines respect the protocol and will not index files specified in robots.txt.

  • The second means of preventing a page from being indexed works in the same way as the robots.txt file, but is page-specific. Webmasters can prevent a page from being crawled by including a “noindex” meta tag instruction in the “head” portion of the document. Either robots.txt or the noindex meta tag can be used to block crawlers. The only difference between the two is that the noindex meta tag is page specific, while the robots.txt file can be used to prevent indexing of individual pages, groups of files, or even entire Web sites.

  • Password protecting a page is the third means of preventing it from being crawled and indexed by a search engine. This technique is much stronger than the first two because it uses a technical barrier rather than a voluntary standard. Why would a Webmaster block crawlers from a page using the Robots Exclusion Protocol rather than simply password protecting the pages? Password-protected pages can be accessed only by the select few users who know the password. Pages excluded from engines using the Robots Exclusion Protocol, on the other hand, can be accessed by anyone except a search engine crawler. The most common reason Webmasters block pages from indexing is that their content changes so frequently that the engines cannot keep up.
Pages using any of the three methods described here are part of the Invisible Web. In many cases, they contain no technical roadblocks that prevent crawlers from spidering and indexing the page. They are part of the Invisible Web because the Webmaster has opted to keep them out of the search engines.

Once a crawler has determined whether it is permitted to access a page, the next step is to attempt to fetch it and hand it off to the search engine’s indexer component. This crucial step determines whether a page is visible or invisible. Let’s examine some variations that crawlers encounter as they discover pages on the Web, using the same logic they do to determine whether a page is indexable.

  • Case 1. The crawler encounters a page that is straightforward HTML text, possibly including basic Web graphics. This is the most common type of Web page. It is visible and can be indexed.
  • Case 2. The crawler encounters a page made up of HTML, but it’s a form consisting of text fields, check boxes, or other components requiring user input. It might be a sign-in page, requiring a user name and password. It might be a form requiring the selection of one or more options. The form itself, since it’s made up of simple HTML, can be fetched and indexed. But the content behind the form (what the user sees after clicking the submit button) may be invisible to a search engine. There are two possibilities here:


    • The form is used simply to select user preferences. Other pages on the site consist of straightforward HTML that can be crawled and indexed (presuming there are links from other pages elsewhere on the Web pointing to the pages). In this case, the form and the content behind it are visible and can be included in a search engine index. Quite often, sites like this are specialized search sites . A good example is Hoover’s Business Profiles, hich provides a form to search for a company, but presents company profiles in straightforward HTML that can be indexed.
    • The form is used to collect user-specified information that will generate dynamic pages when the information is submitted. In this case, although the form is visible the content “behind” it is invisible. Since the only way to access the content is by using the form, how can a crawler—which is simply designed to request and fetch pages—possibly know what to enter into the form? Since forms can literally have infinite variations, if they function to access dynamic content they are essentially roadblocks for crawlers. A good example of this type of Invisible Web site is The World Bank Group’s Economics of Tobacco Control Country Data Report Database, which allows you to select any country and choose a wide range of reports for that country. It’s interesting to note that this database is just one part of a much larger site, the bulk of which is fully visible. So even if the search engines do a comprehensive job of indexing the visible part of the site, this valuable information still remains hidden to all but those searchers who visit the site and discover the database on their own.
    In the future, forms will pose less of a challenge to search engines. Several projects are underway aimed at creating more intelligent crawlers that can fill out forms and retrieve information. One approach uses preprogrammed “brokers” designed to interact with the forms of specific databases. Other approaches combine brute force with artificial intelligence to “guess” what to enter into forms, allowing the crawler to “punch through” the form and retrieve information. However, even if general-purpose search engines do acquire the ability to crawl content in databases, it’s likely that the native search tools provided by each database will remain the best way to interact with them.
  • Case 3. The crawler encounters a dynamically generated page assembled and displayed on demand. The telltale sign of a dynamically generated page is the “?” symbol appearing in its URL. Technically, these pages are part of the visible Web. Crawlers can fetch any page that can be displayed in a Web browser, regardless of whether it’s a static page stored on a server or generated dynamically. A good example of this type of Invisible Web site is Compaq’s experimental SpeechBot search engine, which indexes audio and video content using speech recognition, and converts the streaming media files to viewable text. Somewhat ironically, one could make a good argument that most search engine result pages are themselves Invisible Web content, since they generate dynamic pages on the fly in response to user search terms.
    Dynamically generated pages pose a challenge for crawlers. Dynamic pages are created by a script, a computer program that selects from various options to assemble a customized page. Until the script is actually run, a crawler has no way of knowing what it will actually do. The script should simply assemble a customized Web page. Unfortunately, unethical Webmasters have created scripts to generate millions of similar but not quite identical pages in an effort to “spamdex” the search engine with bogus pages. Sloppy programming can also result in a script that puts a spider into an endless loop, repeatedly retrieving the same page.
    These “spider traps” can be a real drag on the engines, so most have simply made the decision not to crawl or index URLs that generate dynamic content. They’re “apartheid” pages on the Web—separate but equal, making up a big portion of the “opaque” Web that potentially can be indexed but is not. Inktomi’s FAQ about its crawler, named “Slurp,”
    offers this explanation:



    “Slurp now has the ability to crawl dynamic links or dynamically
    generated documents. It will not, however, crawl them by default. There
    are a number of good reasons for this. A couple of reasons are that
    dynamically generated documents can make up infinite URL spaces,
    and that dynamically generated links and documents can be different
    for every retrieval so there is no use in indexing them”.
    As crawler technology improves, it’s likely that one type of dynamically generated content will increasingly be crawled and indexed. This is content that essentially consists of static pages that are stored in databases for production efficiency reasons. As search engines learn which sites providing dynamically generated content can be trusted not to subject crawlers to spider traps, content from these sites will begin to appear in search engine indices. For now, most dynamically generated content is squarely in the realm of the Invisible Web.
  • Case 4. The crawler encounters an HTML page with nothing to index. There are thousands, if not millions, of pages that have a basic HTML framework, but which contain only Flash, images in the .gif, .jpeg, or other Web graphics format, streaming media, or other non-text content in the body of the page. These types of pages are truly parts of the Invisible Web because there’s nothing for the search engine to index. Specialized multimedia search engines, such as ditto.com and WebSeek are able to recognize some of these non-text file types and index minimal information about them, such as file name and size, but these are far from keyword searchable solutions.
  • Case 5. The crawler encounters a site offering dynamic, real-time data. There are a wide variety of sites providing this kind of information, ranging from real-time stock quotes to airline flight arrival information. These sites are also part of the Invisible Web, because these data streams are, from a practical standpoint, unindexable. While it’s technically possible to index many kinds of real-time data streams, the value would only be for historical purposes, and the enormous amount of data captured would quickly strain a search engine’s storage capacity, so it’s a futile exercise. A good example of this type of Invisible Web site is TheTrip.com’s Flight tracker, which provides real-time flight arrival
    information taken directly from the cockpit of in-flight airplanes.
  • Case 6. The crawler encounters a PDF or Postscript file. PDF and Postscript are text formats that preserve the look of a document and display it identically regardless of the type of computer used to view it. Technically, it’s a straightforward task to convert a PDF or Postscript file to plain text that can be indexed by a search engine. However, most
    search engines have chosen not to go to the time and expense of indexing files of this type. One reason is that most documents in these formats are technical or academic papers, useful to a small community of scholars but irrelevant to the majority of search engine users, though this is changing as governments increasingly adopt the PDF format for their official documents. Another reason is the expense of conversion to plain text. Search engine companies must make business decisions on how best to allocate resources, and typically they elect not to work with these formats.
    An experimental search engine called ResearchIndex, created by computer scientists at the NEC Research Institute, not only indexes PDF and Postscript files, it also takes advantage of the unique features that commonly appear in documents using the format to improve search results. For example, academic papers typically cite other documents, and include lists of references to related material. In addition to indexing the full text of documents, ResearchIndex also creates a citation index that makes it easy to locate related documents. It also appears that citation searching has little overlap with keyword searching, so combining the two can greatly enhance the relevance of results. We hope that the major search engines will follow Google’s example and gradually adopt the pioneering work being done by the developers of ResearchIndex. Until then, files in PDF or Postscript format remain firmly in the realm of the Invisible Web.
  • Case 7. The crawler encounters a database offering a Web interface. There are tens of thousands of databases containing extremely valuable information available via the Web. But search engines cannot index the material in them. Although we present this as a unique case, Web-accessible databases are essentially a combination of Cases 2 and 3. Databases generate Web pages dynamically, responding to commands issued through an HTML form. Though the interface to the database is an HTML form, the database itself may have been created before the development of HTML, and its legacy system is incompatible with protocols used by the engines, or they may require registration to access the data. Finally, they may be proprietary, accessible only to select users, or users who have paid a fee for access. Ironically, the original HTTP specification developed by Tim Berners-Lee included a feature called format negotiation that allowed a client to say
    what kinds of data it could handle and allow a server to return data in any acceptable format. Berners-Lee’s vision encompassed the information in the Invisible Web, but this vision—at least from a search engine stand-point—has largely been unrealized.
These technical limitations give you an idea of the problems encountered by search engines when they attempt to crawl Web pages and compile indices. There are other, non-technical reasons why information isn’t included in search engines. We look at those next.



What You See Is not What You Get In theory, the results displayed in response to a search engine query accurately reflect the pages that are deemed relevant to the query. In practice, however, this isn’t always the case. When a search index is out of date. Search results may not match the current content of the page simply because the page has been changed since it was last indexed. But there’s a more insidious problem: spiders can be fooled into crawling one page that’s masquerading for another. This technique is called “cloaking” or, more technically, “IP delivery.”
By convention, crawlers have unique names, and they identify themselves by name whenever they request pages from a server, allowing servers to deny them access during particularly busy times so that human users won’t suffer performance consequences. The crawler’s name also provides a means for Webmasters to contact the owners of spiders that put undue stress on servers. But the identification codes
also allow Webmasters to serve pages that are created specifically for spiders in place of the actual page the spider is requesting.
This is done by creating a script that monitors the IP (Internet Protocol) addresses making page requests. All entities, whether Web browsers or search engine crawlers, have their own unique IP addresses. IP addresses are effectively “reply to” addresses—the Internet address to which pages should be sent. Cloaking software watches for the unique signature of a search engine crawler (its IP address), and feeds specialized versions of pages to the spider that aren’t identical to the ones that will be seen by anyone else.
Cloaking allows Webmasters to “break all the rules” by feeding specific information to the search engine that will cause a page to rank well for specific search keywords. Used legitimately, cloaking can solve the problem of unscrupulous people stealing metatag source code from a high-ranking page. It can also help sites that are required by law to have a “search-unfriendly” disclaimer page as their home page. For example, pharmaceutical companies Eli Lilly and Schering-Plough use IP delivery techniques to assure that their pages rank highly for their specific products, which would be impossible if the spiders were only able to index the legalese on pages required by law.
Unfortunately, cloaking also allows unscrupulous Webmasters to employ a “bait and switch” tactic designed to make the search engine think the page is about one thing when in fact it may be about something completely different. This is done by serving a totally bogus page to a crawler, asserting that it’s the actual content of the URL, while in fact the content at the actual URL of the page may be entirely different. This sophisticated trick is favored by spammers seeking to lure unwary searchers to pornographic or other unsavory sites.
IP delivery is difficult for search crawlers to recognize, though a careful searcher can often recognize the telltale signs by comparing the title and description with the URL in a search engine result. For example, look at these two results for the query “child toys”:
Dr. Toy’s Guide: Information on Toys and Much More
Toy Information! Over 1,000 award winning toys and children’s
products are fully described with company phone numbers, photos
and links to useful resources...
URL: www.drtoy.com/
AAA BEST TOYS
The INTERNET’S LARGEST ULTIMATE TOY STORE for
children of all ages.
URL: 196.22.31.6/xxx-toys.htm
In the first result, the title, description, and URL all suggest a reputable resource for children’s toys. In the second result, there are several clues that suggest that the indexed page was actually served to the crawler via IP delivery. The use of capital letters and a title beginning with “AAA” (a favorite but largely discredited trick of spammers) are blatant red flags. What really clinches it is the use of a numeric URL, which makes it difficult to know what the destination is, and the actual filename of the page, suggesting something entirely different from wholesome toys for children. The important thing to remember about this method is that the titles and descriptions, and even the content of a page, can be faked using IP delivery, but the underlying URL cannot. If a search result looks dubious, pay close attention to the URL before clicking on it. This type of caution can save you both frustration and potential embarrassment.

Four Types of Invisibility
Technical reasons aside, there are other reasons that some kinds of material that can be accessed either on or via the Internet are not included in search engines. There are really four “types” of Invisible Web content. We make these distinctions not so much to make hard and fast distinctions between the types, but rather to help illustrate the amorphous boundary of the Invisible Web that makes defining it in concrete terms so difficult. The four types of invisibility are:
  • The Opaque Web
  • The Private Web
  • The Proprietary Web
  • The Truly Invisible Web

The Opaque Web
The Opaque Web consists of files that can be, but are not, included in search engine indices. The Opaque Web is quite large, and presents a unique challenge to a searcher. Whereas the deep content in many truly Invisible Web sites is accessible if you know how to find it, material on the Opaque Web is often much harder to find.

The biggest part of the Opaque Web consists of files that the search engines can crawl and index, but simply do not. There are a variety of reasons for this; let’s look at them.


DEPTH OF CRAWL
Crawling a Web site is a resource-intensive operation. It costs money for a search engine to crawl and index every page on a site. In the past, most engines would merely sample a few pages from a site rather than performing a “deep crawl” that indexed every page, reasoning that a sample provided a “good enough” representation of a site that would satisfy the needs of most searchers. Limiting the depth of crawl also reduced the cost of indexing a particular Web site.

In general, search engines don’t reveal how they set the depth of crawl for Web sites. Increasingly, there is a trend to crawl more deeply, to index as many pages as possible. As the cost of crawling and indexing goes down, and the size of search engine indices continues to be a
competitive issue, the depth of crawl issue is becoming less of a concern for searchers. Nonetheless, simply because one, fifty, or five thousand pages from a site are crawled and made searchable, there is no guarantee that every page from a site will be crawled and indexed. This problem gets little attention and is one of the top reasons why useful material may be all but invisible to those who only use general-purpose search tools to find Web materials.


FREQUENCY OF CRAWL
The Web is in a constant state of dynamic flux. New pages are added constantly, and existing pages are moved or taken off the Web. Even the most powerful crawlers can visit only about 10 million pages per day, a fraction of the entire number of pages on the Web. This means that each search engine must decide how best to deploy its crawlers, creating a schedule that determines how frequently a particular page or site is visited.

Web search researchers Steve Lawrence and Lee Giles, writing in the July 8, 1999, issue of Nature state that “indexing of new or modified pages by just one of the major search engines can take months” (Lawrence, 1999). While the situation appears to have improved since their study, most engines only completely “refresh” their indices monthly or even less frequently.

It’s not enough for a search engine to simply visit a page once and then assume it’s still available thereafter. Crawlers must periodically return to a page to not only verify its existence, but also to download the freshest copy of the page and perhaps fetch new pages that have been added to a site. According to one study, it appears that the half-life of a Web page is somewhat less than two years and the half-life of a Web site is somewhat more than two years. Put differently, this means that if a crawler returned to a site spidered two years ago it would contain the same number of URLs, but only half of the original pages would still exist, having been replaced by new ones (Koehler, 2000).

New sites are the most susceptible to oversight by search engines because relatively few other sites on the Web will have linked to them compared to more established sites. Until search engines index these new sites, they remain part of the Invisible Web.


MAXIMUM NUMBER OF VIEWABLE RESULTS
It’s quite common for a search engine to report a very large number of results for any query, sometimes into the millions of documents. However, most engines also restrict the total number of results they will display for a query, typically between 200 and 1,000 documents. For queries that return a huge number of results, this means that the majority of pages the search engine has determined might be relevant are inaccessible, since the result list is arbitrarily truncated. Those pages that don’t make the cut are effectively invisible.

Good searchers are aware of this problem, and will take steps to circumvent it by using a more precise search strategy and using the advanced filtering and limiting controls offered by many engines. However, for many inexperienced searchers this limit on the total number of viewable hits can be a problem. What happens if the answer you need is available (with a more carefully crafted search) but cannot be viewed using your current search terms?


DISCONNECTED URLS
For a search engine crawler to access a page, one of two things must take place. Either the Web page author uses the search engine’s “Submit URL” feature to request that the crawler visit and index the page, or the crawler discovers the page on its own by finding a link to the page on
some other page. Web pages that aren’t submitted directly to the search engines, and that don’t have links pointing to them from other Web pages, are called “disconnected” URLs and cannot be spidered or indexed simply because the crawler has no way to find them.

Quite often, these pages present no technical barrier for a search engine. But the authors of disconnected pages are clearly unaware of the requirements for having their pages indexed. A May 2000 study by IBM, AltaVista, and Compaq discovered that the total number of disconnected URLs makes up about 20 percent of the potentially indexable Web, so this
isn’t an insignificant problem (Broder, etc., 2000).

In summary, the Opaque Web is large, but not impenetrable. Determined searchers can often find material on the Opaque Web, and search engines are constantly improving their methods for locating and indexing Opaque Web material. The three other types of Invisible Webs are more problematic, as we’ll see.


The Private Web
The Private Web consists of technically indexable Web pages that have deliberately been excluded from search engines. There are three ways that Webmasters can exclude a page from a search engine:
  • Password protect the page. A search engine spider cannot go past the form that requires a username and password.
  • Use the robots.txt file to disallow a search spider from accessing the page.
  • Use the “noindex” meta tag to prevent the spider from reading past the head portion of the page and indexing the body.
For the most part, the Private Web is of little concern to most searchers. Private Web pages simply use the public Web as an efficient delivery and access medium, but in general are not intended for use beyond the people who have permission to access the pages.

There are other types of pages that have restricted access that may be of interest to searchers, yet they typically aren’t included in search engine indices. These pages are part of the Proprietary Web, which we describe next.


The Proprietary Web
Search engines cannot for the most part access pages on the Proprietary Web, because they are only accessible to people who have agreed to special terms in exchange for viewing the content.
Proprietary pages may simply be content that’s only accessible to users willing to register to view them. Registration in many cases is free, but a search crawler clearly cannot satisfy the requirements of even the simplest registration process.

Examples of free proprietary Web sites include The New York Times, Salon’s “The Well” community, Infonautics’ “Company Sleuth” site, and countless others.

Other types of proprietary content are available only for a fee, whether on a per-page basis or via some sort of subscription mechanism. Examples of proprietary fee-based Web sites include the Electric Library, Northern Light’s Special Collection Documents, and The Wall Street Journal Interactive Edition.

Proprietary Web services are not the same as traditional online information providers, such as Dialog, LexisNexis, and Dow Jones. These services offer Web access to proprietary information, but use legacy database systems that existed long before the Web came into being. While the content offered by these services is exceptional, they are not considered to be Web or Internet providers.


The Truly Invisible Web
Some Web sites or pages are truly invisible, meaning that there are technical reasons that search engines can’t spider or index the material they have to offer. A definition of what constitutes a truly invisible resource must necessarily be somewhat fluid, since the engines are
constantly improving and adapting their methods to embrace new types of content. But at the end of 2001, truly invisible content consisted of several types of resources.

The simplest, and least likely to remain invisible over time, are Web pages that use file formats that current generation Web crawlers aren’t programmed to handle. These file formats include PDF, Postscript, Flash, Shockwave, executables (programs), and compressed files. There are two reasons search engines do not currently index these types of files.
  1. First, the files have little or no textual context, so it’s difficult to categorize them, or compare them for relevance to other text documents. The addition of metadata to the HTML container carrying the file could solve this problem, but it would nonetheless be the metadata description that got indexed rather than the contents of the file itself.
  2. The second reason certain types of files don’t appear in searchindices is simply because the search engines have chosen to omit them. They can be indexed, but aren’t. You can see a great example of this in action with the Research Index engine, which retrieves and indices PDF,postscript, and even compressed files in real time, creating a searchable database that’s specific to your query. AltaVista’s Search Engine product for creating local site search services is capable of indexing more than 250 file formats, but the flagship public search engine includes only a few of these formats. It’s typically lack of willingness, not an ability issue with file formats.
More problematic are dynamically generated Web pages. Again, in some cases, it’s not a technical problem but rather unwillingness on the part of the engines to index this type of content. This occurs specifically when a non-interactive script is used to generate a page. These are static pages, and generate static HTML that the engine could spider. The problem is that unscrupulous use of scripts can also lead crawlers into “spider traps” where the spider is literally trapped within a huge site of thousands, if not millions, of pages designed solely to spam the search engine. This is a major problem for the engines, so they’ve simply opted not to index URLs that contain script commands.

Finally, information stored in relational databases, which cannot be extracted without a specific query to the database, is truly invisible. Crawlers aren’t programmed to understand either the database structure or the command language used to extract information.

Now that you know the reasons that some types of content are effectively invisible to search engines, let’s move on and see how you can apply this knowledge to actual sites on the Web, and use this understanding to become a better searcher.


Visible or Invisible?


How can you determine whether what you need is found on the visible or Invisible Web? And why is this important? Learning the difference between visible and Invisible Web resources
is important because it will save you time, reduce your frustration, and often provide you with the best possible results for your searching efforts. It’s not critical that you immediately learn to determine whether a resource is visible or invisible—the boundary between visible and invisible sources isn’t always clear, and search services are continuing their efforts to make the invisible visible. Your ultimate goal should be to satisfy your information need in a timely manner using all that the Web has to offer.

The key is to learn the skills that will allow you to determine where you will likely find the best results—before you begin your search. With experience, you’ll begin to know ahead of time the types of resources that will likely provide you with best results for a particular type of search.


Navigation vs. Content Sites
Before you even begin to consider whether a site is invisible or not, it’s important to determine what kind of site you’re viewing. There are two fundamentally different kinds of sites on the Web:
  • Sites that provide content
  • Sites that facilitate Web navigation and resource discovery
All truly invisible sites are fundamentally providers of content, not portals, directories, or even search engines, though most of the major portal sites offer both content and navigation. Navigation sites may use scripts in the links they create to other sites, which may make them
appear invisible at first glance. But if their ultimate purpose is to provide links to visible Web content, they aren’t really Invisible Web sites because there’s no “there” there. Navigation sites using scripts are simply taking advantage of database technology to facilitate a process of
pointing you to other content on the Web, not to store deep wells of content themselves.

On the other hand, true Invisible Web sites are those where the content is stored in a database, and the only way of retrieving it is via a script or database access tool. How the content is made available is key—if the content exists in basic HTML files and is not password protected or restricted by the robots exclusion protocol, it is not invisible content. The content must be stored in the database and must only be accessible using the database interface for content to be truly invisible to search engines.

Some sites have both visible and invisible elements, which makes categorizing them all the more challenging. For example, the U.S. Library of Congress maintains one of the largest sites on the Web. Much of its internal navigation relies on sophisticated database query and retrieval tools. Much of its internal content is also contained within databases, making it effectively invisible. Yet the Library of Congress site also features many thousands of basic HTML pages that can be and have been indexed by the engines. Later we’ll look more closely at the Library of Congress site, pointing out its visible and invisible parts.

Some sites offer duplicate copies of their content, storing pages both in databases and as HTML files. These duplicates are often called “mirror” or “shadow” sites, and may actually serve as alternate content access points that are perfectly visible to search engines. The Education Resource Information Clearinghouse (ERIC) database of educational resource documents on the Web is a good example of a site that does this, with some materials in its database also appearing in online journals, books, or other publications.

In cases where visibility or invisibility is ambiguous, there’s one key point to remember: where you have a choice between using a general-purpose search engine or query and retrieval tools offered by a particular site you’re usually better off using the tools offered by the site. Local site search tools are often finely tuned to the underlying data; they’re limited to the underlying data, and won’t include “noise” that you’ll invariably get in the results from a general search engine. That said, let’s take a closer look at how you tell the difference between visible and Invisible Web sites and pages.


Direct vs. Indirect URLs

The easiest way to determine if a Web page is part of the Invisible Web is to examine its URL. Most URLs are direct references to a specific Web page. Clicking a link containing a direct URL causes your browser to explicitly request and retrieve a specific HTML page. A search engine crawler follows exactly the same process, sending a request to a Web server to retrieve a specific HTML page.

Examples of direct URLs:
  • http://www.yahoo.com (points to Yahoo!’s home page)
  • http://www.invisible-web.net/about.htm (points to the information page for this text companion Web site)
  • http://www.forbes.com/forbes500/ (points to the top-level page for the Forbes 500 database. Though this page is visible, the underlying database is an Invisible Web resource)
Indirect URLs, on the other hand, often don’t point to a specific physical page on the Web. Instead, they contain information that will be executed by a script on the server—and this script is what generates the page you ultimately end up viewing. Search engine crawlers typically won’t follow URLs that appear to have calls to scripts. The key tip-offs that a page can’t or won’t be crawled by a search engine are symbols or words that indicate that the page will be dynamically generated by assembling its component parts from a database. The most common symbol used to indicate the presence of dynamic content is the question mark, but be careful: although question marks are used to execute scripts that generate dynamic pages, they are often simply used as “flags” to alert the server that additional information is being passed along using variables that follow the question mark. These variables can be used to track your route through a site, represent items in a shopping cart, and for many other purposes that have
nothing to do with Invisible Web content. Typically, URLs with the words “cgi-bin” or “javascript” included will also execute a script to generate a page, but you can’t simply assume
that a page is invisible based on this evidence alone. It’s important to conduct further investigations.

Examples of indirect URLs:
  • http://us.imdb.com/Name?Hitchcock,+Alfred (points to the listing for Alfred Hitchcock in the Internet Movie Database)
  • http://www.sec.gov/cgi-bin/srch-edgar?cisco+adj+systems (points to a page showing results for a search on Cisco Systems in the SEC EDGAR database)
  • http://adam.ac.uk/ixbin/hixserv?javascript:go_to(‘0002’,current_level+1) (points to a top-level directory in the ADAM Art Resources database)

The URL Test
If a URL appears to be indirect, and looks like it might execute a script, there’s a relatively easy test to determine if the URL is likely to be crawled or not.
  1. Place the cursor in the address window immediately to the left of the question mark, and erase the question mark and everything to the right of it.
  2. Then press your computer’s Enter key to force your browser to attempt to fetch this fragment of the URL.
  3. Does the page still load as expected? If so, it’s a direct URL. The question mark is being used as a flag to pass additional information to the server, not to execute a script. The URL points to a static HTML page that can be crawled by a search engine spider.
  4. If a page other than the one you expected appears, or you see some sort of error message, it likely means that the information after the question mark in the URL is needed by a script in order to dynamically generate the page.


    Without the information, the server doesn’t know what data to fetch from the database to create the page;
    these types of URLs represent content that is part of the Invisible Web, because the crawler won’t read past the question mark. Note carefully: most crawlers can read past the question mark and fetch the page, just as your browser can, but they won’t for fear of spider traps.
Sometimes it’s trickier to determine if a URL points to content that will be generated dynamically. Many browsers save information about a page in variables that are hidden to the user. Clicking “refresh” may simply send the data used to build the page back to the server, recreating the page. Alternately, the page may have been cached on your computer. The best way to test URLs that you suspect are invisible is to start up another instance of your browser, cut and paste the URL into the new browser’s address box, and try to load the page. The new instance of the browser won’t have the same previously stored information, so you’ll likely see a different page or an error message if the page is invisible. Browsable directories, given their hierarchical layout, may appear at first glance to be part of the visible Web. Test the links in these directories by simply holding your cursor over a link and examining its structure. If the links have question marks indicating that scripts generate the new pages, you have a situation where the top level of the directory, including its links and annotations, may be visible, but the material it links to is invisible.
This is a case where the content of the directory itself is invisible, but content that it links to is not.
Human Resources Development Canada’s Labor Market Information directory is an example of this phenomenon. It’s important to do these tests, because to access most material on the Invisible Web you’ll need to go directly to the site providing it. Many huge, content-specific sites may at first glance appear to be part of the Invisible Web, when in fact they’re nothing more than specialized search sites. Let’s look at this issue in more detail.


Specialized vs. Invisible
There are many specialized search directories on the Web that share characteristics of an Invisible Web site, but are perfectly visible to the search engines. These sites often are structured as hierarchical directories, designed as navigation hubs for specific topics or categories of information, and usually offer both sophisticated search tools and the ability to browse a structured directory. But even if these sites consist of hundreds, or even thousands of HTML pages, many aren’t part of the Invisible Web, since search engine spiders generally have no problem finding and retrieving the pages. In fact, these sites typically have an extensive internal link structure that makes the spider’s job even easier. That said, remember our warning about the depth of crawl issue: because a site is easy to index doesn’t mean that search engines have spidered it thoroughly or recently.

Many sites that claim to have large collections of invisible or “deep” Web content actually include many specialized search services that are perfectly visible to search spiders. They make the mistake of equating a sophisticated search mechanism with invisibility. Don’t get us wrong—we’re all in favor of specialized sites that offer powerful search tools and robust interfaces. It’s just that many of these specialized sites aren’t invisible, and to label them as such is misleading.

For example, we take issue with a highly popularized study performed by Bright Planet claiming that the Invisible Web is currently 400 to 550 times larger than the commonly defined World Wide Web (Bright Planet, 2000). Many of the search resources cited in the study are excellent specialized directories, but they are perfectly visible to search engines. Bright Planet also includes ephemeral data such as weather and astronomy measurements in their estimates that serve no practical purpose for searchers. Excluding specialized search tools and data irrelevant to searchers, we estimate that the Invisible Web is between 2 and 50 times larger than the visible Web.

How can you tell the difference between a specialized vs. Invisible Web resource? Always start by browsing the directory, not searching. Search programs, by their nature, use scripts, and often return results that contain indirect URLs. This does not mean, however, that the site
is part of the Invisible Web. It’s simply a byproduct of how some search tools function.
  • As you begin to browse the directory, click on category links and drill down to a destination URL that leads away from the directory itself. As you’re clicking, examine the links. Do they appear to be direct or indirect URLs? Do you see the telltale signs of a script being executed? If so, the page is part of the Invisible Web—even if the destination URLs have no question marks. Why? Because crawlers wouldn’t have followed the links to the destination URLs in the first place.
  • But if, as you drill down the directory structure, you notice that all of the links contain direct links, the site is almost certainly part of the visible Web, and can be crawled and indexed by search engines.
This may sound confusing, but it’s actually quite straightforward. To illustrate this point, let’s look at some examples in several categories. We’ll put an Invisible Web site side-by-side with a high-quality specialized directory and compare the differences between them.


Visible vs. Invisible
The Gateway to Educational Materials Project is a directory of collections of high-quality educational resources for teachers, parents, and others involved in education. The Gateway features annotated links to more than 12,000 education resources.
  • Structure: Searchable directory, part of the Visible Web. Browsing the categories reveals all links are direct URLs. Although the Gateway’s search tool returns indirect URLs, the direct URLs of the directory structure and the resulting offsite links provide clear linkages for search engine spiders to follow.
AskERIC allows you to search the ERIC database, the world’s largest source of education information. ERIC contains more than one million citations and abstracts of documents and journal articles on education research and practice.
  • Structure: Database, limited browsing of small subsets of the database available. These limited browsable subsets use direct 84 The Invisible Web URLs; the rest of the ERIC database is only accessible via the AskERIC search interface, making the contents of the database effectively invisible to search engines.
Very important point: Some of the content in the ERIC database also exists in the form of plain HTML files; for example, articles published in the ERIC digest. This illustrates one of the apparent paradoxes of the Invisible Web. Just because a document is located in an Invisible Web database doesn’t mean there aren’t other copies of the document existing elsewhere on visible Web sites. The key point is that the database containing the original content is the authoritative source, and searching the database will provide the highest probability of retrieving a document. Relying on a general-purpose search engine to find documents that may have copies on visible Web sites is unreliable.

The International Trademark Association (INTA) Trademark Checklist is designed to assist authors, writers, journalists/editors, proofreaders, and fact checkers with proper trademark usage. It includes listings for nearly 3,000 registered trademarks and service marks with their generic terms and indicates capitalization and punctuation.
  • Structure: Simple HTML pages, broken into five extensively cross-linked pages of alphabetical listings. The flat structure of the pages combined with the extensive crosslinking make these pages extremely visible to the search engines.
The Delphion Intellectual Property Network allows you to search for, view, and analyze patent documents and many other types of intellectual property records. It provides free access to a wide variety of data collections and patent information including United States patents, European patents and patent applications, PCT application data from the World Intellectual Property Office, Patent Abstracts of Japan, and more.
  • Structure: Relational database, browsable, but links are indirect and rely on scripts to access information from the database. Data contained in the Delphion Intellectual Property Network database is almost completely invisible to Web search engines.
Key point: Patent searching and analysis is a very complex process. The tools provided by the Delphion Intellectual Property Network are finely tuned to help patent researchers home in on only the most relevant information pertaining to their search, excluding all else. Search engines are simply inappropriate tools for searching this kind of information. In addition, new patents are issued weekly or even daily. The Delphion Intellectual Property Network is constantly refreshed. Search engines, with their month or more long gaps between recrawling Web sites, couldn’t possibly keep up with this flood of new information.

Hoover’s Online offers in-depth information for businesses about companies, industries, people, and products. It features detailed profiles of hundreds of public and private companies.
  • Structure: Browsable directory with powerful search engine. All pages on the site are simple HTML; all links are direct (though the URLs appear complex). Note: some portions of Hoover’s are only available to subscribers who pay for premium content.
Thomas Register features profiles of more than 155,000 companies, including American and Canadian companies. The directory also allows searching by brand name, product headings, and even some supplier catalogs. As an added bonus, material on the Thomas Register
site is updated constantly, rather than on the fixed update schedules of the printed version.
  • Structure: Database access only. Further, access to the search tool is available to registered users only. This combination of database-only access available to registered users puts the Thomas Register squarely in the universe of the Invisible Web.
WebMD aggregates health information from many sources, including medical associations, colleges, societies, government agencies, publishers, private and non-profit organizations, and for-profit corporations.
  • Structure: MyWebMD site features a browsable table of contents to access its data, using both direct links and javascript relative links to many of the content areas on the site. However, the site also provides a comprehensive site map using direct URLs, allowing search engine spiders to index most of the site.
The National Health Information Center’s Health Information Resource Database includes 1,100 organizations and government offices that provide health information upon request. Entries include contact information, short abstracts, and information about publications and services that the organizations provide.
  • Structure: You may search the database by keyword, or browse the keyword listing of resources in the database. Each keyword link is an indirect link to a script that searches the database for results. The database is entirely an Invisible Web site.
As these examples show, it’s relatively easy to determine whether a resource is part of the Invisible Web or not by taking the time to examine its structure. Some sites, however, can be virtually impossible to classify since they have both visible and invisible elements. Let’s look at
an example.


The Library of Congress Web Site: Both Visible and Invisible
The U.S. Library of Congress is the largest library in the world, so it’s fitting that its site is also one of the largest on the Web. The site provides a treasure trove of resources for the searcher. In fact, it’s hard to even call it a single site, since several parts have their own domains or subdomains.

The library’s home page has a simple, elegant design with links to the major sections of the site. Mousing over the links to all of the sections reveals only one link that might be invisible to the America’s Library site. If you follow the link to the American Memory collection, you see a screen that allows you to access more than 80 collections featured on the site. Some of the links, such as those to “Today in History” and the “Learning Page,” are direct URLs that branch to simple HTML pages. However, if you select the “Collection Finder” you’re presented with a directory-type menu for all of the topics in the collection. Each one of the links on this page is not only an indirect link but contains a large amount of information used to create new dynamic pages. However, once those pages are created, they include mostly direct links to simple HTML pages.

The point of this exercise is to demonstrate that even though the ultimate content available at the American Memory collection consists of content that is crawlable, following the links from the home page leads to a “barrier” in the form of indirect URLs on the Collection Finder directory page. Because they generally don’t crawl indirect URLs, most crawlers would simply stop spidering once they encounter those links, even though they lead to perfectly acceptable content.

Though this makes much of the material in the American Memory collection technically invisible, it’s also probable that someone outside of the Library of Congress has found the content and linked to it, allowing crawlers to access the material despite the apparent roadblocks. In other words, any Web author who likes content deep within the American Memory collection is free to link to it—and if crawlers find those links on the linking author’s page, the material may ultimately be crawled, even if the crawler couldn’t access it through the “front door.” Unfortunately, there’s no quick way to confirm that content deep within a major site like the Library of Congress has been crawled in this manner, so the searcher should utilize the Library’s own internal search and directory services to be assured of getting the best possible results.


The Robots Exclusion Protocol
Many people assume that all Webmasters want their sites indexed by search engines. This is not the case. Many sites that feature timely content that changes frequently do not want search engines to index their pages. If a page changes daily and a crawler only visits the page monthly, the result is essentially a permanently inaccurate page in a search index. Some sites make content available for free for only a short period before moving it into archives that are available to paying customers only—the online versions of many newspaper and media sites are good examples of this.

To block search engine crawlers, Webmasters employ the Robots Exclusion Protocol. This is simply a set of rules that enable a Webmaster to tell a crawler which parts of a server are off-limits. The Webmaster simply creates a list of files or directories that should not be crawled or indexed, and saves this list in a file called robots.txt. CNN, Canadian Broadcasting Corporation, the London Times, and the Los Angeles Times all use robots.txt to exclude some or all of their content using the robots.txt file.

Here’s an example of the robots.txt file used by the Los Angeles Times:
User-agent: *
Disallow: /RealMedia
Disallow: /archives
Disallow: /wires/
Disallow: /HOME/
Disallow: /cgi-bin/
Disallow: /class/realestate/dataquick/dqsearch.cgi
Disallow: /search
The User-agent field specifies which spiders must pay attention to the following instructions. The asterisk (*) is a wildcard, meaning all crawlers must read and respect the contents of the file. Each “Disallow” command is followed by the name of a specific directory on the Los Angeles Times Web server that spiders are prohibited from accessing and crawling. In this case, the spider is blocked from reading streaming media files, archive files, real estate listings, and so on.

It’s also possible to prevent a crawler from indexing a specific page by including a “noindex” meta tag instruction in the “head” portion of the document. Here’s an example:
«head»
«title»Keep Out, Search Engines!«/title»
«ΜETA name=”robots” content=”noindex, nofollow”»
«/head»
Either the robots.txt file or the noindex meta tag can be used to block crawlers. The only difference between the two is that the noindex meta tag is page specific, while the robots.txt file can be used to prevent indexing of individual pages, groups of files—even entire Web sites.

As you can see, it’s important to look closely at a site and its structure to determine whether it’s visible or invisible. One of the wonderful things many Invisible Web resources can do is
help you focus your search and allow you to manipulate a “subject oriented” database in ways that would not be possible with a general-purpose search tool. Many resources allow you to organize your results via various criteria or are much more up-to-date than a general search tool or print versions of the same material. For example, lists published by Forbes and Fortune provide the searcher with all kinds of ways to sort, limit, or filter data that is simply impossible with the print-based versions. Also, you could have a much smaller haystack of “focused” data to search through to find the necessary “needles” of information. In the later section we’ll show you some specific cases where resources on the Invisible Web provide a superior—if not the only—means of locating important and dependable information online.


Using the Invisible Web


How do you decide when the Invisible Web is likely to be your best source for the information you’re seeking? After all, Invisible Web resources aren’t always the solution for satisfying an information need. Although we’ve made a strong case for the value of the resources available on the Invisible Web, we’re not suggesting that you abandon the general-purpose search engines like AltaVista, HotBot, and Google. Far from it! Rather, we’re advocating that you gain an understanding of what’s available on the Invisible Web to make your Web searching time more efficient. By expanding the array of tools available to you, you’ll learn to select the best available tool for every particular searching task.

In this section, we’ll examine the broad issue of why you might choose to use Invisible Web resources instead of a general-purpose search engine or Web directory. Then we’ll narrow our focus and look at specific instances of when to use the Invisible Web. To illustrate these specifics, we’ve compiled a list of 25 categories of information where you’ll likely get the best results from Invisible Web resources. Then we’ll look at what’s not available on the Web, visible or Invisible.

It’s easy to get seduced by the ready availability and seeming credibility of online information. But just as you would with print materials, you need to evaluate and assess the quality of the information you find on the Invisible Web. Even more importantly, you need to watch out for bogus or biased information that’s put online by charlatans more interested in pushing their own point of view than publishing accurate information.

The Invisible Web, by its very nature, is highly dynamic. What is true on Monday might not be accurate on Thursday. Keeping current with the Invisible Web and its resources is one of the biggest challenges faced by the searcher. We’ll show you some of the best sources for keeping up with the rapidly changing dynamics of the Invisible Web. Finally, as you begin your own exploration of the Invisible Web, you should begin to assemble your own toolkit of trusted resources. As your personal collection of Invisible Web resources grows, your confidence in choosing the appropriate tool for every search task will grow in equal proportions.


Why Use the Invisible Web?
General-purpose search engines and directories are easy to use, and respond rapidly to information queries. Because they are so accessible and seemingly all-powerful, it’s tempting to simply fire up your favorite Web search engine, punch in a few keywords that are relevant to your search, and hope for the best. But the general-purpose search engines are essentially mass audience resources, designed to provide something for everyone. Invisible Web resources tend to be more focused, and often provide better results for many information needs. Consider how a publication like Newsweek would cover a story on Boeing compared to an aviation industry trade magazine such as Aviation Week and Space Technology. Or a how a general newsmagazine like Time would cover a story on currency trades vs. a business magazine like Forbes or Fortune.

In making the decision whether to use an Invisible Web resource, it helps to consider the point of view of both the searcher and the provider of a search resource. The goal for any searcher is relatively simple: to satisfy an information need in a timely manner.

Of course, providers of search resources also strive to satisfy the information needs of their users, but they face other issues that complicate the equation. For example, there are always conflicts between speed and accuracy. Searchers demand fast results, but if a search engine has a large, comprehensive index, returning results quickly may not allow for a thorough search of the database.
For general-purpose search engines, there’s a constant tension between finding the correct answer vs. finding the best answer vs. finding the easiest answer. Because they try to satisfy virtually any information need, general-purpose search engines resolve these conflicts by making compromises. It costs a significant amount of money to crawl the Web, index pages, and handle search queries. The bottom line is that general-purpose search engines are in business to make a profit, a goal that often works against the mission to provide comprehensive results for searchers with a wide variety of information needs. On the other hand, governments, academic institutions, and other organizations that aren’t constrained by a profit-making motive operate many Invisible Web resources. They don’t feel the same pressures to be everything to everybody. And they can often afford to build comprehensive search resources that allow searchers to perform exhaustive research within a specific subject area, and keep up-to-date and current.

Why select an Invisible Web resource over a general-purpose search engine or Web directory? Here are several good reasons:
  • Specialized content focus = more comprehensive results. Like the focused crawlers and directories, Invisible Web resources tend to be focused on specific subject areas. This is particularly true of the many databases made available by government agencies and academic institutions. Your search results from these resources will be more comprehensive than those from most visible Web resources for two reasons.
  1. First, there are generally no limits imposed by databases on how quickly a search must be completed—or if there are, you can generally select your own time limit that will be reached before a search is cut off. This means that you have a much better chance of having all relevant results returned, rather than just those results that were found fastest.
  2. Second, people who go to the trouble of creating a database-driven information resource generally try to make the resource as comprehensive as possible, including as many relevant documents as they are able to find. This is in stark contrast to general-purpose search engine crawlers, which often arbitrarily limit the depth of crawl for a particular Web site. With a database, there is no depth of crawl issue—all documents in the database will be searched by default.
  • Specialized search interface = more control over search input and output. Here’s a question to get you thinking. Let’s assume that everything on the Web could be located and accessed via a general search tool like Google or HotBot. How easy and efficient would it be to use one of these general-purpose engines when a specialized tool was available? Would you begin a search for a person’s phone number with a search of an encyclopaedia? Of course not. Likewise, even if the general-purpose search engines suddenly provided the capability to find specialized information, they still couldn’t compete with search services specifically designed to find and easily retrieve specialized information. Put differently, searching with a general-purpose search engine is like using a shotgun, whereas searching with an Invisible Web resource is more akin to a taking a highly precise rifle-shot approach.
    As an added bonus, most databases provide customized search fields that are subject-specific. History databases will allow limiting searches to particular eras, for example, and biology databases by species or genomic parameters. Invisible Web databases also often provide extensive control over how results are formatted. Would you like documents to be sorted by relevance, by date, by author, or by some other criteria of your own choosing? Contrast this flexibility with the general-purpose search engines, where what you see is what you get. Increased precision and recall. Consider two informal measures of search engine performance—recall and precision.


    Recall represents the total number of relevant documents retrieved in response to a search query, divided by the total number of relevant documents in the search engine’s entire index. One hundred percent recall means that the search engine was able to retrieve every document in its index that was relevant to the search terms.
    Measuring recall alone isn’t sufficient, however, since the engine could always achieve 100 percent recall simply by returning every document in its index. Recall is balanced by precision.


    Precision is the number of relevant documents retrieved divided by the total number of documents retrieved. If 100 pages are found, and only 20 are relevant, the precision is (100/20), or 20 percent.
    Relevance, unfortunately, is strictly a subjective measure. The searcher ultimately determines relevance after fully examining a document and deciding whether it meets the information need. To maximize potential relevance, search engines strive to maximize recall and precision simultaneously. In practice, this is difficult to achieve. As the size of a search engine index increases, there are likely to be more relevant documents for any given query, leading to a higher recall percentage. As recall increases, precision tends to decrease, making it harder for the searcher to locate relevant documents. Because they are often limited to specific topics or subjects, many Invisible Web and specialized search services offer greater precision even while increasing total recall. Narrowing the domain of information means there is less extraneous or irrelevant information for the search engine to process. Because Invisible Web resources tend to have smaller databases, recall can be high while still offering a great deal of precision, leading to the best of all possible worlds: higher relevance and greater value to the searcher.

  • Invisible Web resources = highest level of authority. Institutions or organizations that have a legitimate claim on being an unquestioned authority on a particular subject maintain many Invisible Web resources. Unlike with many sites on the visible Web, it’s relatively easy to determine the authority of most Invisible Web sites. Most offer detailed information about the credentials of the people responsible for maintaining the resource. Others feature awards, citations, or other symbols of recognition from other acknowledged subject authorities. Many Invisible Web resources are produced by book or journal publishers with sterling reputations among libraries and scholars.

  • The answer may not be available elsewhere. The explosive growth of the Web, combined with the relative ease of finding many things online, has led to the widely held but wildly inaccurate belief that “if it’s not on the Web, it’s not online.” There are a number of reasons this belief simply isn’t true. For one thing, there are vast amounts of information available exclusively via Invisible Web resources. Much of this information is in databases, which can’t be directly accessed by search engines, but it is definitely online and often freely available.

When to Use the Invisible Web
It’s not always easy to know when to use an Invisible Web resource as opposed to a general search tool. As you become more familiar with the landscape of the Invisible Web, there are several rules of thumb you can use when deciding to use an Invisible Web resource.
  • When you’re familiar with a subject. If you know a particular subject well, you’ve likely already discovered one or more Invisible Web resources that offer the kind of information you need. Familiarity with a subject also offers another advantage: knowledge of which search terms will find the “best” results in a particular search resource, as well as methods for locating new resources.
  • When you’re familiar with specific search tools. Some Invisible Web resources cover multiple subjects, but since they often offer sophisticated interfaces you’ll still likely get better results from them compared to general-purpose search tools. Restricting your search through the use of limiters, Boolean logic, or other advanced search functions generally makes it easier to pull a needle from a haystack.
  • When you’re looking for a precise answer. When you’re looking for a simple answer to a question, the last thing you want is a list of hundreds of possible results. No matter—an abundance of potential answers is what you’ll end up with if you use a general-purpose search engine, and you’ll have to spend the time scanning the result list to find what you need. Many Invisible Web resources are designed to perform what are essentially lookup functions, when you need a particular fact, phone number, name, bibliographic record, and so on.
  • When you want authoritative, exhaustive results. General-purpose search engines will never be able to return the kind of authoritative, comprehensive results that Invisible Web resources can. Depth of crawl, timeliness, and the lack of selective filtering fill any result list from a general-purpose engine with a certain amount of noise. And, because the haystack of the Web is so huge, a certain number of authoritative documents will inevitably be overlooked.
  • When timeliness of content is an issue. Invisible Web resources are often more up-to-date than general-purpose search engines and directories.


Top 25 Invisible Web Categories
To give you a sense of what’s available on the Invisible Web, we’ve put together a list of categories where, in general, you’ll be far better off searching an Invisible Web resource than a general-purpose search engine. Our purpose here is to simply provide a quick overview of each category, noting one or two good Invisible Web resources for each. Detailed descriptions of and annotated links to many more resources for all of these categories can be found in the online directory available at companion website.
  1. Public Company Filings. The U.S. Securities and Exchange Commission (SEC) and regulators of equity markets in many other countries require publicly traded companies to file certain documents on a regular schedule or whenever an event may have a material effect on the company. These documents are available in a number of locations, including company Web sites. While many of these filings may be visible and findable by a general-purpose search engine, a number of Invisible Web services have built comprehensive databases incorporating this information. FreeEDGAR, 10K Wizard , and SEDAR are examples of services that offer sophisticated searching and limiting tools as well as the assurance that the database is truly comprehensive. Some also offer free e-mail alert services to notify you that the companies you choose to monitor have just filed reports.
  2. Telephone Numbers. Just as telephone white pages serve as the quickest and most authoritative offline resource for locating telephone numbers, a number of Invisible Web services exist solely to find telephone numbers. InfoSpace, Switchboard.com, and AnyWho offer additional capabilities like reverse-number lookup or correlating a phone number with an e-mail address. Because these databases vary in currency it is often important to search more than one to obtain the most current information.
  3. Customized Maps and Driving Directions. While some search engines, like Northern Light, have a certain amount of geographical “awareness” built in, none can actually generate a map of a particular street address and its surrounding neighborhood. Nor do they have the capability to take a starting and ending address and generate detailed driving directions, including exact distances between landmarks and estimated driving time(now adays all that is possible n.d.t.). Invisible Web resources such as Mapblast and Mapquest are designed specifically to provide these interactive services.
  4. Clinical Trials. Clinical trials by their very nature generate reams of data, most of which is stored from the outset in databases. For the researcher, sites like the New Medicines in Development database are essential. For patients searching for clinical trials to participate in, ClinicalTrials.gov and CenterWatch’s Clinical Trials Listing Service are invaluable.
  5. Patents. Thoroughness and accuracy are absolutely critical to the patent searcher. Major business decisions involving significant expense or potential litigation often hinge on the details of a patent search, so using a general-purpose search engine for this type of search is effectively out of the question. Many government patent offices maintain Web sites, but Delphion’s Intellectual Property Network allows full-text searching of U.S. and European patents and abstracts of Japanese patents simultaneously. Additionally, the United States Patent Office provides patent information dating back to 1790, as well as U.S. Trademark data.
  6. Out of Print Books. The growth of the Web has proved to be a boon for bibliophiles. Countless out of print booksellers have established Web sites, obliterating the geographical constraints that formerly limited their business to local customers. Simply having a Web presence, however, isn’t enough. Problems with depth of crawl issues, combined with a continually changing inventory, make catalog pages from used booksellers obsolete or inaccurate even if they do appear in the result list of a general-purpose search engine. Fortunately, sites like Alibris and Bibliofind allow targeted searching over hundreds of specialty and used bookseller sites.
  7. Library Catalogs. There are thousands of Online Public Access Catalogs (OPACs) available on the Web, from national libraries like the U.S. Library of Congress and the Bibliothèque Nationale de France, academic libraries, local public libraries, and many other important archives and repositories. OPACs allow searches for books in a library by author, title, subject, keywords, or call number, often providing other advanced search capabilities. webCATS, Library Catalogs on the World Wide Web (now at http://www.lights.ca/webcats/ ) is an excellent directory of OPACs around the world. OPACS are great tools to verify the title or author of a book.
  8. Authoritative Dictionaries. Need a word definition? Go directly to an authoritative online dictionary. Merriam-Webster’s Collegiate and the Cambridge International Dictionary of English are good general dictionaries. Scores of specialized dictionaries also provide definitions of terms from fields ranging from aerospace to zoology. Some Invisible Web dictionary resources even provide metasearch capability, checking for definitions in hundreds of online dictionaries simultaneously. OneLook is a good example.
  9. Environmental Information. Need to know who’s a major polluter in your neighborhood? Want details on a specific country’s position in the Kyoto Treaty? Try the Envirofacts multiple database search.
  10. Historical Stock Quotes. Many people consider stock quotes to be ephemeral data, useful only for making decisions at a specific point in time. Stock market historians and technical analysts, however, can use historical data to compile charts of trends that some even claim to have a certain amount of predictive value. There are numerous resources available that contain this information. One of our favorites is from BigCharts.com .
  11. Historical Documents and Images. You’ve seen that general-purpose search engines don’t handle images well. This can be a problem with historical documents, too, as many historical documents exist on the Web only as scanned images of the original. The U.S. Library of Congress American Memory Project is a wonderful example of a continually expanding digital collection of historical documents and images. The American Memory Project also illustrates that some data in a collection may be “visible” while other portions are “invisible.”
  12. Company Directories. Competitive intelligence has never been easier thanks to the Web. We wrote about Hoover’s and the Thomas Register. There are numerous country or region specific company directories, including the Financial Times’ European Companies Premium Research (http://www.globalarchive.ft.com/cb/cb_search.html) and
    Wright Investors’ Services (http://profiles.wisi.com/profiles/comsrch.htm).
  13. Searchable Subject Bibliographies. Bibliographies are gold mines for scholars and other researchers. Because bibliographies generally conform to rigid formats specified by the MLA or the AP, most are stored in searchable online databases, covering subjects ranging from Architecture to Zoology. The Canadian Music Periodical Index provided by the National Library of Canada is a good example as it contains almost infinite citations.
  14. Economic Information. Governments and government agencies employ entire armies of statisticians to monitor the pulse of economic conditions. This data is often available online, but rarely in a form visible to most search engines. RECON-Regional Economic Conditions is an interactive database from the Federal Deposit Insurance Corporation that illustrates this point.
  15. Award Winners. Who won the Nobel Peace Prize in 1938? You might be able to learn that it was Viscount Cecil of Chelwood (Lord Edgar Algernon Robert Gascoyne Cecil) via a general-purpose search engine, but the Nobel e-museum site will provide the definitive answer. Other Invisible Web databases have definitive information on major winners of awards ranging from Oscar (http://www.oscars.org/awards_db/) to the Peabody Awards (http://www.peabody.uga.edu/recipients/search.html).
  16. Job Postings. Looking for work? Or trying to find the best employee for a job opening in your company? Good luck finding what you’re looking for using a general-purpose search engine. You’ll be far better off searching one of the many job-posting databases, such as CareerBuilder.Com , the contents of which are part of the Invisible Web. Better yet, try one of our favorites—the oddly named Flipdog. Flipdog is unique in that it scours both company Web sites and other job posting databases to compile what may be the most extensive collection of job postings and employment offers available on the Web.
  17. Philanthropy and Grant Information. Show me the money! If you’re looking to give or get funding, there are literally thousands of clearinghouses on the Invisible Web that exist to match those in need with those willing and able to give. The Foundation Finder from the Foundation Center is an excellent place to begin your search.
  18. Translation Tools. Web-based translation services are not search tools in their own right, but they provide a valuable service when a search has turned up documents in a language you don’t understand. Translation tools accept a URL, fetch the underlying page, translate it into the desired language and deliver it as a dynamic document. AltaVista provides such a service. Please note the many limitations and frequent translation issues that often arise. These tools, while far from perfect, will continue to improve with time. Another example of an Invisible Web translation tool is EuroDicAutom, described as “the multilingual terminological database of the European Commission’s Translation Service.”
  19. Postal Codes. Even though e-mail is rapidly overtaking snail mail as the world’s preferred method of communication, we all continue to rely on the postal service from time to time. Many postal authorities such as the Royal Mail in the United Kingdom provide postal code look-up tools.
  20. Basic Demographic Information. Demographic information from the U.S. Census and other sources can be a boon to marketers or anyone needing details about specific communities. One of many excellent starting points is the American FactFinder. The utility that this site provides seems to almost never end!
  21. Interactive School Finders. Before the Web, finding the right university or graduate school often meant a trek to the library and hours scanning course catalogs. Now it’s easy to locate a school that meets specific criteria for academic programs, location, tuition costs, and many other variables. Peterson’s GradChannel is an excellent example of this type of search resource for students, offered by a respected provider of school selection data.
  22. Campaign Financing Information. Who’s really buying—or stealing—the election? Now you can find out by accessing the actual forms filed by anyone contributing to a major campaign. The Federal Elections Commission provides several databases (http://www.fec.gov/finance_reports.htrml) while a private concern called Fecinfo.Com
    “massages” government-provided data for greater utility. Fecinfo.com has a great deal of free material available in addition to several fee-based resources. Many states are also making this type of data available.
  23. Weather Data. If you don’t trust your local weatherman, try an Invisible Web resource like AccuWeather. This extensive resource offers more than 43,000 U.S. 5-day forecasts, international forecasts, local NEXRAD Doppler radar images, customizable personal pages, and fee-based premium services. Weather information clearly illustrates the vast amount of real-time data available on the Internet that the general search tools do not crawl. Another favorite is Automated Weather Source. This site allows you to view local weather conditions in real-time via instruments placed at various sites (often located at schools) around the country.
  24. Product Catalogs. It can be tricky to determine whether pages from many product catalogs are visible or invisible. One of the Web’s largest retailers, Amazon.com, is largely a visible Web site. Some general-purpose search engines include product pages from Amazon.com’s catalogs in their databases, but even though this information is visible, it may not be relevant for most searches. Therefore, many engines either demote the relevance ranking of product pages or ignore them, effectively rendering them invisible.
    However, in some cases general search tools have arrangements with major retailers like Amazon to provide a “canned” link for search terms that attempt to match products in a retailer’s database.
  25. Art Gallery Holdings. From major national exhibitions to small co-ops run by artists, countless galleries are digitizing their holdings and putting them online. An excellent way to find these collections is to use ADAM, the Art, Design, Architecture & Media Information Gateway. ADAM is a searchable catalogue of more than 2,500 Internet resources whose entries are all invisible. Specifically, the Van Gogh Museum in Amsterdam provides a digital version of the museums, collection that is invisible to general search tools.


What’s NOT on the Web—Visible or Invisible

There’s an entire class of information that’s simply not available on the Web, including the following:
  • Proprietary databases and information services. These include Thomson’s Dialog service, LexisNexis, and Dow Jones, which restrict access to their information systems to paid subscribers.
  • Many government and public records. Although the U.S. government is the most prolific publisher of content both on the Web and in print, there are still major gaps in online coverage. Some proprietary services such as KnowX offer limited or no access to public records for a fee. Coverage of government and public records is similarly spotty in other countries around the world. While there is a definite trend toward moving government information and public records online, the sheer mass of information will prohibit all of it from going online. There are also privacy concerns that may prevent certain types of public records from going digital in a form that might compromise an individual’s rights.
  • Scholarly journals or other “expensive” information. Thanks in part to the “publish or perish” imperative at modern universities, publishers of scholarly journals or other information that’s viewed as invaluable for certain professions have succeeded in creating a virtual “lock” on the market for their information products. It’s a very profitable business for these publishers, and they wield an enormous amount of control over what information is published and how it’s distributed. Despite ongoing, increasingly acrimonious struggles with information users, especially libraries, who often have insufficient funding to acquire all of the resources they need, publishers of premium content see little need to change the status quo. As such, it’s highly unlikely that this type of content will be widely available on the Web any time soon. There are some exceptions. Northern Light’s Special Collection, for example, makes available a wide array of reasonably priced content that previously was only available via expensive subscriptions or site licenses from proprietary information services. ResearchIndex, can retrieve copies of scholarly papers posted on researchers’ personal Web sites, bypassing the “official” versions appearing in scholarly journals. But this type of semi-subversive “Napster-like” service may come under attack in the future, so it’s too early to tell whether it will provide a viable alternative to the official publications or not. For the near future, public libraries are one of the best sources for this information, made available to community patrons and paid for by tax dollars.
  • Full Text of all newspapers and magazines. Very few newspapers or magazines offer full-text archives. For those publications that do, the content only goes back a limited time—10 or 20 years at the most.There are several reasons for this. Publishers are very aware that the content they have published quite often retains value over time. Few economic models have emerged that allow publishers to unlock that value as yet. Authors’ rights are another concern. Many authors retained most re-use rights to the materials printed in magazines and newspapers. For content published more than two decades ago, reprints in digital format were not envisioned or legally accounted for. It will take time for publishers and authors to forge new agreements and for consumers of Web content to become comfortable with the notion that not everything on the Web is free. New micropayment systems, or “all you can eat” subscription services will emerge that should remove some of the current barriers keeping magazine and newspaper content off the Web. Some newspapers are placing archives of their content on the Web. Often the search function is free but retrieval of full text is fee based—for example, the services offered by Newslibrary. And finally, perhaps the reason users cannot find what they are looking for on either the visible or Invisible Web is simply because it’s just not there. While much of the world’s print information has migrated to the Web, there are and always will be millions of documents that will never be placed online. The only way to locate these printed materials will be via traditional methods: using libraries or asking for help from people who have physical access to the information.


Spider Traps, Damned Lies, and Other Chicanery
Though there are many technical reasons the major search engines don’t index the Invisible Web, there are also “social” reasons having to do with the validity, authority, and quality of online information. Because the Web is open to everybody and anybody, a good deal of its content is published by non-experts or—even worse—by people with a strong bias that they seek to conceal from readers. Search engines must also cope with unethical Web page authors who seek to subvert their indexes with millions of bogus “spam” pages. Most of the major engines have developed strict guidelines for dealing with spam that sometimes has the unfortunate effect of excluding legitimate content.

No matter whether you’re searching the visible or Invisible Web, it’s important always to maintain a critical view of the information you’re accessing. For some reason, people often lower their guard when it comes to information on the Internet. People who would scoff if asked to participate in an offline chain-mail scheme cast common sense to the wind and willingly forward hoax e-mails to their entire address books. Urban legends and all manner of preposterous stories abound on the Web.

Here are some important questions to ask and techniques to use for assessing the validity and quality of online information, regardless of its source.
  • Who Maintains the Content? The first question to ask of any Web site is who’s responsible for creating and updating it. Just as you would with any offline source of information, you want to be sure that the author and publishers are credible and the information they are providing can be trusted.
    Corporate Web sites should provide plenty of information about the company, its products and services. But corporate sites will always seekto portray the company in the best possible light, so you’ll need to use other information sources to balance favorable bias. If you’re unfamiliar with a company, try searching for information about it using
    Hoover’s. For many companies, AltaVista provides a link to a page with additional “facts about” the company, including a capsule overview, news, details of Web domains owned, and financial information.
    Information maintained by government Web sites or academic institutions is inherently more trustworthy than other types of Web content, but it’s still important to look at things like the authority of the institution or author. This is especially true in the case of academic institutions, which often make server space available to students who may publish anything they like without worrying about its validity.
    If you’re reading a page created by an individual, who is the author? Do they provide credentials or some other kind of proof that they write with authority? Is contact information provided, or is the author hiding behind the veil of anonymity? If you can’t identify the author or maintainer of the content, it’s probably not a good idea to trust the resource, even if it appears to be of high quality in all other respects.
  • What Is the Content Provider’s Authority? Authority is a measure of reputation. When you’re looking at a Web site, is the author or producer of the content a familiar name? If not, what does the site provide to assert authority?
    For an individual author, look for a biography of the author citing previous work or awards, a link to a resume or other vita that demonstrates experience, or similar relevant facts that prove the author has authority. Sites maintained by companies should provide a corporate profile, and some information about the editorial standards used to select or commission work.
    Some search engines provide an easy way to check on the authority of an author or company. Google, for example, tries to identify authorities by examining the link structure of the entire Web to gauge how often a page is cited in the form of a link by other Web page authors. It also checks to see if there are links to these pages from “important” sites of the Web that have authority. Results in Google for a particular
    query provide an informal gauge of authority. Beware, though, that this is only informal—even a page created by a Nobel laureate may not rank highly on Google if other important pages on the Web don’t link to it.
  • Is There Bias? Bias can be subtle, and can be easily camouflaged in sites that deal with seemingly non-controversial subjects. Bias is easy to spot when it takes the form of a one-sided argument. It’s harder to recognize when it dons a Janusian mask of two-sided “argument” where one side consistently (and seemingly reasonably) always prevails. Bias
    is particularly insidious on so-called “news” sites that exist mainly to promote specific issues or agendas. The key to avoiding bias is to look for balanced writing.
    Another form of bias on the Web appears when a page appears to be objective, but is sponsored by a group or organization with a hidden agenda that may not be apparent on the site. It’s particularly important to look for this kind of thing in health or consumer product information sites. Some large companies fund information resources for specific
    health conditions, or advocate a particular lifestyle that incorporates a particular product. While the companies may not exert direct editorial influence over the content, content creators nonetheless can’t help but be aware of their patronage, and may not be as objective as they might be. On the opposite side of the coin, the Web is a powerful medium for activist groups with an agenda against a particular company or industry. Many of these groups have set up what appear to be objective Web sites presenting seemingly balanced information when in fact they are extremely one-sided and biased.
    There’s no need to be paranoid about bias. In fact, recognizing bias can be very useful in helping understand an issue in depth from a particular point of view. The key is to acknowledge the bias and take steps to filter, balance, and otherwise gain perspective on what is likely to be a complex issue.
  • Examine the URL. URLs can contain a lot of useful clues about the validity and authority of a site. Does the URL seem “appropriate” for the content? Most companies, for example, use their name or a close approximation in their primary URL. A page stored on a free service like Yahoo’s GeoCities or Lycos-Terra’s Tripod is not likely to be an official company Web site. URLs can also reveal bias.
    Deceptive page authors can also feed search engine spiders bogus content using cloaking techniques, but once you’ve actually retrieved a page in your browser, its URLs cannot be spoofed. If a URL appears to contain suspicious or irrelevant words to the topic it represents, it’s likely a spurious source of information.

  • Examine Outbound Links. The hyperlinks included in a document can also provide clues about the integrity of the information on the page. Hyperlinks were originally created to help authors cite references, and can provide a sort of online “footnote” capability. Does a page link to other credible sources of information? Or are most of the links to other internal content on a Web site?
    Well-balanced sites have a good mix of internal and external links. For complex or controversial issues, external links are particularly important. If they point to other authorities on a subject, they allow you to easily access alternative points of view from other authors. If they point to less credible authors, or ones that share the same point of view as the author, you can be reasonably certain you’ve uncovered bias, whether subtle or blatant.
  • Is the Information Current? Currency of information is not always important, but for timely news, events, or for subject areas where new research is constantly expanding a field of knowledge, currency is very important.
    Look for dates on a page. Be careful—automatic date scripts can be included on a page so that it appears current when in fact it may be quite dated. Many authors include “dateline” or “updated” fields somewhere on the page.
    It’s also important to distinguish between the date in search results and the date a document was actually published. Some search engines include a date next to each result. These dates often have nothing to do with the document itself—rather, they are the date the search engine’s crawler last spidered the page. While this can give you a good idea of the freshness of a search engine’s database, it can be misleading to assume that the document’s creation date is the same. Always check the document itself if the date is an important part of your evaluation criteria.
  • Use Common Sense. Apply the same filters to the Web as you do to other sources of information in your life. Ask yourself: “How would I respond to this if I were reading it in a newspaper, or in a piece of junk mail?” Just because something is on the Web doesn’t mean you should believe it—quite the contrary, in many cases. For excellent information about evaluating the quality of Web resources, we recommend Genie Tyburski’s excellent Evaluating The Quality Of Information On The Internet.


Keeping Current with the Invisible Web

Just as with the visible Web, new Invisible Web resources are being made available all the time. How do you keep up with potentially useful new additions? There are also several useful, high-quality current awareness services that publish newsletters that cover Invisible Web resources. These newsletters don’t limit themselves to the Invisible Web, but the news and information they provide is exceptionally useful for all serious Web searchers. All of these newsletters are free.
  • The Scout Report The Scout Report provides the closest thing to an “official” seal of
    approval for quality Web sites. Published weekly, it provides organized summaries of the most valuable and authoritative Web resources available. The Scout Report Signpost provides the full-text search of nearly 6,000 of these summaries. The Scout Report staff is made up of a group of librarians and information professionals, and their standards for
    inclusion in the report are quite high.
  • Librarians’ Index to the Internet (LII) This searchable, annotated directory of Web resources, maintained by Carole Leita and a volunteer team of more than 70 reference librarians, is organized into categories including “best of,” “directories,” “databases,” and “specific resources.” Most of the Invisible Web content reviewed by LII falls in the “databases” and “specific resources” categories. Each entry also includes linked cross-references, making it a browser’s delight.
    Leita also publishes a weekly newsletter that includes 15-20 of the resources added to the Web site during the previous week.
  • ResearchBuzz ResearchBuzz is designed to cover the world of Internet research. To
    that end this site provides almost daily updates on search engines, new data-managing software, browser technology, large compendiums of information, Web directories, and Invisible Web databases. If in doubt, the final question is, “Would a reference librarian find it useful?” If the answer’s yes, in it goes.
    ResearchBuzz’s creator, Tara Calishain, is author of numerous Internet research books, including Official Netscape Guide to Internet Research. Unlike most of the other current awareness services described here, Calishain often writes in-depth reviews and analyses of new resources, pointing out both useful features and flaws in design or implementation.
  • Free Pint Free Pint is an e-mail newsletter dedicated to helping you find reliable Web sites and search the Web more effectively. It’s written by and for knowledge workers who can’t afford to spend valuable time sifting through junk on the Web in search of a few valuable nuggets of e-gold. Each issue of Free Pint has several regular sections. William Hann, Managing Editor, leads off with an overview of the issue and general news announcements, followed by a “Tips and Techniques” section, where professionals share their best searching tips and describe their favorite Web sites.
    The Feature Article covers a specific topic in detail. Recent articles have been devoted to competitive intelligence on the Internet, central and eastern European Web sources, chemistry resources, Web sites for senior citizens, and a wide range of other topics. Feature articles are between 1,000-2,000 words long, and are packed with useful background information, in addition to numerous annotated links to vetted sites in the article’s subject area. Quite often these are Invisible Web resources. One nice aspect of Free Pint is that it often focuses on European resources that aren’t always well known in North America or other parts of the world.
  • Internet Resources Newsletter Internet Resources Newsletter’s mission is to raise awareness of new sources of information on the Internet, particularly for academics, stu-
    dents, engineers, scientists, and social scientists. Published monthly, Internet Resources Newsletter is edited by Heriot-Watt University Library staff and published by Heriot-Watt University Internet Resource Centre.


Build Your Own Toolkit

As you become more familiar with what’s available on the Invisible Web, it’s important to build your own collection of resources. Knowing what is available before beginning your search is in many ways the greatest challenge in mastering the Invisible Web. But isn’t this a paradox? If Invisible Web resources can’t be found using general-purpose search tools, how do you go about finding them?

A great way to become familiar with Invisible Web resources is to do preemptive searching, a process much like the one professional librarians use in collection development.

  • Explore the Invisible Web gateways, cherry-picking resources that seem relevant to your information needs, asking yourself what kinds of questions each resource might answer in the future.
  • As your collection grows, spend time organizing and reorganizing it for easier access.
  • Be selective—choose Invisible Web resources the same way you build your personal collection of reference works.
  • Consider saving your collection of Invisible Web resources with a remote bookmark service such as Backflip, Delicious or Hotlinks etc.. This will give you access to your collection from any Web accessible computer. Your ultimate goal in building your own toolkit should draw on one of the five laws of library science: to save time. Paradoxically, as you become a better searcher and are able to build your own high-quality toolkit, you’ll actually need to spend less time exercising your searching skills, since in many cases you’ll already have the resources you need close at hand. With your own collection of the best of the Invisible Web, you’ll be able to boldly—and quickly—go where no search engine has gone before.


The Best of the Invisible Web


You face a similar challenge to the one confronted by early explorers of Terra Incognito. Without the benefit of a search engine to guide you, exactly where do you begin your search for information on the Invisible Web?

In this section, we discuss several Invisible Web pathfinders that make excellent starting points for the exploration of virtually any topic. We also introduce our directory. This introduction takes the form of the familiar “Frequently Asked Questions” (FAQ) section you see on many Web sites. We talk about the structure of the directory, how we selected our resources, and how to get the most out of the directory for doing your own searching.

Finally, we’ll leave you with a handy “pocket reference” that you can refer to on your explorations—the top ten concepts to understand about the Invisible Web.


Invisible Web Pathfinders
Invisible Web pathfinders are, for the most part, Yahoo!-like directories with lists of links to Invisible Web resources. Most of these pathfinders, however, also include links to searchable resources that aren’t strictly invisible. Nonetheless, they are useful starting points for finding and building your own collection of Invisible Web resources.
  • direct search direct search is a growing compilation of links to the search interfaces of resources that contain data not easily or entirely searchable/accessible from general search tools like AltaVista, Google, and HotBot. The goal of direct search is to get as close as possible to the search form offered by a Web resource (rather than having to click through one or two pages to get there); hence the name “direct search.”
  • InvisibleWeb The InvisibleWeb Catalog contains over 10,000 databases and searchable sources that have been frequently overlooked by traditional searching. Each source is analyzed and described by editors to ensure that every user of the InvisibleWeb Catalog will find reliable information on hundreds of topics, from Air Fares to Yellow Pages. All
    of this material can be accessed easily by Quick or Advanced Search features or a browsable index of the InvisibleWeb Catalog. Unlike other search engines, this takes you directly to the searchable source within a Web site, even generating a search form for you to perform your query.
  • Librarians’ Index to the Internet The Librarians’ Index to the Internet is a searchable, annotated subject directory of more than 7,000 Internet resources selected and evaluated by librarians for their usefulness to users of public libraries. LII only includes links to the very best Net content. While not a “pure” Invisible Web pathfinder, LII categorizes each resource as Best Of, Directories, Databases, and Specific Resources. Databases, of course, are Invisible Web resources. By using LII’s advanced search feature, you can limit your search to return only databases in the results list. Advanced search
    also lets you restrict your results to specific fields of the directory (author name, description, title, URL, etc.). In effect, the Librarians’ Index to the Internet is a laser-sharp searching tool for finding Invisible Web databases.
  • WebData General portal Web sites like Yahoo!, Excite, Infoseek, Lycos, and Goto.com, etc. are page-oriented search engine sites (words on pages are indexed), where WebData.com’s searches are content-oriented searches (forms and databases on Web sites are indexed). WebData.com and the traditional search engines are often confused
    with each other when composed side by side because they look alike. However, results from searches on WebData.com return databases where the others return Web pages that may or may not be what a user is looking for.
  • AlphaSearch The primary purpose of AlphaSearch is to access the finest Internet
    “gateway” sites. The authors of these gateway sites have spent significant time gathering into one place all relevant sites related to a discipline, subject, or idea. You have instant access to hundreds of sites by entering just one gateway site. http://www.calvin.edu/library/searreso/internet/as/
  • ProFusion ProFusion is a meta search engine from Intelliseek, the same company that runs InvisibleWeb.com. In addition to providing a sophisticated simultaneous search capability for the major general-purpose search engines, ProFusion provides direct access to the Invisible Web with the ability to search over 1,000 targeted sources of information, including sites like TerraServer, Adobe PDF Search, Britannica.com, The New York Times, and the U.S. Patent database. http://www.profusion.com


An Invisible Web Directory
In general, we like the idea of comparing the resources available on the Invisible Web to a good collection of reference works. The challenge is to be familiar with some key resources prior to needing them. Information professionals have always done this with canonical refer-
ence books, and often with traditional, proprietary databases like Dialog and LexisNexis. We encourage you to approach the Invisible Web in the same way—consider each specialized search tool as you would an individual reference resource.


In Summary: The Top 10 Concepts to Understand about the Invisible Web
As you begin your exploration and charting of the Invisible Web, here’s a list of the top ten concepts that you should understand about the Invisible Web.
  1. In most cases, the data found in an Invisible Web database or opaque Web database cannot be accessed entirely or easily via a general-purpose search engine.
  2. The Invisible Web is not the sole solution to all of one’s information needs. For optimal results, Invisible Web resources should be used in conjunction with other information resources, including general-purpose Web search engines and directories.
  3. Because many Invisible Web databases (as well as opaque databases) search a limited universe of material, the opportunity for a more precise and relevant search is greater than when using a general search tool.
  4. Often, Invisible Web and Opaque Web databases will have the most current information available online, since they are updated more frequently than most general-purpose search engines.
  5. In many cases, Invisible Web resources clearly identify who is providing the information, making it easy to judge the authority of the content and its provider.
  6. Material accessible “on the Invisible Web” is not the same as what is found in proprietary databases, such as Dialog or Factiva. In many cases, material on the Invisible Web is free or available for a small fee. In some cases material is available in multiple formats.
  7. Targeted crawlers, which commonly focus on Opaque Web resources, often offer more comprehensive coverage of their subject, since they crawl more pages of each site that they index and crawl them more often than a general-purpose search engine.
  8. To use the Invisible Web effectively, you must make some effort to have an idea of what is available prior to searching. Consider each resource as if it were a traditional reference book. Ask yourself, “What questions can this resource answer?” Think less of an entire site and more of the tools that can answer specific types of questions.
  9. Invisible Web databases can make non-textual material searchable and accessible.
  10. Invisible Web databases offer specialized interfaces that enhance the utility of the information they access. Even if a general-purpose search engine could somehow access Invisible Web data, the shotgun nature of its search interface simply is no match for the rifle-shot approach offered by most Invisible Web tools.



to be continued...



Google tricks

Google is clearly the best general-purpose search engine on the Web. But most people don't use it to its best advantage. Do you just plug in a keyword or two and hope for the best? That may be the quickest way to search, but with more than 3 billion pages in Google's index, it's still a struggle to pare results to a manageable number.
    Google's search options go beyond simple keywords, the Web, and even its own programmers. Let's look at some of Google's lesser-known options.


Syntax Search Tricks

Using a special syntax is a way to tell Google that you want to restrict your searches to certain elements or characteristics of Web pages. Google has a fairly complete list of its syntax elements. Here are some advanced operators that can help narrow down your search results.
  • Intitle: at the beginning of a query word or phrase (intitle:"Three Blind Mice") restricts your search results to just the titles of Web pages.
  • Intext: does the opposite of intitle:, searching only the body text, ignoring titles, links, and so forth. Intext: is perfect when what you're searching for might commonly appear in URLs. If you're looking for the term HTML, for example, and you don't want to get results such as www.mysite.com/index.html, you can enter intext:html.
  • Link: lets you see which pages are linking to your Web page or to another page you're interested in. For example, try typing inlink:http://www.pcmag.com  
  • Site: restricts results to top-level domains 
  • Daterange: (start date end date). You can restrict your searches to pages that were indexed within a certain time period. Daterange: searches by when Google indexed a page, not when the page itself was created. This operator can help you ensure that results will have fresh content (by using recent dates), or you can use it to avoid a topic's current-news blizzard and concentrate only on older results. 
Daterange: is actually more useful if you go elsewhere to take advantage of it, because daterange: requires Julian dates, not standard Gregorian dates. You can find converters on the Web (such as  here) , but an easier way is to do a Google daterange: search by filling in a form here
or here. If one special syntax element is good, two must be better, right? Sometimes. Though some operators can't be mixed (you can't use the link: operator with anything else) many can be, quickly narrowing your results to a less overwhelming number.
Try using site: with intitle: to find certain types of pages. For example, get scholarly pages about Mark Twain by searching for intitle:"Mark Twain"site:edu. Experiment with mixing various elements; you'll develop several strategies for finding the stuff you want more effectively. The site: command is very helpful as an alternative to the mediocre search engines built into many sites.


Swiss Army Google

Google has a number of services that can help you accomplish tasks you may never have thought to use Google for. For example, the new calculator feature lets you do both math and a variety of conversions from the search box. For extra fun, try the query "Answer to life the universe and everything."
    Let Google help you figure out whether you've got the right spelling and the right word for your search. Enter a misspelled word or phrase into the query box (try "thre blund mise") and Google may suggest a proper spelling. This doesn't always succeed; it works best when the word you're searching for can be found in a dictionary. Once you search for a properly spelled word, look at the results page, which repeats your query. (If you're searching for "three blind mice," underneath the search window will appear a statement such as Searched the web for "three blind mice.") You'll discover that you can click on each word in your search phrase and get a definition from a dictionary.
    Suppose you want to contact someone and don't have his phone number handy. Google can help you with that, too. Just enter a name, city, and state. (The city is optional, but you must enter a state.) If a phone number matches the listing, you'll see it at the top of the search results along with a map link to the address. If you'd rather restrict your results, use rphonebook: for residential listings or bphonebook: for business listings.
If you'd rather use a search form for business phone listings, try Yellow Search.


Extended Googling

Google offers several services that give you a head start in focusing your search.
You're probably used to using Google in your browser. But have you ever thought of using Google outside your browser?
    Google Alert  monitors your search terms and e-mails you information about new additions to Google's Web index. (Google Alert is not affiliated with Google; it uses Google's Web services API to perform its searches.)
    If you're more interested in news stories than general Web content, check out the beta version of Google News Alerts. This service (which is affiliated with Google) will monitor up to 50 news queries per e-mail address and send you information about news stories that match your query. (Hint: Use the intitle: and source: syntax elements with Google News to limit the number of alerts you get.)
    Google on the telephone? Yup. This service is brought to you by the folks at Google Labs, a place for experimental Google ideas and features (which may come and go, so what's there at this writing might not be there when you decide to check it out).
    With Google Voice Search, you dial the Voice Search phone number, speak your keywords, and then click on the indicated link. Every time you say a new search term, the results page will refresh with your new query (you must have JavaScript enabled for this to work). Remember, this service is still in an experimental phase, so don't expect 100 percent success.
    In 2002, Google released the Google API (application programming interface), a way for programmers to access Google's search engine results without violating the Google Terms of Service. A lot of people have created useful (and occasionally not-so-useful but interesting) applications not available from Google itself, such as Google Alert. For many applications, you'll need an API key, which is available free here.
    Thanks to its many different search properties, Google goes far beyond a regular search engine. You'll be amazed at how many different ways Google can improve your Internet searching.



More Google API Applications

CapeMail is an e-mail search application that allows you to send an e-mail to google@capeclear.com with the text of your query in the subject line and get the first ten results for that query back. Maybe it's not something you'd do every day, but if your cell phone does e-mail and doesn't do Web browsing, this is a very handy address to know.






Resources


The Invisible Web: Uncovering Information Sources Search Engines Can’t See
by Chris Sherman and Gary Price (2001)
ISBN 0-910965-51-X

How To Find and Search the Invisible Web
Google hacks