Hey web devs! We knew you'd look under the hood. Please pardon the mess...we still have some clean up to do. If it drives you crazy and you want to help us get it perfect, maybe you should join our team! We could use another set of hands!
Reprinted from WebTechniques
You have a problem. Your users can't find anything on your Web site or intranet. They can find what's on the home page, but the home page changes every day. Behind it, your site has more than thirty thousand pages, hosted on fifty servers. What to do?
The time, energy, and money you've spent building your site is of no benefit if your users can't find what they're looking for. One way to start seeing a return on your investment is to install a search engine. But which one? And how do search engines work, anyway?
A variety of off-the-shelf search engine packages are available to developers. Some are commercial, while others are freely available open-source software. My favorite is a versatile, open-source package called ht://dig (http://www.htdig.org/ ). If you aren't interested in configuring and maintaining your own search engine, you can partner with a full-service search provider.
But what should you consider when choosing a search engine solution? Start by asking a few questions.
First of all, what content needs to be indexed? Sites commonly encompass many kinds of content: plain text, HTML pages, Word documents, Adobe PDF files, and so forth. You may also need support for SGML entities and extended character sets, like Unicode, if some of your content is in non-English languages.
Does the content need to be indexed from a local file system, or should it be indexed via the Web? This will mostly depend upon how dynamic your content is. If your content consists largely of Word documents and PDFs, then indexing from the local file system will be the fastest approach. If your site consists of dynamically generated pages assembled on the fly, then you'll need to draw those files from over the Web, the way a browser would.
How large is the site (or collection of sites) to be indexed? Is the content publicly accessible, or on an intranet? These questions can reveal hidden needs for scalability and efficiency of indexing.
What is your budget? Enterprise-level indexing and document management systems can be extremely expensive. On the other end of the spectrum is free software. These packages offer the additional advantage of having their source code available, should you need to extend them.
Is the standard search results screen supplied with the software acceptable, or will you need to customize the look, feel, and behavior of the various pages?
Should it be possible to restrict the search to a subset of the index? How about searching across multiple indexes? And do you need support for indexing meta tag keywords and descriptions, or for more advanced metadata features, such as Dublin Core ( http://www.dublincore.org/ )? Requirements (and support) for advanced metadata handling vary from package to package.
Finally, don't forget to make sure the software you're evaluating is available for your server platform. No package will live up to its performance claims if it isn't going to run!
Most search engines operate on the principle that pre-indexed data is easier and faster to search than raw text. The form and quality of the index created from your original HTML pages is of paramount importance to how the searches are performed: how fast, how accurately, and with which advanced features. For most search engines, the index takes the form of a highly optimized lookup database.
At the core, however, you can think of search indexes as the electronic equivalent of the Biblical concordance. In the concordance, every word in the Bible is listed along with every time it occurs, for ease of reference. Need to know where Moses appears in the New Testament? Check the concordance: it'll show you that he first shows up in Matthew 8:4, and last in Revelations 15:3. Search engines, for the most part, operate by the same idea.
To create an index, you need a piece of software that actually goes out and finds all of this content, builds the searchable index, and in some cases, schedules subsequent indexing. These mostly unseen components are variously known as spiders, indexers, and robots. Depending on the complexity and scale of the technology in use, sometimes these beasts are pieces of a larger software package, and other times they are complete packages themselves.
To achieve success with a search engine you must have a carefully planned user interface as well. What sort of searches must be supported? The simplest search of all is the full text search, where the entire text of all indexed documents is scanned for keywords that match the query. However, many other options are available:
Proximity Search. With this technique, words must appear close to one another in the original document in order to match. A more restrictive variation, "phrase search," only matches documents containing words in an exact sequence.
Fuzzy Search. This method matches documents containing words that are like requested keywords, even if the keywords contain misspellings. Similarly, "substring searches" let users enter a fragment of text, perhaps not even a whole word, and will return documents based on the partial match.
Concept Search. This advanced technique lets users search for things that fall into the same general category as the keywords the users input. A search on "carburetor" might return matches on "fuel injection," or a link to the Holley site.
Language-Aware Matching. If an engine incorporates a thesaurus, a search on "soccer" might return results that match "football". Some search engines also support stemming, where you can search on "key" and get matches for "keys," "keying," "keyed," and so forth, as the engine adds common linguistic "stems" to the original keyword.
Regular Expressions. One other method, found mainly in engines directed at programmers and other hard core users, is the regular expression, or regex search. Regular expressions are symbols users add to their queries to describe complex patterns to match, and they can be difficult to master. For example, a search for "[S|s]t\wve C.?amp[i|e]*" might match "steve champi," "Stove Camper," "Steve Champeon," and more. Regex searching is highly processor intensive, which further limits its applications.
Other engines may offer still more options, including natural-language queries and searches that can be modified by Boolean logic. Often, your audience will determine your searching needs. If you're hosting technical documentation for a bunch of Perl hackers, you may be required to support regular expression searches and Boolean searches. More general audiences will have different needs.
There are several resources for choosing search engines to be found online. Two good ones are Danny Sullivan's Search Engine Watch ( http://www.searchenginewatch.com/ ) and Search Tools for Web Sites, intranets, and Portals ( http://www.searchtools.com/ ). No matter how carefully you craft your search engine's index and interface, however, remember that a search engine is only a tool. When used well, a search engine can dramatically reduce the time it takes for your users to find the information they need from your site. But no quick fix can ever take the place of a well-designed, intuitive site architecture. A good Web site should give your users access to information-not conceal it from them.