Let Your Agents Do the Walking

by Charles A. Gimon

for INFO NATION

Ever tried to find something in cyberspace? Just a couple of years ago, it was no walk in the park. If you were lucky, the thing that you were looking for might be mentioned in a FAQ-list, a friend-of-a-friend might know where some archive was kept, or you might have insider info that your research topic was being studied at the University of Wollongong. It was like learning about the facts of life on a school playground.

The tools for finding your way around the Internet have developed with frightening speed. We've gone from lists of lists, to hypertext directories, to intelligent agents, to big money corporate buyouts--most of this within the last year. Just when you start getting comfortable calling 1992 "ancient history", you have to apply the same cliche to things that happened in January of 1994. The wave isn't slowing down, and there's a chance it could lead to all sorts of futuristic software servants fetching info from all over the globe and feeding it to you while you relax. Yet some of the best, most reliable search tools on the Internet use simple ideas that hammer away at your request until they get an answer.

Text lists of useful Internet sites like the Yanoff list or the "Big Dummy" list were handed around back in the dark ages (a couple of years ago), and they're still popular today. The concept has migrated to the World Wide Web, in O'Reilly's Global Network Navigator, [http://nearnet.gnn.com/, note 1] for instance. It's a huge, annotated list of hypertext links arranged by subject, the Web version of Ed Krol's Whole Internet Catalog. It's a nice service, and interesting to browse through, but it's just a big, big version of what everyone else and their grandmother has in their home page. Not that there's anything wrong with that--GNN is a big, big list, and the HTML links mean that you're a mouse click away from the resource that's described in their list. Plus there's a search capability that lets you search the page for a certain word. The technology is strictly entry-level HTML, though. You or I could do it with enough time and a comfortable text editor.

The next step up from a straight list of sites, or a hypertext list with links, is a searchable database. The most familiar database of Internet info is probably archie, a database of files in ftp archives, originally developed by a group of researchers at McGill University in Montreal. Traditionally, an archie server made you telnet in and use unix-ish commands to find addresses for sites that were archiving your file; today there are easy-to-use Web pages for archie and archie clients for Windows that can be pointed at servers all over the world. Sites that mirror the archie database have been some of the first Internet sites to get bogged down with traffic. (If you have trouble getting into archie.rutgers.edu [telnet://archie.rutgers.edu], try archie.unl.edu [telnet://archie.unl.edu] in Nebraska instead. Use "archie" for both the login name and the password.) [note 2]

It's a little smarter than a simple list. Archie will telnet out to ftp archives at other sites automatically, usually once a day, to update its lists. Still, archie is only a list of files you can get using ftp. It's scope is limited, and if you're not looking for a specific bit of software, your result are going to be limited, too.

By far the best searchable database on the Internet is Lycos at Carnegie-Mellon University. Lycos (so-called after the Latin name for the wolf spider) searches a humongous list of World Wide Web pages. Access to the Lycos search engine was officially opened to the public on July 20, 1994. At that time, they had summaries of 54,000 web documents on file. Today, they're getting a couple of million hits per week on a database of over five and a half million web pages as of the middle of August [1995].

Using Lycos is painfully easy. You go to the Lycos Web page, put your search keywords in the blank, and click on the start button. What you'll get is a list of Web pages that have your keywords in them. Lycos uses a simple scoring technique to decide how important any particular Web page might be to you. Is the keyword in the title, or deep in the text? The system works well, and Lycos is generous in how many hits it gives you. I've paged through sixty or seventy from one query, and still not hit any limit. You want to go to that Web page? Just click on it--each hit is also a hypertext link to that site.

Lycos got so big, so fast not through smart technology, but through slick psychology. Regular folks are encouraged to register their Web page addresses with Lycos. You give Lycos your URL, Lycos will put it in its database. When it gets a chance, Lycos will look at your Web page to make sure it's working, and to grab the text for the database as well. This gives the average Joe a warm fuzzy as a reward--"My Web page is listed in Lycos!"--while Lycos rakes in more and more information. It's more than a win-win situation, it makes Lycos' growth a direct function of it's user base, a cybernetic insight worthy of Norbert Wiener himself. Simple, yet brilliant.

Other search tools are moving on into more sophisticated ways of getting the job done. Brian Pinkerton, then a grad student at the University of Washington, started working on WebCrawler just in January of 1994. His motive was fairly uncomplicated: "I wrote it because I could never find information when I wanted it and because I do not have time to follow endless links."

WebCrawler has looked at 150,000 documents, and has another million and a half web addresses on file. There's more than just a database search happening here. Put in a query, and WebCrawler will search hypertext links in documents it has in its database, but then it will use software agents to check out links in those documents to other web pages it hasn't looked at yet. The intelligent agents inside WebCrawler duplicate the sort of selecting and filtering that a human does when using the World Wide Web. Pinkerton, a good "net citizen", was worried about the enthusiasm of his software agents, so he put limits on how much material WebCrawler would try to suck out of any one server at a time. [note 3]

WebCrawler is fairly modest when it comes to hardware. Right now, it's running on three Pentiums using NextStep as their operating system. The big innovation in WebCrawler is its smart software. Mainly due to that software, Webcrawler was bought out by one of the 500-pound-gorillas of cyberspace, America On-Line, in a deal that was finalized April 25th of this year and announced on June 1st. At the same time, AOL announced that it was buying Global Network Navigator from O'Reilly and Associates (the publishers of those useful unix manuals with the animals on the cover you see everywhere). The search capabilities and smart software in WebCrawler would complement the big GNN list nicely. Add all this to AOL's earlier purchase of ans.net, one of the big Internet backbones, and you have the makings of America On-Line's planned Internet-only service, as of this writing due to roll out in late August of this year, remarkably close to the debut of the Microsoft Network.

When the deal between AOL and WebCrawler was announced, there was plenty of worrying that AOL would take WebCrawler and start charging for it. Actually, America On-Line won't be charging people to use WebCrawler. It turns out that they're well aware of their poor reputation out on the Net, and they want to keep WebCrawler free for the public relations value, specifically to "give something back to the Internet community" after releasing so many AOL users onto Internet who seem to take more than they give.

Lycos hasn't been ignored by corporations looking to purchase hot Internet properties, either. CMG Information Services, Inc., has purchased exclusive rights to the Lycos "spider" technology, and they've set up Lycos, Inc. as a subsidiary to manage the whole concept. CMG's previous enterprises involved direct marketing--not Elvis figurines or veg-o-matics, but things like textbook sales to college professors. CMG has also searched and filtered sales leads on behalf of mutual funds. Direct marketing, searching, filtering--it's all sort of a premonition of their Internet involvement. Lycos, Inc. has sold a non-exclusive license to Microsoft to use the software for running search engines on the new Microsoft Network. Notice that that's a non-exclusive license; other licenses have been sold, and the original Lycos at Carnegie-Mellon will stay in place. "CMU remains committed to widespread access to Lycos, and we are working hard to provide the service at no cost to the user (we expect to recover costs through advertising)." For right now, it's still a freebie.

Yet another acquisition for the America On-Line side is Martijn Koster, a Dutch computer scientist. He has been working at Nexor in Britain, but he's been lured away by the new AOL-WebCrawler team. The CUSI that Koster developed for Nexor has been exported to plenty of sites around the world, including Spry (now a tentacle of Compuserve) where it's called the Internet Wizard. [note 4] CUSI is supposed to stand for Configurable Unified Search Engine--I know, it's an E, not an I, but that's how he has it in his web pages. (Probably he meant "interface".) The idea here is to make your query on one web page; that server will take your query and plug it into other web searchers, compile the results and spit them back out at you in a standard format. CUSI in Switzerland lets the end user configure their search using a regular expression in the Perl programming language, which is a plus for unix gurus and their ilk, but no help at all to the rest of us. The main drawback of these services is the lack of depth in the search they carry out, and the lack of common sense in the search. A search through the Swiss CUSI server using the one keyword "Indonesia" turned up a web page about photovoltaic cells--one of which happened to be in, yes, that's right, Indonesia. Spry's Internet Wizard turned up a nice list of sites, but many of those had similar trivial mentions of Indonesia down in the text. But again, you combine the unified search idea of Koster's with Pinkerton's smart agents, and you see where AOL has a chance at fielding a killer search application in the future.

SavvySearch at Colorado State is a much better version of this "unified search" idea. It's fast and returns useful information. Interestingly enough, the same "Indonesia" search done through SavvySearch turned up a nice selection of Web pages, but the original source of all those hits was Lycos! This may have been because the WebCrawler home page was just moved from the U of Washington to its new home at webcrawler.com, and maybe SavvySearch hadn't updated its address book yet. It does make one wonder about the value of unified searches, if all the info is coming from just one site anyway. The answer will be to sign up more and more databases to participate in these unified search schemes. [note 5]

Now, if you cross the Lycos idea of getting your info right from users with the CUSI idea of unified searching, what do you get? Unified registration, of course. A web page called Submit-It will do just that--you put it the URL for your Web page, and Submit-It will send the info along to all the search engines it knows about. Submit-It is a service of Permalink of Independence, Missouri.

These info services are so useful, and have so much potential to grow along with the Internet itself, that the chance to make a buck off of them is just too tempting. Case in point: DejaNews, a searcher that digs through Usenet posts. It doesn't carry every newsgroup, but it's fast, and for the moment it's free. Internic has them registered as being Bob Gustwick Associates of Austin, Texas. They're coy about their future plans: "we may eventually need to charge for some queries. We will try to avoid this but we can not rule it out." Draw your own conclusions. [note 6]

InfoSeek is out to make money off of Web searches, and they're quite up-front about it. InfoSeek was started in January 1994 by Steve Kirsch, previously the founder and president of Frame Technology, the people who make FrameMaker. InfoSeek takes the multiple-search concept to the next level. It searches lots of databases at the same time, both on and off the Internet: a truckload of mostly-computer related publications, posts from 10,000 Usenet newsgroups for the last four weeks, 400,000 Web sites, NewsBytes and other industry wire services, movie reviews, company profiles, and many more databases "in planning". At $9.95 per month, InfoSeek may seem to be a waste of money, when you can search the Web for free at so many sites. But when you compare InfoSeek with existing information services like Lexis/Nexis, which can be breathtakingly expensive, InfoSeek seems like a real deal, and will look even more attractive when they put in all the databases they say they plan to add. They're not out to compete with the freebies, they're out to undercut the overpriced premium services in the market. InfoSeek has a chance to put real information services in the hands of the masses.

The most far-reaching business venture in information so far is NLightN, a product of the generic-sounding Library Corporation. They started out in 1975 putting Library of Congress bibliographic info on microfiche; they've been working on the NLightN concept for a geologic age (since 1990). Just like Microsoft, they've bought a license to use the Lycos technology.

NLightN is as much a clearinghouse as it is a search engine. It advertises itself as the "world's largest table of contents". (More than that, the folks at NLightN call their service "seductively useful".) They have Internet info, off-Internet databases, library catalogs, those transcripts of TV shows you can mail-order from Journal Graphics for a dollar or two, movie reviews, corporate info--you name it, they got it.

Unlike InfoSeek, NLightN charges by the "information unit". One IU is a dime. Small wire service articles appear to be priced at, well, ten cents, but you can order paper copies of magazine articles, audio tapes, even whole books by clicking on a button. This is the most intriguing feature of NLightN--you put in "Indonesia", and beyond all the info you can get on-line, you're offered the chance to order a whole book on the subject, delivered to you by express. Cyber-purists will complain that things like that ought to be digitized, and they're right, but imagine if this concept were extended to other products. You could put in a search for "Greek cheese" and find the best places to buy feta--at the best price. The potential is staggering.

Other companies and universities are forging ahead with research into smart searches, data retrieval, intelligent agents, and all sorts of things that are supposed to lead us into a lotus-eating future. MCC Inc of Austin, Texas has its Carnot Project, where researchers are working on using intelligent agents to get info from storage spots across a big corporation that might not be compatible otherwise. Who cares if one office is using Sun workstations while another one in California is using Macs? Let the software agent deal with it. The software application is called InfoSleuth ("advertising, discovery and fusion" of info from "heterogenous and distributed" sources) and it's scheduled to roll out in 1997. At MIT, Yezdi Lashkari's WebHound is an experimental agent that looks at a list of Web pages you like, and suggests other related ones you might like to try. Similar projects like HOMR are running at MIT, too: they don't predict what Web pages you'd like, but what music CD's you'd like, based on what you prefer now. At the United Nations Trade Point Development Centre in Melbourne Australia, bureaucrats are applying smart searches and intelligent agent to electronic commodities trading. And at the University of Chicago, decision-making and recognition software is being put into a real-life robot named "Chip"--if you're lucky, you might even be able to get a live view of their lab on their web page. Let your imagination run a little bit: instead of telling a software robot to get artichoke prices for you on the Internet, you could be telling a hardware robot to get a cold beer for you from the fridge.

Anybody with World Wide Web access can use today's generation of Internet searchers now, of course. Here's a whole load of URLs for you to try:

http://nearnet.gnn.com
Global Network Navigator

http://www.lycos.com
Lycos

http://webcrawler.com
WebCrawler

http://cuiwww.unige.ch/meta-index.html
CUSI site in Switzerland

http://www.spry.com/wizard/index.html
Spry's Internet Wizard

http://submit-it.permalink.com/submit-it
Submit-It

http://www.dejanews.com
DejaNews

http://www.infoseek.com
InfoSeek

http://www.nlightn.com
NLightN

http://webhound.www.media.mit.edu
WebHound

http://cosmos.kaist.ac.kr/na/netagent.html
Net Agent site in Korea

http://www.cs.colostate.edu/~dreiling/smartform.html
Savvy Search

http://www.mcc.com/projects/InfoSleuth
InfoSleuth

http://www.cs.uchicago.edu/~firby/aap/index.html
University of Chicago agents page (featuring "Chip")

http://www.unicc.org/untpdc/eto/etoagent/
UN Electronic Trading site in Australia

http://www.cs.umbc.edu/agents/agents.html
Intelligent Agents page at the University of Maryland/ Baltimore County

Charles A. Gimon teaches an Intro to the PC class at the English Learning Center in South Minneapolis. He can be reached at gimonca@skypoint.com

NOTE 1: GNN was purchased by America Online in 1995. No longer in existence.

NOTE 2: As of 2002, there appear to be no archie servers still in operation.

NOTE 3: As of 2003, Webcrawler still exists as an aggregator or meta-search engine, combining search results from other multiple search engines. Good "net citizenship" is in theory supported by the Robot Exclusion Standard (the robots.txt file).

NOTE 4: As of 2003, spry.com is a web-hosting provider, and no longer a search engine, although Internet Wizard may still be a trademark of Compuserve.

NOTE 5: Compare the descendant of Savvy Search at http://www.search.com/.

NOTE 6: DejaNews has since been acquired by http://www.google.com/.

Back to my net writings.
Back to my home page.