http://www.guardianunlimited.co.uk/Columnists/Column/0,5673,478293,00.html Second sight
Traditional search engines only access a small fraction of the web says Victor Keegan More internet news
Thursday April 26, 2001 The Guardian
The internet has quickly become the biggest source of instant knowledge the world has ever known. It sometimes seems, as we type in requests to a search engine, that the whole world of learning and entertainment is at the tips of our fingers. But we haven't seen the half of it yet. The store of knowledge on the internet is like a rapidly expanding iceberg in which most of the mass remains stubbornly invisible. Some estimates suggest that the "invisible"or "deep" web of impenetrable databases could be up to 500 times larger than the current web search engines are able to penetrate. In terms of exploration, we are still in the pre-Columbus era.
But they are working on it. One of this year's Big Things will be the emergence of a family of new and improved search engines, with minds of their own, aiming to get at the parts of the web today's engines can't reach.
Where is all this data hiding? A lot of it is on our own hard disks. Most of us have forgotten exactly what we have stored there and the rest of the world hasn't got a clue. But all that is changing as the peer-to peer (P2P) revolution (enabling data to be exchanged from hard disk to hard disk without going through a central server) gathers pace. Napster started it with the exchange of music files online - but made the mistake of channelling all requests through a central server (ie a huge computer serving lots of individual ones) thereby attracting the attention of teams of lawyers from the record companies.
But the dozens of other P2P companies inspired by Napster don't need to go through a central node. The P2P revolution turns home and corporate computers - including personal digital assistants (PDAs) into servers themselves. Suddenly, the hard disks of all participants will become a huge addition to the information sources of the web. And it won't just be music on tap. Users can exchange personal files, spreadsheets, poetry, operating systems, audio files - almost anything that can be digitised.
This includes full-length movies as well. Check out DivX a compression technology for film which is tipped to become the video sister of MP3. Jordan Greenhall, chief executive officer of Project Mayo )claims the technology is spreading faster than MP3 did at an equivalent stage in its development. DivX claims to be able to compress a 4GB DVD disc into only 650MB (small enough to fit on a CD-rom).
Another part of the treasure trove of the web ripe for mining is the labyrinth of chat rooms. Predictably, demand from corporations is leading the way. Moreover.com, the news monitoring company from Clerkenwell that is building up a worldwidereputation, is in the forefront. It searches financial chat rooms where comments - true or false - are made that could affect a company's share price. Moreover can search these rooms every 15 minutes and sell the information it gleans about what potential punters has a ready outlet with marketing departments and small companies prepared to to sign up for a deal. Nick Denton, founder of Moreover, says he advises companies not to get too "heavy handed" with people making adverse comments about their activities - but there are obvious privacy issues here. They could become explosive if search engines start to comb private chat rooms or secretly monitor employees searching the web for illicit reasons during working hours. PA's www.ewatch.com monitors news sites commenting on companies as well and also keeps an eye on which writers are covering particular industries.
The most profitable part of the submerged web is hidden in the bowels of corporate databases where up to 85% of information is said to be "unstructured". That makes it ripe for tackling by the new army of intelligent search engines including Britain's Autonomy. Others include www.purpleyogi.com which claims to connect information from multiple sources to a user based on their interests and what they are currently reading on the screen. In the consumer sector, www.mysimon.com searches 2,500 shopping sites to glean product prices for cost-conscious consumers.
A new generation search facility attracting much attention is www.groove.com. Groove is a P2P application that enables groups within an organisation to share files and do different activities at the same time. It has been sold to a number of corporations including GlaxoSmithKline, which took out 10,000 licences. It will enable the company's scientists to coordinate research projects internally and in collaboration with other companies and universities. Groove claims its software, which doesn't go through a central server, is "undiscoverable". Another new age search engine offering access to content that ordinary engines miss is www.brightplanet.com. It believes that traditional search engines only have access to 1% of what exists on the web.
That, in turn, raises the bigger question of all the knowledge that isn't on the web but should be. One obvious example is the British Library which, despite putting unique material on its web site, such as the Gutenberg bible, is only scratching the surface of its potential. Well under 1% of its riches are accessible through the web. If the invisible web represents all the material on the web below the surface that can't easily be seen, then the BL and other storers of knowledge represent everything underneath the iceberg. The land grab on the web is just beginning.