Uncovering The Dark Secrets Of The Web
Sun Herald
Sunday June 2, 2002
TRAWL the internet using any of the major search engines and you will usually turn up thousands of references that match some part of your query.
If you really want to spin yourself out, consider this. Standard engines such as Google and MSN Search scan only about 1 per cent of what's out there.
That's the figure Jared Cope came across when researching his honours project with the Faculty of Engineering and Information Technology at the Australian National University.
``It was kind of hard to believe at first," he said. ``So I came up with a device which you can add to a search engine and dramatically increase your coverage of the web.
``It's not that thousands of hits aren't enough. The problem has more to do with what's being searched rather than what's found."
When you search the internet, a program looks for matches, or hits, on web pages. Most home pages have links that will get you to what you want, and sometimes there's a search facility. All this information is readily available to search engines.
The sticking point is websites that are accessible only through their own search box. To search engines, these sites may as well be solid brick walls because they can't see the web pages behind them. Borrowing a term from astronomy, this hidden information is called dark matter.
You can get to some dark matter using meta-searches, where your initial search engine puts your query to many other engines. The difficulty is that these have to be set up manually. You have to tell the computer which engines to use, therefore they have to be ones that you already know about.
Cope designed a program that recognises sites that have a search box as their sole entry point. He did this by studying HTML, the language used to write web pages.
His program kicks into action when a search engine runs up against a wall. It asks questions about the page to decide if the site is using a search box. If it is, the program brings it to your attention so you can come back and have a look later on.
In theory, Cope's tool could increase accessibility to 85pc.
``But it only works for publicly available stuff," he said. ``It doesn't let you get into password-protected sites because they have a log-in box and the code for that is completely different."
The CSIRO hopes to use Cope's search tool to enhance search engines on public service sites. As for Cope, the big guns haven't knocked on his door yet.
What does HTML stand for?
Hypertext markup language: hypertext gives directions to other web pages via links, and markup formats the text.
© 2002 Sun Herald
Share This