Sensible Bits: Meta Search Engines... What lies beneath...

Sunday, May 28, 2006

Meta Search Engines... What lies beneath...

Web search is the topic that always excites me. I am doing my research in this area and would like to summarize a wonderful paper i found in this field. The paper was written by Weiyi Meng, Clement Yu and King-Lup Liu from University of Illinois Chicago. Meta search engine is one of the fastest growing concept in the field of information retrieval.

The web is a huge information resource and it keeps on increasing everyday. So a good search engine is needed to effectively give us information about the web. There are two types of search engines: General purpose which search for all the pages on the web and the special purpose engines which search for only the specific areas of the web.

There are many search engines available in the market and all use different techniques to search for the data. But as the web is growing in size rapidly finally we will have to live with the idea of having specific-purpose search engines for different parts of the web. A metasearch engine is a system that provides a unified access to multiple existing search engines. A metasearch engine doesn’t maintain its own index for the documents. It just uses the results used by other search engines and reorganizes the result provided by them.

There are many advantages of using the metasearch engine. Metasearch engines can increase the search coverage of the web. The web keeps on increasing faster than the indexing capacity of any of the search engines. So searching in more than one engine will help to cover more areas of the web. Metasearch engine also solves the scalability problem which is faced by a single general purpose search engine. It also facilitates the use of multiple search engines. There may be a situation where we have to access data from many news papers which has a system that give out different results. So we can combine these results. Metasearch engine helps in improving the retrieval effectiveness.

In a session of using a metasearch engine, a user submits a query to the metasearch engine through a user friendly interface. The metasearch engine sends the user query to a number of underlying search engines. Different component search engines may accept the queries in different format. After the retrieval results from the local search engines are received, the metasearch engine merges the results into a single ranked list and presents the merged results, to the user. The result could be a list of documents or more likely document identifiers with a short companion description.

The different components of a metasearch engine are user interface, database selector, document selector, query dispatcher, search engine and result merger. The description for each of them is as given below.

A user interface is a place where the user types the queries and gets his result. It may be same as any other search engine. This query is then passed to the database selector. If the number of component search engines in the metasearch engine is very small, we can send the query to each of the engine. But if we have thousands of component search engines then this strategy to send queries to each of them no longer works. Also if the search engines are topic specific, then it will be useless to submit the query to each of them as we will be wasting the resources and also we will be getting results that we don’t require. So resources of local system are wasted and also we will need more time to merge results if we have to process large number of results.

So we need to develop a database selector which can identify all the useful databases and should minimize wrongly identified useless databases. The three approaches for database selection are rough representative based, statistical based and learning based approaches. In rough representative approach, the local database is represented by only a few selected keywords or paragraphs. Statistical approaches usually represent the contents of the database using rather detailed statistical information like document frequency of a term or average weight of a term. Learning based approaches predict the usefulness of the database for new queries based on retrieval experiences of the past queries.

The next important part of a search engine is a document selector. It determines which documents to retrieve from the database of the search engine. The goal is to retrieve as many useful documents from the search engine as possible. The different methods to retrieve from the local search engines are user determination, weighted allocation, learning based approaches and guaranteed retrieval. User determination means it lets the user determine how many documents to retrieve from each component of the database. Weighted allocation means that retrieval of documents based on the ranks given to the documents by the component engines. In this way we only get higher ranked documents. In learning based approach, we determine the documents to retrieve from the component databases based on past experiences. Guaranteed retrieval approach aims at retrieving all the potentially useful documents with respect to any given query.

A query dispatcher is used to establish a connection between server and selected search engine and passing the query to it. The query dispatcher must be compatible with all the component search engines and must be able to pass the query in the format that is recognized by the user. Some of the Boolean queries must be translated before submitting to any search engines. For the vector space model queries, query translation is as straight forward as just retaining all the terms.

A result merger is also one important component on which the success of the metasearch engine depends. The result merger combines the result into single ranked list. The top n documents in the list are then forwarded to the user interface to be displayed, where m is the number of documents desired by the users. A good result merger should rank all the returned documents in descending order based on the similarity of the document to the query. There are two approaches to result merging: Local similarity adjustment and global similarity estimation. This type of approaches adjusts local similarities using additional information such as the quality of component databases. Global similarity estimation approaches attempts to compute or estimate the true global similarities of the returned documents.

But there are a lot of challenges that are present in this process. The biggest challenge is that all the component search engines are heterogeneous. All the search engines are built independently. So if we study the how each of these engines differ from each other, then it can help us to make a good metasearch engine. Some of the heterogeneities that can be observed in component search engines are indexing methods, document term weighting scheme, query term weighting scheme, similarity function, document database, document version, result presentation.

As this paper is a study paper it also gives some points about what areas should be worked on to improve the existing metasearch engines. One of the major challenges is to integrate the local search engines employing different indexing techniques. We also need to integrate the local systems supporting different types of queries i.e. Boolean and vector space queries. Efforts must be made to get more information about the component search engines that don’t give enough information about their working. We need to develop new and effective result merging methods. We also have to study the appropriate cooperation between a metasearch engine and the local systems. New indexing and weighting techniques must be incorporated to build better metasearch engines. Also it discusses issues with the effectiveness of the metasearch and creates a standard testbed to evaluate the proposed techniques for database selection, document selection and result merging.

In conclusion, the author says that in this world of increasing search engines and digital libraries on the World Wide Web, providing efficient and effective access to the text information from multiple sources is very important. The paper concentrated on the problems of database selection, document selection result merging. They studied a wide variety of problems and tried to give the directions for future research.

They have tried to provide better solutions to each of the above mentioned problems and mention the methods like getting knowledge about the component search engines such as more detailed database representatives. Also they have touched the important issue of scalability with respect to data and access.

I felt that the paper was a very well written paper and touched most of the important aspects of metasearch engine. It explains what the metasearch engine is and why it is needed. It explains all the different components of the metasearch engines and explains the different issues involved in them. And finally it gives some points which must be worked on in this field. I felt that the paper was very well organized and gave a detailed description about metasearch engines and also about issues involved in them.

Summary Written by: Saket S Mengle.( Reserach assistant, IR Labs, IIT, Chicago.)

Please leave your comments on this blog and mail me at smengle[at]iit[dot]edu if you are interested to know more.

Sensible Bits

Sunday, May 28, 2006

Meta Search Engines... What lies beneath...

0 Comments:

Contributors

Previous Posts