Comparison of Information Retrieval Models

Abstract

Information

retrieval is an emerging field of computer science that is based on the storage

of documents and retrieving them on user’s request. It includes the most

essential task of retrieving relevant document according to the requested

query. For this task efficient and effective retrieve models have been made and

proposed. Our survey paper sheds light on some of these information retrieval

models. These models have been built for different datasets and purposes. A

healthy comparison among these models is also shown

Keywords: Information retrieval, retrieval

models.

Introduction

Huge

amount of information is available in electronic form and its size is

continuously increasing. Handling information without any information retrieval

system would be impossible. As the size of data increases researchers start

paying attention on how to obtain or extract relevant information from it.

Initially much of the information retrieval technology was based on

experimentation and trial error. Managing

the increasing amount of textual information available in electronic form

efficiently and effectively is very critical. Different retrieval models were

formed based on different terminologies to manage and extract information.

Information is mostly stored in form of documents. The main purpose of these

retrieval systems is to find information needed. An information retrieval

system is a software program that stores and manages information on documents,

often textual documents but possibly multimedia. The system assists users in

finding the information they required. A perfect retrieval system would

retrieve only the relevant documents but practically it is not possible as

relevance depends on the subjective opinion of the user.

Basic patterns

of models:

Almost

every retrieval model includes following basic steps:

Document

Content representation

Query

representation

Query

and collection comparison

Representation of

results

Figure 1 information retrieval process (Hiemstra, November 2009)

Many models represent documents in

indexed form as it is efficient approach. Different algorithms are used and

developed especially for indexing purpose as better the data is stored more

accurately and efficiently it is retrieved.

Query formulation is the next important step.

User tries to search data using keywords or phrases. In order to search these

phrases in indexed collection, the query must be present in same form. Indexing

can be done by different ways according to content representation of both the

documents in the collection and the user query. (Cerulo,

2004) (Hiemstra, November 2009)

Results

of any retrieval system depend on its comparison algorithm therefore it

determines the accuracy of the system. The better the comparison better the

results are obtained. A list of documents is obtained as the outcome this

comparison that can be relevant or irrelevant. The main objective of a

retrieval model is to measure the degree of relevance of a document with

respect to the given query. (Paik, August 13,

2015)

The rank

of relevant documents is higher as compared to irrelevant documents and they

are shown at the top of the list to minimize user time and efforts spend in

searching the documents

The paper is divided in different sections

with each section explaining different models & their results with their

advantages and limitations.

Retrieval

Models

Exact match models

This model

labels the documents as relevant or irrelevant. It is also known as Boolean Model, the earliest and the

easiest model to retrieve documents. It uses logical functions in the query to

retrieve the required data. George Boole’s mathematical logic operators are

combined with query terms and their respective documents to form new sets of

documents. There are three basic operators AND (logical product) OR (logical

sum) and NOT (logical difference)

(Ricardo Baeza-Yates, 2009). The resultant of AND operator is a set of

documents smaller than or equal to the document sets of any of the terms. OR

operator results in a document set that is bigger than or equal to the document

sets of single terms.

Boolean

model gives users a sense of control over the system. It distinguishes between

relevant and irrelevant documents clearly if the query is accurate. This model

does not rank any document as the degree of relevance is totally ignored. This

model either retrieves a document or not, that might cause frustration for end

user.

Region models

An

extension of the Boolean model that reason about arbitrary parts of textual

data, called segments, extents or regions. A region might be a word, a phrase,

a text element such as a title, or a complete document. Regions are identified

by a start position and an end position. Region systems are not restricted to

retrieving documents.

The

region models did not have a big impact on the information retrieval research

community, not on the development of new retrieval systems. The reason for this

is quite obvious: region models do not explain in anyway how search results

should be ranked. In fact, most region models are not concerned with ranking at

all; one might say they – like the relational model – are actually data models

instead of information retrieval models. (Mihajlovi´)

Ranking Models

Boolean

models may skip important data as they do not support ranking mechanism.

Therefore there was a need to introduce ranking algorithms in retrieval system.

The results are ranked on the basis of occurrence of terms in the queries. Some

ranking algorithms depend only on the link structure of the documents while

some use a combination of both that is they use document content as well as the

link structure to assign a rank value for a given document. (Gupta, 2013)

Similarity

measures/coefficient

Using

document sets and query, a similarity measure, compare them and the documents

with more similarities are returned to the user. Many methods are user to

measure the similarity that are cosine similarity, tf-idf etc.

Cosine

similarity

The

cosine similarity compute the angles between the vectors in n dimensional

space. The cosine similarity in d documents and d’ is given by :

( d * d’ ) / | d | * | d’ |

The performance of retrieval vector base model can be improved by

utilizing user-supplied information of those documents that are relevant to the

query in question. (Kita, oct 1 , 2000)

Vaibhav

Kant Singh, Vinay Kumar Singh (Vaibhav Kant

Singh, 2015) describes vector space model for information retrieval. The

VSM provide a guide to the user that are more similar and have more

significance by calculate the angle between query and the terms or the

documents. Here documents are represented as term-vectors

d = (t1,

t2, t3………tn)

Where ti

=1<=i<=t
ti
is non-negative value and denotes the term i occurrences on document some
important measures of vector space model are as follows {0,1}.
Probabilistic
model
The
probabilistic model is based on probability ranking principle. Some statistics
are involved for event's probability estimation that tells whether the document
retrieve is relevant or non-relevant in accordance with information need.
Probabilistic models employ the conditional probability under occurrence of the
terms. Probabilistic model state that the retrieval system rank the set of
documents according to the probability which is relevant to the query with all
the given evidences. The documents are ranked according to probabilities in
decreasing order. The term-index of term weight words are in binary
representation.
Bayesian
network Model
Bayesian
network models (BNM) is acyclic graphical model which means it does not have a
directed path but deals with random variables. BNM contains a set of
random-variables and the conditional probability dependencies between them. It
is also known as belief networks, casual nets etc. BNM ranks the documents by
usage of multiple evidences in order to compute conditional probability.
Probability distribution presentation uses graphical approach to analyses
complex conditional assumptions that are independent.
Inference
Network Model
In
inference retrieval model the random-variables concerned with four layers of
nodes that are a query node, set of document nodes, representation nodes and
index word nodes. The random-variables are represents as edges in inference
network retrieval model. All the nodes
in this model represents random-variables with binary variables {0, 1}.
Figure 1 simplified inference network
model (Hiemstra, November 2009)
Language
based models:
Language
based models are the type of retrieval models based on the idea of speech
recognition. Speech recognition depends on two main and unique models that are
the acoustic-model and the language model. It is computed for each collection
containing set of documents and based on terms. Ranking of documents are done
by probability generalization of query.
Alternative
Algebraic model
In this
retrieval model we further discuss two models that are latent semantic indexing
and neural network model
Latent
Semantic Indexing
Latent
Semantic indexing (LSI) helps accurate retrieval information in large database.
The similarity of the documents depends on the contexts of the existing and not
existing words. LSI comprises the idea of singular value decomposition (SVD)
and vector space model. Latent semantic indexing only takes the documents which
have semantic similarity i-e having same topic, but they aren't similar in the
vector space and then represents in reduced-vector-space having highest
similarity. To compute LSI by using SVD a matrix A is decomposed into further 3
matrices
A = U?V T
Where:
? is
diagonal matrix
U is an
orthogonal matrix and
V is
transpose of an orthogonal matrix
Jin
Wang et all (Jin Wang, 9May2012) proposes
a model which uses bag of word model for the analysis of human motions in video
frame.
Ontology-based
Information Retrieval
The most
emerging field if information retrieval and extraction now a days is
ontology-based information retrieval (OBIE). OBIE is defined as the use of
ontologies in order to retrieve information. Ontology means the
conceptualization specification of the terms or the words. Ontologies are
particular domain-specific generally so that it means different domains with
different ontologies. As they are domain-specific so they have relationship
between the class and the entities. They are application dependent. On the
basis of similarities and dissimilarities an ontology-tree is hierarchal
representation of classes or entities and their relationship between different
grouping and classification of entities.
Figure 2 Ontology based information extraction (Ritesh Shah, February 2014)
Neural
Network Model
Neural
network models are the models that consists of interconnected neurons and has
labelled and directed graphical structure. Neural networks graphs has some
nodes that perform some calculations in order to get output. Directed graphs
contains nodes or vertices and they also have some connections that connects
the nodes while labelled graphs are the graphs in which all the connections
have some label to identify the properties of all connections. The nodes in the
graph behaves as processing unit , edges behaves as synaptic connection and
some weights are also assigned to edges in the graph in neural network model. In information retrieval the neural network
models contains query-term nodes and document-term nodes. The query-term nodes
initiate the retrieval process by sending the signals to the document-term
nodes. The document-term nodes then sends signals to document nodes. Igor
MOKRIŠ, Lenka SKOVAJSOVÁ (Igor MOKRIŠ, 2006)
describes the neural network retrieval model in the "Slovak" language.
Conclusion:
Different
Information-retrieval techniques are discussed with advantages and disadvantages
in this survey paper. Each model has its own different criteria to extract the
relevant document for user's requested query. So we came to the point that few
methods do best for some applications while few do best for other applications
in data retrieval. Every method has its own criteria to extract and deal with
the given query for a certain information need. Information-retrieval systems
are being used in different organizations and still the new-model are being
worked upon to get relevant results.
Model
Related work
Methods
Advantages
limitations
Exact match Model
i.
David
E. Losada
ii.
Set
theory based and Boolean algebra
iii.
Representation
of query by Boolean expression
iv.
Terms
combined with operators AND,OR,NOT
v.
Proximity
vi.
Stemming
i.
Easy
to implement
ii.
Exact
match model
iii.
Computationally
efficient
i.
No
term weighting used in document and query
ii.
Add
too much complexity and detail
iii.
Difficulty
for end-users to form a correct Boolean query
iv.
No
ranking
v.
No
partial matching
Vector space model
i.
Waiting
scheme used
ii.
Cosine
similarity
iii.
Rank
documents by similarity
i.
Improve
retrieval performance by term weighting
ii.
Similarity
can be used for different elements
i.
Term
independence assumption
ii.
Users
cannot specify relationships between terms
Probabilistic Model
i.
Probability
rank principle based
ii.
relevance
and non-relevance based of data
i.
Ranking
of document
ii.
Does
not consider index inside a document
i.
Binary
word-in-doc weights
ii.
Independence
of terms
iii.
Only
partial ranking of documents
iv.
Prior
knowledge based
Language based models
Probability estimation of events in text
Query likelihood model
Speech recognition
Term based for each document in entire collection
Length normalization of term frequencies
Data sparsity
Bayesian network Model
directed graphical model
random variable relationship is captured by directed
edges
Deals with noisy data
Describe interaction between query and document space
Query specification based on Boolean expressions
Expensive Computation
Bad performance for small collection
Inference Network Model
Random-variables concerned with query ,set of document and index
words
Provide a framework with possible strategies of Rankin used
Boolean query formulation
Latent Semantic Indexing
Concept based retrieval of text
Use SVD
Retrieval of the documents even if there is no share of keyword in
the query
Solves problem of ambiguities(polysemy & synonymy)
Expensive
Works on small collection
Ontology-based Information Retrieval
i.
Entities
classification based in hierarchal manner
ii.
Keyword
matching based
Capability to reuse and share of ontology with other applications
High time consumption
Difficulties come in creating ontological-tree
Addition of new concept in existing ontology require considerable
time and effort
Neural Network Model
Neural based
Weights assigned to edge of neurons
Easy to use but requires some statistical trainings
Deals with large collection of data
Detect relationship between query and retrieve documents
Difficult to design
expensive
Complicated in nature
Does not deal with small documents
References
Cerulo, G. C.
(2004). A Taxonomy of Information. Journal of Computing and Information
Technology , 175–194.
Daniel Valcarce,
J. P. (n.d.). A Study of Smoothing Methods for Relevance-Based Language
Modelling of Recommender Systems. Information Retrieval Lab Computer Science
Department University of A Coruña, Spain {daniel.valcarce,javierparapar,barr
iro}@udc.es.
Gupta, P. K.
(2013). Survey Paper on Information Retrieval Algorithms and Personalized
Information Retrieval Concept. International Journal of Computer
Applications.
Hiemstra, D.
(November 2009). Information Retrieval Models. Goker, A., and Davies, J.
Information Retrieval: Searching in the 21st.
Hiemstra, D.
(November 2009.). Published in: Goker, A., and Davies, J. Information
Retrieval: Searching in the 21st Century. John Wiley and Sons, Ltd.,.
Hui Yang, M. S.
(2014). Dynamic Information Retrieval Modeling. SIGIR'14, July 6–11, 2014,
Gold Coast, Queensland, Australia. ACM 978-1-4503-2257-7/14/07.
http://dx.doi.org/10.1145/2600428.2602297.
Igor MOKRIŠ, L.
S. (2006). Neural Network Model Of System For Information Retrieval From Text
Documents In Slovak Language. ActaElectrotechnica et.
Kita, X. T. (oct
1 , 2000). improvement of vector space information retrieval model based on
supervised learnin. IRAL '00 Proceedings of the fifth international workshop
on on Information retrieval with Asian languages ACM New York, NY, USA ©2000 ,
69-74.
Koltun, E. K.
(4, July 2012 ). A Probabilistic Model for Component-Based Shape Synthesis. ACM
Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH 2012 .
Mihajlovi´, D.
H. (n.d.). A database approach to information retrieval: The remarkable. University
of TwenteCentre for Telematics and Information Technology.
Paik, J. H. (13
august 2015). A Probabilistic Model for Information Retrieval Based on Maximum
Value Distribution. University of Maryland, College Park, USA ,SIGIR'15.
Paik, J. H.
(August 13, 2015). A Probabilistic Model for Information Retrieval Based on
Maximum Value Distribution. University of Maryland, College Park,
USA,SIGIR'15.
Ricardo
Baeza-Yates, B.-N. (2009). Modern Information Retrieva. ACM Press, New
York.
Ritesh Shah, S.
J. (February 2014). Ontology-based Information Extraction: An Overview. International
Journal of Computer Applications (0975 – 8887).
Xi-Quan Yang, D.
Y.-H. (2014). Scientific literature retrieval model based on weighted term
frequency. IEEE Computing Society.