An Ontology Search Engine Based on Semantic Analysis

Abstract
Searching useful information and locating appropriate
Ontology from WWW or Semantic Web is an important task
in Ontology research domain. The most difference between
Ontology information and common information is that
Ontology has semantic structure. In order to improve search
precision by semantic analysis the paper proposes concepts--
weights vectors matching algorithm (CWVMA). The
algorithm firstly parses input messages and preliminary
keywords based results into concepts sets. Then it decides
matching rule according to influence of concepts set on
Ontology semantic and creates weight vector by matching
rule. At last it obtains result vector as foundation of measure
similarity between input messages and preliminary results. In
addition, this paper designed and developed an Ontology
search engine based on the above algorithm----WI
OntoSearch prototype system. The system can search about 4
billion web pages by Google Web Service. A lot of results of
experiments explain the algorithm can improve precision of
Ontology search..
1. Introduction
Sharing and reusing data among different
applications is a key task for the Internet and Semantic
Web. With development of all kinds of relevant
technologies a great deal of redundancy information is
created on the Internet. Much of them are Ontology
information. So fleetly finding and accurately locating
Ontology information on the Internet or Semantic Web,
that is Ontology search, is an important task in
Ontology research domain.
Comparing with general search engine, Ontology
search engine has own features. Because the most
important difference between Ontology information
and other Web information is that Ontology has rigid
semantic structure. How to improve precision of
Ontology search by semantic analysis is one of key
problems in designing Ontology search engine.
1.1. Related works
Searching Ontology by traditional search engines
based on keywords (literature [1]) have own problems.
They dose not check semantic of search objects and
only see them as character strings. Usually much
irrelevant Ontology information will be returned to the
user, just because they have the keywords somewhere
in their files. In order to acquire correct Ontology
information, users who are often specialists of a certain
domain would like to input relevant concepts set as
query words. However, keywords based search engines
can’t offer satisfying results for such query.
The literature [2] introduces an Ontology search
engine. However, the search engine only considers
semantic structure visualization. It dose not change
essence of keywords based search engines. The
literature [3] elaborates on an intelligent search tool----
TUCUXI. This tool captures the semantics of Web
pages through linguistic tools like WordNet and
returns appropriate results by matching structure. The
tool considers semantic but it needs whole Ontology
file as input information. So it is not suitable for users
who need Ontology.
This paper adequately considers semantic structure
of Ontology information and proposes concepts--
weights vectors matching algorithm (CWVMA). The
algorithm parses input information and preliminary
results based on keywords into set of concepts and then
builds weight vector according to influence of set of
concepts on whole Ontology semantic. At last it deals
with corresponding weight vectors as resultant vectors
according to concepts matching. The resultant vector’s
sum will be seen as measure value in order to filter
irrelevance Ontologies and order remainder
Ontologies.
The above algorithm provides the foundation for a
prototype system named WI OntoSearch, which is an
Ontology search engine we designed and implemented
in the paper. Through WI OntoSearch, We search
about 4 billion web pages provided by Google Web
Service. A large number of experiments explain the
algorithm can improve precision of Ontology search.
The paper is organized as follows.Section 2
discourses upon the concepts--weights vectors
matching algorithm as well as some definitions.
Section 3 introduces WI OntoSearch architecture.
Experimental results are found in Section 4. Section 5
concludes the paper and presents future work.
2. Concepts--weight vectors matching
algorithm
Definition 1 (weight vector): An Ontology is made up
of multiple terms (which are called concepts) that are
related and constrained by various structural
frameworks. Ontology of n concepts is mapped into
the vector ) r ,..., r , r ( n 2 1 by matching rule, ] , [ ri 1 0 ∈ . The
value of i r denotes the influence of the th i concept on
whole Ontology semantic and is decided by matching
rule. The vector ) r ,..., r , r ( n 2 1 is called weight vector.
Definition 2 (concepts--weights vectors): Concepts
set ) C ,..., C , C ( n 2 1 , where i C is the th i concept of the
Ontology, and corresponding weight vector ) r ,..., r , r ( n 2 1
are named concepts--- weights vectors together.
Definition 3 (benchmark Ontology): Given an
Ontology, it is a standard by which other Ontology can
be measured or judged.
Definition 4 (evaluating Ontology): An Ontology that
needs to compare with benchmark Ontology.
The literature [4] introduces appraisable factors
that need to be considered in measuring Ontology
semantic. The concepts that compose Ontology as well
as influence of these concepts on Ontology semantic
are among the most important of these factors. So
concepts--weights vectors matching algorithm need to
decide weight vector according to matching rule then
map concepts set to obtain result vector.
In most instances matching rule has close relation to
practice application. If application needn’t consider
influence of concepts on Ontology semantic, it can use
simple matching rule like the setting one rule that sets
constant 1 on every sub-weight of weight vector. The
rule can interpret that every concept in Ontology has
same effect on semantic of whole Ontology. If
application needs to discriminate minute variations in
influence of concepts on Ontology semantic, it must
choice complex matching rule like the distance--root
rule. The distance--root rule determines every sub--
weight of weight vector according to distance from
root concept (root node in Ontology tree).
WI OntoSearch in this paper uses the algorithm
with the setting one rule. The first reason is that input
concept’s number general is the few and the algorithm
with setting one rule completely satisfies user’s need.
The second reason is that a search engine needs
acceptable response time while building weight vector
by complex rule like distance--root rule will expend
much time.
The basic idea of the algorithm is to estimate
semantic similarity between the input message and
preliminary keywords based results (They are regarded
as benchmark Ontology and evaluating Ontologies
respectively). The algorithm is the following.
Algorithm concepts--weights vectors matching algorithm
Input: benchmark Ontology:Ontology1,evaluating
Ontology:Ontology2
Output: result vector ) R ,..., R , R ( m 2 1
parse Ontology1 and Ontology2 into ) I ,.., I , I ( m 2 1
and ) C ,..., C , C ( n 2 1 ;
create corresponding weight vectors ) t ,..., t , t ( m 2 1 and ) r ,..., r , r ( n 2 1 by
matching rules;
for all i I in ) I ,.., I , I ( m 2 1 do
compare i I with j C in ) C ,..., C , C ( n 2 1 ;
if i I = j C then j i i r t R × = ;
else 0 = i R ;
end if
end for
output result vector ) R ,..., R , R ( m 2 1 .
i R in result vector ) R ,..., R , R ( m 2 1 can expresse semantic
similarity between some concept of evaluating
Ontology and the th i concept in benchmark Ontology.
Sum of result vector ) R ,..., R , R ( m 2 1 , that is ∑
m
i R , can
denote semantic similarity between evaluating
Ontology and benchmark Ontology. So it can be seen
as value of similarity between evaluating Ontology and
benchmark Ontology.
For example: Two Ontology snippets are the
following:
They are parsed into concepts sets (food, fruit, apples)
and (food, fruit-Vegetables, meats, fruit, apples).
Weight vectors acquired by the distance-root rule are
(1,0.75,0.5)and(1,0.75,0.75,0.5,0.25)respectively. We
account result vector (1,0.75*0.5,0.5*0.25)by the
above algorithm. The result indicates similarity
between “food” in Ontology 2 (evaluating Ontology)
Food
Fruit
Apples
Food
Fruit-Vegetables Meats
Ontology 1 Ontology 2
Fruit
Apples
and same word in Ontology 1(benchmark Ontology) is
1 and similarity between “fruit” in Ontology 2 and
same word in Ontology 1 is 0.5 and similarity between
“apple” in Ontology 2 and same word in Ontology 1
is 0.25. Value of similarity between evaluating
Ontology and benchmark Ontology is ∑
m
i R =1.5.
The algorithm with the setting one rule includes
two circles. So complication of the algorithm
is ) n m ( O × , where m is number of input concepts and
n is number of concepts in evaluating Ontology.
3. WI OntoSearch architecture
There are three parts such as forepart, search
engine and back-end in WI OntoSearch. The forepart
includes Request Dispatcher and Display engine.
Search engine completes getting, filtering and ordering
result as well as creating rank value in the basis of all
kinds of algorithms. The back-end can offer data and
cache.
Figure 1 WI OntoSearch architecture
3.1. Input messages pretreatment
Comparing with users of common search, users of
Ontology search often know some knowledge about
Ontology. So it is easy to input related concepts as
query words for them. For an instance, user needs an
Ontology about “book”. The Ontology generally
involves in concepts like “author” and “price”. If user
can input these related concepts as input messages, the
results will be better precision.
There are three formats of input messages in WI
OntoSearch system. The following is actual
pretreatment for these formats.
  One word: In order to acquire more candidate
Ontologies a lexical tool like WordNet can be used
in the system.
  Multiple words: we can organize multiple words
into concepts set ) I ,.., I , I ( m 2 1 .
  Ontology segment: To create concepts set,
Ontology segment can be parsed by special parser.
It is noted that current Ontology languages has some
formats (RDF in [5], OWL in [6], DAML in [7]). WI
OntoSearch have finished single search and unite
search for these languages. To meet development of
Ontology language, WI OntoSearch provides custom
search.
3.2. Results management
Definition 5(rank value): This is a numerical value
that determines rank of an evaluating Ontology in all
results. Rank value in the paper is value of similarity
between evaluating Ontology and input message.
Results management includes filter preliminary
results, order and display ultimate results. The system
selects appropriate filter threshold and precision
threshold from rank values, then filters and orders
preliminary results based on keywords. In addition, WI
OntoSearch displays semantic structure of result
Ontology to help user know Ontology structure,
semantic and application domain.
4. Experiment results
We introduce evaluating criterion and accounting
method before we discuss experiments results.
Recall can’t be accounted because Ontology search
can’t gather all Web pages such as other Web searches.
However, WI OntoSearch will filter many irrelevance
results so we redefine recall which can be used to
evaluate irrelevance results. Search can return a large
number of results. All search systems focus on
precision. We redefine precision according to
precision threshold in this paper.
Definition 6 (recall): Given a appropriate filter
threshold, percent of related messages on filtered
results is called as Recall.
Definition 7 (precision): Percent of appropriate results
for user on candidate results that satisfy a selected
precision threshold is called as precision.
Running environment is the following: operating
system: Windows 2000 professional, CPU: P4
2.4GHZ, memory 256MB, Web application server:
Tomcat 4.1.
Table 1 Input and number of results (part)
Before filter After filter Input owl rdfs daml owl rdfs daml
country 32 21 47 12 6 22
book 23 14 33 9 2 14
food 21 7 21 2 0 4
country tourism 6 2 7 3 1 3
book author 12 12 23 5 3 9
food meat 5 2 7 1 1 2
country tourism city 3 2 4 2 0 2
book author price 6 7 11 1 1 3
food meat cake 2 0 1 0 0 1
Filter threshold is “0” in table 1. Experiment data
in table 1 show concepts—weights vectors matching
algorithm has better effect on filtering irrelevance
Ontologies in spite of owl/rdfs/daml file format or
single/dual/three input words.
In order to account Recall, filtered messages can be
checked through random sampling. When we check
those messages, we find that a few messages clearly
related to user requirement are filtered. After we
carefully analyze the problem, we find two noises that
are related to the problem. One is that WI OntoSearch
can’t access the message because of net overtime or
inexistence of the message though preliminary results
are displayed by Google Web Service. These messages
should be filtered. The other is that the parser makes a
mistake because of wrong web pages or wrong file
formats. In the situation, WI OntoSearch can’t acquire
right concepts. In order to use the messages WI
OntoSearch keeps back the messages though there are
a few wrong web pages or wrong file formats in search
systems. Besides the above noises, filtered messages
almost are not related to search. Value of Recall is
close to 0.
0%
25%
50%
75%
100%
4 8 16 20
one word
two words
three words
Figure 2 Effect of input on precision
Figure 2 evaluates effect of input words’ number on
precision. Horizontal line denotes number of
experiments, while vertical line denotes precision. In
order to avoid special input word’s effect, vertical line
must use average value of precision ∑ ni P
n 1
1 . Figure 2 tell
us some conclusions. Firstly, when input word is single
word all average value of precision is less than 50%.
Secondly, precision is increased with increase of input
words. At last, when three words are inputted precision
is closed to 90%--100%. So we advice user to input
right concepts set.
If input concept has synonymy, users can choice
lexical tool like WordNet to expand candidate results.
Table 2 is experiments data.
Table 2 Experiment result using WordNet
Input Not use WordNet WordNet
Google WIOntoSearch Google WIOnoSearch
Person 72 17 109 20
Picture 8 1 24 3
5. Conclusion and future work
This paper proposes CWVMA according to factors
that the literature [4] presents to measure Ontology
semantic. We develop WI OntoSearch which is an
Ontology search engine based on CWVMA and do lots
of experiments by it. A large number of experiments
results show the algorithm has very good effect on
improving precision of Ontology search.
The algorithm with different matching rules can
satisfy different requirements. Another interesting and
appealing work is clustering Ontologies by the
algorithm with complex mapping rules.
References
[1] Alberdi E, Sleeman D, ReTAX: A Step in the
Automation of Taxonomic Revision, Artificial
Intelligence,1997, pp257-279.
[2] Y. Zhang, W. Vasconcelos and D. Sleeman,
“OntoSearch: An Ontology Search Engine”. In
Proceedings The Twenty-fourth SGAI International
Conference on Innovative Techniques and Applications
of Artificial Intelligence (AI-2004), Cambridge, UK,2004.
[3] R.Benassi, S.Bergamaschi, and M.Vincini, “TUCUXI:
The InTelligent Hunter Agent for Concept Understanding
and LeXical ChaIning”, IEEE/WIC/ACM Web
Intelligence International Conference, Beijing, China
2004, pp249-255.
[4] C.Linn, “A Metric Framework for Quantifying Semantic
Reliability in Shared Ontology Environments”,
IEEE/WIC/ACM Web Intelligence International
Conference, Beijing China, 2004, pp519-523.
[5] W3C RDF Vocabulary Description Language 1.0: RDF
Schema, http://www.w3.org/TR/rdf-schema/ 2004.
[6] I. Horrocks, P. Patel-Schneider and F. Harmelen “From
SHIQ and RDF to OWL: the making of a Web Ontology
language”, Journal of Web Semantics, 1(1):7--26, 2003
[7] I.Horrocks, “DAML+OIL: A Reason-able Web Ontology
Language”, In Proc. of EDBT 2002, number 2287 in
Lecture Notes in Computer Science, Springer, March
2002,pp 2-13.

你可能感兴趣的:(research,semantic,search,algorithm,vector,input,structure)