How can we automatically construct the ontology

The search for knowledge instead of websites

Research Report 2006 - Max Planck Institute for Computer Science

Suchanek, Fabian; Weikum, Gerhard

Databases and information systems (Prof. Dr. Gerhard Weikum)
MPI for Computer Science, Saarbrücken

How can we make Internet search engines so intelligent that they not only search for websites, but also really "understand" our queries? This article introduces new technologies that were developed at the Max Planck Institute for Computer Science with this aim in mind.
How can we make search engines really "understand" our queries - instead of just finding key-words on Web pages? This article sketches new technologies, which were developed at the Max Planck Institute for Computer Science.

introduction

Over the past decade, the Internet has become a major source of information. Train timetables, news, scientific articles, company data and even entire encyclopedias are now available online. The majority of these Internet pages are recorded by search engines. Google, for example, allows us to search billions of Internet pages for search words in a fraction of a second. This technology is completely sufficient for many search queries. Most of the time we find what we are looking for after briefly browsing through the displayed results. For example, if you are interested in the physicist Max Planck, then a search for "Max Planck" directly after the link placed first on the Max Planck Society provides several biographies of the physicist. Even questions like “When was Max Planck born?” Can be answered directly with a quick look at the results page. The English version of Google answers the question "When was Max Planck born?" like shot from a pistol: "Max Planck - Date of Birth: 23 April 1858".

Nevertheless, as an Internet user, you occasionally come up against the limits of this technology. For example, if you want to know which physicists were born in the same year as Max Planck, this question can hardly be formulated in a Google-compatible way. All inquiries for "Physicist, born, year, Max Planck" are only returned by Max Planck himself. So you are forced to first googled Max Planck's date of birth and then ask about physicists who were also born in that year. If you then want to know more complex things (which of these physicists were also politically active?), There is no way around reading the relevant Internet pages.

The reason for this inconvenience is that Google does not search knowledge, it searches websites. Google can only satisfy those information requests for which there is already a ready-made answer on a website. If the answer is scattered over several pages or if an answer can only be obtained through logical conclusions, then you have come to the wrong address with Google. In abstract terms, the problem is that today's computers only have texts, but not knowledge. This lack of general knowledge is also responsible, for example, that machine translation produces sometimes amusing results. If it were possible to make the knowledge of this world available to the computer in a large knowledge structure, it would be able to cope with these tasks much more easily.

Knowledge representation in ontologies

Such a structured collection of knowledge is called “ontology”. In the simplest case, an ontology is a directed graph whose nodes are so-called “entities” and whose edges are “relations”. For example, the entity “Max Planck” is associated with the entity “23. April 1858 "in relation to" born on ", because Max Planck was born on April 23, 1858. Although this model is subject to certain restrictions, many knowledge modules can already be expressed in a simple way (Fig.1).

We group entities that have many common properties in so-called classes. Max Planck, for example, like his colleagues who are enthusiastic about physics, belongs to the “physicist” class. In ontology, the class “physicist” is nothing more than another entity that is connected to the physicist with the relation “is a”. Every physicist is a scientist, so the “physicist” and “scientist” classes are in the “subclass of” relation. This results in a hierarchy of classes in which the upper (more general) includes the lower (more specific).

As the next step in abstraction, we introduce a distinction between words and their meanings. So we differentiate between “Max Planck” (the word) and Max Planck (the physicist). This makes sense because different words can refer to the same individual (for example, "Dr. Planck" or "M. Planck"). Conversely, the same word can also refer to different individuals (e.g. there are several people with the name "Planck"). In addition, through this distinction, we abstract beyond the choice of language. So, simplified, the words “physicist”, “physicist” and “physicien” can all refer to the class “physicist”. In ontology, the words are nothing more than further entities.

Such an ontology is often supplemented with rules for logical conclusion (axioms). For example, one of the most basic axioms is that an entity belongs to all of the superclasses of its class. So if it is known that Max Planck was a physicist, then it follows from the subclass relationship between physicist and scientist that Max Planck was a scientist. If every physicist is a scientist and every scientist is a person, then every physicist is a person (transitivity of the subclass relationship).

A system of axioms can also express that two relations are inverse to one another, that certain relations causally condition one another or that time intervals include one another. In this way, the computer can draw the logical conclusion from Max Planck's place of birth Kiel, his date of birth and the day of his death, for example, that Max Planck was a German scientist who lived through two world wars.

Further representations of knowledge use logical formulas to represent relationships such as: Every person has two parents of different sexes, a scientist with a doctorate has a doctoral supervisor, a professor must have published, etc. We may also want to represent speculative knowledge or, due to ambiguities, scope for different ones Leaving interpretations that we can then provide with probabilities. For example, “Paris” can mean the French capital or the figure from the Iliad, or we want to include competing hypotheses about the causes of a certain disease in the ontology, or we want to record the measurement inaccuracy for the solar orbit of Mars. In these cases we have to combine logical and probabilistic methods to represent knowledge.

Ontologies play a central role in the vision of the “Semantic Web”, which the WWW inventor Tim Berners-Lee sees as the successor generation to the current Web2.0 wave. It should then be possible to bring websites directly into connection with entities and the cognitive concepts behind them and to draw intelligent conclusions using logic calculations, for example to find the best clarinet teacher that the daughter of the house can get from her high school in less than half an hour can reach out. To do this, however, all websites must be explicitly annotated with ontological concepts and represented in a logical formalism. Today, such an undertaking still requires a very high and error-prone manual effort for each individual website, so that fundamental scalability problems (still) stand in the way of a quick realization of the vision. Our current research on the intelligent search for knowledge has one goal in common with the semantic web vision, but uses methods that start from data sources that are directly available today and automatically build extensive collections of knowledge from them.

Automatic construction and maintenance of ontologies

The key question now is how to fill an ontology with knowledge. There are several approaches to this. One possibility is to insert the entities and relations all by hand. In fact, the most widespread ontologies nowadays have been created in manual detail work: WordNet is a lexicon of the English language with around 200,000 terms in an ontological structure. SUMO is an ontology with hundreds of thousands of entities, and the commercial ontology Cyc contains as many as two million facts and axioms. Despite this amount of knowledge, a hand-drawn ontology will always lag behind the current development. None of the ontologies mentioned know, for example, the latest Windows system or the soccer stars of the last World Cup.

At the Max Planck Institute for Computer Science, we therefore pursue different approaches to the construction and maintenance of the ontology. One approach uses the large online encyclopedia Wikipedia. Wikipedia contains articles on thousands upon thousands of personalities, products, terms, and organizations. Each of these articles is assigned to certain categories. For example, the article about Max Planck is in the categories "German", "Physicist" and "Born 1858". We use this information to note the class and date of birth of the entity "Max Planck" in the ontology. Wikipedia knows a large number of individuals, but does not provide a well-structured hierarchy of classes. The information that "physicists" are "scientists" and "scientists" are "people" is very difficult to find in Wikipedia. We therefore combine the data from Wikipedia using automated processes with the data from the WordNet ontology mentioned above. This already gives us a very large knowledge structure in which all entities known in Wikipedia have their place. We also use other structured sources of knowledge (such as the IMDB film database).

Unfortunately, not all knowledge is already available in a structured form. The most common form of Internet pages is unstructured, natural language text. Examples are biographies, lexicon entries or news texts. In order to collect this information, too, one uses an approach called “Pattern Matchin”. For example, if you want to add new dates of birth to the ontology, you can first use known dates of birth to find out the pattern according to which dates of birth are often mentioned on websites. A very common pattern for dates of birth is e.g. "X was born on Y" ("Max Planck was born on April 23, 1858"). If you search the Internet for further occurrences of this pattern, other pairs of person and date of birth are brought to light. This can then be entered in the ontology. This approach suffers from the fact that even slight changes in the sentence structure can destroy the pattern. For example, the pattern “X was born on Y” does not match the phrase “Max Planck, the great physicist, was born on April 23, 1958”. We have therefore refined the pattern matching approach in such a way that it includes the grammatical structure of the sentences. The pattern then only requires that X must be the subject of the predicate “was born”, which is connected to Y via the preposition “am”. This pattern now also fits the sentence "Max Planck, the great physicist, ...".

The pattern extractions behind this learning process are implemented in the Leila software tool developed at the institute. So that Leila is not led astray by the diversity and fuzziness of natural language and generates false hypotheses for patterns too quickly, pattern candidates are tested for their robustness using a statistical learning process. In this way, Leila extracts mostly correct facts. For example, Leila can learn with high confidence from the entirety of all Wikipedia articles that Weltschmerz is a feeling that Calcutta is on the Ganges and Paris on the Seine - from sentences like “Calcutta is in the Delta des Ganges” and “Paris has many Museums on the banks of the Seine ”, and that Saarlanders are an ethnic group and hamburgers (the sandwiches) are not.

Yago

By combining these techniques, we have succeeded in creating a very large ontology: Yago (Yet another Great Ontology). Yago currently knows almost a million entities and knows around 6 million facts about these entities. The core of Yago contains - as an empirical evaluation showed us - almost exclusively correct facts, which we have extracted and organized with our most robust methods from Wikipedia articles and their connection with WordNet. We can automatically add further knowledge by analyzing websites and databases with tools like Leila. If statistical learning methods and heuristics come into play, one would expect the correctness rate of the new knowledge to decrease. If, however, as in our case, you already have a high-quality ontology available as a starting point, you can compare new hypotheses with regard to their consistency with this ontology. So you only add those new facts that do not contradict the existing ones. In this way, the ontology is expanded to include new, high-quality facts, which in turn are available for assessing further hypotheses. In a certain way, the learning process is self-regulating: the more knowledge Yago contains, the more robust and easier it is to acquire additional knowledge.

Knowledge search

Our Yago knowledge collection is available online (http://www.mpii.mpg.de/~suchanek/yago) and can answer inquiries using a special query language. The question asked at the beginning, "Which physicists were born in the same year as Max Planck?" Can be formulated for Yago as follows:
"Max Planck" bornInYear $ year

(the $ year variable will now contain Max Planck's date of birth)
$ otherPhysicist bornInYear $ year

(we ask about another person who was also born in $ year ...)
$ otherPhysicist isa physicist

(... and set the condition that he must also be a physicist)
Yago promptly replies with several dozen other physicists. If you want to know which of them was also politically active, you add the condition “$ otherPhysicist isa politician”. Yago answers with the New Zealander Thomas King, who in addition to his work as an astronomer also worked in Parliament (Fig.2).

These methods of ontology-based knowledge search can also be integrated into future search engines and lead to a more powerful form of knowledge search and networking in the largest corpora on our planet. At the Max Planck Institute for Computer Science, we are working on methods for an intelligent search engine that displays the content of all websites, digital libraries and e-science databases as explicit knowledge - with concepts (e.g. enzymes, quasars, poets, etc.) and entities (e.g. Steapsin, 3C 273, Bertolt Brecht, etc.) and their relations - represented and made findable with high precision. Such a search engine would be a breakthrough for the step from the advanced information society to a modern knowledge society in which all knowledge of mankind is not only available on the Internet, but can also be used effectively [1-5].

Original publications

Suchanek, F.M., G. Kasneci and G. Weikum
Yago - A Core of Semantic Knowledge.
Technical Report MPI-I-2006-5-006, Max Planck Institute for Informatics, Saarbrücken 2006.
Suchanek, F.M., G. Ifrim and G. Weikum
Combining Linguistic and Statistical Analysis to Extract Relations from Web Documents.
In: Proceedings of the 12th ACM SIGKdd International Conference on Knowledge Discovery and Data Mining ACL- Stroudsburg PA / USA 2006, 712-717.
Baumgartner, Peter and F.M. Suchanek
Automated Reasoning Support for First Order Ontologies.
Fourth International Workshop on Principles and Practice of Semantic Web Reasoning (PPSWR). Lecture Notes in Computer Science 4187, Springer-Verlag, Heidelberg 2006, 18-31.
Springer Verlag, Heidelberg 2004. 586 p.
Etzioni, O., M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D.S. Weld and A. Yates
Unsupervised named-entity extraction from the Web: An experimental study.
In: Artificial Intelligence 165. Elsevier, Amsterdam 2005, 91-134.