Subtitle: Can We Automate Ontology Construction? Past Failures and Future Possibilities
Other Language Version: [Korean]
1. Introduction: The Epistemological Crisis of Modern AI and Data Science
Modern Artificial Intelligence (AI) and data science are standing at a turning point. Large Language Models (LLM) and Deep Learning technologies have made remarkable progress in text generation, image recognition, and pattern matching. But at the same time, they’ve shown us serious limitations.
Today’s data analysis and AI modeling methods rely mainly on ‘Implicit Knowledge’ based on statistical correlations. This comes with a fundamental flaw: they can’t explicitly understand the semantic structures or causal relationships built into data.
These limitations lead to model hallucinations, Black-Box problems, and semantic inconsistency when we combine different data sources. These are major factors holding back AI adoption in high-risk decision-making areas.
In this article, we start with the core question “Can we automate Ontology construction?” We look at the key differences in data analysis and AI inference methods before and after we bring in ontology. We dig into why past automation attempts were bound to fail, from both historical and technical viewpoints.
We want to explore how moving away from the ‘Data Lake’ approach of just piling up data toward ‘Knowledge Graph’, and ontology-based approaches, changes everything. These approaches define meaning and relationships between data, turning AI from probabilistic guessing machines into intelligent agents that can carry out logical reasoning.
The core argument is this: while ontology construction used to be labor-intensive work that depended on highly trained experts doing everything by hand, things have changed. Through the recent combination of LLM and Neuro-Symbolic architectures, semi-automation and automation have become technically feasible.
This means more than just technical progress. It’s a leap toward ‘System 2’ thinking, where data can explain its own meaning, and AI can carry out complex logical tasks without human help.

2. Theoretical Foundation: The Battle Between Implicit and Explicit Knowledge
To talk about whether we can automate ontology construction, we need to first understand the two basic ways computer systems represent knowledge: Connectionism-based implicit knowledge and Symbolism-based explicit knowledge.
2.1 What Ontology Really Is
In information science, an ontology is not just a terminology dictionary. It’s a formal and explicit specification of how we conceptualize a specific domain in a shared way.
- Classes/Concepts: Categories of entities that exist within the domain.
For example, in the medical domain, this includes Patient, Diagnosis, Therapy, etc. - Relations/Properties: Directional connections that define how classes interact.
This includes predicative relationships such as a doctor treating a patient or a drug alleviating symptoms. - Axioms/Rules: Constraints that define logical truth.
This includes logical rules such as “A patient cannot be a therapy at the same time (Disjointness)” and “If A is the parent of B, B cannot be the parent of A (Asymmetry)”.
Ontology stands as the pinnacle of Symbolic AI. It sets out to represent knowledge using Symbols and Logic that humans can make sense of.
- Structure:
- (Concepts): Employee, Manager
- (Relations): reports_to, works_in
- (Instances): John Doe, Sales Dept
- (Axioms):
This explicit structure lets systems carry out Deductive Reasoning on data.
In other words, it gives systems the ability to logically work out new facts through existing rules and relationships, even for facts that weren’t explicitly put in.
2.2 Deep Learning’s Implicit Knowledge and Its Limits
On the other hand, Transformer-based deep learning models like GPT-5 / Gemini represent knowledge implicitly.
These models map discrete symbols (words, pixels) into high-dimensional continuous Vector Spaces.
In vector space, the word ‘cat’ is not defined by biological classification systems, but by how close it is geometrically to words like ‘meow’, ‘fur’, and ‘pet’.
While this approach shows excellent ability to generalize when processing unstructured data, it has these critical limitations:
- Lack of Causal Directionality: How close vectors are only shows association, not causal relationships.
The vectors for ‘fire’ and ‘smoke’ sit very close together, but the model has no built-in knowledge of whether ‘fire causes smoke’ or ‘smoke causes fire’.
This is because it’s based purely on statistical co-occurrence.- Co-occurrence example: The words “coffee” and “cup” often show up together in phrases like “a cup of coffee,” so they have high co-occurrence frequency.
- Semantic Ambiguity: Models trained without ontology struggle with polysemy processing.
When a word like ‘bank (financial institution/riverbank)’ shows up, the model only guesses the meaning based on contextual probability. It doesn’t treat them as clearly separate concepts.
This can cause serious errors when processing Out-of-Distribution data. - Hallucination: Probabilistic text generation can produce sentences that are grammatically fluent but factually wrong.
This is because there are no logical constraints (Ground Truth Logic) inside the model to check ‘truth’.
So, Ontology construction is essential for moving current AI stuck in System 1 (intuitive, fast thinking) up to System 2 (deliberative, logical thinking).
We can call this a ‘structuring of intelligence’ task that goes beyond simple database design.
3. Comparing Before and After Ontology
The inference performance of data analysis and AI models without ontology versus the data integration efficiency, AI model inference depth, and system explainability after we build an ontology differ dramatically.
3.1 Data Integration & Analytics
In traditional data analysis environments, the lack of ontology leads to data silos and higher integration costs.
On the other hand, Ontology-Based Data Access (OBDA) virtually integrates data, changing how we analyze things.
3.1.1 Without Ontology: The ETL Swamp and Broken-Up Schemas
Without ontology, organizations depend entirely on the ETL (Extract, Transform, Load) process.
- Limits of Physical Integration: To analyze data from different sources (e.g., Oracle DB for CRM, MongoDB for logs, Excel files), we must physically pull out data and copy it to a single data warehouse.
In this process, we must manually write complex scripts to map column names from each source (e.g., CUST_ID vs CLIENT_NO). - Rigidity: When source system schemas change (e.g., column name changes), all connected ETL pipelines break down.
Analysts must know all the physical structures of data (table names, join keys) and write complex SQL queries (e.g., joining 5 or more tables), creating very high chances of errors. - Semantic Inconsistency: The ‘values’ of data exist, but ‘meaning’ gets lost.
For example, when there is data ‘Status: 1’, whether this means ‘active’ or ‘pending’ can only be known by that system’s developer. Analysts must spend an enormous amount of time figuring this out.
3.1.2 After Bringing in Ontology: Semantic Data Fabric
When we build an ontology, data is managed not through physical storage but through a Semantic Layer.
- Virtual Integration: Data stays in source systems as is, while ontology acts as a ‘global conceptual schema’ that covers them.
When users query the business concept “Revenue,” the ontology reasoner automatically turns this into queries that fit each subsystem (Query Rewriting) and runs them. - Semantic Clarity: ‘Status: 1’ gets mapped to the ActiveUser class within the ontology.
Analysts no longer need to figure out encrypted codes and can explore data using business terms. - Logical Checking of Data Quality: Ontology axioms act as guardrails that watch over data quality in real-time.
For example, if we define an axiom stating “male patients cannot receive pregnancy diagnoses,” the system can immediately spot logically impossible data as errors when it shows up during data entry or integration stages.
This is Semantic Integrity verification that goes beyond simple syntactic checks (data type checking).

3.2 AI Training & Inference
For AI models, bringing in ontology means a leap from ‘Correlation’ to ‘Causation’, and turns ‘inexplicable black boxes’ into ‘verifiable white boxes’.
3.2.1 The Correlation Trap vs Causal Reasoning
Deep learning models trained without an ontology rely on statistical patterns within data, namely associations.
This corresponds to the lowest level, level 1, of Judea Pearl’s ‘Ladder of Causation’.
- Before Building Ontology (Spurious Correlations): For example, an AI that picked up the pattern “patients sent to a certain ward (A) have high mortality rates” from hospital data might wrongly recommend not sending patients to ward A.
In reality, Ward A has high mortality rates because it is the Intensive Care Unit (ICU), but we can’t infer this causal structure from the data alone. - After Building (Causal Reasoning): Ontology and Causal Graphs define causal directionality between variables (Severity → ICU admission → Mortality rate).
Neuro-symbolic AI can carry out Intervention and Counterfactual reasoning (levels 2 and 3 of the ladder) based on this structure, such as “ICU admission is a result that comes from patient severity, not a cause of death.”

3.2.2 Controlling Hallucinations (Grounding & Hallucination Mitigation)
The biggest problem with generative AI (LLM), the hallucination phenomenon, happens because models probabilistically make up content not in training data.
- Before Building Ontology: When a user asks about “iPhone 18 specifications released in 2026,” if the training data doesn’t have this content, the model may come up with plausible but false specifications based on past patterns.
- After Building (GraphRAG): GraphRAG, which combines Knowledge Graphs with Retrieval Augmented Generation (RAG) technology, forces LLMs to look up the ontology knowledge base before coming up with answers.
If there is no entity called iPhone 18 in the ontology, or if the ScreenSize property is not defined, the system returns “no information available” or puts together answers based only on facts pulled from the ontology.
This also acts as a filter to check whether generated text breaks logical constraints (e.g., “battery capacity cannot be negative”).
3.2.3 Explainable AI (XAI)
- Before Building Ontology: When AI turns down a loan, the reason is shown only as an opaque number: “because the vector computation result was 0.45.” While techniques like LIME or SHAP can show feature importance to some degree, they cannot explain the decision’s logical basis.
- After building: Ontology-based systems can show explicit reasoning paths such as “the applicant’s income is below standard (Rule A), collateral value is not enough (Rule B), and when these two conditions come together, it fits the loan rejection policy (Policy C).”
This provides auditability, which is essential in regulated industries such as finance, healthcare, and law.
| Feature | Before Building Ontology (Data-Driven) | After Building Ontology (Knowledge-Driven) |
|---|---|---|
| Reasoning Method | • Inductive (statistical pattern matching) | • Deductive (logical entailment) & Abductive |
| Data Integration | • Physical integration (ETL) • schema-dependent | • Virtual integration (OBDA) • Schema-independent |
| Knowledge Form | • Implicit (vector embedding) • Black box | • Explicit (classes, relations, axioms) • White box |
| Flexibility | • Low (pipeline breaks down when source changes) | • High (can handle changes with mapping modifications only) |
| Reliability | • High hallucination risk • Unexplainable | • Logically verifiable • Explainable |
4. Feasibility and Latest Technological Trends in Automated Ontology Construction
The realistic answer to the question posed at the beginning of this article, “Is ontology construction possible automatically?”, can be summarized as: “Yes. However, rather than complete autonomy, human-in-the-loop semi-automation is the realistic standard.”
The methods that once relied on manual labor have undergone a revolutionary change due to the emergence of generative AI.
4.1 LLM-Driven Ontology Learning
While traditional statistical NLP techniques relied on word frequency or simple grammatical patterns, LLMs use ‘common sense’ and ‘linguistic reasoning ability’ pre-trained on vast text corpora to automate complex stages of ontology learning.
- Entity & Relation Extraction: LLMs pull out entities and relationships between them in triple format (Subject, Predicate, Object) from unstructured text through zero-shot or few-shot prompting.
For example, from the sentence “diabetes patients must be given insulin,” it pulls out the relationship (DiabetesPatient, needs, Insulin) with high accuracy without separate training.- Triple (subject, predicate, object) format, extracting information in the form (diabetes patient, needs, insulin)
- Axiom Induction: Even the toughest task of coming up with logical rules (axioms) is being automated.
LLMs can look at context and put forward abstract constraints such as “all A are B (SubClassOf)” and “A and B cannot hold at the same time (DisjointWith).”
This is an area that was impossible with past statistical methods. - Ontology Expansion and Refinement: LLMs can take on the role of ‘ontology engineer’ by looking at new documents based on existing seed ontologies to suggest missing concepts or find and suggest fixes for logical contradictions in existing hierarchical structures.
4.2 Case Study: Amazon’s AutoKnow System
The strongest evidence of automated ontology construction is Amazon’s AutoKnow system.
This system automatically builds and manages a ‘Product Knowledge Graph’ holding billions of products and attributes.
- Automatic Classification (Taxonomy Construction): Deep learning models extract product types such as “ice cream” from unstructured product names like “black cherry cheesecake ice cream” and automatically map them to higher-level categories such as “frozen foods.”
- Attribute and Relationship Discovery: Attributes such as “flavor,” “capacity,” and “battery life” are automatically pulled out and structured from product descriptions and customer reviews.
- Data Cleaning and Spotting Anomalies: Probabilistic models are used to automatically spot and remove logically impossible data, such as “100-ton laptops.”
On top of that, customer behavior logs (clickstreams) are analyzed to determine whether two products have a synonym or a substitute relationship, thereby strengthening the knowledge graph.
This case shows that ontology construction is no longer theoretical work in ivory towers; it can now be automated and run at a large scale in actual business environments.
5. Looking Back: Why Existing Methods Couldn’t Solve This
If automation is possible, why have Semantic Web and ontology learning failed to become mainstream over the past 20 years?
Technological limitations and problems with the economic incentive structure can explain the answer.
5.1 Limits of Statistical NLP and the Semantic Gap
Before the Transformer architecture came out in 2017, ontology learning relied on statistical NLP and shallow parsing.
- Missing Context Understanding: Traditional methods (e.g., Word2Vec, TF-IDF) were based on word co-occurrence frequency.
This often failed at disambiguation problems, such as determining whether “Apple” refers to a fruit or an IT company.
Simple matching, without considering context, produced noisy, low-quality ontologies. - Can’t Induce Axioms: Statistical techniques could find ‘associations’ between words, but could not pull out ‘logical relationships’.
They could know that “bird” and “fly” often co-occur, but it was impossible to infer exception rules or complex logical constraints like “most birds except penguins can fly” from the data alone.
As a result, automatically generated ontologies remained at the level of simple glossaries, unable to perform reasoning.
5.2 Knowledge Acquisition Bottleneck
- High-Cost Structure: Past ontology construction required domain experts and ontologists to manually define rules.
This was work requiring enormous time and cost.
Whenever domain knowledge changed (e.g., new product launches, new regulations), ontologies had to be manually updated, creating a ‘maintenance nightmare’. - Missing Economic Incentives: The Semantic Web vision assumed that websites worldwide would voluntarily tag their data in complex formats like RDF.
However, individual website operators had no reason to bring in complex technology without an immediate payoff.
Only large corporations like Google or Amazon built knowledge graphs internally for the clear ROI of better search quality, while the semantic transformation of the public web failed.
5.3 Scalability Problems of Reasoning Engines
- Computational Complexity: Early ontology reasoners (e.g., Racer, FaCT++) used algorithms with very high computational complexity (exponential time complexity) to guarantee logical completeness.
This was so slow that it couldn’t handle large-scale datasets with millions of triples. - Today’s Solutions: Neuro-Symbolic approaches and Vector Logic have been introduced.
Technologies like Logic Tensor Networks embed logical symbols in vector spaces, enabling fast logical operations even on large-scale data through approximate reasoning, thereby solving past scalability problems.
6. Looking Ahead: The Era of Neuro-Symbolic AI and Agents
Ontology construction has now evolved from manual work to a collaborative process where ‘LLMs draft, logic reasoners check, and humans give final approval’.
These automated ontologies are positioning themselves as core infrastructure for next-generation AI systems, beyond simple data dictionaries.
6.1 The Brain of Autonomous Agents
Beyond simple chatbots, ontology is essential for ‘agentic AI’ that carries out complex tasks.
For autonomous driving robots or supply chain management agents to execute the command “optimize warehouse inventory,” they need an explicit world model of the physical space, product attributes, transportation constraints, etc.
Ontology provides agents with an action space and constraints, helping them plan safely and efficiently without trial and error.
6.2 Data Sovereignty and Customized AI
To get around LLM generality limitations, companies are building their own data as ontologies and combining them with RAG (Retrieval Augmented Generation) systems.
This is the most effective way to put companies’ unique knowledge assets into AI. It serves as the foundation for delivering high-performance AI services while maintaining data sovereignty and security, without relying on external models.
7. Conclusion
The answer to the question “Can we automate ontology construction?” has become more positive with technological progress.
Automation that failed in the past due to statistical NLP limitations and economic inefficiency is now a reality as Large Language Models’ (LLMs) language understanding capabilities team up with neuro-symbolic architectures’ logical verification capabilities.
Changes after we build the ontology include moving data analysis from ‘physical integration’ to ‘virtual integration’ and AI models from ‘correlation learning’ to ‘causal reasoning’.
This is the key to dramatically boosting the reliability, transparency, and utility of AI systems.
Modern organizations must put strategic priority on ‘automation of knowledge engineering’ that structures data into Knowledge Graphs and ontologies beyond just piling it up.
This is the only way forward from the “Big Data” era to the “Smart Data” era.
.END.
1 thought on “Automated Ontology Construction: Potential for Failure or Success”