Yara Al Tassi & Hanan Zwaihed | Advised by: Dr. Fatima K. Abu Salem
Introduction
So much of Lebanese history has been lost to time, fragmented across missing records, fading memories, and inaccessible archives. Lebanese war history, in particular, remains clouded by missing narratives, undocumented experiences, and disjointed historical accounts. To truly understand war’s impact- both on Lebanon’s sociopolitical landscape and on the individuals who lived through it- we must reconstruct these narratives in a way that makes them easily accessible, searchable, and interconnected.
One of the most devastating consequences of this historical fragmentation is the ongoing crisis of Lebanon’s missing persons. Decades after the war’s end, families continue to search for answers about the fate of their loved ones, with burial sites remaining unidentified and testimonies scattered across individual records. The lack of a comprehensive, structured archive of war testimonies has made it difficult to piece together information that could provide crucial insights into the locations of mass graves, detention centers, and other sites tied to the forcibly disappeared.
Historical data exists in two primary forms: written records – such as official reports, books, and archival documents- and oral sources, including interviews, eyewitness testimonies, and audio recordings. In Lebanon, oral histories are more abundant than written accounts, yet they are unstructured and difficult to analyze systematically. While there are already many methods for processing written records, models that can structure and link oral historical data are still not widespread. This gap prevents us from fully capturing and preserving the lived experiences of those affected by the war, as well as understanding the details that might reveal information about missing persons.
Existing initiatives, such as the Lebanon Memory Archive, have made efforts to collect historical records by linking various projects related to the war. However, these platforms primarily act as repositories rather than integrated systems that establish meaningful connections between different sources. They lack a mechanism to correlate narratives across different formats, time periods, and perspectives—connections that could help reveal previously unknown burial sites, patterns of enforced disappearances, or details about those who vanished.
This article aims to propose a unified knowledge framework that digitizes, structures, and links both written and oral Lebanese war data into an intelligent system that can analyze, contextualize, and preserve these historical narratives. By leveraging natural language processing (NLP), knowledge graphs, and retrieval-augmented generation (RAG), we seek to build a model that is capable of understanding, connecting, and visualizing relationships between different war accounts. This approach will not only help preserve Lebanon’s historical memory but will also fill critical gaps in our collective understanding of the war, ensuring that forgotten voices and lost stories are recovered and accessible for future generations.
Oral History Archives
Oral history archives provide us with firsthand accounts of historical events, cultural events, and personal experiences. However, in order to utilize this information, we need to represent it in a format that can be easily manipulated and organized. Oral data is inherently unstructured and organizing and retrieving information remains a challenge.
The digitization of oral history is an evolving field that raises many questions. A key discussion in this area is led by Dr. Julianne Nyhan, a scholar specializing in digital humanities and oral history, who will be presenting the 2025 Susan Hockey Lecture at UCL Centre for Digital Humanities (UCLDH). In her talk, Oral History as Data? Critically Approaching the Digital Turn in Method, Meaning, and Recollection, she will explore the impact of digital methods on oral history analysis, particularly how structuring oral interviews into knowledge graphs or other structured formats might affect their interpretation.
Dr. Nyhan’s work investigates whether oral history can be meaningfully transformed into machine-readable data without epistemological loss. This aligns with recent research on LLM-based knowledge extraction like the LLM-RAG approach proposed by Sun et al. (2024), which similarly aims to structure oral history through knowledge graphs while preserving meaning and context.
Additionally, Nyhan’s collaborative projects, including the “Multimodal Digital Oral History” initiative (Smith, Nyhan, and Flinn, 2023) and the “Mixed-methods Digital Oral History” project (2024-27), provide frameworks for integrating semantic web technologies with historical-interpretative analysis. These approaches are highly relevant to our work in structuring Lebanese war oral narratives, as they highlight the potential and limitations of AI-driven knowledge representation in historical research.
In 2024 Yi Sun∗, Wanru Yang and Yin Liu proposed a method of constructing knowledge graphs of oral historical archives in their article “The Application of Constructing Knowledge Graph of Oral Historical Archives Resources Based on LLM-RAG”. This article discusses the feasibility of transforming oral data into knowledge graphs for easier conceptualization of the data as well as facilitating linking between different events and ideas. This method presents a possible solution to Dr. Nyhan’s concern regarding the potential epistemological data loss when transforming the data into knowledge graphs.
Through the use of natural language processing, large language models integration in knowledge graph construction has become possible. The paper proposes an LLM-RAG approach to enhance model extraction, visualization, and association within oral historical archives. Although their work is language-specific, their methodology can be adapted to Arabic data, offering valuable insights for applying similar techniques to Lebanese war history.
Knowledge Graph Construction
A knowledge graph (KG) enables discovery, linkage, and retrieval of structured knowledge from unstructured sources. While KGs have been widely used in structured domains such as biomedical research and financial data, their application to oral historical data remains a growing area of exploration.
A key feature of knowledge graphs is their ability to discover new knowledge and link relationships between different entities, or ideas, effectively. However, since knowledge graphs (KGs) rely on binary first-order predicate logic, their ability to capture complex knowledge structures is somewhat difficult.
Recent advances in natural language processing (NLP) have improved KG construction by reducing information loss, shifting processing tasks to retrieval and language understanding algorithms, and retaining the graph structure for deeper relational inference. However, there are some challenges to using LLMs for automatic KG construction for complex oral data, such as word segmentation ambiguity, nested entities, and the complexity of Chinese linguistic expressions which can all lead to erroneous extractions. Although our own construction will not involve Chinese, but rather Arabic, the two languages are equally complex and therefore we would be facing the same issues in Arabic.
The LLM-RAG model addresses these issues as follows:
- Scalability: LLM-RAG integrates retrieval and generation into one model, allowing it to handle larger volumes of text
- Contextual Understanding: By leveraging transformer based models such as BERT, LLM-RAG captures deeper contextual and semantic relationships
- Improved Accuracy: The retrieval mechanism within LLM-RAG helps focus on relevant textual segments, reducing errors and improving knowledge extraction performance.
Architecture
Sun et al. outline key components in building a knowledge graph for oral archives:
1. Data Resources
- Digital resources for oral historical archives include books, manuscripts, photos, audio, video, and letters, which all follow distinct description rules and preservation standards
- Datasets to create and validate entities and relationships such as geographic information database, event information database, name database, etc.
2. Knowledge Extraction:
- Data preprocessing: use NLP tools to clean and structure transcribed data, correcting errors in automatic speech recognition using LSTM (long term short term memory) model. BERT is also used in this stage to understand the context and recover the proper punctuation.
- Text splitter: segments text while maintaining contextual integrity using a recursive character text splitter
- Embedding model and vector storage: Uses the BAAI General Embedding-Large-zh model for vectorization, with Faiss employed for efficient similarity search. The BAAI model is needed since Chinese terms can be difficult to vectorize. For Arabic, models like AraBERT, AraELECTRA, ARBERT, and MARBERT can provide better results, particularly for handling dialectal Arabic.
- Retriever agent: selects relevant information using retrieval tools like langchain and OpenAI API functions
- LLM Output: processes structured queries and retrieves relevant textual chunks
- Human correction: allows for validation and iterative refinement of extracted data
3. Knowledge Association:
- Establishes relationships between documents, people, events, and locations to construct a time-based, entity-linked model.
4. Graph Applications:
- Visualization Tools: Includes character associations, event topology, GIS mapping, and narrative visualizations.
- Search & Query Mechanisms: Enables historical event analysis, inference-based searches, and topic-based browsing.
Application to Oral Archives and Historical War Data
- Oral Narrative Graph: LLM-RAG extracts and organizes key entities from oral archives, constructing a narrative that sequences events, locations, and characters. This enables an intuitive representation of oral history interviews
- Event Correlation Graph: Directed line segment diagrams visualize historical event interconnections by linking events through extracted knowledge. For instance, linking discussions on science-art integration reveals significant interactions and institutional relationships.
- Character impression cloud: Extracting evaluative statements about historical figures from oral records enables sentiment analysis and the creation of impression clouds which provide insight into public perceptions and historical character evaluations
- Character relationship graph: graph-based visualization of character interactions helps researchers analyze interpersonal connections based on affiliations, collaborations, and shared events.
Complementing World War I and II Literature
The documentation of missing persons has evolved significantly across historical conflicts, with World War I and II establishing crucial methodologies for record-keeping and legal frameworks. In contrast, Lebanon’s missing persons crisis is primarily documented through oral testimonies, NGO reports, and forensic efforts. Recent advancements in Linked Open Data (LOD) and Natural Language Processing (NLP) provide new opportunities to structure and analyze oral histories in ways that align with archival methodologies from past conflicts. This section explores a comparative analysis of these approaches, highlighting their strengths, limitations, and potential applications for Lebanon’s missing persons crisis.
Documentation Techniques: Structured Archives vs. Oral History
In the aftermath of World War I and II, missing persons were documented primarily through government archives, war crime tribunals, and military records. These centralized records provided structured data that facilitated cross-referencing between official war logs, prisoner-of-war records, and post-war tribunal findings. The WarSampo project, a digital humanities initiative focused on WWII data, has successfully modernized this structured approach by utilizing LOD to interconnect historical documents into a publicly accessible knowledge system.
Conversely, Lebanon’s missing persons crisis lacks such centralized documentation. Instead, it is primarily recorded through oral testimonies, human rights reports, and forensic investigations, making data retrieval and verification highly fragmented. Oral histories provide rich, firsthand accounts but pose challenges in terms of data structuring, accessibility, and long-term preservation. This discrepancy highlights the need for AI-driven approaches, such as knowledge graphs and NLP-based information retrieval, to bridge the gap between archival documentation techniques and oral history narratives.
The Role of Linked Open Data (LOD) in Unifying Dispersed Records:
The documentation processes of World War I and II and those of Lebanon’s missing persons crisis differ significantly in their approach to data centralization and accessibility. During World War conflicts, missing persons records were meticulously archived by governmental agencies, military institutions, and post-war reconciliation bodies. These records were often stored in classified government repositories, later declassified to allow for public access and academic research. The structured nature of these records allowed historians and policymakers to establish clear patterns, identify casualties, and facilitate legal action. The WarSampo project has demonstrated how these structured records can be further enhanced through LOD, where multiple databases are interconnected to create a rich, interlinked historical narrative.
Lebanon’s case presents a stark contrast. Instead of structured archival records, information about missing persons is scattered across non-governmental organizations, human rights groups, forensic teams, and family-led initiatives. Unlike the systematic documentation efforts seen in World War conflicts, Lebanon’s missing persons records are often incomplete, fragmented, and politically sensitive, making large-scale data integration difficult. A WarSampo-inspired approach could address these discrepancies by leveraging LOD to connect oral testimonies, forensic data, and historical accounts. By structuring oral histories into machine-readable formats and linking them with existing legal and forensic records, Lebanon could create a more cohesive and accessible repository.
Despite the potential benefits of LOD, several challenges arise when attempting to implement this model in Lebanon. The first issue is data inconsistency. Since oral testimonies are not standardized and often contain variations in storytelling, creating structured and reliable datasets becomes complex. Unlike World War archives, which were maintained by governmental bodies, Lebanon’s documentation relies on disparate sources with differing levels of credibility and preservation. Additionally, political resistance to open-access documentation remains a major obstacle. Many records are either classified, inaccessible, or contested by different factions, hindering efforts to create a unified dataset. Overcoming these challenges requires both technical advancements in data integration and political cooperation to ensure transparency and public accessibility.
References
Act for the Disappeared, https://waynoun.com/en.
Al Aan TV تلفزيون الآن. “نهيد ضحية النظام السوري .. تروي تفاصيل اختطاف زوجها خلال اجتياح لبنان عام 1980.” YouTube, 21 Apr. 2015, www.youtube.com/watch?v=Q-sIAJ0uYnk.
Lebanon Memory Archive, https://www.lebanonmemory.com/.
WarSampo, https://www.sotasampo.fi/en/.
Nyhan, Julianne. Oral History as Data? Critically approaching the digital turn in method, meaning and recollection. UCL Centre for Digital Humanities, https://www.ucl.ac.uk/digital-humanities/events/2025/may/oral-history-data-critically-approaching-digital-turn-method-meaning-and.
Yi Sun, Wanru Yang, and Yin Liu. 2024. The Application of Constructing Knowledge Graph of Oral Historical Archives Resources Based on LLM-RAG. In Proceedings of the 2024 8th International Conference on Information System and Data Mining (ICISDM ’24). Association for Computing Machinery, New York, NY, USA, 142–149. https://doi.org/10.1145/3686397.3686420
I’m reading this article from Japan. I was searching for a competition and so impressed by the activity.