I received my Ph.D. in Computer Science from the University of Bonn, supervised by Prof. Jens Lehmann (AMiner Most Influential Scholar for Knowledge Engineering, Chief Scientist of Amazon AlexaAI, and Co-Founder of DBpedia). My primary research focus was on temporal knowledge graph representation learning and reasoning. Previously, I obtained my bachelor's and master's degree from Zhejiang University.
I have published 20+ papers as the first or corresponding author in top international conferences and journals in natural language processing (NLP) and data mining (DM), such as ICLR, NeurIPS, WWW, SIGIR, EMNLP, and TKDE. In total, I have published over 50 papers, with more than 4,000 citations (Google Scholar, h-index: 24, i10-index: 38).
I am an AI Financial Research Scientist at IDEA (GBA) and have been approved as a Category B Talent under the Shenzhen "Pengcheng Peacock Plan." In January 2025, I co-founded DataArc Technology Inc., where I serve as CTO. DataArc is incubated by IDEA.
LLMs have become indispensable in the field of natural language processing, excelling in various reasoning tasks such as text generation and understanding. Despite their remarkable performance, these models encounter challenges related to explainability, safety, hallucination, out-of-date knowledge, and deep reasoning capabilities, particularly when dealing with knowledge-intensive tasks.
Our research delves into the potential of knowledge-driven LLM reasoning as a promising approach to address these limitations. More specifically, knowledge-driven LLM reasoning leverages LLMs to interact with the external environment (consisting of a variety of knowledge sources, e.g., KGs, textual corpus, databases, code repositories) and retrieve necessary knowledge to enhance the understanding and generation capabilities of LLMs.
Among various knowledge sources, KGs offer structured, explicit, and editable representations of knowledge. We first propose an algorithmic framework "Think-on-Graph" (ToG (ICLR'24)). Building upon ToG, we introduce Think-on-Graph 2.0 (ICLR'25), a hybrid RAG framework that iteratively retrieves information from both unstructured documents and structured knowledge graphs.
Synthetic data has emerged as a critical technology for training and improving large language models, especially when dealing with limited or proprietary data. High-quality synthetic data can significantly enhance model performance while addressing privacy concerns and data scarcity issues.
Our research focuses on developing advanced synthetic data generation frameworks that leverage knowledge graphs to create more accurate and contextually rich training data. We propose Synthesize-on-Graph (SoG), a novel framework that incorporates cross-document knowledge associations for efficient corpus synthesis.
To address the challenge of long-context reasoning in LLMs, we introduce LongFaith (ACL'25 Findings), which focuses on synthesizing faithful long-context reasoning instruction datasets. Furthermore, we explore privacy-preserving approaches through continual pretraining on encrypted synthetic data.
See full list at Google Scholar. (*equal contribution, #corresponding author)
xuchengjin@dataarctech.com / xuchengjin@idea.edu.cn
Room 3901, Building 1, Changfu Jinmao Mansion, No. 5 Shi Hua Road, Futian District, Shenzhen