Chengjin Xu
Huawei Genius Youth DataArc CTO & Co-founder Pengcheng Peacock Plan
50+ Papers
4000+ Citations
24 h-index
38 i10-index
Contact

I received my Ph.D. in Computer Science from the University of Bonn, supervised by Prof. Jens Lehmann (AMiner Most Influential Scholar for Knowledge Engineering, Chief Scientist of Amazon AlexaAI, and Co-Founder of DBpedia). My primary research focus was on temporal knowledge graph representation learning and reasoning. Previously, I obtained my bachelor's and master's degree from Zhejiang University.

I have published 20+ papers as the first or corresponding author in top international conferences and journals in natural language processing (NLP) and data mining (DM), such as ICLR, NeurIPS, WWW, SIGIR, EMNLP, and TKDE. In total, I have published over 50 papers, with more than 4,000 citations (Google Scholar, h-index: 24, i10-index: 38).

Current Position

I am an AI Financial Research Scientist at IDEA (GBA) and have been approved as a Category B Talent under the Shenzhen "Pengcheng Peacock Plan." In January 2025, I co-founded DataArc Technology Inc., where I serve as CTO. DataArc is incubated by IDEA.

Research Interests

LLM Reasoning & Agents Knowledge-enhanced LLMs Multimodal LLMs Synthetic Data Generation LLM Pretraining & Fine-tuning

We're Hiring!

DataArc (Shenzhen) is looking for self-motivated interns and co-founding partners. If you are interested in synthetic data technologies, please send your CV to xuchengjin@dataarctech.com.

Knowledge-driven LLMs

Think-on-Graph

LLMs have become indispensable in the field of natural language processing, excelling in various reasoning tasks such as text generation and understanding. Despite their remarkable performance, these models encounter challenges related to explainability, safety, hallucination, out-of-date knowledge, and deep reasoning capabilities, particularly when dealing with knowledge-intensive tasks.

Our research delves into the potential of knowledge-driven LLM reasoning as a promising approach to address these limitations. More specifically, knowledge-driven LLM reasoning leverages LLMs to interact with the external environment (consisting of a variety of knowledge sources, e.g., KGs, textual corpus, databases, code repositories) and retrieve necessary knowledge to enhance the understanding and generation capabilities of LLMs.

Among various knowledge sources, KGs offer structured, explicit, and editable representations of knowledge. We first propose an algorithmic framework "Think-on-Graph" (ToG (ICLR'24)). Building upon ToG, we introduce Think-on-Graph 2.0 (ICLR'25), a hybrid RAG framework that iteratively retrieves information from both unstructured documents and structured knowledge graphs.

Synthetic Data Technology

Synthetic Data

Synthetic data has emerged as a critical technology for training and improving large language models, especially when dealing with limited or proprietary data. High-quality synthetic data can significantly enhance model performance while addressing privacy concerns and data scarcity issues.

Our research focuses on developing advanced synthetic data generation frameworks that leverage knowledge graphs to create more accurate and contextually rich training data. We propose Synthesize-on-Graph (SoG), a novel framework that incorporates cross-document knowledge associations for efficient corpus synthesis.

To address the challenge of long-context reasoning in LLMs, we introduce LongFaith (ACL'25 Findings), which focuses on synthesizing faithful long-context reasoning instruction datasets. Furthermore, we explore privacy-preserving approaches through continual pretraining on encrypted synthetic data.

See full list at Google Scholar. (*equal contribution, #corresponding author)

2025-2026
  • Shengjie Ma*, Chengjin Xu*, Chaozhuo Li, Haoran Luo, Weizhi Ma, Yingxia Shao, Xing Xie, Jian Guo#. Think-on-Graph 2.0: Deep and Faithful Large Language Model Reasoning with Knowledge-guided Retrieval Augmented Generation. In the 13th International Conference on Learning Representations (ICLR'25). [PDF]
  • Shengjie Ma, Xuhui Jiang, Chengjin Xu, Cehao Yang, Liang Zhang, Jian Guo. Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models. In Learning on Graphs Conference (LoG'26). [PDF]
  • Cehao Yang, Xin Lin, Chengjin Xu, Xuhui Jiang, Shengjie Ma, Anqi Liu, Hao Xiong, Jian Guo. LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data. In Findings of the Association for Computational Linguistics: ACL 2025 (ACL'25 Findings). [PDF]
  • Jian Gu, Xuhui Jiang, Zhichao Shi, Hanwen Tan, Xiaozhi Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A Survey on LLM-as-a-Judge. In The Innovation, 2025. [Link] (1185 citations)
  • Mingxian Peixian, Xin Zhuang, Chengjin Xu, Xuhui Jiang, Rui Chen, Jian Guo. SQL-R1: Training Natural Language to SQL Reasoning Model by Reinforcement Learning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS'25). [PDF]
  • Zheng Xu, Bo Ou, Yiqi Qi, Shudong Du, Chengjin Xu, Chengquan Yuan, Jian Guo. ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding. In the 13th International Conference on Learning Representations (ICLR'25, Oral). [PDF]
  • Mingzhi Li, Cehao Yang, Chengjin Xu, Xuhui Jiang, Yiqi Qi, Jian Guo, Hau Leung, Irwin King. Retrieval, Reasoning, Re-ranking: A Context-Enriched Framework for Knowledge Graph Completion. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL'25). [PDF]
  • Bingchen Cao, Xin Lin, Yiqi Qi, Chengjin Xu, Cehao Yang, Jian Guo. Financial Wind Tunnel: A Retrieval-Augmented Market Simulator. In Proceedings of the Web Conference 2026 (WWW'26). [PDF]
  • Honghao Liu, Xuhui Jiang, Chengjin Xu, Cehao Yang, Yiran Cheng, Jian Guo. Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLMs. In Proceedings of the 20th Conference of the European Chapter of the Association for Computational Linguistics (EACL'26). [PDF]
  • Chengjin Xu, DataArcTech Team. DataArc SynData Toolkit: A Modular Synthetic Data Generation Platform. In Open Source Project, 2026. [GitHub] (1k+ stars)
  • 2024
  • Xuhui Jiang*, Chengjin Xu*, Yinghan Shen, Fenglong Su, Yuanzhuo Wang, Fei Sun, Zixuan Li, Huawei Shen. Toward Practical Entity Alignment Method Design: Insights from New Highly Heterogeneous Knowledge Graph Datasets. In Proceedings of the Web Conference 2024 (WWW'24). [PDF] (Oral Paper)
  • Jiashuo Sun*, Chengjin Xu*, Lumingyuan Tang*, Saizhuo Wang, Chen Lin, Yeyun Gong, Heung-Yeung Shum, Jian Guo#. Think-on-Graph: Deep and Responsible Reasoning of Large Language Model with Knowledge Graph. In the 12th International Conference on Learning Representations (ICLR'24). [PDF] [Code]
  • Xuhui Jiang, Yinghan Shen, Zhichao Shi, Chengjin Xu, Wei Li, Zixuan Li, Jian Guo, Huawei Shen, Yuanzhuo Wang. Unlocking the Power of Large Language Models for Entity Alignment. In The 62nd Annual Meeting of the Association for Computational Linguistics (ACL'24). [PDF]
  • Xuhui Jiang, Yinghan Shen, Zhichao Shi, Chengjin Xu, Wei Li, Huixuan Zihe, Jian Guo, Yuanzhuo Wang. MM-ChatAlign: A Novel Multimodal Reasoning Framework based on Large Language Models for Entity Alignment. In Findings of the Association for Computational Linguistics: EMNLP 2024 (EMNLP'24 Findings). [PDF]
  • Mingzhi Li, Cehao Yang, Chengjin Xu, Zheng Song, Xuhui Jiang, Jian Guo, Hau Leung, Irwin King. Context-aware Inductive Knowledge Graph Completion with Latent Type Constraints and Subgraph Reasoning. In Proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI'25). [PDF]
  • 2022 and prior
  • Chengjin Xu, Fenglong Su, Bo Xiong, Jens Lehmann. Time-aware Entity Alignment using Temporal Relational Attention. In Proceedings of the ACM Web Conference 2022 (WWW'22).
  • Chengjin Xu, Mojtaba Nayyeri, Yung-Yu Chen, Jens Lehmann. Geometric Algebra based Embeddings for Static and Temporal Knowledge Graph Completion. In IEEE Transactions on Knowledge and Data Engineering (TKDE), 2022.
  • Chengjin Xu, Fenglong Su, Jens Lehmann. Time-aware Graph Neural Network for Entity Alignment between Temporal Knowledge Graphs. In EMNLP'21.
  • Chengjin Xu, Fenglong Su, Jens Lehmann. Knowledge Graph Representation Learning using Ordinary Differential Equations. In EMNLP'21 (Oral).
  • [2026.01] We released DataArc SynData Toolkit, an open-source synthetic data generation platform (1k+ stars).
    [2026.01] Financial Wind Tunnel was accepted to WWW'26.
    [2026.01] Synthesize-on-Graph was accepted to LoG'26.
    [2025.09] SQL-R1 was accepted to NeurIPS'25.
    [2025.05] LongFaith was accepted to ACL'25 Findings.
    [2025.01] Think-on-Graph 2.0 was accepted to ICLR'25.
    [2025.01] ChartMoE was accepted to ICLR'25 as an oral paper.
    [2025.01] I co-founded DataArc Technology Inc. and serve as CTO. DataArc is incubated by IDEA.
    [2024.11] Context-aware Inductive KGC was accepted to AAAI'25.
    [2024.10] Retrieval, Reasoning, Re-ranking was accepted to NAACL'25.
    [2024.10] MM-ChatAlign was accepted to EMNLP'24 Findings.
    [2024.09] Our survey paper A Survey on LLM-as-a-Judge was published in The Innovation.
    [2024.05] ChatEA was accepted to ACL'24.
    [2024.03] Toward Practical Entity Alignment was accepted to WWW'24 as an oral paper.
  • PC Members: NeurIPS, KDD, ICLR, WWW, AAAI, ACL, EMNLP, ECAI
  • Reviewers: IEEE TKDE, IEEE TNNLS, IEEE/ACM TASLP
  • Honors:
    • ISWC'2020 Best Student Paper Nominee (3/170)
    • 深圳"鹏程孔雀计划"特聘岗位人才(B档)
    • Selected for Huawei "Genius Youth" Program
  • Contact

    xuchengjin@dataarctech.com / xuchengjin@idea.edu.cn
    Room 3901, Building 1, Changfu Jinmao Mansion, No. 5 Shi Hua Road, Futian District, Shenzhen

    Send Email