Skip to main content

Text2VQL

Teaching a Model Query Language to Open-Source Language Models with ChatGPT

While large language models (LLMs) like ChatGPT has demonstrated impressive capabilities in addressing various software engineering tasks, their use in a model-driven engineering (MDE) context is still in an early stage. Since the technology is proprietary and accessible solely through an API, its use may be incompatible with the strict protection of intellectual properties in industrial models. While there are open-source LLM alternatives, they often lack the power of proprietary models and require extensive data fine-tuning to realize their full potential. Furthermore, open-source datasets tailored for MDE tasks are scarce, posing challenges for training such models effectively. In this work, we introduce Text2VQL, a framework that generates graph queries captured in the VIATRA Query Language (VQL) from natural language specifications using open-source LLMs. Initially, we create a high-quality synthetic dataset comprising pairs of queries and their corresponding natural language descriptions using ChatGPT and VIATRA parser. Leveraging this dataset, we use parameter-efficient tuning to specialize three open-source LLMs, namely, DeepSeek Coder 1b, DeepSeek Coder 7b, and CodeLlama 7b for VQL query generation. Our experimental evaluation demonstrates that the fine-tuned models outperform the base models in query generation, highlighting the usefulness of our synthetic dataset. Moreover, one of the fine-tuned models achieves performance comparable to ChatGPT.