22.12.2021 | Research & Teaching

A new approach to simplify the use of databases for everyone



In a bachelor thesis a novel Natural Language Processing approach was developed, which enables the use of e.g. SQL databases for everyone.

In the past decades, large amount of data has been generated that needs to be stored. For this reason, modern databases are a fundamental innovation to store data, and easily read and then use them profitably. However, since most people lack knowledge of database languages such as SQL, a barrier exists that prevents them from fully harnessing the potential of databases. In a bachelor thesis at the DSC, an approach was developed that is based on the use of natural language processing and machine learning to enable the use of databases for everyone.

In Marcel Franzen’s bachelor thesis “Implementation of an automatic transformation of multilingual language input into database queries” (original title “Implementierung einer automatischen Transformierung multilingualer Spracheingaben in Datenbankabfragen”), a novel approach is presented that extends already existing approaches with respect to multilingual usage, so that input in different languages achieves the same results. Accordingly, end users with little or no knowledge of SQL can generate database queries.

Mr. Franzen uses word embeddings for natural language processing (NLP) in his work, which are words encoded as vectors. The vectors from different languages are arranged in their own vector space. The vector spaces are transferred into a common vector space. Thus, on the one hand, the vectors of words are close to each other, which have a high semantic similarity. On the other hand, the same words in different languages also have a high similarity with respect to their word embeddings.
With the help of these properties, questions can be addressed to a database in German as well as in English, which are then converted into corresponding SQL queries. For this purpose, the question as well as the names of the table columns are transformed into a corresponding vector representation by word embeddings, which are then further processed by neural networks. Various smaller models are used to predict individual parts of the SQL query, resulting in a complete SQL query. Due to multilingualism, the language of the queries, may differ from the language of the table columns. Furthermore, this makes the language of the dataset used to train the models irrelevant.

Question: What is Terrence Ross nationality?
Table: player nationality position years in toronto team
SQL-Query: SELECT nationality FROM table WHERE player = ‘terrence ross’
This research is a joint work with the AGRA group of Department 3, as well as the Data Science Center. It was submitted as a bachelor thesis at the University of Bremen.

We congratulate Mr. Franzen on passing his colloquium and wish him continued success in his professional and private life.

Author: Christopher Metz
Are you interested in writing a thesis with us?

Please contact:

Dr. Lena Steinmann
DSC Koordinatorin
+49 (421) 218 - 63941
lena.steinmann@uni-bremen.de



« back