The topic of this proposal is the finite model theory of concatenation (FC), a new logic that combines approaches from two different subfields of theoretical computer science and that has direct applications in information extraction and database theory, and eventual applications in all areas that deal with searching in or filtering of textual data.
Textual data is everywhere in modern life: No matter whether we are dealing with books, report, emails, social media, spreadsheets, or even HTML documents or log files, a huge amount of information is represented as a sequence of letters.
Information extraction (IE) deals with the problem of extracting relevant information from such data. One model for this are document spanners, a relational framework for IE that can be understood as querying text like one would query a database. Due to their similarity to relational queries, document spanners have recently become a trending topic in database theory.
Database theory uses methods from finite model theory (FMT), a subdiscipline of logic in computer science. FMT provided the inspiration for relational database, and both fields have maintained a close connection over these last fifty years. Of particular importance are methods for efficient evaluation of queries and deep insights of what can and cannot be expressed with certain types of queries. Every result in FMT is also a result on databases, as there is an immediate onetoone correspondence between logical formulas and database queries: Every database query corresponds to a logical formula, and every logical formula corresponds to a database query.
This is wellunderstood for relational databases, but it does directly translate to the textual setting of document spanners: The canonical FMTapproaches to text are provably too weak to model document spanners. But to be able to evaluate, optimize, and compare queries on texts, having this for document spanners would be tremendously useful.
As first steps in this direction, previous research used "the theory of concatenation", a different logic from combinatorics on words (CoW), another subdiscipline of theoretical computer science, that studies patterns in texts. But this logic corresponds to infinite model theory, not finite model theory, which makes many useful techniques from FMT unavailable.
To address this problem, the proposed new logic FC combines the theory of concatenation from CoW with the finite model approach from FMT. As shown in the preparatory research, FC has exactly the required expressive power of document spanners while still allowing the use of many classical FMT techniques.
Apart from the connection to document spanners, FC can be seen as a natural variant of the logics that have previously been used: From an FMTpoint of view, FC can be seen as extending the canonical firstorder logic on strings with a string equality operator. From a CoWpoint of view, FC is the finite model version of the theory of concatenation. As such, FC is a fundamental framework for using logic on textual data, and it can directly be extended to cover other operations, like length constraints as they are used in string verification.
The preparatory research introduces FC, proves it has the desired onetoone corresponds to document spanners, and shows that many FMTtechniques can be adapted. But many open questions remain, in particular regarding two of the most important aspects: The efficient evaluation of queries, and the fundamental limitations of the framework  or, less formally, which queries can be evaluated quickly, which algorithms should be used for this, and what is possible in this querying model?
This project will close this gap by creating a solid theoretical foundation for FC. This will require new proof techniques that at the very least merge those from CoW and FMT, or even go far beyond them. The ultimate goal is that FC will do for querying text what FMT did for databases.
