DPRI_doc_20-21/Przybylski/kompilator/Wizja.md

10 KiB

Project Vision Document

Project name: MealyCompiler (Solomonoff)

Authors: Aleksander Mendoza, Bogdan Bondar, Marcin Jabłoński

Date: 13.12.2020

1. Executive summary

This project focuses on research in the field of automata theory and inductive inference. While many existing libraries already provide support for general purpose automata and implement various related algorithms, this project takes a slightly different approach. The primary tool for working with the library, is through domain specific language of regular expressions. Most of the things can be done without writing even a single line of Java code. We provide commandline as well as web interface and integrated build system.

The main applications concern formal methods, natural language processing, state-based system modeling, pattern recognition, inductive inference and machine learning. It can be of great help for researchers as well as can be used on industrial scale.

The greatest competitor for Solomonoff is Google's OpenFST project. Solomonoff introduces dozens of improvements. The most important being efficiency improvements, symbolic ranges, nondeterminism detection and type system.

2. Goal and target audience

The goal is to provide a better alternative for OpenFST.

OpenFst was a niche project with sparse documentation and neglected user interface. Everything is done primarily in C++ with templates, while commandline interface is basic, regular expression language has many fundamental flaws, their build system consists of tool for generating Makefiles that only call the compiler. Solomonoff strives to bring improvement with the following features:

  • We have online REPL with examples and guides where everything works out of the box. Everything can be done with our rich and carefully designed regular expression language, so there is no need to write code.

  • We provide extensive documentation in form of 3 scientific papers, detailed technical documentation of compiler implementation, extensive GitHub page and interactive REPL tutorial.

  • We ship with integrated build system that supports parallelism and interacts with compiler directly for optimum performance. It can be configured in TOML build files.

  • Everything can be done in regular expressions and there is no need to interact with compiler's API (althought we do provide as an addition).

  • There are plentiful of technical improvements, performance optimisations and innovative algorithms. All of them are described in detail in our documentation. We also provide performance benchamrks on our GitHub

  • Unlike any other existing automata tool, we ship with out-of-the functions for box inductive inference.

The project found approval among members of Samsung's R&D team for Bixby developement and inductive inference researchers from Dortmund Technical University, Germany. Our system will be deployed in Samsung, our website can aid linguists with easily learning Solomonoff and our build system should make commandline usage more accessible.

The main products are:

  • compiler backend - the core of the project
  • commandline interface with build system - allows more technical users to automate Solomonoff with shell scripts
  • website and REPl - allows less technical users to experiene Solomonoff without much friction.

Our project brings performance, which was measured with benchmarks (available on GitGub). We provide innovation, which can be easily noticed by the sheer amount of features Solomonoff implements, which are not available in any other existing tool (more information in scientific papers). We are user-friendly and we assesed it during usability tests and by collecting feedback from end users.

3. Market

Currently there exists only one serious alternative, which is OpenFST library with their Thrax extension for writing regex-like grammars. Their solution has numerous problems. It's focus on probabilistic approach to modeling nondeterminism, made the library quite slow. It also became a double-edged sword, by making rule-based system difficult to maintain (compiler doesn't warn programmer when nondeterminism causes some rules to overshadow others). Compilation of grammars is lacking in many aspects. The grammar expression language is very basic and obscure. Compiler is not parallized and highly inefficient. On top of that, the probabilistic approach.

  • Our solution completely gets rid of nondeterminism. We will not attempt to model any probabilistic models. We will focus more heavily on making the expression language user-friendly and helpful in detecting potential non-determinism. The compiler should be able to process multiple rules in parallel to make compilation time faster. We implement Glushkov's construction and take advantage of special properties of functional transducers. For better user experience and easier management of code, we provide integrated build system and online REPL with extensive documentation.

There used to be another similar project, developed by AT&T but it's been long discontinued and replaced by OpenFST. Aside from those two competitors, the market is very niche. There do not exist any other tools (or at least, they are not available publicly) strictly for compiling transducer. The closest other competitors might include general-purpose automata libraries like Google's RE2 or Anders Møller's BRICS library. However, those projects are fundamentally different as they implement classical automata instead of tranasducers. As a result, Solomonoff is strictly more powerful and superior to those solutions. There is no other automata compiler with support for inductive inference. There is no ther symbolic transducer library (although there exist general-purpose libraries for symbolic automata, like symbolicautomata, Microsoft Automata Library, Rex, Bex, Fast, Mona).

4. Product description

A simple and efficient library written in C will be the main and primary component of our product. On top of that, it will have command-line interface equipped with compiler. For easy and quick access, we should support online repl for all curious people who want to give our library a try. The compiler should support parallelism, warn user about non-determinism and allow for possibly some extent of generic programming (by defining functions working on regular expressions or bulk-generation of rules according to some regularities). We should ,however pay extra attention, to not making this language turing complete/undecidable by accident (otherwise compilation might never end).

  • simple and efficient compiler-backend written in Java

    • regular expression (concatenation, Kleene closure, union, composition, projection, inverse, composition, difference)
    • algorithms of inductive inference
    • type system
    • integration with LearnLib
    • is optimised for functional ranged transducers (symbolic automata)
  • REPL and build system

    • support for parallelism
    • non-determinism warnings
    • packaging system
    • dependency resolver
    • supports everything that compiler does
    • additional directives
    • TOML configurations
  • online repl and interactive tutorial

    • can write regular expressions on-the-fly
    • has all functions of the compiler
    • saves work of user (cookies and session)
    • syntax highlighting
    • user can download the effects of their work for their local computer
    • visualizes graphs of automata
    • provides technical documentation

Project architecture and key components:

  • there is all theoretical work and background written in our PDF
  • theoretical paper serves as basis for formal specification of library functions
  • Java compiler uses as few libraries as possible: ANTLR for parsing, LearnLib for inductive inference
  • REPL is developed on top of compiler backend and is not included in the compiler itself (although compiler provides certain facilities necessary for implementing REPL)
  • build system is developed and shipped independently from compiler backend
  • online REPL with backend in Spring, uses compiler's Java API.
  • the REPL used by build system and web brwoser has overlaping features but due to inherent differences between the two enviroments their implementations differ a little.
  • website contains documentation, examples and tutorials

Our target audiences include:

  • researchers who want to:
    • have some ready-made library with functionalities they need for experiments.
    • verify formal properties of complex sequential systems
  • linguists
    • working on rule-based translation
  • machine-learning experts
    • who work in fields related to natural language processing who use Solomonoff inference
    • Companies developing artifical inteligence systems who might use it as one of their tools. (Especially if their employees are any of the people above)

5. Scope and limitations

Work schedule:

  • first semester
    • formal specification
    • C library prototype (unoptimised)
    • compiler prototype
    • online repl prototype (with dummy compiler functionality)
  • second semester
    • optimised Java compiler
    • compiler usable in practice and tested on real life users
    • online repl integrated with compiler
    • machine learning algorithms
    • build system

Team:

  • Aleksander Mendoza
    • formal specification and theoretical foundations
    • compiler implementation (Java)
    • unit tests (JUnit)
  • Bogdan Bondar
    • web design (Bootstrap)
    • backend development (Spring)
    • integration tests (Selenium)
  • Marcin Jabłoński
    • build system (Java)
    • REPL (ANTLR)
    • assisting with compiler implementation (initial but discontinued C version)

Limitations:

  • no official support for mobile devices. Website is responsive but no mobile-optimised version was added. End users should not code on their smartphones, which is not a common practice either way.
  • no/limited compiler backend for embedded devices. While JavaME does exist, the compiler was not meant to be deployed in embedded systems.
  • no (centralised) package manager for build-system. Developing package manager infrastructure is a prohibitively expensive operation.