DPRI_doc_20-21/Przybylski/kompilator/wymagania projektowe.md

13 KiB

Project Requirements Document

Project name: MealyCompiler (Solomonoff)

Authors: Aleksander Mendoza, Bogdan Bondar, Marcin Jabłoński

Date: 8.01.2021

0. Document version

  • 13.12.2020 - initial version
  • 8.01.2021 - minor improvements and final touch

1. Project's components (project's products)

Done in first semester:

  • C compiler backend (discontinued, because requirements shifted more towards Java)

    • regular expression (concatenation, Kleene closure, union, output)
  • Java prototype

    • regular expression (concatenation, Kleene closure, union
    • type system
  • Online REPL prototype

    • runs C backend in WebAssembly
    • Ace editor with syntax highlighting

Second semester:

  • simple and efficient compiler-backend written in Java

    • regular expression (concatenation, Kleene closure, union, output, composition, projection, inverse, composition, difference)
    • algorithms of inductive inference
    • type system
    • integration with LearnLib
    • is optimised for functional ranged transducers (symbolic automata)
    • parser in ANTLR
  • REPL and build system

    • support for parallelism
    • non-determinism warnings
    • packaging system
    • dependency resolver
    • supports everything that compiler does
    • additional directives
    • TOML configurations
  • online repl and interactive tutorial

    • can write regular expressions on-the-fly and has all functions of the compiler (REST calls to Spring backend, which calls compiler Java API)
    • saves work of user (cookies and session)
    • syntax highlighting (Ace editr)
    • user can download the effects of their work for their local computer
    • visualizes graphs of automata (uses viz.js)
    • provides technical documentation (formulas with MathJax)
  • tests

    • integration tests in Python with Selenium (Firefox + Chrome)
    • all invariants, precodnitions, postconditions of specification expressed in form of assertions. Runtime analysis of specification with JUnit.
    • automatically generated tests for random automata
    • performance benchmarks
    • usability tests
  • theory and specification

    • scientific papers explaining the theory with appropriate mathematical rigour
    • papers with proofs of correctness of essential algorithms

2. Project limitations

  • The minimal required Java version 1.8 . Oracle has dropped support for older versions long ago
  • Website makes minimal use of CSS3, but older browsers should still be able to use the website.
  • Internet explorer is not supported, because Microsoft stopped developing it.
  • We did not test website for Safari and Edge, but they should work as well.
  • build system and commandline interface works on all systems that can run Java. Embedded devices are not supported, as such a use case is unlikely. In the future we might add lightweight runtime that can execute automata on embedded envronments.

Justifications:

  • initially we started writing compiler in C for best performance. Over the course of development it turned out that performance gains were minimal compared to Java, while the speed writing C code was much slower compared to ease of higher level development in Java. Moreover, the Samsung infrastructure heavily relies on Java and we found out that Java libraries are always preferred over C. Later we also established cooperation with LearnLib from Dortmund university and their entire library is written purely in Java. Hence we decided to switch to Java for better compatibility.
  • we decided to make a website, because this technology is universaly accessible to everyone. A mobile app would require installation (and touch screen would be uncomfortable for writing regexes), command-line interface is only accessible to advanced users and desktop GUI apps require downloading, installation and setup. An online REPL would make Solomonoff easily accessible to masses.
  • The build system was implemented in Java for compatibility with compiler backend. It is primarly targeted at more advanced users and large projects. Build system allows for working with multiple files, which extends the compiler backend that is only capable of working with monolithic streams of code.
  • We considered using user authorization but we decided to keep it simple. Cookies and downloads are out only mean of permanent storage. Hiding our REPL behind "login wall" could potentially turn away some impatient users. There are many demo websites similar to ours that follow similar strategy and don't retain any user data.
  • There are plenty of compiler features that we purposely did not implement. We do not support probabilistic automata, because their semantics tend to be unpredictable and difficult to control by regexes. We don't allow epsilon transitions and it allows for many optimisations. More such examples and technical details can be found in our documentation.
  • Build system does not support namespaces. Instead we took approach similar to C, where "modules" are not a first-class language feature and are instead based on naming convention. When it comes to language features we are strong believers that simplicity and follow the mantra of "less means more".

3. List of functional requirements

  • Java API:
    • load/save transducer from/to file
    • compile reguler expression
    • run transducer
    • create multiple independent instances of compiler that can work in parallel
  • Build system
    • load one or more files
    • use transducers defined in other files
    • compile files in project in parallel
    • define list of source files in build configuration
    • store many independent configuration files, even in the same directory
    • run REPL after building project
  • Online REPL
    • open website and follow tutorial (shows additional tips for first time)
    • compile a larger piece of code and then experiment with it in REPL
    • compile code line by line in REPL
    • read the technical documentation
    • reopen website and continue where you left off (depending on time limit some things might be lost. The server should not store compiler instances indefinitely)
    • download work progress locally
    • go to GitHub page/download compiler and build system JAR
  • Language functionalities:
    • union, concatenation, kleene closure, output, composition, difference, inversion, identity, clear output
    • inference: RPNI, RPNI-EDSM, RPNI-MEALY, OSTIA
    • weights, reflections, functional nondeterminism
    • ambiguous nondeterminism detection, typechecking
    • lazy composition, linear programs, hoare-triples
    • external native functions, optional user extensions

4. List of non-functional requirements

  • scientific papers describing the theory in detail
  • end user tests
  • integration in Samsung
  • integration with LearnLib
  • performance benchmarks
  • accessible and easy tutorials even for less technical users like linguists

5. Measurable indicators

  • efficiency benchmarks on large datasets of regular expressions
    • RAM usage
    • disk usage
    • execution speed
    • compilation speed
  • list of features
  • contributions to LearnLib
  • deployment on http://solomonoff.projektstudencki.pl/
  • unit tests
  • integration tests
  • user experience feedback

6. Acceptation criteria for first semester

  • required:
    • C compiler implementation:
      • union,
      • concatenation,
      • kleene closure
      • output
      • execution
    • Java prototype, theory and specification
      • Glushkov's construction with variables
      • type system
      • nondeterminism detection
      • binary search execution
    • online compiler
      • WebAssembly bindings
      • Ace editor and syntax highlighting
      • website design
  • expected:
    • usable precompiled delivery
    • optimised algorithms
    • additional operations (composition, inverse, subtraction)
  • planned:
    • support for formal verification
    • inductive inference
    • optimisations
    • fully developed compiler
    • tutorials, examples, how-tos
    • extensive testing

7. Acceptation criteria for second semester

  • required:
    • fully usable optimised compiler with all additional features
    • working with multiple source files
    • inductive inference
    • tutorials, examples how-tos
    • compatibility with client's existing Java infrastructure
    • compatibility with LearnLib
  • expected:
    • secondary compiler features (graph visualisation, export/import, external utility functions)
    • parallel compilation
    • configurable build system
    • inductive inference artifacts as build dependencies
    • scripts for automated integration tests
    • great performance benchmarks
    • detailed technical documentation
  • planned:
    • partial inductive inference (OSTIA-C) for LearnLib
    • Thrax-Solomonoff converter for backward-compatibility with legacy systems
    • Video tutorials
    • advanced online code editor/full online IDE
    • extensible build system with plugins and repositories

8. Project work organization

  • Aleksander Mendoza (Product owner)
    • Glushkov's construction
    • weighted transducers
    • inductive inference
    • nondeterministic minimization
  • Bogdan Bondar (implementation)
    • Spring backend
    • frontend
    • compiler integration
    • testing (Selenium, unittest, JUnit)
  • Marcin Jabłoński (implementation)
    • build system (Java)
    • repl
    • dependency resolver
    • compiler developemnt (assistance and C implementation)
    • compiler extension for handling multi-file projects

Aleksander Mendoza is responsible for finding clients and communicating with them.

Initially our team attempted to use Scrum, but later we switched to incremental methodology, because workflow relied heavily on specification and long-term planning. Scrum's main advantage lies in its flexibility, which wasn't the key for this project. It also imposed unrealistic and unnatural team dynamics, which only made work more complicated than it had to be. Scrum gives all team memebrs high degree of independence and autonomy. In scrumchat, implementators describe the progress they made. On the other hand, in our project the specification is more rigid and work progresses according to it. Hence, it's always well understood who does what at which moment. The future tasks are generally known ahead of time.

Tools:

  • JIRA
  • git & GitHub
  • CircleCI
  • Selenium for integration tests
  • MS Teams for video chats, Messanger for daily quick chat

We created a full detailed list of planned tasks at the beginning of semester and tried to follow it, but we also added more unforseen tasks on the rolling basis according to necessity. Every task corresponded to some palpable feature and its implementation allowed for closing the task.

9. Project risks

  • The most important risk of our project was its heavy reliance on advanced theoretical concepts. It required plenty of rigour to make sure our foundations are correct and well defined. Should anything in our understanding of automata be wrong, the whole project would at risk of becoming irrelevant.

  • The second most critical concern was time. There was plenty to do and very little time. It was haard to estimate how much any of the tasks would take. While missing initial deadlines due to unforseen complications is typical for software engineering projects, our project was exposed to a such risks at a much larger scale. Should anything be wrong in the formal specification, it could require months of additional research. In the worst case, if there was a mistake, some goals might turn out to be mathematically impossible. For this reason our team had to be rigorous about their promises.

  • The organization of work was a challenge. Project requirements often required us to learn new technologies and solve nontrivial problems. Our team often got stuck on challenging problems and sometimes we had to change plans as some of our plans turned out to be technically impossible:

    • we struggled with JWebAssembly and in the end switched to Spring
    • the low-level C implementation was going too slow and we faced the risk of not delivering on time
    • after the first semester we gained plenty of experience developing Java prototype and we noticed a galore of details that could be done better than we initially planned. We took a drastic decision to rewrite the compiler in Java, which was seen as risky.

    Due to these and many other difficulties, our team could have failed on multiple occations.

  • Our project is very niche and finding clients is not easy. If any of our clients lost interest in our solution, finding a new one might become impossible.

10. Milestones

  • proof of concept and first implementation of Glushkov's construction (Deadline: end of first semester)
  • proof of concept for type system (Deadline: end of first semester)
  • establishing relations with Dortmund University
  • preparing code for adoption in Samsung. It requires writing a very specific feature that allows for converting legacy codebase from Thrax to Solomonoff. (Deadline: end of 2020)
  • getting build system ready (Deadline: end of 2020)
  • testing (Deadline: end of January 2021)
  • full integration in Samsung (Deadline: February 2021)