Go to file
Robin Jadoul ef7cb5a562 Add man pages 2016-05-29 19:34:56 +02:00
cmake Use optparse-cpp in CMake 2016-05-26 15:59:57 +02:00
docs Move proposal documents to docs folder 2016-04-24 15:35:57 +02:00
examples Changed example file for SyntaxHighlighter 2016-05-29 01:38:29 +02:00
include/Lexesis Use references instead of pointers for exceptions + catch regex parsing exceptions in inputparser and add more info before rethrowing 2016-05-29 19:00:20 +02:00
man Add man pages 2016-05-29 19:34:56 +02:00
src Use references instead of pointers for exceptions + catch regex parsing exceptions in inputparser and add more info before rethrowing 2016-05-29 19:00:20 +02:00
templates/c++ Remove main from c++ template 2016-05-27 17:20:55 +02:00
tests Some more tests 2016-05-29 14:40:56 +02:00
.gitignore Initial commit 2016-04-07 16:35:31 +02:00
CMakeLists.txt Move to another optparse fork 2016-05-28 15:29:34 +02:00
Doxyfile.in Some doxygen changes 2016-05-27 19:23:27 +02:00
README.md Add man pages 2016-05-29 19:34:56 +02:00
run_tests.py Some more tests 2016-05-29 14:40:56 +02:00

README.md

Lexesis

A language agnostic lexical analyser generator

Table Of Contents

Introduction

Lexesis is a language agnostic lexical analyser generator. Which means that it uses a description of tokens in the form of regular expressions, and outputs source files for a lexer/scanner (which can be in any language for which a backend has been built, currently only c++), which can be used in building a larger application. It's principle is very similar to the well known tools such as lex or flex, which the difference that Lexesis has a simpler input format, and does not depend on language specific actions to be specified in the configuration file. It uses a programming language independent description of the tokens, in order to allow a bigger reusability across different programming languages of the same lexical analyser specification.

This project came into existence as an application exercise in a course on regular languages and automata for the University of Antwerp.

Requirements

  • git
  • CMake 3.2.2+
  • Boost variant header library (needed for mstch)

For those still on Ubuntu Trusty, the default cmake version is still 2.8.12, so there is a ppa available with a more up-to-date version.

Run

sudo apt-get update && sudo apt-get -y install software-properties-common; \
sudo add-apt-repository -y ppa:george-edison55/cmake-3.x; \
sudo apt-get update && sudo apt-get install -y cmake

to get this newer version

Used dependencies

The following dependencies will be automatically downloaded with git while building

Building

Get your terminal in the source tree and run the following commands:

mkdir build
cd build
cmake ..
make
make install

This will place the Lexesis executable in the build/bin folder, with some extra needed data for Lexesis in build/share You can now simply run ./bin/Lexesis with the arguments you like (see below and in the man pages for an overview).

Running tests

First, build Lexesis in debug mode. The only difference with the normal building, is the line where you call cmake. That line should read

cmake .. -DCMAKE_BUILD_TYPE=Debug

instead. Afterwards, build all the examples as well, since they are used to test the functionality of the generated lexers.

If everything is correctly built, the only thing that remains is to call

python3 run_tests.py

in the project root, and watch the results.

Getting started

Now that Lexesis is successfully built and your terminal is in the build folder, it's time to generate the lexer based on your input file.

The input file

Input files for Lexesis have a .lxs extension and have a set of some very simple rules: On each line, a new type of token is specified with a different priority, starting with the highest at the top of the file and lowest at the bottom. If your input matches more than one of the regexes in your input file, the generated lexer will choose the token with the highest priority. The line begins with the name for the new type of token, following a = and finally the regex used to match tokens of that type. If you want to add a comment to the file, make sure the line starts with a # and Lexesis will ignore that line.

Consider the following example:

CAPITAL = [A-Z]
NUMBER = [0-9]

# This is a comment
ALL = [a-zA-Z]

Here we have 3 different tokens CAPITAL, NUMBER and ALL. Note that the names for the tokens only consist of capital letters, small letter and underscores, other characters are not recommended, in order to work for most possible backends. When we run A through the generated lexer, it will return that it's a CAPITAL, since that is specified higher than ALL.

Regular expressions

More examples

More examples can be found in the examples subdirectory, go ahead an have a look at them. Feel free to play around and experiment with them. The keywords example simply prints the lexed token type, along with its content, while the SyntaxHighlighter example is a bit more complete, making use of multiple lexers at once and providing a simple form of syntax highlighting for xml in the terminal using ansi escape codes. Be aware however that this also accepts anything that even remotely looks like xml, since the regular languages underlying the lexers cannot verify or parse xml. Additionally, the SyntaxHighlighter example also contains a couple of CMake rules that allow automatic generation and regeneration of lexer source files when the .lxs file changed. See the CMakeLists.txt if you are curious.

Tested with

OS Compiler Boost version
Ubuntu 14.04 gcc 5.3.0 Boost 1.54
Ubuntu 14.04 clang 3.5.0 Boost 1.54
Ubuntu 15.10 gcc 5.2.1 Boost 1.58
Ubuntu 15.10 clang 3.6.2-1 Boost 1.58
Arch linux gcc 6.1.1 Boost 1.60
Arch linux clang 3.8.0 Boost 1.60
OSX El capitan clang 7.3.0 Boost 1.60

Authors

  • Thomas Avé
  • Robin Jadoul