Lexesis/README.md

Lexesis
=======
A language agnostic lexical analyser generator

## Table Of Contents
* [Introduction](#introduction)
* [Requirements](#requirements)
* [Used dependencies](#used-dependencies)
* [Building](#building)
* [Getting started](#getting-started)
* [More examples](#more-examples)
* [Tested with](#tested-with)
* [Authors](#authors)
* [License](#license)

## Introduction
Lexesis is a language agnostic lexical analyser generator. Which means that it uses a description of *tokens* in the form of regular expressions, and outputs source files for a lexer/scanner (which can be in any language for which a backend has been built, currently only c++), which can be used in building a larger application.
It's principle is very similar to the well known tools such as [lex](https://en.wikipedia.org/wiki/Lex_(software)) or [flex](http://flex.sourceforge.net), which the difference that Lexesis has a simpler input format, and does not depend on language specific actions to be specified in the configuration file. It uses a programming language independent description of the tokens, in order to allow a bigger reusability across different programming languages of the same lexical analyser specification.

This project came into existence as an application exercise in a course on regular languages and automata for the University of Antwerp.

## Requirements
* git
* CMake 3.2.2+
* Boost variant header library (needed for mstch)
* Doxygen (optional, needed for building documentation)

For those still on *Ubuntu Trusty*, the default cmake version is still 2.8.12, so there is a ppa available with a more up-to-date version.

Run

    sudo apt-get update && sudo apt-get -y install software-properties-common; \
    sudo add-apt-repository -y ppa:george-edison55/cmake-3.x; \
    sudo apt-get update && sudo apt-get install -y cmake

to get this newer version

Boost variant can be installed on an ubuntu machine by running

    sudo apt-get update && sudo apt-get -y install libboost-dev

## Used dependencies
The following dependencies will be automatically downloaded with git while building

* [mstch](https://github.com/no1msd/mstch)
* [optparse](https://github.com/myint/optparse)

## Building
Get your terminal in the source tree and run the following commands:

    mkdir build
    cd build
    cmake ..
    make
    sudo make install

You can now run `Lexesis`

If you want to build the documentation as well, simply run

    make doc

The output should be located in `build/doc`, with the main *html* page in `build/doc/html/index.html`.

### Running tests
First, build Lexesis in debug mode. The first difference with the normal building, is the line where you call cmake. That line should read

    cmake .. -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=.

instead. Afterwards, after calling `make install` (which now install locally in the build folder, so you don't need the sudo), build all the examples as well with
    
    cmake .
    make examples

since they are used to test the functionality of the generated lexers.

If everything is correctly built, the only thing that remains is to call

    python3 run_tests.py

in the project root, and watch the results.

## Getting started

Now that Lexesis is successfully built and your terminal is in the `build` folder, it's time to generate the lexer based on your input file.

### The input file

Input files for Lexesis have a `.lxs` extension and have a set of some very simple rules:
On each line, a new type of token is specified with a different priority, starting with the highest at the top of the file and lowest at the bottom.
If your input matches more than one of the regexes in your input file, the generated lexer will choose the token with the highest priority.
The line begins with the name for the new type of token, following a `=` and finally the regex used to match tokens of that type.
If you want to add a comment to the file, make sure the line starts with a `#` and Lexesis will ignore that line.

Consider the following example:

    CAPITAL = [A-Z]
    NUMBER = [0-9]

    # This is a comment
    ALL = [a-zA-Z]

Here we have 3 different tokens `CAPITAL`, `NUMBER` and `ALL`. 
Note that the names for the tokens only consist of capital letters, small letter and underscores, other characters are not recommended, in order to work for most possible backends.
When we run `A` through the generated lexer, it will return that it's a `CAPITAL`, since that is specified higher than `ALL`.

### Regular expressions

Most posix regular expression features have been implemented, with the exception of some notable features:

* There is no way to match the beginning or ending of a line (`^` or `$`)
* Repetition using `{` and `}` is (not yet) supported

It should be noted that escape characters inside character classes don't exist, so a `-` that is part of the class should be specified as very first or very last element of the class (and cannot be used as endpoint of a range), and a `]` should be specified as first element. A `^` however should not be used as first element, unless it is meant as an inversion modifier for the character class.

When needed (for example at the beginning of a rule, when whitespace is stripped by the reading of the input rules) a space can be specified as `\s` for convenience. Otherwise `[ ]`, a character class containing only a space can be used as well.

### Using the lexer

Of course, how you use the generated lexer highly depends on which backend you used to generate it. For the default c++ backend however, the easiest way of getting to know the lexer is probably having a look at the generated header file, usually named *&lt;Lexername&gt;.h*.
In general, there should be some way the tokens are defined, and there should be some way to generate a list of tokens (or get each tokens separately).

## More examples
More examples can be found in the *examples* subdirectory, go ahead an have a look at them.
Feel free to play around and experiment with them.
The *keywords* example simply prints the lexed token type, along with its content, while the *SyntaxHighlighter* example is a bit more complete, making use of multiple lexers at once and providing a simple form of syntax highlighting for xml in the terminal using ansi escape codes. Be aware however that this also accepts anything that even remotely looks like xml, since the regular languages underlying the lexers cannot verify or parse xml. Additionally, the SyntaxHighlighter example also contains a couple of CMake rules that allow automatic generation and regeneration of lexer source files when the *.lxs* file changed. See the [CMakeLists.txt](examples/SyntaxHighlighter/src/CMakeLists.txt) if you are curious.

## Tested with
| OS                | Compiler      | Boost version |
| ----------------- | ------------- | ------------- |
| Ubuntu 14.04      | gcc 5.3.0     | Boost 1.54    |
| Ubuntu 14.04      | clang 3.5.0   | Boost 1.54    |
| Ubuntu 15.10      | gcc 5.2.1     | Boost 1.58    |
| Ubuntu 15.10      | clang 3.6.2-1 | Boost 1.58    |
| Arch linux        | gcc 6.1.1     | Boost 1.60    |
| Arch linux        | clang 3.8.0   | Boost 1.60    |
| OSX El capitan    | clang 7.3.0   | Boost 1.60    |

## Authors
* Thomas Avé
* Robin Jadoul

## License
This program is distributed under the terms of the MIT license. The generated code falls under the permissive zlib/libpng license.
Initial commit 2016-04-07 16:35:31 +02:00			`Lexesis`
			`=======`
First work on README 2016-05-28 16:07:21 +02:00			`A language agnostic lexical analyser generator`
Initial commit 2016-04-07 16:35:31 +02:00
First work on README 2016-05-28 16:07:21 +02:00			`## Table Of Contents`
More work on README 2016-05-28 16:28:15 +02:00			`* [Introduction](#introduction)`
			`* [Requirements](#requirements)`
Update Table of contents 2017-01-30 16:51:31 +01:00			`* [Used dependencies](#used-dependencies)`
More work on README 2016-05-28 16:28:15 +02:00			`* [Building](#building)`
			`* [Getting started](#getting-started)`
			`* [More examples](#more-examples)`
			`* [Tested with](#tested-with)`
			`* [Authors](#authors)`
Update Table of contents 2017-01-30 16:51:31 +01:00			`* [License](#license)`
First work on README 2016-05-28 16:07:21 +02:00
			`## Introduction`
More work on README 2016-05-28 16:28:15 +02:00			`Lexesis is a language agnostic lexical analyser generator. Which means that it uses a description of tokens in the form of regular expressions, and outputs source files for a lexer/scanner (which can be in any language for which a backend has been built, currently only c++), which can be used in building a larger application.`
			`It's principle is very similar to the well known tools such as [lex](https://en.wikipedia.org/wiki/Lex_(software)) or [flex](http://flex.sourceforge.net), which the difference that Lexesis has a simpler input format, and does not depend on language specific actions to be specified in the configuration file. It uses a programming language independent description of the tokens, in order to allow a bigger reusability across different programming languages of the same lexical analyser specification.`

			`This project came into existence as an application exercise in a course on regular languages and automata for the University of Antwerp.`
First work on README 2016-05-28 16:07:21 +02:00
			`## Requirements`
			`* git`
			`* CMake 3.2.2+`
			`* Boost variant header library (needed for mstch)`
Add documentation building and improved example building to README 2016-05-30 19:15:15 +02:00			`* Doxygen (optional, needed for building documentation)`
First work on README 2016-05-28 16:07:21 +02:00
More work on README 2016-05-28 16:28:15 +02:00			`For those still on Ubuntu Trusty, the default cmake version is still 2.8.12, so there is a ppa available with a more up-to-date version.`

			`Run`
Add testing instruction to README 2016-05-29 18:39:32 +02:00
			`sudo apt-get update && sudo apt-get -y install software-properties-common; \`
			`sudo add-apt-repository -y ppa:george-edison55/cmake-3.x; \`
			`sudo apt-get update && sudo apt-get install -y cmake`

More work on README 2016-05-28 16:28:15 +02:00			`to get this newer version`

Add boost install instructions to README 2016-05-31 16:19:11 +02:00			`Boost variant can be installed on an ubuntu machine by running`

			`sudo apt-get update && sudo apt-get -y install libboost-dev`

First work on README 2016-05-28 16:07:21 +02:00			`## Used dependencies`
			`The following dependencies will be automatically downloaded with git while building`

			`* [mstch](https://github.com/no1msd/mstch)`
			`* [optparse](https://github.com/myint/optparse)`

			`## Building`
			`Get your terminal in the source tree and run the following commands:`
Add testing instruction to README 2016-05-29 18:39:32 +02:00
			`mkdir build`
			`cd build`
			`cmake ..`
			`make`
Properly install the project 2017-01-26 15:47:58 +01:00			`sudo make install`
Add testing instruction to README 2016-05-29 18:39:32 +02:00
Properly install the project 2017-01-26 15:47:58 +01:00			You can now run `Lexesis`
First work on README 2016-05-28 16:07:21 +02:00
Add documentation building and improved example building to README 2016-05-30 19:15:15 +02:00			`If you want to build the documentation as well, simply run`

			`make doc`

			The output should be located in `build/doc`, with the main html page in `build/doc/html/index.html`.

First work on README 2016-05-28 16:07:21 +02:00			`### Running tests`
Fix some build problems 2016-05-29 20:33:13 +02:00			`First, build Lexesis in debug mode. The first difference with the normal building, is the line where you call cmake. That line should read`
Add testing instruction to README 2016-05-29 18:39:32 +02:00
Properly install the project 2017-01-26 15:47:58 +01:00			`cmake .. -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=.`
Add testing instruction to README 2016-05-29 18:39:32 +02:00
Properly install the project 2017-01-26 15:47:58 +01:00			instead. Afterwards, after calling `make install` (which now install locally in the build folder, so you don't need the sudo), build all the examples as well with
Fix some build problems 2016-05-29 20:33:13 +02:00
Add documentation building and improved example building to README 2016-05-30 19:15:15 +02:00			`cmake .`
Fix some build problems 2016-05-29 20:33:13 +02:00			`make examples`

			`since they are used to test the functionality of the generated lexers.`
Add testing instruction to README 2016-05-29 18:39:32 +02:00
			`If everything is correctly built, the only thing that remains is to call`

			`python3 run_tests.py`

			`in the project root, and watch the results.`
First work on README 2016-05-28 16:07:21 +02:00
			`## Getting started`

Added section about input files 2016-05-29 18:57:55 +02:00			Now that Lexesis is successfully built and your terminal is in the `build` folder, it's time to generate the lexer based on your input file.

			`### The input file`

			Input files for Lexesis have a `.lxs` extension and have a set of some very simple rules:
			`On each line, a new type of token is specified with a different priority, starting with the highest at the top of the file and lowest at the bottom.`
			`If your input matches more than one of the regexes in your input file, the generated lexer will choose the token with the highest priority.`
			The line begins with the name for the new type of token, following a `=` and finally the regex used to match tokens of that type.
			If you want to add a comment to the file, make sure the line starts with a `#` and Lexesis will ignore that line.

			`Consider the following example:`

Add man pages 2016-05-29 19:34:56 +02:00			`CAPITAL = [A-Z]`
			`NUMBER = [0-9]`
Added section about input files 2016-05-29 18:57:55 +02:00
Use indentation instead of for code 2016-05-29 19:07:13 +02:00			`# This is a comment`
Add man pages 2016-05-29 19:34:56 +02:00			`ALL = [a-zA-Z]`
Added section about input files 2016-05-29 18:57:55 +02:00
Add man pages 2016-05-29 19:34:56 +02:00			Here we have 3 different tokens `CAPITAL`, `NUMBER` and `ALL`.
			`Note that the names for the tokens only consist of capital letters, small letter and underscores, other characters are not recommended, in order to work for most possible backends.`
			When we run `A` through the generated lexer, it will return that it's a `CAPITAL`, since that is specified higher than `ALL`.
First work on README 2016-05-28 16:07:21 +02:00
			`### Regular expressions`

Complete README? 2016-05-29 19:50:44 +02:00			`Most posix regular expression features have been implemented, with the exception of some notable features:`

			* There is no way to match the beginning or ending of a line (`^` or `$`)
			* Repetition using `{` and `}` is (not yet) supported

			It should be noted that escape characters inside character classes don't exist, so a `-` that is part of the class should be specified as very first or very last element of the class (and cannot be used as endpoint of a range), and a `]` should be specified as first element. A `^` however should not be used as first element, unless it is meant as an inversion modifier for the character class.

			When needed (for example at the beginning of a rule, when whitespace is stripped by the reading of the input rules) a space can be specified as `\s` for convenience. Otherwise `[ ]`, a character class containing only a space can be used as well.

			`### Using the lexer`

			`Of course, how you use the generated lexer highly depends on which backend you used to generate it. For the default c++ backend however, the easiest way of getting to know the lexer is probably having a look at the generated header file, usually named <Lexername>.h.`
			`In general, there should be some way the tokens are defined, and there should be some way to generate a list of tokens (or get each tokens separately).`

First work on README 2016-05-28 16:07:21 +02:00			`## More examples`
			`More examples can be found in the examples subdirectory, go ahead an have a look at them.`
			`Feel free to play around and experiment with them.`
Add SyntaxHighlighter cmake rules mention to the readme 2016-05-28 17:49:51 +02:00			The keywords example simply prints the lexed token type, along with its content, while the SyntaxHighlighter example is a bit more complete, making use of multiple lexers at once and providing a simple form of syntax highlighting for xml in the terminal using ansi escape codes. Be aware however that this also accepts anything that even remotely looks like xml, since the regular languages underlying the lexers cannot verify or parse xml. Additionally, the SyntaxHighlighter example also contains a couple of CMake rules that allow automatic generation and regeneration of lexer source files when the .lxs file changed. See the [CMakeLists.txt](examples/SyntaxHighlighter/src/CMakeLists.txt) if you are curious.
First work on README 2016-05-28 16:07:21 +02:00
			`## Tested with`
			`\| OS \| Compiler \| Boost version \|`
			`\| ----------------- \| ------------- \| ------------- \|`
			`\| Ubuntu 14.04 \| gcc 5.3.0 \| Boost 1.54 \|`
			`\| Ubuntu 14.04 \| clang 3.5.0 \| Boost 1.54 \|`
			`\| Ubuntu 15.10 \| gcc 5.2.1 \| Boost 1.58 \|`
			`\| Ubuntu 15.10 \| clang 3.6.2-1 \| Boost 1.58 \|`
			`\| Arch linux \| gcc 6.1.1 \| Boost 1.60 \|`
			`\| Arch linux \| clang 3.8.0 \| Boost 1.60 \|`
			`\| OSX El capitan \| clang 7.3.0 \| Boost 1.60 \|`

			`## Authors`
			`* Thomas Avé`
			`* Robin Jadoul`
Add license 2017-01-30 15:53:19 +01:00
			`## License`
			`This program is distributed under the terms of the MIT license. The generated code falls under the permissive zlib/libpng license.`