Lexesis is a language agnostic lexical analyser generator. Which means that it uses a description of *tokens* in the form of regular expressions, and outputs source files for a lexer/scanner (which can be in any language for which a backend has been built, currently only c++), which can be used in building a larger application.
It's principle is very similar to the well known tools such as [lex](https://en.wikipedia.org/wiki/Lex_(software)) or [flex](http://flex.sourceforge.net), which the difference that Lexesis has a simpler input format, and does not depend on language specific actions to be specified in the configuration file. It uses a programming language independent description of the tokens, in order to allow a bigger reusability across different programming languages of the same lexical analyser specification.
This project came into existence as an application exercise in a course on regular languages and automata for the University of Antwerp.
instead. Afterwards, after calling `make install` (which now install locally in the build folder, so you don't need the sudo), build all the examples as well with
Here we have 3 different tokens `CAPITAL`, `NUMBER` and `ALL`.
Note that the names for the tokens only consist of capital letters, small letter and underscores, other characters are not recommended, in order to work for most possible backends.
When we run `A` through the generated lexer, it will return that it's a `CAPITAL`, since that is specified higher than `ALL`.
Most posix regular expression features have been implemented, with the exception of some notable features:
* There is no way to match the beginning or ending of a line (`^` or `$`)
* Repetition using `{` and `}` is (not yet) supported
It should be noted that escape characters inside character classes don't exist, so a `-` that is part of the class should be specified as very first or very last element of the class (and cannot be used as endpoint of a range), and a `]` should be specified as first element. A `^` however should not be used as first element, unless it is meant as an inversion modifier for the character class.
When needed (for example at the beginning of a rule, when whitespace is stripped by the reading of the input rules) a space can be specified as `\s` for convenience. Otherwise `[ ]`, a character class containing only a space can be used as well.
### Using the lexer
Of course, how you use the generated lexer highly depends on which backend you used to generate it. For the default c++ backend however, the easiest way of getting to know the lexer is probably having a look at the generated header file, usually named *<Lexername>.h*.
In general, there should be some way the tokens are defined, and there should be some way to generate a list of tokens (or get each tokens separately).
The *keywords* example simply prints the lexed token type, along with its content, while the *SyntaxHighlighter* example is a bit more complete, making use of multiple lexers at once and providing a simple form of syntax highlighting for xml in the terminal using ansi escape codes. Be aware however that this also accepts anything that even remotely looks like xml, since the regular languages underlying the lexers cannot verify or parse xml. Additionally, the SyntaxHighlighter example also contains a couple of CMake rules that allow automatic generation and regeneration of lexer source files when the *.lxs* file changed. See the [CMakeLists.txt](examples/SyntaxHighlighter/src/CMakeLists.txt) if you are curious.