Implementation of a complete compiler (1) Lexical analysis

GitHub address Source code of each stage Collection of descriptions of each stage

In order to translate a program from one language to another, the compiler must first take the various components of the program apart and figure out their structure and meaning, and then put the components together in another way. The front end of the compiler performs analysis, and the back end does synthesis.

Analysis is generally divided into three types: lexical analysis, syntactic analysis, and semantic analysis.

Lexical analysis is performed at this stage, with the purpose of decomposing the input file into independent lexical symbols, that is, words.

According to Hushu’s prompts, this stage is divided into three modules:

Error handling module (errormsg.c errormsg.h): used to generate error messages containing file names and line numbers 2. Lexical analysis module (lexical.lex token.h): lexical analysis through Lex 3. Common tool module (util.c util.h): defines some commonly used functions

Lexical analysis module and error handling module: The two communicate through the variables and functions declared in errormsg.h: the EM_tokPos variable passes the position of each word in characters; the EM_newline() function records the line number; EM_error() outputs error information.

Error handling module and common tool module: The error handling module uses the checked_malloc() function declared in util.h to allocate memory

Also included is the driver (driver.c) test file (test.c) makefile

**The following mainly introduces the most important lexical analysis module at this stage. **

tokens.h: defines lexical word constants and yylval

typedef union {
	int ival;
    	char cval;
    	double dval;
	string sval;
	} YYSTYPE;
extern YYSTYPE yylval;

The above code defines yylval, which is a collection representing different semantic values. The ival cval dval sval are used to store the semantic values of integers, characters, floating point numbers, strings, and words respectively.

# define ID 128
... ... ... ...

This section defines some constants that are used by lexical.lex to specify what type of words are being matched.

lexical.lex: Lex source file, which can generate a lexical analyzer through Lex

Lex is a generator that can convert regular expressions into a lexical analyzer. It generates a C program (lex.yy.c) from the lexical specification. The specification contains a regular expression and an action. This action passes the word type (possibly along with other information) to the next processing stage of the compiler.

%{
#include 
#include "util.h"
#include "tokens.h"
#include "errormsg.h"

int charPos=1; //Record the position of each word

int yywrap(void) //Lex function, returns 1 to stop parsing, can be used to parse multiple files
{
    charPos=1;
    return 1;
}

void adjust(void) //Calculate the word position and pass it to the error message module through EM_tokPos
{
    EM_tokPos=charPos;
    charPos+=yyleng;
}

%}

%%
[" ""\t"] {adjust(); continue;}
"\n" {adjust(); EM_newline(); continue;}
(\")([A-Za-z0-9])*(\") {adjust(); yylval.sval = yytext; return STRING_V;}
string {adjust(); return STRING;}
'[A-Za-z0-9]' {adjust(); yylval.cval = yytext[1]; return CHAR_V;}
char {adjust(); return CHAR;}
short {adjust(); EM_error(EM_tokPos, "The short type is not supported yet");}
-?[0-9]+ {adjust(); yylval.ival=atoi(yytext); return INT_V;}
int {adjust(); return INT;}
unsigned {adjust(); EM_error(EM_tokPos, "unsigned type is not supported yet");}
long {adjust(); EM_error(EM_tokPos, "long type is not supported yet");}
float {adjust(); EM_error(EM_tokPos, "float type is not supported yet");}
-?[0-9]+(\.[0-9]+)? {adjust(); yylval.dval = atof(yytext); return DOUBLE_V;}
do {adjust(); return DO;}
double {adjust(); return DOUBLE;}
struct {adjust(); return STRUCT;}
union {adjust(); return UNION;}
void {adjust(); return VOID;}
enum {adjust(); return ENUM;}
signed {adjust(); EM_error(EM_tokPos, "The signed type is not supported yet");}
conust {adjust(); return CONUST;}
volatile {adjust(); EM_error(EM_tokPos, "volatile is not supported yet");}
typedef {adjust(); return TYPEDEF;}
auto {adjust(); EM_error(EM_tokPos, "auto is not supported yet");}
register {adjust(); EM_error(EM_tokPos, "register is not supported yet");}
static {adjust(); return STATIC;}
extern {adjust(); return EXTERN;}
break {adjust(); return BREAK;}
case {adjust(); return CASE;}
continue {adjust(); return CONTINUE;}
default {adjust(); return DEFAULT;}
else {adjust(); return ELSE;}
for {adjust(); return FOR;}
goto {adjust(); return GOTO;}
if {adjust(); return IF;}
return {adjust(); return RETURN;}
switch {adjust(); return SWITCH;}
while {adjust(); return WHILE;}
sizeof {adjust(); return SIZEOF;}
[A-Za-z]+\[[0-9]+\] {adjust(); return ARRAY;}
[A-Za-z_]([A-Za-z0-9_])* {adjust(); yylval.sval = yytext; return ID;}
"," {adjust(); return COMMA;}
":" {adjust(); return COLON;}
";" {adjust(); return SEMICOLON;}
"(" {adjust(); return LPAREN;}
")" {adjust(); return RPAREN;}
"[" {adjust(); return LBRACK;}
"]" {adjust(); return RBRACK;}
"{" {adjust(); return LBRACE;}
"}" {adjust(); return RBRACE;}
"." {adjust(); return DOT;}
"+" {adjust(); return PLUS;}
"-" {adjust(); return MINUS;}
"*" {adjust(); return TIMES;}
"/" {adjust(); return DIVIDE;}
"!=" {adjust(); return NEQ;}
"==" {adjust(); return ASSIGN;}
"=" {adjust(); return EQ;}
"<=" {adjust(); return LE;}
"<" {adjust(); return LT;}
">=" {adjust(); return GE;}
">" {adjust(); return GT;}
"&" {adjust(); return AND;}
"|" {adjust(); return OR;}

The first part, the part between %{…%}, contains several includes and declarations used by the rest of the C code in the file.

The second part, that is, the part between %}…%%, contains the abbreviation of the regular expression and the status description. For example, you can write

digits [0-9]+

Then in the third part, {digits} can be used instead of [0-9]+.

The third part, the part after %%, contains regular expressions and actions. Each action returns a value of type int (constant defined in token.h), indicating which word is matched. There are two matching principles to eliminate ambiguity: Rule priority: For a specific longest initial substring, the first matching regular expression determines the word type of this substring; Longest match: After the regular expression is determined by rule priority, the substring takes the longest string that matches the regular expression.

Several variables: yytext is the string matched by the regular expression; yyleng is the length of the matched string; charPos tracks the position of each word and tells EM_tokPos.

Lexical analysis Done.