Phases:
Lexical Analysis- Lexical analysis is the process of
converting a sequence of characters into a sequence of tokens. A token can be a
keyword.
Syntax Analysis- The next phase is called the syntax analysis
or parsing. It takes the token produced by lexical analysis as input and
generates a parse tree (or syntax tree)
Semantic Analysis- Semantic analysis checks whether the parse
tree constructed follows the rules of language
For example, assignment of values is between compatible data
types, and adding string to an integer. Also, the semantic analyzer keeps track
of identifiers,Their types and expressions; whether identifiers are declared
before use or not etc.
Annotated syntax tree
Intermediate Code Generation- After semantic analysis the compiler
generates an intermediate code of the source code for the target machine.
Code Optimization- The next phase does code optimization of
the intermediate code. Optimization can be assumed as something that removes
unnecessary code lines, arranges the sequence of statements in order to speed
up the program execution without wasting resources (CPU, memory).
Code Generation - In this phase, the code generator takes the
optimized representation of the intermediate code and maps it to the target
machine language
Symbol Table - It is a data-structure maintained throughout
all the phases of a compiler. All the identifier's names along with their types
are stored here.
Lexical
Analysis:
-
The basic element recognized by the compiler is the
“token.” A token is source-program text that the compiler does not break down
into component elements.
-
There are 6 types of C tokens: identifiers, keywords,
constants, operators, string literals and other separators
Syntax - <token-name, attribute-value>
Removing any white space or comments in the source code
Keywords, constants, identifiers, strings, numbers, operators
and punctuation symbols can be considered as tokens.
Lexical analysis for a modern computer language - finite
automata
https://www.tutorialspoint.com/compiler_design/images/compiler_phases.jpg
Regular expression:
Precedence and associativity: *, concatenation (.), and |
(pipe sign) are left associative
x* means zero or more occurrence of
x.
====> i.e., it can generate {e,
x, xx, xxx, xxxx, … }
x+ means one or more occurrence of
x.
====> i.e., it can generate { x,
xx, xxx, xxxx … } or x.x*
x? means at most one occurrence of
x
====> i.e., it can generate
either {x} or {e}.
[a-z] is all lower-case alphabets
of English language.
[A-Z] is all upper-case alphabets of
English language.
[0-9] is all natural digits used in
mathematics
Finite Automata:
Finite automata are a state machine that takes a string of
symbols as input and changes its state accordingly. Finite automata are a
recognizer for regular expressions.
Syntax analysis:
Lexical analyzer can identify tokens with the help of regular
expressions and pattern rules.
But cannot check syntax of a given sentence due to the
limitations of the regular expressions.
IE - Check balancing tokens, such as parenthesis.
This phase uses context-free grammar (CFG), which is
recognized by push-down automata.
CFG is super set of Regular Grammar, which means every Regular
Grammar is also context-free.
Context-Free Grammar:
A context-free grammar has four components:
1. A set of non-terminals (V). Non-terminals are syntactic
variables that denote sets of strings.
The non-terminals define sets of strings that help define the
language generated by the grammar.
2. A set of tokens, known as terminal symbols (Σ). Terminals
are the basic symbols from which strings are formed.
3. A set of productions (P). The productions of a grammar
specify the manner in which the terminals and non-terminals can be combined to
form strings
Each production consists of a non-terminal called the left
side of the production, an arrow, and a sequence of tokens and/or on-
terminals, called the right side of the production.
4. One of the non-terminals is designated as the start symbol
(S); from where the production begins.
Syntax Analyzers:
A syntax analyzer or parser takes the input from a lexical
analyzer in the form of token streams. The parser analyzes the source code
(token stream) against the production
Rules to detect any errors in the code. The output of this
phase is a parse tree
Derivation- A
derivation is basically a sequence of production rules, in order to get the
input string.
Left most and right most derivation:
Example:
Production rules:
E → E + E
E → E * E
E → id
Input string: id + id * id
-
left most derivation:
E → E * E
E → E + E * E
E → id + E * E
E → id + id * E
E → id + id * id
- Right most derivation:
E → E + E
E → E + E * E
E → E + E * id
E → E + id * id
E → id + id * id
A parse tree is a graphical depiction of a
derivation.
All leaf nodes are terminals.
All interior nodes are non-terminals.
In-order traversal gives original input string.
https://www.tutorialspoint.com/compiler_design/images/parse_tree_step_5.jpg
Ambiguity: A grammar G is said to be ambiguous if it has more
than one parse tree (left or right derivation) for at least one string.
Associativity: If an operand has operators on both sides, the
side on which the operator takes this operand is decided by the associativity
of those operators.
Precedence: If two different operators share a common
operand, the precedence of operators decides which will take the operand.
Types of Parsing- Syntax analyzers follow
production rules defined by means of context-free grammar. The way the
production rules are implemented (derivation)
Divides parsing into two types: Top-down parsing and
Bottom-up parsing.
Top Down parsing:
https://www.tutorialspoint.com/compiler_design/images/top_down_parsing.jpg
Recursive decent, backtrack & non backtrack->
predicative parser, LL parsing.
Bottom up parsing:
Reduced parsing -> LR -> SLR, LR, and LALR parser.
Compiler error types -
Lexical: name of some identifier typed incorrectly
Syntactical: missing semicolon or unbalanced parenthesis
Semantical: incompatible value assignment
Logical: code not reachable, infinite loop
Panic mode - quit from error comes so that doesn't go to
infinite loop.
Statement mode - predicts correct statements for compiler.
Error productions -Some common errors are known to the
compiler designers that may occur in the code. Designers can create augmented
grammar to be used, As productions that generate erroneous constructs when
these errors are encountered.
Abstract Syntax Trees - reduce syntax tree to abstract tree. Syntax
tree have more than required info for compiler so slow.
Semantic
Analyzer:
Semantics of a language provides meaning to its constructs,
like tokens and syntax structure. Semantics help interpret symbols, their
types, and their relations with each other. Semantic analysis judges whether
the syntax structure constructed in the source program derives any meaning or
not
CFG + semantic rules = Syntax Directed Definitions
IE: int x = "String"; // syntax tree ok but
semantically wrong.
Tasks: Scope resolution, Type checking, Array-bound checking
Errors: Type mismatch, undeclared variable, reserved
identifier misuse, multiple declaration of variable in a scope, accessing an
out of scope variable, Actual and formal parameter mismatch, attribute grammar
used here.
E → E + T { E.value = E.value + T.value }
Symbol Table:
Symbol table is an important data structure created and
maintained by compilers in order to store information about the occurrence of
various entities such as variable names, function names, objects, classes,
interfaces, etc.
Symbol table is used by both the analysis and the synthesis
parts of a compiler.
Purpose: To store the names of all entities in a structured
form at one place.
To verify if a variable has been declared.
To implement type checking, by verifying assignments and
expressions in the source code are semantically correct.
To determine the scope of a name (scope resolution)
Maintains entry for each name;
<symbol name,
type, attribute>
Exam: static int interest; => <interest, int,
static>
Symbol table implementation:
Linear (sorted or unsorted) list
Binary Search Tree
Hash table
Intermediate
code generator:
Need: for each new machine, a full native compiler is
required and synthesis, is changed according to the target machine.
Optimization can be done on intermediate code.
Code
generation:
it should carry the exact meaning of the source code.
It should be efficient in terms of CPU usage and memory
management.
Things into consideration to generate the code-
Target language- some machine
specific instructions.
IR Type - Abstract Syntax Tree
(AST) structure, Reverse Polish Notation, or 3-address code.
Selection of instruction
Register allocation and Ordering of
instructions
Code
Optimization:
Do not change the meaning of the program
Must increase the performance
Optimization itself should be fast.
Lexical Analysis- Lexical analysis is the process of
converting a sequence of characters into a sequence of tokens. A token can be a
keyword.
Syntax Analysis- The next phase is called the syntax analysis
or parsing. It takes the token produced by lexical analysis as input and
generates a parse tree (or syntax tree)
Semantic Analysis- Semantic analysis checks whether the parse
tree constructed follows the rules of language
For example, assignment of values is between compatible data
types, and adding string to an integer. Also, the semantic analyzer keeps track
of identifiers,Their types and expressions; whether identifiers are declared
before use or not etc.
Annotated syntax tree
Intermediate Code Generation- After semantic analysis the compiler
generates an intermediate code of the source code for the target machine.
Code Optimization- The next phase does code optimization of
the intermediate code. Optimization can be assumed as something that removes
unnecessary code lines, arranges the sequence of statements in order to speed
up the program execution without wasting resources (CPU, memory).
Code Generation - In this phase, the code generator takes the
optimized representation of the intermediate code and maps it to the target
machine language
Symbol Table - It is a data-structure maintained throughout
all the phases of a compiler. All the identifier's names along with their types
are stored here.
Lexical
Analysis:
-
The basic element recognized by the compiler is the
“token.” A token is source-program text that the compiler does not break down
into component elements.
-
There are 6 types of C tokens: identifiers, keywords,
constants, operators, string literals and other separators
Syntax - <token-name, attribute-value>
Removing any whitespace or comments in the source code
Keywords, constants, identifiers, strings, numbers, operators
and punctuations symbols can be considered as tokens.
Lexical analysis for a modern computer language - finite
automata
https://www.tutorialspoint.com/compiler_design/images/compiler_phases.jpg
Regular expression:
Precedence and associativity: *, concatenation (.), and |
(pipe sign) are left associative
x* means zero or more occurrence of
x.
====> i.e., it can generate {e,
x, xx, xxx, xxxx, … }
x+ means one or more occurrence of
x.
====> i.e., it can generate { x,
xx, xxx, xxxx … } or x.x*
x? means at most one occurrence of
x
====> i.e., it can generate
either {x} or {e}.
[a-z] is all lower-case alphabets
of English language.
[A-Z] is all upper-case alphabets of
English language.
[0-9] is all natural digits used in
mathematics
Finite Automata:
Finite automata are a state machine that takes a string of
symbols as input and changes its state accordingly. Finite automata are a
recognizer for regular expressions.
Syntax analysis:
Lexical analyzer can identify tokens with the help of regular
expressions and pattern rules.
But cannot check syntax of a given sentence due to the
limitations of the regular expressions.
IE - Check balancing tokens, such as parenthesis.
This phase uses context-free grammar (CFG), which is
recognized by push-down automata.
CFG is superset of Regular Grammar, which means every Regular
Grammar is also context-free.
Context-Free Grammar:
A context-free grammar has four components:
1. A set of non-terminals (V). Non-terminals are syntactic
variables that denote sets of strings.
The non-terminals define sets of strings that help define the
language generated by the grammar.
2. A set of tokens, known as terminal symbols (Σ). Terminals
are the basic symbols from which strings are formed.
3. A set of productions (P). The productions of a grammar
specify the manner in which the terminals and non-terminals can be combined to
form strings
Each production consists of a non-terminal called the left
side of the production, an arrow, and a sequence of tokens and/or on-
terminals, called the right side of the production.
4. One of the non-terminals is designated as the start symbol
(S); from where the production begins.
Syntax Analyzers:
A syntax analyzer or parser takes the input from a lexical
analyzer in the form of token streams. The parser analyzes the source code
(token stream) against the production
Rules to detect any errors in the code. The output of this
phase is a parse tree
Derivation- A
derivation is basically a sequence of production rules, in order to get the
input string.
Left most and right most derivation.
Example:
Production rules:
E → E + E
E → E * E
E → id
Input string: id + id * id
-
left most derivation :
E → E * E
E → E + E * E
E → id + E * E
E → id + id * E
E → id + id * id
-
Right most derivation:
E → E + E
E → E + E * E
E → E + E * id
E → E + id * id
E → id + id * id
A parse tree is a graphical depiction of a
derivation.
All leaf nodes are terminals.
All interior nodes are non-terminals.
In-order traversal gives original input string.
https://www.tutorialspoint.com/compiler_design/images/parse_tree_step_5.jpg
Ambiguity: A grammar G is said to be ambiguous if it has more
than one parse tree (left or right derivation) for at least one string.
Associativity: If an operand has operators on both sides, the
side on which the operator takes this operand is decided by the associativity
of those operators.
Precedence: If two different operators share a common
operand, the precedence of operators decides which will take the operand.
Types of Parsing- Syntax analyzers follow
production rules defined by means of context-free grammar. The way the
production rules are implemented (derivation)
Divides parsing into two types: Top-down parsing and
Bottom-up parsing.
Top Down parsing:
https://www.tutorialspoint.com/compiler_design/images/top_down_parsing.jpg
Recursive decent, backtrack & non backtrack->
predicative parser, LL parsing.
Bottom up parsing:
Reduced parsing -> LR -> SLR, LR, and LALR parser.
Compiler error types -
Lexical: name of some identifier typed incorrectly
Syntactical: missing semicolon or unbalanced parenthesis
Semantical: incompatible value assignment
Logical: code not reachable, infinite loop
Panic mode - quit from error comes so that doesn't go to
infinite loop.
Statement mode - predicts correct statements for compiler.
Error productions -Some common errors are known to the
compiler designers that may occur in the code. Designers can create augmented
grammar to be used, As productions that generate erroneous constructs when
these errors are encountered.
Abstract Syntax Trees - reduce syntax tree to abstract tree. Syntax
tree have more than required info for compiler so slow.
Semantic
Analyzer:
Semantics of a language provides meaning to its constructs,
like tokens and syntax structure. Semantics help interpret symbols, their
types, and their relations with each other. Semantic analysis judges whether
the syntax structure constructed in the source program derives any meaning or
not
CFG + semantic rules = Syntax Directed Definitions
IE: int x = "String"; // syntax tree ok but
semantically wrong.
Tasks: Scope resolution, Type checking, Array-bound checking
Errors: Type mismatch, undeclared variable, reserved
identifier misuse, multiple declaration of variable in a scope, accessing an
out of scope variable, Actual and formal parameter mismatch, attribute grammar
used here.
E → E + T { E.value = E.value + T.value }
Symbol Table:
Symbol table is an important data structure created and
maintained by compilers in order to store information about the occurrence of
various entities such as variable names, function names, objects, classes,
interfaces, etc.
Symbol table is used by both the analysis and the synthesis
parts of a compiler.
Purpose: To store the names of all entities in a structured
form at one place.
To verify if a variable has been declared.
To implement type checking, by verifying assignments and
expressions in the source code are semantically correct.
To determine the scope of a name (scope resolution)
Maintains entry for each name;
<symbol name,
type, attribute>
Exam: static int interest; => <interest, int,
static>
Symbol table implementation:
Linear (sorted or unsorted) list
Binary Search Tree
Hash table
Intermediate
code generator:
Need: for each new machine, a full native compiler is
required and synthesis, is changed according to the target machine.
Optimization can be done on intermediate code.
Code
generation:
it should carry the exact meaning of the source code.
It should be efficient in terms of CPU usage and memory
management.
Things into consideration to generate the code-
Target language- some machine
specific instructions.
IR Type - Abstract Syntax Tree
(AST) structure, Reverse Polish Notation, or 3-address code.
Selection of instruction
Register allocation and Ordering of
instructions
Code
Optimization:
Do not change the meaning of the program
Must increase the performance
Optimization itself should be fast.