Technical Notes: Compiler Design subject

Phases:

Lexical Analysis- Lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A token can be a keyword.

Syntax Analysis- The next phase is called the syntax analysis or parsing. It takes the token produced by lexical analysis as input and generates a parse tree (or syntax tree)

Semantic Analysis- Semantic analysis checks whether the parse tree constructed follows the rules of language

For example, assignment of values is between compatible data types, and adding string to an integer. Also, the semantic analyzer keeps track of identifiers,Their types and expressions; whether identifiers are declared before use or not etc.

Annotated syntax tree

Intermediate Code Generation- After semantic analysis the compiler generates an intermediate code of the source code for the target machine.

Code Optimization- The next phase does code optimization of the intermediate code. Optimization can be assumed as something that removes unnecessary code lines, arranges the sequence of statements in order to speed up the program execution without wasting resources (CPU, memory).

Code Generation - In this phase, the code generator takes the optimized representation of the intermediate code and maps it to the target machine language

Symbol Table - It is a data-structure maintained throughout all the phases of a compiler. All the identifier's names along with their types are stored here.

Lexical Analysis:

- The basic element recognized by the compiler is the “token.” A token is source-program text that the compiler does not break down into component elements.

- There are 6 types of C tokens: identifiers, keywords, constants, operators, string literals and other separators

Syntax - <token-name, attribute-value>

Removing any white space or comments in the source code

Keywords, constants, identifiers, strings, numbers, operators and punctuation symbols can be considered as tokens.

Lexical analysis for a modern computer language - finite automata

https://www.tutorialspoint.com/compiler_design/images/compiler_phases.jpg

Regular expression:

Precedence and associativity: *, concatenation (.), and | (pipe sign) are left associative

x* means zero or more occurrence of x.

====> i.e., it can generate {e, x, xx, xxx, xxxx, … }

x+ means one or more occurrence of x.

====> i.e., it can generate { x, xx, xxx, xxxx … } or x.x*

x? means at most one occurrence of x

====> i.e., it can generate either {x} or {e}.

[a-z] is all lower-case alphabets of English language.

[A-Z] is all upper-case alphabets of English language.

[0-9] is all natural digits used in mathematics

Finite Automata:

Finite automata are a state machine that takes a string of symbols as input and changes its state accordingly. Finite automata are a recognizer for regular expressions.

https://www.tutorialspoint.com/compiler_design/images/finite_automata.jpg

Syntax analysis:

Lexical analyzer can identify tokens with the help of regular expressions and pattern rules.

But cannot check syntax of a given sentence due to the limitations of the regular expressions.

IE - Check balancing tokens, such as parenthesis.

This phase uses context-free grammar (CFG), which is recognized by push-down automata.

CFG is super set of Regular Grammar, which means every Regular Grammar is also context-free.

Context-Free Grammar:

A context-free grammar has four components:

1. A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of strings.

The non-terminals define sets of strings that help define the language generated by the grammar.

2. A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from which strings are formed.

3. A set of productions (P). The productions of a grammar specify the manner in which the terminals and non-terminals can be combined to form strings

Each production consists of a non-terminal called the left side of the production, an arrow, and a sequence of tokens and/or on- terminals, called the right side of the production.

4. One of the non-terminals is designated as the start symbol (S); from where the production begins.

Syntax Analyzers:

A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams. The parser analyzes the source code (token stream) against the production

Rules to detect any errors in the code. The output of this phase is a parse tree

Derivation- A derivation is basically a sequence of production rules, in order to get the input string.

Left most and right most derivation:

Example:

Production rules:

E → E + E

E → E * E

E → id

Input string: id + id * id

- left most derivation:

E → E * E

E → E + E * E

E → id + E * E

E → id + id * E

E → id + id * id

- Right most derivation:

E → E + E

E → E + E * E

E → E + E * id

E → E + id * id

E → id + id * id

A parse tree is a graphical depiction of a derivation.

All leaf nodes are terminals.

All interior nodes are non-terminals.

In-order traversal gives original input string.

https://www.tutorialspoint.com/compiler_design/images/parse_tree_step_5.jpg

Ambiguity: A grammar G is said to be ambiguous if it has more than one parse tree (left or right derivation) for at least one string.

Associativity: If an operand has operators on both sides, the side on which the operator takes this operand is decided by the associativity of those operators.

Precedence: If two different operators share a common operand, the precedence of operators decides which will take the operand.

Types of Parsing- Syntax analyzers follow production rules defined by means of context-free grammar. The way the production rules are implemented (derivation)

Divides parsing into two types: Top-down parsing and Bottom-up parsing.

Top Down parsing:

https://www.tutorialspoint.com/compiler_design/images/top_down_parsing.jpg

Recursive decent, backtrack & non backtrack-> predicative parser, LL parsing.

Bottom up parsing:

Reduced parsing -> LR -> SLR, LR, and LALR parser.

Compiler error types -

Lexical: name of some identifier typed incorrectly

Syntactical: missing semicolon or unbalanced parenthesis

Semantical: incompatible value assignment

Logical: code not reachable, infinite loop

Panic mode - quit from error comes so that doesn't go to infinite loop.

Statement mode - predicts correct statements for compiler.

Error productions -Some common errors are known to the compiler designers that may occur in the code. Designers can create augmented grammar to be used, As productions that generate erroneous constructs when these errors are encountered.

Abstract Syntax Trees - reduce syntax tree to abstract tree. Syntax tree have more than required info for compiler so slow.

Semantic Analyzer:

Semantics of a language provides meaning to its constructs, like tokens and syntax structure. Semantics help interpret symbols, their types, and their relations with each other. Semantic analysis judges whether the syntax structure constructed in the source program derives any meaning or not

CFG + semantic rules = Syntax Directed Definitions

IE: int x = "String"; // syntax tree ok but semantically wrong.

Tasks: Scope resolution, Type checking, Array-bound checking

Errors: Type mismatch, undeclared variable, reserved identifier misuse, multiple declaration of variable in a scope, accessing an out of scope variable, Actual and formal parameter mismatch, attribute grammar used here.

E → E + T { E.value = E.value + T.value }

Symbol Table:

Symbol table is an important data structure created and maintained by compilers in order to store information about the occurrence of various entities such as variable names, function names, objects, classes, interfaces, etc.

Symbol table is used by both the analysis and the synthesis parts of a compiler.

Purpose: To store the names of all entities in a structured form at one place.

To verify if a variable has been declared.

To implement type checking, by verifying assignments and expressions in the source code are semantically correct.

To determine the scope of a name (scope resolution)

Maintains entry for each name;

Exam: static int interest; => <interest, int, static>

Symbol table implementation:

Linear (sorted or unsorted) list

Binary Search Tree

Hash table

Intermediate code generator:

Need: for each new machine, a full native compiler is required and synthesis, is changed according to the target machine.

Optimization can be done on intermediate code.

Code generation:

it should carry the exact meaning of the source code.

It should be efficient in terms of CPU usage and memory management.

Things into consideration to generate the code-

Target language- some machine specific instructions.

IR Type - Abstract Syntax Tree (AST) structure, Reverse Polish Notation, or 3-address code.

Selection of instruction

Code Optimization:

Do not change the meaning of the program

Must increase the performance

Optimization itself should be fast.

Lexical Analysis- Lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A token can be a keyword.

Syntax Analysis- The next phase is called the syntax analysis or parsing. It takes the token produced by lexical analysis as input and generates a parse tree (or syntax tree)

Semantic Analysis- Semantic analysis checks whether the parse tree constructed follows the rules of language

Annotated syntax tree

Intermediate Code Generation- After semantic analysis the compiler generates an intermediate code of the source code for the target machine.

Code Optimization- The next phase does code optimization of the intermediate code. Optimization can be assumed as something that removes unnecessary code lines, arranges the sequence of statements in order to speed up the program execution without wasting resources (CPU, memory).

Code Generation - In this phase, the code generator takes the optimized representation of the intermediate code and maps it to the target machine language

Symbol Table - It is a data-structure maintained throughout all the phases of a compiler. All the identifier's names along with their types are stored here.

Lexical Analysis:

- The basic element recognized by the compiler is the “token.” A token is source-program text that the compiler does not break down into component elements.

- There are 6 types of C tokens: identifiers, keywords, constants, operators, string literals and other separators

Syntax - <token-name, attribute-value>

Removing any whitespace or comments in the source code

Keywords, constants, identifiers, strings, numbers, operators and punctuations symbols can be considered as tokens.

Lexical analysis for a modern computer language - finite automata

https://www.tutorialspoint.com/compiler_design/images/compiler_phases.jpg

Regular expression:

Precedence and associativity: *, concatenation (.), and | (pipe sign) are left associative

x* means zero or more occurrence of x.

====> i.e., it can generate {e, x, xx, xxx, xxxx, … }

x+ means one or more occurrence of x.

====> i.e., it can generate { x, xx, xxx, xxxx … } or x.x*

x? means at most one occurrence of x

====> i.e., it can generate either {x} or {e}.

[a-z] is all lower-case alphabets of English language.

[A-Z] is all upper-case alphabets of English language.

[0-9] is all natural digits used in mathematics

Finite Automata:

Finite automata are a state machine that takes a string of symbols as input and changes its state accordingly. Finite automata are a recognizer for regular expressions.

https://www.tutorialspoint.com/compiler_design/images/finite_automata.jpg

Syntax analysis:

Lexical analyzer can identify tokens with the help of regular expressions and pattern rules.

But cannot check syntax of a given sentence due to the limitations of the regular expressions.

IE - Check balancing tokens, such as parenthesis.

This phase uses context-free grammar (CFG), which is recognized by push-down automata.

CFG is superset of Regular Grammar, which means every Regular Grammar is also context-free.

Context-Free Grammar:

A context-free grammar has four components:

1. A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of strings.

The non-terminals define sets of strings that help define the language generated by the grammar.

2. A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from which strings are formed.

3. A set of productions (P). The productions of a grammar specify the manner in which the terminals and non-terminals can be combined to form strings

Each production consists of a non-terminal called the left side of the production, an arrow, and a sequence of tokens and/or on- terminals, called the right side of the production.

4. One of the non-terminals is designated as the start symbol (S); from where the production begins.

Syntax Analyzers:

A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams. The parser analyzes the source code (token stream) against the production

Rules to detect any errors in the code. The output of this phase is a parse tree

Derivation- A derivation is basically a sequence of production rules, in order to get the input string.

Left most and right most derivation.

Example:

Production rules:

E → E + E

E → E * E

E → id

Input string: id + id * id

- left most derivation :

E → E * E

E → E + E * E

E → id + E * E

E → id + id * E

E → id + id * id

- Right most derivation:

E → E + E

E → E + E * E

E → E + E * id

E → E + id * id

E → id + id * id

A parse tree is a graphical depiction of a derivation.

All leaf nodes are terminals.

All interior nodes are non-terminals.

In-order traversal gives original input string.

https://www.tutorialspoint.com/compiler_design/images/parse_tree_step_5.jpg

Ambiguity: A grammar G is said to be ambiguous if it has more than one parse tree (left or right derivation) for at least one string.

Associativity: If an operand has operators on both sides, the side on which the operator takes this operand is decided by the associativity of those operators.

Precedence: If two different operators share a common operand, the precedence of operators decides which will take the operand.

Types of Parsing- Syntax analyzers follow production rules defined by means of context-free grammar. The way the production rules are implemented (derivation)

Divides parsing into two types: Top-down parsing and Bottom-up parsing.

Top Down parsing:

https://www.tutorialspoint.com/compiler_design/images/top_down_parsing.jpg

Recursive decent, backtrack & non backtrack-> predicative parser, LL parsing.

Bottom up parsing:

Reduced parsing -> LR -> SLR, LR, and LALR parser.

Compiler error types -

Lexical: name of some identifier typed incorrectly

Syntactical: missing semicolon or unbalanced parenthesis

Semantical: incompatible value assignment

Logical: code not reachable, infinite loop

Panic mode - quit from error comes so that doesn't go to infinite loop.

Statement mode - predicts correct statements for compiler.

Abstract Syntax Trees - reduce syntax tree to abstract tree. Syntax tree have more than required info for compiler so slow.

Semantic Analyzer:

CFG + semantic rules = Syntax Directed Definitions

IE: int x = "String"; // syntax tree ok but semantically wrong.

Tasks: Scope resolution, Type checking, Array-bound checking

E → E + T { E.value = E.value + T.value }

Symbol Table:

Symbol table is used by both the analysis and the synthesis parts of a compiler.

Purpose: To store the names of all entities in a structured form at one place.

To verify if a variable has been declared.

To implement type checking, by verifying assignments and expressions in the source code are semantically correct.

To determine the scope of a name (scope resolution)

Maintains entry for each name;

Exam: static int interest; => <interest, int, static>

Symbol table implementation:

Linear (sorted or unsorted) list

Binary Search Tree

Hash table

Intermediate code generator:

Need: for each new machine, a full native compiler is required and synthesis, is changed according to the target machine.

Optimization can be done on intermediate code.

Code generation:

it should carry the exact meaning of the source code.

It should be efficient in terms of CPU usage and memory management.

Things into consideration to generate the code-

Target language- some machine specific instructions.

IR Type - Abstract Syntax Tree (AST) structure, Reverse Polish Notation, or 3-address code.

Selection of instruction

Code Optimization:

Do not change the meaning of the program

Must increase the performance

Optimization itself should be fast.

Technical Notes

Sunday, March 12, 2017

Compiler Design subject

Phases:

Lexical Analysis:

Syntax analysis:

Context-Free Grammar:

Types of Parsing- Syntax analyzers follow production rules defined by means of context-free grammar. The way the production rules are implemented (derivation)

Optimization itself should be fast.

Compiler error types -

Semantic Analyzer:

Symbol Table:

Intermediate code generator:

Code generation:

Code Optimization:

No comments:

Post a Comment

Ericsson Interview