Introduction of Lexical Analysis (original) (raw)

Last Updated : 24 Apr, 2026

Lexical analysis, also known as scanning, is the first phase of a compiler. In this phase, the compiler reads the source code character by character from left to right and groups them into meaningful units called tokens. These tokens are then passed to the next phase of compilation, known as syntax analysis.

lexical_analysis

Lexical Analysis

**Token

Sequence of characters that represents a basic unit of meaning in a programming language. Tokens are defined by the grammar of the language and are used by the parser to understand program structure.

Categories of Tokens

Keywords

Reserved words that have predefined meanings in a programming language.

Identifiers

Identifiers are names given to variables, functions, arrays, or other user-defined elements.

**Rules:

**Examples: count, _sum, totalValue

Constants

Fixed values whose value does not change during program execution.

**Examples:

Operators

Perform operations on operands.

**Examples:

Special Symbols

Used to structure programs.

**Examples:

Read more about Tokens.

**Lexeme

Actual sequence of characters in the source code that matches a token pattern.

**Example

float → KEYWORD

abs_zero_Kelvin → IDENTIFIER

= → OPERATOR

273 → INTEGER

; → SEMICOLON

**Lexemes and Tokens Representation

Lexemes Tokens Lexemes Continued... Tokens Continued...
while WHILE a IDENTIFIER
( LPAREN = ASSIGNMENT
a IDENTIFIER a IDENTIFIER
>= COMPARISON - ARITHMETIC
b IDENTIFIER 2 INTEGER
) RPAREN ; SEMICOLON

**How Lexical Analyzer Works?

Tokens in a programming language are defined using regular expressions. The lexical analyzer (scanner) uses a Deterministic Finite Automaton (DFA) to recognize these tokens because DFAs can identify regular languages efficiently.

Each final state of the DFA represents a specific token type. The process of converting regular expressions into a DFA can be automated, making token recognition fast and systematic.

The lexical analyzer can also detect errors such as:

It reports errors along with the line number and column number.

Read more about Working of Lexical Analyzer in Compiler.

Example 1

Input:

a = b + c;

Token sequence:

id = id + id ;

Each id refers to an entry in the symbol table containing details about the variable.

Example 2

Program:

int main()
{
int a, b;
a = 10;
return 0;
}

Valid tokens:

'int' 'main' '(' ')' '{'
'int' 'a' ',' 'b' ';'
'a' '=' '10' ';'
'return' '0' ';'
'}'

Comments and extra spaces are ignored by the lexical analyzer. Only meaningful tokens are identified and passed to the next phase of compilation.

**Exercise 1: Count number of tokens:

int main()
{
int a = 10, b = 20;
printf("sum is:%d", a+b);
return 0;
}
Answer: Total number of token: 27.

**Exercise 2: Count number of tokens:

int max(int i);

Answer: Total number of tokens 7: int, max, ( ,int, i, ), ;