The build process in software development involves several stages that transform source code into an executable program. This process is especially detailed and crucial in systems programming, such as kernel development. Below, we'll go through the stages of the build process, including compiling, assembling, and linking, with a focus on how these stages apply in a typical kernel development environment.
1 Overview of the Build Process
- Preprocessing: Handles macro substitution, file inclusion, and conditional compilation.
- Compilation: Converts preprocessed source code into assembly language.
- Assembly: Translates assembly code into machine code, producing object files.
- Linking: Combines object files and libraries into a single executable or binary.
- Loading:
Detailed Stages of the Build Process
1 Preprocessing:
The preprocessing stage handles the directives in the source code that start with # (e.g., #include, #define). The preprocessor performs the following tasks:
- File Inclusion: Replaces
#includedirectives with the contents of the included files. - Macro Expansion: Replaces macro names with their definitions.
- Conditional Compilation: Includes or excludes parts of the code based on
#ifdef,#ifndef, and similar directives.
Example source code (main.c):
#include <stdio.h>
#define MESSAGE "Hello, World!"
int main() {
printf("%s\n", MESSAGE);
return 0;
}
Preprocessed output (main.i):
extern int printf(const char *, ...);
int main() {
printf("%s\n", "Hello, World!");
return 0;
}
2 Compilation
The compilation stage converts preprocessed source code into assembly language. Each source file is compiled independently into an assembly file.
Example assembly output (main.s):
.file "main.c"
.section .rodata
.LC0:
.string "Hello, World!"
.text
.globl main
.type main, @function
main:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl $.LC0, (%esp)
call printf
leave
ret
3 Assembly
The assembly stage translates the assembly code into machine code, producing object files. An assembler takes the assembly code and generates binary instructions for the target architecture.
Example object file (main.o, in binary):
4 Linking
To combine object files into a single executable, resolving symbol references and arranging sections in memory.
Tools: Linker (e.g., ld)
Steps:
- Combine multiple object files.
- Resolve external symbols (e.g., function calls between object files).
- Arrange code and data in memory according to the linker script.
- Generate the final executable or binary file.
5 Loading
To load the executable into memory and start its execution.
Steps:
- Load the executable into the appropriate memory location.
- Set up the initial execution context (stack, registers).
- Jump to the entry point of the executable (e.g.,
startin the linker script).
2 Overview of the Linking Process
The linking process can be divided into several key tasks:
- Symbol Resolution
- Relocation
- Section Merging
- Memory Layout and Address Assignment
- Generation of Executable or Library
1 Symbol Resolution
Purpose: To resolve all symbols (functions, variables) referenced in the code but defined in different modules or libraries.
Steps:
- The linker collects all object files and libraries specified in the link command.
- It builds a symbol table from these files, mapping symbol names to their addresses.
- If a symbol is referenced in one object file but defined in another, the linker ensures it knows where to find the definition.
2 Relocation
Purpose: To adjust addresses within the code and data sections to reflect their actual locations in memory.
Steps:
- Each object file contains relocation entries that tell the linker where to adjust addresses.
- The linker calculates the final memory addresses for code and data sections.
- It updates the addresses in the code and data sections based on the calculated addresses.
3 Section Merging
Purpose: To combine sections of the same type (e.g., .text, .data) from different object files into single sections.
Steps:
- The linker merges all
.textsections from different object files into a single.textsection. - Similarly, it merges
.datasections,.bsssections, and other relevant sections.
4 Memory Layout and Address Assignment
Purpose: To define the memory layout of the final executable, specifying where each section should be loaded into memory.
Steps:
- The linker script (if used) provides detailed instructions on how to arrange sections in memory.
- The linker uses this script to assign starting addresses to each section.
- It ensures that sections are aligned correctly and follow the specified memory layout.
5 Generation of Executable or Library
Purpose: To create the final output file (executable, shared library, or static library) that can be loaded and executed by the operating system.
Steps:
- The linker writes the combined and relocated sections to the output file.
- It generates necessary headers and tables (e.g., symbol table, relocation table) required by the operating system.
3 Example: Linking Process with Linker Script
To illustrate these concepts, let's consider an example using a linker script to create an executable for a simple kernel.
Linker Script Example (link.ld)
OUTPUT_FORMAT(elf32-i386)
ENTRY(start)
phys = 0x00100000;
SECTIONS
{
.text phys : AT(phys) {
code = .;
*(.text)
*(.rodata)
. = ALIGN(4096);
}
.data : AT(phys + (data - code)) {
data = .;
*(.data)
. = ALIGN(4096);
}
.bss : AT(phys + (bss - code)) {
bss = .;
*(.bss)
. = ALIGN(4096);
}
end = .;
/DISCARD/ : {
*(.comment)
*(.eh_frame)
*(.note.gnu.build-id)
}
}
Breakdown of the Linker Script:
OUTPUT_FORMAT(elf32-i386): Specifies the output format as ELF for a 32-bit Intel architecture.ENTRY(start): Defines the entry point of the executable as the symbolstart.- When the OS or bootloader loads this executable, it will begin execution at the address associated with the
startsymbol. - The
startsymbol is typically defined in one of the source file, usually in assembly or C, and marks the beginning of the program's execution flow. - The
startsymbol, which is typically the first instruction of your kernel or program, will be located at the physical address0x00100000.
- When the OS or bootloader loads this executable, it will begin execution at the address associated with the
phys = 0x00100000;: Sets a physical address where the code should be loaded.- Defines a base physical address for the sections in the executable.
SECTIONS { ... }: Defines the memory layout of the executable.
Sections Defined:
.text: Contains the code and read-only data, starting at the physical address specified byphys.physspecifies the physical address where the.textsection starts.- This section includes the code and read-only data (
*(.text)and*(.rodata)). - The starting address of this section is
0x00100000.
.data: Contains initialized data, positioned right after the.textsection.- This section starts at the address calculated by
phys + (data - code), ensuring that it follows the.textsection. - It includes initialized data (
*(.data)).
- This section starts at the address calculated by
.bss: Contains uninitialized data, positioned after the.datasection.- This section starts at the address calculated by
phys + (bss - code), ensuring that it follows the.datasection.
- This section starts at the address calculated by
end = .;: Marks the end address of the entire memory layout.- It marks the end of the memory layout.
/DISCARD/: Excludes unnecessary sections from the final executable.- Discards specific sections that are not needed in the final executable.
Linking Process Steps:
- Symbol Resolution: The linker collects all object files (e.g.,
kernel.o) and libraries specified in the link command. It builds a symbol table mapping symbol names to their addresses. - Relocation: The linker reads relocation entries in the object files and adjusts addresses based on the final memory layout.
- Section Merging: It merges
.text,.data, and.bsssections from different object files into single sections as specified in the linker script. - Memory Layout and Address Assignment: Using the linker script, the linker assigns starting addresses to each section and ensures proper alignment.
- Generation of Executable: The linker writes the combined and relocated sections to the output file (e.g.,
kernel.elf), generates headers, and necessary tables.
Practical Outcome
Every time the linker script is used:
1 Placement of start:
- The
startsymbol, which is typically the first instruction of your kernel or program, will be located at the physical address0x00100000.
2 Loading:
- When a bootloader loads the ELF file generated by this linker script, it will load the
.textsection (and thus thestartsymbol) into memory starting at0x00100000.
3 Execution:
- The bootloader will then jump to
0x00100000to start executing the code atstart.
Example:
.section .text
.globl start
start:
cli # Clear interrupts
hlt # Halt the CPU
.section .rodata
message:
.ascii "Hello, kernel!\n"
.section .data
my_data:
.long 0xdeadbeef
.section .bss
.lcomm my_bss, 4
When you link this code using the provided linker script:
- The
.textsection, containing thestartlabel and its instructions, is placed at0x00100000. - The
.datasection is placed after.text. - The
.bsssection is placed after.data.
Loading and Execution by the Bootloader:
- The bootloader reads the ELF file and loads the
.textsection at0x00100000. - It then jumps to
0x00100000, starting execution from thestartlabel.
4 Assumptions in Code
Let's assume you have the following C code:
int print() {
// Function implementation
}
int start() {
// Function implementation
}
4.1 Compilation and Linking
1 Compilation:
- When the source files are compiled, the
printandstartfunctions will be placed in the.textsection of the compiled object files.
1 Linking:
- During linking, the
.textsections of all object files are merged together and placed in the.textsection defined in the linker script.
4.2 Address Calculation
The linker script specifies that the .text section starts at 0x00100000:
.text phys : AT(phys) {
code = .;
*(.text)
*(.rodata)
. = ALIGN(4096);
}
physis defined as0x00100000..textsection starts atphys, which is0x00100000.
Address of start
- The
startfunction is marked as the entry point withENTRY(start). - Given that
.textstarts at0x00100000,startwill be the first symbol in the.textsection if it appears first in the.textsegment during linking. - Therefore,
startwill be at address0x00100000.
Address of print
- The
printfunction will follow in the.textsection afterstart. - The exact address of
printdepends on the size of thestartfunction. - If
startis, for example, 0x20 bytes long, thenprintwill be located at0x00100020.
4.3 Example Memory Layout
Assuming the following:
startfunction is placed first and is 32 bytes long (0x20).printfunction follows immediately afterstart.
The memory layout would be:
startat0x00100000printat0x00100020(assumingstartis 32 bytes)
4.4 Graphical Representation
Here's a graphical representation of the memory layout:
0x00100000 --> start function
+-------------------+
| start() code |
| (32 bytes) |
+-------------------+
0x00100020 --> print function
+-------------------+
| print() code |
| (next function) |
+-------------------+
| ... |
+-------------------+5 Linker Script Overview
OUTPUT_FORMAT(elf32-i386)
ENTRY(start)
phys = 0x00100000;
SECTIONS
{
.text phys : AT(phys) {
code = .;
*(.text)
*(.rodata)
. = ALIGN(4096);
}
.data : AT(phys + (data - code)) {
data = .;
*(.data)
. = ALIGN(4096);
}
.bss : AT(phys + (bss - code)) {
bss = .;
*(.bss)
. = ALIGN(4096);
}
end = .;
/DISCARD/ : {
*(.comment)
*(.eh_frame)
*(.note.gnu.build-id)
}
}
5.1 Linker Script Breakdown
1. OUTPUT_FORMAT(elf32-i386)
- Purpose: Specifies the format of the output file.
- Usage:
OUTPUT_FORMAT(elf32-i386) - Explanation: This tells the linker to produce an ELF (Executable and Linkable Format) file for the 32-bit x86 architecture.
2. ENTRY(start)
- Purpose: Defines the entry point of the program.
- Usage:
ENTRY(start) - Explanation: This specifies that the symbol
startis the entry point of the executable. When the program starts executing, it begins at thestartsymbol.
3. phys = 0x00100000;
- Purpose: Defines a variable representing a physical address.
- Usage:
phys = 0x00100000; - Explanation: Sets the
physvariable to0x00100000(1 MB). This is a common address for loading the kernel in x86 architecture, as it is above the memory used by BIOS and other system functions.
4. SECTIONS { ... }
- Purpose: Defines the memory layout and sections of the output file.
- Usage:
SECTIONS { ... } - Explanation: The
SECTIONScommand defines how different sections of the input files should be mapped into the output file.
5. .text phys : AT(phys) { ... }
- Purpose: Specifies the
.textsection's location and content. - Usage:
.text phys : AT(phys) { ... } - Explanation:
.text physspecifies that the.textsection should start at the physical addressphys, which is0x00100000.AT(phys)tells the linker that the load address for this section is0x00100000.- Inside the braces
{}, thecode = .;line sets the variablecodeto the current address, marking the start of the.textsection. *wildcard pattern: This includes all input sections named.textand.rodatafrom the input object files.. = ALIGN(4096);ensures the next section starts at a 4096-byte aligned address.
6. .data : AT(phys + (data - code)) { ... }
- Purpose: Specifies the
.datasection's location and content. - Usage:
.data : AT(phys + (data - code)) { ... } - Explanation:
.dataspecifies the start of the.datasection.AT(phys + (data - code))calculates the load address for the.datasection as the physical address plus the offset between thedataandcodesymbols.- Inside the braces
{},data = .;sets the variabledatato the current address, marking the start of the.datasection. *wildcard pattern: This includes all input sections named.datafrom the input object files.. = ALIGN(4096);ensures the next section starts at a 4096-byte aligned address.
7. .bss : AT(phys + (bss - code)) { ... }
- Purpose: Specifies the
.bsssection's location and content. - Usage:
.bss : AT(phys + (bss - code)) { ... } - Explanation:
.bssspecifies the start of the.bsssection.AT(phys + (bss - code))calculates the load address for the.bsssection as the physical address plus the offset between thebssandcodesymbols.- Inside the braces
{},bss = .;sets the variablebssto the current address, marking the start of the.bsssection. *wildcard pattern: This includes all input sections named.bssfrom the input object files.. = ALIGN(4096);ensures the next section starts at a 4096-byte aligned address.
8. end = .;
- Purpose: Defines a symbol marking the end of the last section.
- Usage:
end = .; - Explanation: Sets the
endsymbol to the current address, which marks the end of all sections defined so far. This is useful for calculating the size of the program or for placing data after all sections.
9. /DISCARD/ : { ... }
- Purpose: Discards unwanted sections.
- Usage:
/DISCARD/ : { ... } - Explanation: Sections listed within the
/DISCARD/block are not included in the final output file. This is typically used to remove debugging or unnecessary sections.*wildcard pattern: This includes all input sections named.comment,.eh_frame, and.note.gnu.build-idfrom the input object files and discards them.
6 Linker Script
Linker script is a text file used by the linker which explains how different sections of the object files should be merged to create an output file.
It controls the layout and organization of the output file by specifying how the linker should place the sections of the input files in the output file, how to handle memory regions, how to define symbols, and more.
- GNU linker script has the file extension of
.ld. - It specifies how different sections of code and data should be placed in memory.
- You must supply linker script at the linking phase to the linker using
-Toption.
Key Fields and Directives in Linker Scripts
- OUTPUT_FORMAT
- ENTRY
- SECTIONS
- MEMORY
- PHDRS
- SYMBOLS
- ASSERT
- INCLUDE
- STARTUP
- OUTPUT
- REGION_ALIAS
- SEARCH_DIR
- EXTERN
- FORCE_COMMON_ALLOCATION
- FORCE_SHARED_ALLOCATION
1 OUTPUT_FORMAT
Specifies the format of the output file.
Usage:
OUTPUT_FORMAT(format)
Example:
OUTPUT_FORMAT(elf32-i386)
This specifies that the output file should be in the ELF format for a 32-bit x86 architecture.
2 ENTRY
Defines the entry point of the program where execution starts.
Usage:
ENTRY(symbol)
Example:
ENTRY(start)
This specifies that the start symbol is the entry point of the executable.
3 SECTIONS
Describes the layout of the output file in memory, specifying where each section should be placed.
Usage:
SECTIONS
{
...
}
Example:
SECTIONS
{
.text : {
*(.text)
*(.rodata)
. = ALIGN(4096);
} > RAM
.data : {
*(.data)
. = ALIGN(4096);
} > RAM
.bss : {
*(.bss)
. = ALIGN(4096);
} > RAM
/DISCARD/ : {
*(.comment)
*(.eh_frame)
}
}
Key Terms:
- Section Name: The name of the section, such as
.text,.data, or.bss. - Address Assignment: Specifies the start address of a section.
- Example:
.text 0x00100000 : { ... }
- Example:
- Wildcard Patterns: Used to match section names from input files.
- Example:
*(.text),*(.data)
- Example:
- Alignment: Ensures sections start at aligned addresses.
- Example:
. = ALIGN(4096);
- Example:
4 MEMORY
Defines memory regions for placing sections.
Usage:
MEMORY
{
name (attr) : ORIGIN = origin, LENGTH = length
...
}
Example:
MEMORY
{
RAM (wx) : ORIGIN = 0x00100000, LENGTH = 0x400000
ROM (rx) : ORIGIN = 0x00000000, LENGTH = 0x10000
}
Key Terms:
- Name: The name of the memory region.
- Attributes: Specify the permissions, such as
r(read),w(write),x(execute). - ORIGIN: The start address of the memory region.
- LENGTH: The size of the memory region.
5 PHDRS
Describes the program headers for the output file.
Usage:
PHDRS
{
name type [attributes]
...
}
Example:
PHDRS
{
text PT_LOAD FILEHDR PHDRS;
data PT_LOAD;
}
Key Terms:
- Name: The name of the program header.
- Type: The type of the segment, such as
PT_LOAD. - Attributes: Additional attributes, like
FILEHDR,PHDRS.
6 SYMBOLS
Defines symbols and assigns values to them.
Usage:
symbol = expression;
Example:
_start = 0x100000;
This defines the _start symbol with the value 0x1000001.
Advanced Usage:
- PROVIDE: Defines a symbol only if it is not already defined.
- Example:
PROVIDE(_stack = 0x200000);
- Example:
- ASSERT: Ensures certain conditions are met.
- Example:
ASSERT(_stack > 0x200000, "Stack too low!");
- Example:
7 ASSERT
Ensures certain conditions are met during linking.
Usage:
ASSERT(condition, message)
Example:
ASSERT(_stack > 0x200000, "Stack too low!");
This asserts that the _stack symbols is greater than 0x200000, otherwise it will produce an error message Stack too low!.
8 INCLUDE
Includes another linker script within the current script.
Usage:
INCLUDE "filename"
Example:
INCLUDE "common.ld"
This includes the contents of common.ld into the current linker script.
9 STARTUP
Specifies the startup file to be linked first.
Usage:
STARTUP(filename)
Example:
STARTUP(startup.o)
This ensures that startup.o is linked first.
10 OUTPUT
Specifies the name of the output file.
Usage:
OUTPUT(filename)
Example:
OUTPUT("kernel.bin")
This sets the output file name to kernel.bin.
11 REGION_ALIAS
Defines an alias for a memory region.
Usage:
REGION_ALIAS(alias, region)
Example:
REGION_ALIAS(RAM_ALIAS, RAM)
This defines RAM_ALIAS as an alias for the RAM memory region.
12 SEARCH_DIR
Adds a directory to the search path for libraries and object files.
Usage:
SEARCH_DIR("directory")
Example:
SEARCH_DIR("/usr/local/lib")
This adds /usr/local/lib to the search path.
13 EXTERN
Forces undefined symbols to be added to the symbol table.
Usage:
EXTERN(symbol)
Example:
EXTERN(_start)
This ensures that _start is included in the symbol table even if it is undefined.
14 FORCE_COMMON_ALLOCATION
Forces allocation of common symbols even if there are undefined symbols.
Usage:
FORCE_COMMON_ALLOCATION
Example:
FORCE_COMMON_ALLOCATION
This ensures that common symbols are allocated despite undefined symbols.
15 FORCE_SHARED_ALLOCATION
Forces allocation of shared symbols
Usage:
FORCE_SHARED_ALLOCATION
Example:
FORCE_SHARED_ALLOCATION
This ensures that shared symbols are allocated.
7 Various Symbols and Commands
1 . (Dot):
Represents the current location counter within the memory layout. It's used to specify the current memory address.
2 Wildcards:
Wildcards like * and ** are used to match multiple sections with similar names or properties. For example, *(.text) matches all sections named .text in input files.
3 Comments:
Comments in linker script start with # and are used to provide explanations and annotations within the script.
Leave a comment
Your email address will not be published. Required fields are marked *
