Operating System Concepts for Cloud Engineers

This document is a comprehensive guide to the Linux operating system internals that every cloud engineer must understand. It covers process management, memory architecture, file descriptors and I/O models, inter-process communication, systemd, the Linux networking stack, filesystem hierarchy, and the user/group permission model. By the end, you will be able to reason about why a container is OOM-killed, trace a hanging process with strace, debug networking inside a namespace, and connect OS internals to the data structures that power them. The target audience is engineers who can already use a Linux terminal but want to understand what is happening beneath the commands they type.


Table of Contents

  1. Why This Matters
  2. Mental Models
  3. Core Concepts
  4. Practical Use Cases
  5. Worked Examples
  6. Common Pitfalls & Misconceptions
  7. Summary & Key Takeaways
  8. Quick Reference Cheat Sheet
  9. DSA Connections
  10. Further Reading

Why This Matters

When a Kubernetes pod is evicted with OOMKilled, the root cause is not “Kubernetes decided to kill it.” The root cause is that the Linux kernel’s OOM killer selected that process because it exceeded its memory cgroup limit, and the kernel had no pages left to reclaim. The eviction event you see in kubectl describe pod is a downstream symptom of an operating system decision.

Cloud engineering is, at its foundation, Linux systems engineering. Every container is a Linux process. Every network policy is an iptables or eBPF rule. Every volume mount is a bind mount in a mount namespace. Every CPU limit is a cgroup throttle. If you do not understand the OS layer, you are building on abstractions you cannot debug.

Here is why each topic in this document matters directly to your daily work:

  • Process lifecycle — Container runtimes use fork/exec/wait to launch your application. Zombie processes inside containers are a real production issue when PID 1 does not reap children.
  • Threads vs processes — Choosing between a multi-threaded server and a multi-process model (like Gunicorn workers) determines your failure isolation and memory footprint.
  • Memory management — Understanding virtual memory, RSS, swap, and the OOM killer is the difference between setting sane memory limits and getting paged at 3 AM.
  • File descriptors and epoll — Nginx, Node.js, and every high-performance proxy rely on epoll. When you see “too many open files,” you need to know what file descriptors are.
  • Signals and IPCSIGTERM vs SIGKILL determines whether your application gets a graceful shutdown in Kubernetes. Pipes and sockets underpin every sidecar pattern.
  • Systemd — On VM-based infrastructure, systemd is how services start, restart, and log. Even on Kubernetes nodes, kubelet is a systemd unit.
  • Networking stack — Every ClusterIP, NodePort, and DNS resolution in Kubernetes is built on top of the Linux networking stack. Debugging requires ip, ss, and namespace awareness.
  • Filesystem hierarchy — Knowing where configs live (/etc), where logs accumulate (/var/log), and what /proc exposes is essential for troubleshooting any Linux system.
  • Permissions — Containers running as root vs non-root, setuid binaries, and sticky bits on /tmp are all permission model concepts that show up in security audits.

Mental Models

Before diving into specifics, internalize these four mental models. They will serve as your conceptual scaffolding for everything that follows.

Model 1: Processes as Apartments, Threads as Roommates

Imagine an apartment building. Each apartment (process) has its own front door, its own kitchen, its own bathroom, and its own lease. If one apartment catches fire, the others are unaffected — the walls provide isolation.

Now imagine roommates sharing one apartment (threads within a process). They share the kitchen (heap memory), the bathroom (file descriptors), and the living room (global variables). This is efficient — no duplicate furniture — but if one roommate trashes the kitchen, everyone suffers. There are no walls between roommates.

02-os diagram 1

Key insight: Use processes when you need fault isolation (a crash in one must not affect another). Use threads when you need shared state and low overhead (all threads see the same heap).

Model 2: Virtual Memory as “Every Process Gets Its Own Map”

Imagine a city with physical streets and buildings. Every process is given its own personal map of the city. On Process A’s map, “123 Main Street” might correspond to a real building at physical address 0x7FFF. On Process B’s map, “123 Main Street” points to a completely different physical building at address 0x3A00.

The MMU (Memory Management Unit) is the translator standing between each process and the real city. When Process A says “take me to 123 Main Street,” the MMU looks up the real physical address and routes the request.

02-os diagram 2

Key insight: Two processes can both use virtual address 0x400000 without conflict because the MMU maps them to different physical frames. This is how process isolation works at the memory level.

Model 3: File Descriptors as Numbered Tickets

Walk into a deli and take a ticket. Your ticket says “37.” You do not need to know which shelf your order is sitting on or which employee is preparing it. You just hold your ticket, and when you call your number, the deli knows exactly which order to hand you.

File descriptors work the same way. When a process opens a file, a socket, or a pipe, the kernel hands back a small integer — your ticket number. The process does not need to know the physical disk block, the inode number, or the network buffer address. It just says “read from ticket 5” or “write to ticket 7,” and the kernel resolves the rest.

02-os diagram 3

Every process starts with three tickets already issued: 0 (stdin), 1 (stdout), 2 (stderr). Everything else — files, sockets, pipes, devices — gets the next available number.

Model 4: epoll as a Receptionist

Imagine you are waiting for packages from 10,000 different couriers. You could check each courier one by one, every second, asking “is my package here yet?” That is polling — it works but wastes enormous time.

Or, you could hire a receptionist (epoll). You tell the receptionist: “Here are the 10,000 couriers I am expecting packages from. Notify me only when one of them actually arrives.” You sit and wait. The receptionist taps you on the shoulder only when there is a package ready. No wasted effort, no busy-waiting.

02-os diagram 4

This is exactly how Nginx handles 10,000 concurrent connections on a single thread. It does not create 10,000 threads. It registers 10,000 sockets with epoll and processes only the ones that have data.


Core Concepts

3.1 Process Lifecycle: fork, exec, wait

Every process in Linux is created by an existing process. The only exception is the init process (PID 1), which the kernel creates at boot. Every other process descends from it, forming a tree.

The fork/exec Pattern

The two fundamental system calls for process creation are fork() and exec(). They are deliberately separate because UNIX philosophy says each should do one thing well.

fork() creates an exact copy of the calling process. The child gets a new PID but starts with a copy of the parent’s memory, file descriptors, and program counter. Immediately after fork, parent and child are running the same code at the same point.

exec() replaces the current process’s code and data with a new program. The PID stays the same, but everything else changes — the executable, the stack, the heap.

Together, the pattern is: fork to create a new process, then exec in the child to run a different program.

// fork_exec_demo.c — demonstrates the fork/exec/wait lifecycle
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
 
int main() {
    printf("Parent PID: %d\n", getpid());
 
    pid_t child = fork();  // Create a copy of this process
 
    if (child == 0) {
        // This block runs in the CHILD process
        printf("Child PID: %d, Parent PID: %d\n", getpid(), getppid());
        // Replace child's program with "ls -la"
        execlp("ls", "ls", "-la", "/tmp", NULL);
        // If exec succeeds, this line NEVER runs
        perror("exec failed");
        exit(1);
    } else {
        // This block runs in the PARENT process
        int status;
        waitpid(child, &status, 0);  // Block until child exits
        if (WIFEXITED(status)) {
            printf("Child exited with code: %d\n", WEXITSTATUS(status));
        }
    }
    return 0;
}

Output:

Parent PID: 1234
Child PID: 1235, Parent PID: 1234
total 24
drwxrwxrwt 6 root root 4096 May 21 10:00 .
...
Child exited with code: 0

Zombie Processes

A zombie process is a child that has exited but whose parent has not yet called wait() to collect its exit status. The process is dead — it is not using CPU or memory — but its entry remains in the process table because the kernel is preserving its exit code for the parent.

02-os diagram 5

In cloud contexts, this matters inside containers. If your container’s PID 1 is a shell script that spawns children but never calls wait, you accumulate zombies. They fill the PID table. This is why many container images use tini or dumb-init as the entrypoint — they act as a proper init process that reaps zombies.

Orphan Processes

An orphan process is a child whose parent exits first. The kernel re-parents orphans to PID 1 (init/systemd), which is responsible for reaping them. Orphans are generally harmless — they are adopted and cleaned up. Zombies are the problematic ones because their parent is alive but negligent.

Inspection Commands

# Show all processes with full detail
ps aux                  # a=all users, u=user-oriented, x=include daemonized
 
# Show the process tree — visualize parent-child relationships
pstree -p               # -p shows PIDs alongside process names
 
# Trace system calls made by a running process
strace -p <pid>         # Attach to a live process and print every syscall
                        # Use -f to follow child processes after fork
 
# List all open file descriptors for a process
lsof -p <pid>           # Shows files, sockets, pipes — everything the process has open
 
# Find zombie processes specifically
ps aux | grep 'Z'       # The STAT column shows 'Z' for zombie processes

3.2 Threads vs Processes

Now that you understand processes, the question becomes: when should you use multiple processes, and when should you use multiple threads within a single process?

What a Thread Actually Is

A thread is an independent flow of execution within a process. Threads share the process’s address space (heap, global variables, file descriptors) but each thread has its own stack and register set.

In Linux, threads are implemented as “lightweight processes” via the clone() system call. A thread is really a process that happens to share most of its resources with another process. This is why the kernel scheduler treats threads and processes almost identically.

02-os diagram 6

When to Use Which

CriterionMultiple ProcessesMultiple Threads
Fault isolationExcellent — crash in one does not affect othersPoor — one thread’s segfault kills all threads
Memory overheadHigher — each process has its own address spaceLower — threads share the address space
CommunicationRequires IPC (pipes, sockets, shared memory)Direct memory access (but needs synchronization)
SecuritySeparate privilege boundaries possibleAll threads share the same UID/permissions
Context switch costHigher (TLB flush, page table swap)Lower (same address space, no TLB flush)
Real-world exampleGunicorn workers, Chrome tabs (pre-2020)Java thread pools, Go goroutines, Nginx workers

Thread Pools

Creating a new thread for every request is expensive. A thread pool pre-creates a fixed number of worker threads that pull tasks from a shared queue. This amortizes thread creation cost and bounds resource usage.

# thread_pool_demo.py — Python thread pool serving simulated requests
from concurrent.futures import ThreadPoolExecutor
import time
import threading
 
def handle_request(request_id):
    """Simulate processing a request."""
    thread_name = threading.current_thread().name
    print(f"[{thread_name}] Processing request {request_id}")
    time.sleep(0.5)  # Simulate I/O-bound work (DB query, API call)
    return f"Result for request {request_id}"
 
# Create a pool of 4 worker threads
# Submitting 10 tasks — only 4 run concurrently, rest wait in queue
with ThreadPoolExecutor(max_workers=4, thread_name_prefix="Worker") as pool:
    futures = [pool.submit(handle_request, i) for i in range(10)]
    for future in futures:
        print(future.result())  # Blocks until each result is ready

Output:

[Worker_0] Processing request 0
[Worker_1] Processing request 1
[Worker_2] Processing request 2
[Worker_3] Processing request 3
[Worker_0] Processing request 4
...
Result for request 0
Result for request 1
...

Cloud relevance: When you set --workers 4 in Gunicorn, you are choosing a multi-process model. When a Java application uses Executors.newFixedThreadPool(200), it is using a thread pool. Understanding this choice is critical for setting CPU and memory limits in Kubernetes — processes consume more memory but isolate failures; threads are lighter but share a fate.

3.3 Memory Management

Memory management is where the operating system performs its most impressive magic trick: making every process believe it has the entire machine to itself.

Virtual Memory

Virtual memory is the abstraction that gives each process its own private, contiguous address space, regardless of how much physical RAM exists or how it is actually laid out. The process works with virtual addresses; the hardware MMU translates them to physical addresses at runtime.

Why does this matter? Three reasons:

  1. Isolation — Process A cannot read Process B’s memory because their virtual addresses map to different physical frames.
  2. Overcommit — The OS can promise more memory than physically exists, backing it with disk (swap) when needed.
  3. Simplicity — Every process sees a clean, linear address space starting at 0, regardless of physical fragmentation.

Paging

Physical memory is divided into fixed-size chunks called frames (typically 4 KB). Virtual memory is divided into same-sized chunks called pages. The page table maps virtual pages to physical frames.

02-os diagram 7

Demand Paging

The kernel does not load an entire program into RAM when you launch it. Instead, it uses demand paging: pages are loaded into physical memory only when the process actually accesses them. If a process touches a page that is not in RAM, the MMU raises a page fault, the kernel loads the page from disk, and execution continues.

This is why a program with a 2 GB binary can start almost instantly — most of those 2 GB are never touched and never loaded.

Swap

When physical RAM is full, the kernel needs to make room for new pages. It evicts least recently used pages to a designated area on disk called swap space. If the evicted page is later accessed, a page fault brings it back into RAM (swapping it in) while potentially evicting another page (swapping it out).

Warning for cloud engineers: In Kubernetes, swap is traditionally disabled on nodes because the scheduler assumes memory limits correspond to physical RAM. Kubernetes 1.22+ added alpha support for swap, but in most production clusters, swap is off. This means the OOM killer is your only safety net — there is no “overflow to disk.”

The OOM Killer

When the system runs out of both physical RAM and swap, the kernel’s Out-Of-Memory (OOM) killer activates. It scores every process based on memory usage and other heuristics, then kills the highest-scoring process to free memory.

# View current memory state
free -h                     # -h = human-readable sizes
                            # Shows total, used, free, shared, buff/cache, available
 
# Watch memory stats in real time (1-second interval)
vmstat 1                    # Columns: r(run queue), b(blocked), swpd(swap used),
                            # free, buff, cache, si(swap in), so(swap out)
 
# Detailed memory breakdown from the kernel
cat /proc/meminfo           # MemTotal, MemFree, MemAvailable, Buffers, Cached,
                            # SwapTotal, SwapFree, and dozens more fields
 
# Check a specific process's memory usage
cat /proc/<pid>/status      # Look for VmRSS (resident set size = actual RAM used)
                            # and VmSize (total virtual memory mapped)
 
# View OOM scores (higher = more likely to be killed)
cat /proc/<pid>/oom_score       # Current OOM score (0-1000+)
cat /proc/<pid>/oom_score_adj   # Adjustment factor (-1000 to 1000)
                                # -1000 means "never OOM-kill this process"

02-os diagram 8

Cloud relevance: When you set resources.limits.memory: 512Mi in a Kubernetes pod spec, you are configuring a cgroup memory limit. If the container’s RSS exceeds this, the kernel OOM-kills it. Understanding RSS vs virtual memory vs cache is essential — a process with 2 GB virtual size but 100 MB RSS is fine under a 512 Mi limit.

3.4 File Descriptors, I/O, and epoll

Remember our “numbered tickets at a deli counter” mental model? Let us build on that with the actual mechanics.

File Descriptors

A file descriptor (FD) is a non-negative integer that the kernel uses as an index into a per-process table of open I/O resources. “Everything is a file” in Linux — regular files, directories, sockets, pipes, devices, and even /proc entries are all accessed through file descriptors.

Every process starts with three:

  • FD 0stdin (standard input)
  • FD 1stdout (standard output)
  • FD 2stderr (standard error)

When you open a new file or socket, the kernel assigns the lowest available FD number.

# See all file descriptors for a process
ls -la /proc/<pid>/fd       # Each entry is a symlink showing what the FD points to
                            # e.g., 0 -> /dev/pts/0, 3 -> socket:[12345]
 
# More detailed view with lsof
lsof -p <pid>               # Lists every open FD with type, device, size, node
 
# Check system-wide FD limit
cat /proc/sys/fs/file-max   # Maximum FDs the kernel will allocate system-wide
 
# Check per-process FD limit
ulimit -n                   # Soft limit for current shell (commonly 1024)
ulimit -Hn                  # Hard limit (maximum the soft limit can be raised to)

I/O Models: Blocking, Non-blocking, and Event-Driven

When a process calls read() on a socket, what happens if no data is available yet?

Blocking I/O (default): The process goes to sleep and is woken up when data arrives. Simple, but the thread is stuck doing nothing while waiting.

Non-blocking I/O: The read() call returns immediately with EAGAIN if no data is available. The process must retry (“poll”). Better than blocking, but polling wastes CPU cycles.

Event-driven I/O (epoll): The process tells the kernel “watch these 10,000 FDs for me and tell me which ones are ready.” The kernel maintains an internal data structure and returns only the ready FDs. No wasted cycles, no per-FD polling.

  Blocking I/O:
  Thread 1: read(fd3) ████████████░░░░░ (blocked, waiting)
  Thread 2: read(fd4) ░░░████████████░░ (blocked, waiting)
  Thread 3: read(fd5) ░░░░░░░██████████ (blocked, waiting)
  → Need 1 thread per connection. 10,000 connections = 10,000 threads.

  epoll (event-driven):
  Thread 1: epoll_wait() → fd3 ready → read(fd3)
                         → fd5 ready → read(fd5)
                         → fd4 ready → read(fd4)
  → 1 thread handles all connections. Ready FDs processed as they arrive.

epoll in Detail

epoll is Linux’s scalable I/O event notification mechanism. It consists of three system calls:

  1. epoll_create() — Create an epoll instance (a kernel object that tracks FDs).
  2. epoll_ctl() — Add, modify, or remove FDs from the epoll instance.
  3. epoll_wait() — Block until one or more registered FDs are ready, then return only those ready FDs.
// epoll_echo_server.c — a minimal epoll-based TCP echo server
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/epoll.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <fcntl.h>
 
#define MAX_EVENTS 64
#define BUF_SIZE 1024
 
int main() {
    // Create a TCP socket
    int server_fd = socket(AF_INET, SOCK_STREAM, 0);
 
    // Bind to port 8080
    struct sockaddr_in addr = {
        .sin_family = AF_INET,
        .sin_port = htons(8080),
        .sin_addr.s_addr = INADDR_ANY
    };
    bind(server_fd, (struct sockaddr*)&addr, sizeof(addr));
    listen(server_fd, 128);  // Backlog of 128 pending connections
 
    // Create an epoll instance
    int epoll_fd = epoll_create1(0);
 
    // Register the server socket with epoll (watch for incoming connections)
    struct epoll_event ev = { .events = EPOLLIN, .data.fd = server_fd };
    epoll_ctl(epoll_fd, EPOLL_CTL_ADD, server_fd, &ev);
 
    struct epoll_event events[MAX_EVENTS];
    char buf[BUF_SIZE];
 
    while (1) {
        // Block until at least one FD is ready — THIS is the magic
        int nready = epoll_wait(epoll_fd, events, MAX_EVENTS, -1);
 
        for (int i = 0; i < nready; i++) {
            if (events[i].data.fd == server_fd) {
                // New connection — accept it and register with epoll
                int client_fd = accept(server_fd, NULL, NULL);
                fcntl(client_fd, F_SETFL, O_NONBLOCK);  // Make non-blocking
                ev.events = EPOLLIN;
                ev.data.fd = client_fd;
                epoll_ctl(epoll_fd, EPOLL_CTL_ADD, client_fd, &ev);
            } else {
                // Existing connection has data — read and echo back
                int n = read(events[i].data.fd, buf, BUF_SIZE);
                if (n <= 0) {
                    close(events[i].data.fd);  // Client disconnected
                } else {
                    write(events[i].data.fd, buf, n);  // Echo back
                }
            }
        }
    }
    return 0;
}

Cloud relevance: This is exactly how Nginx, HAProxy, and Node.js work under the hood. When you see Nginx handle 10,000 concurrent connections with 4 worker processes, it is because each worker uses epoll to multiplex thousands of sockets on a single thread.

3.5 Signals and Inter-Process Communication

Processes need to communicate. Sometimes it is a simple notification (“please shut down”), and sometimes it is a high-bandwidth data stream. Linux provides a spectrum of IPC mechanisms.

Signals

A signal is an asynchronous notification sent to a process. Signals are the simplest form of IPC — they carry no data payload, just a signal number.

SignalNumberDefault ActionCloud Relevance
SIGTERM15Terminate (graceful)Kubernetes sends this first during pod termination
SIGKILL9Terminate (immediate, uncatchable)Kubernetes sends this after terminationGracePeriodSeconds
SIGHUP1Terminate (hangup)Often used to tell daemons to reload config
SIGINT2Terminate (interrupt)Ctrl+C in terminal
SIGCHLD17IgnoreSent to parent when child exits — key to reaping zombies
SIGSTOP19Stop (uncatchable)Freezes the process
SIGCONT18ContinueResumes a stopped process
# Send SIGTERM to a process (graceful shutdown request)
kill <pid>               # Default signal is SIGTERM (15)
 
# Send SIGKILL (cannot be caught or ignored)
kill -9 <pid>            # Force kill — last resort, no cleanup possible
 
# Send SIGHUP to reload configuration
kill -HUP <pid>          # Many daemons (Nginx, Apache) reload config on SIGHUP
 
# List all available signals
kill -l                  # Shows all signal names and numbers

Critical for Kubernetes: When Kubernetes terminates a pod, it sends SIGTERM to PID 1 in the container, waits terminationGracePeriodSeconds (default 30s), then sends SIGKILL. If your application does not handle SIGTERM, it gets no graceful shutdown — connections are dropped, transactions are lost. Always install a SIGTERM handler.

Pipes

A pipe is a unidirectional byte stream between two processes. The writer writes to one end; the reader reads from the other. Pipes are created by the pipe() syscall or the | operator in the shell.

# Shell pipe: stdout of ps becomes stdin of grep
ps aux | grep nginx      # The shell creates a pipe, connects ps's stdout to grep's stdin

02-os diagram 9

Unix Domain Sockets

For local communication (same machine), Unix domain sockets are faster than TCP sockets because they bypass the entire networking stack. They appear as files in the filesystem.

# Docker uses a Unix domain socket
ls -la /var/run/docker.sock     # Docker daemon listens here
                                # Docker CLI connects to this socket
 
# Container runtimes (containerd) also use Unix sockets
ls -la /run/containerd/containerd.sock

Shared Memory

Shared memory is the fastest IPC mechanism — two processes map the same physical memory frames into their respective virtual address spaces. No copying, no kernel mediation for reads/writes. The downside: you must handle synchronization yourself (mutexes, semaphores).

Message Queues

Message queues provide structured, prioritized message passing between processes. POSIX message queues (mq_open, mq_send, mq_receive) let processes exchange discrete messages rather than byte streams.

02-os diagram 10

3.6 Systemd

Systemd is the init system and service manager for most modern Linux distributions. It is PID 1 — the first process the kernel starts. It manages service startup, dependency ordering, logging, socket activation, and much more.

Units

Everything systemd manages is a unit. A unit is described by a unit file (a configuration file in INI format). The most common unit types:

Unit TypeExtensionPurpose
Service.serviceA daemon or one-shot process
Socket.socketSocket activation (start service when connection arrives)
Timer.timerSchedule service execution (cron replacement)
Target.targetGroup of units (like runlevels)
Mount.mountFilesystem mount point
Path.pathWatch filesystem path for changes

Managing Services

# Check the status of a service
systemctl status kubelet          # Shows: loaded/active/running, PID, memory, recent logs
 
# Start, stop, restart a service
systemctl start nginx             # Start the service now
systemctl stop nginx              # Stop the service now
systemctl restart nginx           # Stop then start (brief downtime)
systemctl reload nginx            # Reload config without stopping (if supported)
 
# Enable/disable at boot
systemctl enable kubelet          # Create symlinks so it starts on boot
systemctl disable kubelet         # Remove boot symlinks
 
# List all active services
systemctl list-units --type=service --state=running
 
# Show the unit file (the configuration)
systemctl cat nginx.service       # Prints the .service file contents

The Journal (journald)

Systemd includes its own logging system, journald, which collects logs from all services, the kernel, and stdout/stderr of managed processes.

# View logs for a specific service
journalctl -u nginx -f            # -u = unit name, -f = follow (like tail -f)
 
# View logs since last boot
journalctl -b                     # All logs from current boot
 
# View logs in a time range
journalctl --since "2026-05-21 09:00" --until "2026-05-21 10:00"
 
# View kernel messages (equivalent to dmesg)
journalctl -k                     # Kernel ring buffer messages
 
# See logs for a specific PID
journalctl _PID=1234              # All log entries from PID 1234
 
# Check disk usage of the journal
journalctl --disk-usage           # How much space logs consume

A Typical Service Unit File

# /etc/systemd/system/myapp.service
[Unit]
Description=My Application Server         # Human-readable description
After=network.target postgresql.service    # Start after these units are up
Requires=postgresql.service               # Fail if postgresql cannot start
 
[Service]
Type=simple                               # Process stays in foreground
User=appuser                              # Run as this user (not root)
Group=appgroup                            # Run as this group
WorkingDirectory=/opt/myapp               # cd here before starting
ExecStart=/opt/myapp/bin/server --port 8080   # The actual command
ExecReload=/bin/kill -HUP $MAINPID        # How to reload config
Restart=on-failure                        # Restart if exit code != 0
RestartSec=5                              # Wait 5 seconds before restarting
LimitNOFILE=65535                         # Max open file descriptors
 
[Install]
WantedBy=multi-user.target               # Enable in multi-user (normal) boot

Cloud relevance: On Kubernetes nodes, kubelet itself is a systemd service. When a node is “NotReady,” the first thing you check is systemctl status kubelet and journalctl -u kubelet. Understanding systemd is essential for node-level debugging.

3.7 Linux Networking Stack

Every network packet in a Kubernetes cluster traverses the Linux networking stack. Understanding this stack is non-negotiable for debugging service connectivity issues.

Network Interfaces

A network interface is an endpoint for sending and receiving packets. Interfaces can be physical (a NIC) or virtual (created by software).

# List all network interfaces with their addresses
ip addr                          # Shows interface name, state, MAC, IPv4, IPv6
                                 # lo = loopback, eth0 = first ethernet
 
# Show just interface names and states
ip link show                     # Shows UP/DOWN state, MTU, MAC address

02-os diagram 11

Routing

The routing table tells the kernel where to send packets based on their destination IP address.

# Show the routing table
ip route show                    # Each line: destination network → via gateway → dev interface
                                 # "default via 10.0.0.1 dev eth0" = default gateway
 
# Example output:
# default via 10.0.0.1 dev eth0 proto dhcp metric 100
# 10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.5
# 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1

02-os diagram 12

DNS Resolution

When a process calls getaddrinfo("google.com"), the resolution follows a specific order:

  1. /etc/hosts — Static name-to-IP mappings. Checked first.
  2. /etc/resolv.conf — Lists DNS nameservers and search domains.
  3. /etc/nsswitch.conf — Defines the lookup order (typically files dns, meaning check /etc/hosts first, then DNS).
# Check DNS configuration
cat /etc/resolv.conf             # nameserver 10.96.0.10 (in Kubernetes, this is CoreDNS)
                                 # search default.svc.cluster.local svc.cluster.local
 
# Check static host mappings
cat /etc/hosts                   # 127.0.0.1 localhost
                                 # Custom entries for local overrides
 
# Check name resolution order
cat /etc/nsswitch.conf           # hosts: files dns
                                 # "files" = /etc/hosts, "dns" = resolv.conf nameservers

Cloud relevance: In Kubernetes pods, /etc/resolv.conf is automatically configured to point to CoreDNS. The search directive is why you can say my-service instead of my-service.default.svc.cluster.local. Understanding this chain is essential when debugging “DNS not resolving” issues in pods.

Inspecting Network Connections

# List all listening TCP/UDP sockets with process info
ss -tulpn                        # -t=TCP, -u=UDP, -l=listening, -p=process, -n=numeric
                                 # Shows: port, PID, process name
 
# Example output:
# LISTEN  0  128  0.0.0.0:80    0.0.0.0:*  users:(("nginx",pid=1234,fd=6))
 
# Inspect a container's network namespace
nsenter --net=/proc/<pid>/ns/net ip addr    # Enter the container's network namespace
                                            # and run ip addr to see its interfaces
                                            # This is how you debug container networking
                                            # without exec-ing into the container
 
nsenter --net=/proc/<pid>/ns/net ss -tulpn  # See listening ports inside the container

3.8 File System Hierarchy

The Linux filesystem is a single tree rooted at /. Every file, device, process pseudo-file, and mount point hangs off this tree. Understanding the hierarchy is like knowing the layout of a city — you can find what you need without a map.

02-os diagram 13

/proc — The Kernel’s Live Dashboard

/proc is not a real filesystem — it does not exist on disk. It is a virtual filesystem generated by the kernel on the fly, exposing process information and kernel tunables as “files.”

# View process status
cat /proc/<pid>/status         # Name, State, Pid, PPid, VmRSS, Threads, etc.
 
# View process memory map
cat /proc/<pid>/maps           # Every mapped region: address range, permissions,
                               # backing file or [heap]/[stack]/[anon]
 
# View process namespaces (containers!)
ls -la /proc/<pid>/ns/         # Shows: mnt, net, pid, user, ipc, uts, cgroup
                               # These are the namespace files that containers use
 
# Tune kernel parameters at runtime
cat /proc/sys/net/ipv4/ip_forward          # Is IP forwarding enabled? (1 = yes)
echo 1 > /proc/sys/net/ipv4/ip_forward    # Enable IP forwarding (needed for routing)
 
# View system uptime
cat /proc/uptime               # Two numbers: seconds since boot, idle seconds

Cloud relevance: Container runtimes inspect /proc/<pid>/ns/ to manage namespaces. Kubernetes cAdvisor reads /proc to collect container metrics. When you kubectl exec into a pod, you can explore /proc to understand the container’s view of the system.

3.9 User/Group Model and Permissions

Linux is a multi-user system. Every file has an owner, a group, and a set of permissions that control who can read, write, or execute it. In cloud contexts, this model is the foundation of container security.

Users and Groups

Every process runs as a user (identified by a UID) and belongs to one or more groups (identified by GIDs). The root user (UID 0) bypasses most permission checks.

# View current user and groups
id                              # uid=1000(shadab) gid=1000(shadab) groups=...
 
# View all users
cat /etc/passwd                 # username:x:UID:GID:comment:home:shell
                                # "x" means password is in /etc/shadow
 
# View all groups
cat /etc/group                  # groupname:x:GID:member1,member2
 
# View a file's ownership and permissions
ls -la /etc/passwd              # -rw-r--r-- 1 root root 2345 May 21 10:00 /etc/passwd

Permission Bits

Every file has three sets of three permission bits:

02-os diagram 14

SymbolOctalMeaning for FilesMeaning for Directories
r4Read contentsList contents
w2Modify contentsCreate/delete files within
x1Execute as programEnter (cd into) the directory
# Change permissions using octal notation
chmod 755 script.sh             # Owner: rwx (7), Group: r-x (5), Others: r-x (5)
chmod 644 config.yaml           # Owner: rw- (6), Group: r-- (4), Others: r-- (4)
 
# Change permissions using symbolic notation
chmod u+x script.sh             # Add execute permission for the owner (u=user/owner)
chmod g-w secret.txt            # Remove write permission for the group
 
# Change ownership
chown appuser:appgroup file.txt # Change owner to appuser, group to appgroup
chown -R appuser:appgroup /opt/myapp  # Recursive — apply to all files and subdirs

Special Permission Bits

Three special bits extend the basic permission model:

Setuid (SUID) — When set on an executable, the program runs with the permissions of the file’s owner, not the user who launched it. This is how passwd (owned by root) can modify /etc/shadow.

# Find setuid binaries (security audit essential)
find / -perm -4000 -type f 2>/dev/null    # 4000 = setuid bit
# Common setuid binaries: /usr/bin/passwd, /usr/bin/sudo, /usr/bin/ping

Setgid (SGID) — On an executable, runs with the file’s group. On a directory, files created inside inherit the directory’s group (instead of the creator’s primary group). Useful for shared project directories.

Sticky Bit — On a directory, prevents users from deleting files they do not own, even if they have write permission on the directory. The classic example is /tmp:

ls -ld /tmp                     # drwxrwxrwt — the 't' at the end is the sticky bit
                                # Everyone can write to /tmp, but you can only delete
                                # YOUR files, not other users' files
# Set special bits
chmod 4755 myprogram            # 4 = setuid + 755 = rwxr-xr-x → runs as owner
chmod 2755 /shared/dir          # 2 = setgid → new files inherit directory's group
chmod 1777 /tmp                 # 1 = sticky bit → users can't delete others' files

Cloud relevance: Kubernetes SecurityContext lets you set runAsUser, runAsGroup, and fsGroup on pods and containers. The readOnlyRootFilesystem: true setting prevents writes to the container filesystem. Understanding Linux permissions is essential for writing security policies that actually work without breaking your application.


Practical Use Cases

Use Case 1: Debugging an OOM-Killed Container

Your pod was evicted with reason OOMKilled. Here is how to investigate:

# 1. Check the pod's last state
kubectl describe pod myapp-xyz   # Look for "Last State: Terminated, Reason: OOMKilled"
 
# 2. Check node-level memory at the time (if you have node access)
free -h                          # Was the node itself under memory pressure?
cat /proc/meminfo                # Look at MemAvailable vs MemTotal
 
# 3. Check the container's memory cgroup limit
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes
 
# 4. Check which process the OOM killer selected
dmesg | grep -i "oom"           # Kernel OOM killer logs appear in dmesg
journalctl -k | grep -i "oom"  # Same information via journald
 
# 5. Profile memory usage before the next OOM
# Inside the container:
cat /proc/1/status | grep VmRSS  # Resident set size of PID 1 (your app)

Use Case 2: Tracing a Hanging Process

A service is unresponsive. It is running (PID exists) but not handling requests.

# 1. Check if the process is alive and what state it is in
ps aux | grep <process-name>     # Look at STAT column: S=sleeping, D=disk sleep,
                                 # R=running, Z=zombie, T=stopped
 
# 2. Trace system calls in real time
strace -p <pid> -f               # -p attaches to PID, -f follows child threads
                                 # If you see it stuck on futex() → it's waiting on a lock
                                 # If stuck on read() → waiting for data from an FD
                                 # If stuck on epoll_wait() → normal for event loops
 
# 3. Check what it's waiting on
cat /proc/<pid>/wchan             # Name of the kernel function where the process is sleeping
 
# 4. Check its open file descriptors for clues
lsof -p <pid>                    # Is it connected to a remote host that's down?
                                 # Is it waiting on a pipe that nobody's writing to?
 
# 5. Check thread state if multi-threaded
ls /proc/<pid>/task/             # Each subdirectory is a thread (LWP)
cat /proc/<pid>/task/<tid>/status  # Check each thread's state

Use Case 3: Inspecting Container Network Namespaces

A container cannot reach a service. You need to see networking from the container’s perspective without kubectl exec.

# 1. Find the container's PID on the host
docker inspect <container-id> --format '{{.State.Pid}}'
# Or for containerd:
crictl inspect <container-id> | grep pid
 
# 2. Enter the container's network namespace and inspect
nsenter --net=/proc/<pid>/ns/net ip addr        # See the container's interfaces
nsenter --net=/proc/<pid>/ns/net ip route show   # See the container's routing table
nsenter --net=/proc/<pid>/ns/net ss -tulpn       # See listening ports inside container
nsenter --net=/proc/<pid>/ns/net cat /etc/resolv.conf  # DNS config in the container
 
# 3. Test connectivity from the container's perspective
nsenter --net=/proc/<pid>/ns/net ping 10.96.0.1  # Can it reach the Kubernetes API?

Worked Examples

Example 1: Building a Process Supervisor That Reaps Zombies

This is a minimal init process suitable for running as PID 1 in a container. It forwards signals to the child and reaps zombies.

#!/usr/bin/env python3
# mini_init.py — A minimal PID 1 for containers that reaps zombies
# Usage: python3 mini_init.py <command> [args...]
 
import os
import sys
import signal
 
child_pid = None
 
def forward_signal(signum, frame):
    """Forward received signal to the child process."""
    if child_pid:
        os.kill(child_pid, signum)  # Pass signal through to child
 
def main():
    if len(sys.argv) < 2:
        print("Usage: mini_init.py <command> [args...]", file=sys.stderr)
        sys.exit(1)
 
    # Register signal handlers — forward SIGTERM and SIGINT to child
    signal.signal(signal.SIGTERM, forward_signal)
    signal.signal(signal.SIGINT, forward_signal)
 
    # Fork to create child process
    child_pid_local = os.fork()
 
    if child_pid_local == 0:
        # CHILD: replace ourselves with the target command
        os.execvp(sys.argv[1], sys.argv[1:])
        # If exec fails:
        print(f"Failed to exec {sys.argv[1]}", file=sys.stderr)
        os._exit(1)
    else:
        # PARENT: we are PID 1, our job is to wait and reap
        global child_pid
        child_pid = child_pid_local
        print(f"[init] Started child PID {child_pid}: {' '.join(sys.argv[1:])}")
 
        while True:
            try:
                # Wait for ANY child to exit (reaps zombies!)
                pid, status = os.waitpid(-1, 0)  # -1 = any child, 0 = block
                exit_code = os.WEXITSTATUS(status) if os.WIFEXITED(status) else 1
 
                if pid == child_pid:
                    # Our main child exited — we should exit too
                    print(f"[init] Child {pid} exited with code {exit_code}")
                    sys.exit(exit_code)
                else:
                    # Some other adopted orphan exited — just reap it
                    print(f"[init] Reaped orphan {pid} (exit code {exit_code})")
 
            except ChildProcessError:
                # No more children — this shouldn't happen if main child is alive
                break
 
if __name__ == "__main__":
    main()

Output (when used in a container):

[init] Started child PID 7: /usr/bin/my-server --port 8080
[init] Reaped orphan 15 (exit code 0)
[init] Reaped orphan 22 (exit code 0)
[init] Child 7 exited with code 0

Example 2: Monitoring File Descriptor Leaks

A process is slowly running out of file descriptors. This script monitors FD count over time.

#!/bin/bash
# fd_monitor.sh — Monitor file descriptor count for a process
# Usage: ./fd_monitor.sh <pid> [interval_seconds]
 
PID=${1:?Usage: fd_monitor.sh <pid> [interval]}   # First arg is PID (required)
INTERVAL=${2:-5}                                    # Second arg is interval (default 5s)
 
if [ ! -d "/proc/$PID" ]; then                     # Check if PID exists
    echo "Error: Process $PID not found"
    exit 1
fi
 
PROCESS_NAME=$(cat /proc/$PID/comm)                 # Get process name from /proc
echo "Monitoring FDs for $PROCESS_NAME (PID $PID) every ${INTERVAL}s"
echo "Time                  FD_Count  FD_Limit"
echo "----                  --------  --------"
 
while [ -d "/proc/$PID" ]; do                       # Loop while process exists
    FD_COUNT=$(ls /proc/$PID/fd 2>/dev/null | wc -l)   # Count open FDs
    FD_LIMIT=$(grep "Max open files" /proc/$PID/limits | awk '{print $4}')  # Read limit
    TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
    printf "%-22s %-9s %s\n" "$TIMESTAMP" "$FD_COUNT" "$FD_LIMIT"
 
    if [ "$FD_COUNT" -gt $((FD_LIMIT * 90 / 100)) ]; then  # Alert at 90% usage
        echo "WARNING: FD usage above 90%!"
    fi
 
    sleep "$INTERVAL"
done
 
echo "Process $PID has exited."

Output:

Monitoring FDs for nginx (PID 1234) every 5s
Time                  FD_Count  FD_Limit
----                  --------  --------
2026-05-21 10:00:00   42        65535
2026-05-21 10:00:05   43        65535
2026-05-21 10:00:10   45        65535
2026-05-21 10:00:15   44        65535

Example 3: Systemd Unit for a Custom Service

Create a production-ready systemd unit for a Go web service with all the hardening options.

# /etc/systemd/system/mywebapp.service
[Unit]
Description=My Go Web Application
Documentation=https://internal.docs/mywebapp
After=network-online.target                  # Wait for network to be fully up
Wants=network-online.target                  # Soft dependency on network
StartLimitIntervalSec=300                    # Rate limit: max 5 starts in 300 seconds
StartLimitBurst=5                            # to prevent restart loops
 
[Service]
Type=simple                                  # Process stays in foreground
User=webapp                                  # Run as non-root user
Group=webapp                                 # Run as non-root group
WorkingDirectory=/opt/mywebapp               # Working directory for the process
 
# Environment configuration
EnvironmentFile=/etc/mywebapp/env            # Load env vars from file (DB_URL, etc.)
Environment=GOMAXPROCS=4                     # Limit Go runtime to 4 OS threads
 
# The actual command
ExecStart=/opt/mywebapp/bin/server           # Start command
ExecReload=/bin/kill -HUP $MAINPID           # Reload: send SIGHUP
 
# Restart policy
Restart=on-failure                           # Restart only on non-zero exit
RestartSec=5                                 # Wait 5s between restarts
TimeoutStartSec=30                           # Fail if not started within 30s
TimeoutStopSec=30                            # SIGKILL if not stopped within 30s
 
# Resource limits
LimitNOFILE=65535                            # Max open file descriptors
LimitNPROC=4096                              # Max processes/threads
MemoryMax=1G                                 # Hard memory limit (cgroup)
CPUQuota=200%                                # Max 2 CPU cores
 
# Security hardening
NoNewPrivileges=true                         # Prevent privilege escalation
ProtectSystem=strict                         # Mount / as read-only
ProtectHome=true                             # Hide /home, /root, /run/user
ReadWritePaths=/var/lib/mywebapp /var/log/mywebapp  # Whitelist writable paths
PrivateTmp=true                              # Isolated /tmp
ProtectKernelTunables=true                   # Protect /proc/sys, /sys
ProtectKernelModules=true                    # Prevent module loading
 
[Install]
WantedBy=multi-user.target                   # Start in multi-user (normal) mode
# Deploy and activate the service
sudo systemctl daemon-reload                 # Reload unit files after creating/modifying
sudo systemctl enable mywebapp.service       # Enable at boot
sudo systemctl start mywebapp.service        # Start now
sudo systemctl status mywebapp.service       # Verify it's running
journalctl -u mywebapp -f                    # Watch logs in real time

Common Pitfalls & Misconceptions

Pitfall 1: Confusing Virtual Memory Size with Actual RAM Usage

The misconception: “My process is using 4 GB of memory!” — said after looking at VIRT or VSZ in top/ps.

The reality: Virtual memory size (VIRT/VSZ) includes everything the process has mapped, not what it is actively using. A Java process might map 4 GB of virtual address space but only have 500 MB of pages actually resident in physical RAM (RSS/RES).

# What to look at:
ps -o pid,rss,vsz,comm -p <pid>
# RSS = Resident Set Size = actual physical RAM being used
# VSZ = Virtual Size = total virtual address space mapped (includes mmap'd files, etc.)

When setting Kubernetes memory limits, RSS is what counts. A container with resources.limits.memory: 1Gi will be OOM-killed when its RSS (plus cache under pressure) exceeds 1 GB, not when its virtual size does.

Pitfall 2: Assuming SIGTERM Always Stops a Process

The misconception:kill <pid> kills the process.”

The reality: kill sends SIGTERM, which is a request to terminate. The process can catch SIGTERM, ignore it, or handle it any way it wants. A process stuck in uninterruptible disk sleep (D state) will not even see the signal until it returns from the kernel.

If kill <pid> does not work, the process is either:

  • Ignoring/catching SIGTERM (use kill -9 / SIGKILL, which cannot be caught)
  • In uninterruptible sleep (D state), meaning it is waiting on I/O that is not completing (NFS hang, disk failure)

SIGKILL is the absolute last resort. It cannot be caught, blocked, or ignored. But it also gives the process no chance to clean up (flush buffers, close connections, release locks).

Pitfall 3: Running Containers as Root

The misconception: “The container is isolated, so running as root inside it is fine.”

The reality: Container isolation is not perfect. Kernel vulnerabilities, misconfigured mounts, and escape techniques can give a root-inside-container process root-on-the-host access. Always run containers as a non-root user:

# Kubernetes pod security context
securityContext:
  runAsNonRoot: true            # Refuse to start if image runs as root
  runAsUser: 1000               # Run as UID 1000
  runAsGroup: 1000              # Run as GID 1000
  readOnlyRootFilesystem: true  # Prevent writes to container filesystem
  allowPrivilegeEscalation: false  # Prevent setuid from elevating privileges

Pitfall 4: Ignoring File Descriptor Limits

The misconception: “I opened a bunch of files, it’s fine.”

The reality: The default ulimit -n is often 1024. A busy server opening connections for each request can hit this limit quickly, resulting in “Too many open files” errors. Network sockets are file descriptors too — every TCP connection consumes one.

# Check current limits
ulimit -n                       # Soft limit (default: often 1024)
ulimit -Hn                      # Hard limit (maximum soft can be raised to)
 
# In production, set in the systemd unit file:
# LimitNOFILE=65535
 
# Or in /etc/security/limits.conf:
# webapp  soft  nofile  65535
# webapp  hard  nofile  65535

Pitfall 5: Misunderstanding /proc/meminfo “Available” vs “Free”

The misconception: “The server only has 200 MB free, it’s almost out of memory!”

The reality: Linux aggressively uses “free” memory for disk caches (page cache). MemFree is memory that is truly unused. MemAvailable is memory that can be freed if needed (free + reclaimable caches). A healthy server might show 200 MB MemFree but 8 GB MemAvailable.

# Read it correctly:
cat /proc/meminfo | head -5
# MemTotal:       16384000 kB     ← Total physical RAM
# MemFree:          204800 kB     ← Truly unused (looks scary!)
# MemAvailable:    8192000 kB     ← What's actually available (this is fine)
# Buffers:          102400 kB     ← Kernel buffer cache
# Cached:          7884800 kB     ← Page cache (reclaimable)

Pitfall 6: Not Reaping Zombie Processes in Containers

The misconception: “Zombie processes are harmless.”

The reality: Individual zombies are small (they only occupy a process table entry), but they accumulate. Each zombie holds a PID. If you exhaust the PID space (default max is 32768 on many systems, configurable via /proc/sys/kernel/pid_max), no new processes can be created. In a long-running container that spawns many short-lived children, zombie accumulation is a real failure mode.

Solution: use a proper init process as PID 1 (tini, dumb-init, or your own like the example in Worked Examples above).


Summary & Key Takeaways

Process Lifecycle:

  • Every process is created by fork() (copy parent) + exec() (replace with new program).
  • Parents must wait() on children to avoid zombies. Orphans are re-parented to PID 1.
  • In containers, PID 1 must be a proper init that reaps children.

Threads vs Processes:

  • Processes are isolated apartments; threads are roommates sharing a flat.
  • Use processes for fault isolation, threads for shared state and low overhead.
  • Thread pools amortize creation cost and bound resource usage.

Memory Management:

  • Virtual memory gives each process its own address space. The MMU translates virtual to physical.
  • Demand paging loads pages only when accessed. Swap extends RAM to disk.
  • RSS is actual RAM used; VSZ is virtual space mapped. Set Kubernetes limits based on RSS.
  • The OOM killer selects and kills processes when RAM + swap are exhausted.

File Descriptors and epoll:

  • FDs are integer handles to I/O resources. Everything is a file.
  • epoll enables event-driven I/O: register FDs, get notified only when they are ready.
  • This is how Nginx and Node.js handle thousands of connections per thread.

Signals and IPC:

  • SIGTERM is a polite request; SIGKILL is a forceful demand. Handle SIGTERM for graceful shutdown.
  • IPC spans a spectrum: signals (notification only) to shared memory (fastest, most complex).
  • Unix domain sockets bypass the network stack for local communication.

Systemd:

  • Units are the building blocks. Services, sockets, timers, and targets.
  • systemctl manages lifecycle; journalctl manages logs.
  • Kubelet is a systemd service — node debugging starts here.

Networking Stack:

  • Interfaces (lo, eth0, veth, bridges), routing tables, and DNS resolution.
  • ss, ip, and nsenter are your debugging tools.
  • Kubernetes DNS (CoreDNS) is configured via /etc/resolv.conf in pods.

Filesystem Hierarchy:

  • /proc is a live window into the kernel. /sys exposes device and cgroup info.
  • /etc for configs, /var for variable data, /tmp for ephemeral files.

Permissions:

  • Owner/group/others, read/write/execute. Setuid, setgid, sticky bit.
  • Run containers as non-root. Use SecurityContext in Kubernetes.

You should now be able to:

  • Trace a process’s system calls and open file descriptors
  • Explain why a container was OOM-killed and set correct memory limits
  • Debug networking inside a container’s network namespace
  • Write a systemd service unit with proper restart and security hardening
  • Audit file permissions and understand setuid/sticky bit implications
  • Choose between processes and threads for a given workload
  • Explain how epoll enables high-concurrency servers

Quick Reference Cheat Sheet

Process Inspection

ps aux                                  # All processes, full detail
pstree -p                               # Process tree with PIDs
strace -p <pid> -f                      # Trace syscalls (follow threads)
lsof -p <pid>                           # Open file descriptors
cat /proc/<pid>/status                  # Process status (VmRSS, threads, state)
cat /proc/<pid>/wchan                   # What kernel function it's sleeping in
kill -l                                 # List all signals

Memory

free -h                                 # Memory overview (total/used/free/available)
vmstat 1                                # Memory and CPU stats, 1-second interval
cat /proc/meminfo                       # Detailed kernel memory stats
cat /proc/<pid>/status | grep VmRSS     # Resident set size of a process
cat /proc/<pid>/maps                    # Virtual memory map
slabtop                                 # Kernel slab allocator usage

Networking

ip addr                                 # All interfaces and their IP addresses
ip route show                           # Routing table
ss -tulpn                               # Listening TCP/UDP sockets with process info
nsenter --net=/proc/<pid>/ns/net <cmd>  # Run command in container's network namespace
cat /etc/resolv.conf                    # DNS nameservers and search domains
cat /etc/hosts                          # Static host-to-IP mappings
dig <hostname>                          # DNS query tool

Systemd

systemctl status <service>              # Service status, PID, memory, recent logs
systemctl start|stop|restart <service>  # Lifecycle control
systemctl enable|disable <service>      # Boot-time auto-start control
systemctl list-units --type=service     # All loaded service units
journalctl -u <service> -f             # Follow logs for a service
journalctl -k                          # Kernel messages
systemctl daemon-reload                 # Reload unit files after editing

Filesystem and Permissions

ls -la <path>                           # Permissions, owner, group, size
chmod 755 <file>                        # Set permissions (octal)
chown user:group <file>                 # Change ownership
find / -perm -4000 -type f             # Find setuid binaries
stat <file>                             # Detailed file metadata
df -h                                   # Disk usage by filesystem
du -sh <dir>                            # Total size of a directory
lsblk                                   # Block device tree

DSA Connections

The Linux kernel is one of the largest real-world applications of data structures and algorithms. Here are five direct connections between OS internals and classic DSA concepts.

1. Page Table — Radix Tree (Trie)

The page table that translates virtual addresses to physical addresses is implemented as a multi-level radix tree (also called a trie). On x86-64, it is a 4-level tree (PGD PUD PMD PTE), where each level indexes into the next using a portion of the virtual address bits.

02-os diagram 15

Why a radix tree? Because the virtual address space is sparse — a process might use addresses near 0 and addresses near the top (stack), with a vast empty region in between. A flat array would waste enormous memory. A radix tree allocates table pages only for regions that are actually mapped, making it memory-efficient for sparse address spaces. Lookup is O(levels) = O(4), which is effectively O(1).

2. CPU Scheduler — Red-Black Tree (CFS Run Queue)

The Completely Fair Scheduler (CFS), the default Linux CPU scheduler, uses a red-black tree (a self-balancing binary search tree) as its run queue. Each runnable task is a node in the tree, keyed by its virtual runtime (vruntime) — a measure of how much CPU time the task has received.

02-os diagram 16

The scheduler always picks the leftmost node (smallest vruntime), which is the task that has received the least CPU time. After running, the task’s vruntime increases and it is reinserted into the tree. This gives every task a fair share of CPU time.

Why a red-black tree? Insert, delete, and find-minimum are all O(log n). The leftmost pointer is cached for O(1) access to the next task. A simple sorted list would make insertion O(n); a heap would make deletion of arbitrary nodes O(n). The red-black tree balances all three operations.

3. epoll — Red-Black Tree + Ready List

The epoll instance internally uses two data structures:

  1. A red-black tree of all monitored file descriptors (for O(log n) add/remove)
  2. A linked list of FDs that are currently ready (for O(1) retrieval of ready events)

02-os diagram 17

When a device driver signals that data is available on an FD, the kernel’s callback checks the red-black tree and adds the FD to the ready list. When epoll_wait() is called, it simply returns the ready list. This is why epoll is O(ready) instead of O(total) — it only returns FDs that have events, regardless of how many are registered.

4. VFS Dentry Cache — LRU Hash Map

The Virtual File System (VFS) maintains a dentry cache (directory entry cache) that maps file paths to inodes, avoiding repeated disk lookups. It is implemented as a hash table with LRU eviction.

02-os diagram 18

Why LRU hash map? The hash table provides O(1) average lookup for cached paths. The LRU list ensures that when memory pressure forces eviction, the least recently accessed dentries are removed first. Frequently accessed paths (like /var/log/ in a logging-heavy system) stay hot in cache. This is the same pattern used in application-level caches (Redis LRU, Memcached).

5. Inode Table — Hash Map

The inode table maps inode numbers to inode structures (which contain file metadata: permissions, size, block pointers, timestamps). It is implemented as a hash table keyed by (device number, inode number).

02-os diagram 19

Why a hash map? Inode lookups happen on every file operation — open, read, write, stat. They must be as fast as possible. A hash table provides O(1) average-case lookup. The (device, inode) composite key is necessary because inode numbers are only unique within a single filesystem — two different disks can both have inode 2.

The takeaway: The Linux kernel is a masterclass in applied data structures. Page tables use radix trees for sparse mapping. The scheduler uses red-black trees for balanced priority queues. epoll uses red-black trees plus linked lists for event notification. The VFS uses LRU hash maps for caching. Every choice is driven by the specific access pattern and performance requirement of that subsystem.


Further Reading

  • “Linux Kernel Development” by Robert Love — The most accessible deep dive into kernel internals. Covers scheduling, memory management, VFS, and process management with clarity and precision. Best for engineers who want to understand the “why” behind kernel design decisions.

  • “The Linux Programming Interface” by Michael Kerrisk — The definitive reference for Linux system programming. Covers every system call, signal, IPC mechanism, and file operation with extensive examples. At 1,500+ pages, it is encyclopedic. Best used as a reference when you need the exact semantics of a specific syscall.

  • “Operating Systems: Three Easy Pieces” (OSTEP) by Remzi and Andrea Arpaci-Dusseau — Free online at ostep.org. The best introductory OS textbook available. Covers virtualization (CPU, memory), concurrency, and persistence with excellent clarity. Best for engineers who want a rigorous but readable foundation in OS theory.

  • “Systems Performance” by Brendan Gregg — Covers Linux performance analysis methodology, tools, and internals. Deep treatment of CPU scheduling, memory, file systems, and networking from a performance perspective. Essential for anyone doing production performance debugging.

  • Brendan Gregg’s blog and BPF tools (brendangregg.com) — Practical performance analysis recipes, flame graphs, and modern eBPF-based tracing tools. Best for learning to instrument and analyze production Linux systems.

  • The Linux kernel source code documentation (kernel.org/doc) — Official documentation for kernel subsystems. Start with Documentation/admin-guide/ for sysctl tunables, Documentation/scheduler/ for CFS details, and Documentation/vm/ for memory management. Best when you need the authoritative answer on how a specific kernel feature works.

  • “Container Security” by Liz Rice — Covers Linux primitives that containers are built on: namespaces, cgroups, capabilities, seccomp. Essential for understanding what “container isolation” actually means at the OS level.

  • Julia Evans’ zines (jvns.ca) — Visual, concise explanations of Linux concepts like networking, strace, file descriptors, and DNS. Best for quick conceptual reinforcement and building intuition. The “Bite Size Linux” and “Bite Size Networking” zines are particularly relevant to this document’s topics.