Unit 6

Posted May 21, 2025 Updated May 21, 2025

By Ugyen Kinley Phuntshok

5 min read

Unit 6

Lesson 14

Learning Outcomes

Evaluate different types of indexing
Implement multilevel indexing
Understand the B-tree family

What are Indexes?

Indexes are special lookup tables or data structures that speed up data retrieval in databases. Without indexes, queries would scan entire tables, which is inefficient especially for large datasets.

Types of Indices

Ordered Indices: Store search keys in sorted order.
- Clustering (Primary) Index: Records stored in the same sorted order as the index.
- Non-clustering (Secondary) Index: Index entries point to records stored in arbitrary order.
Hashed Indices: Use hash functions to distribute values uniformly across buckets.

Ordered Index Subtypes

Dense Index: Contains an index entry for every search key value.
Sparse Index: Contains entries for only some search keys; requires the data file to be sorted by the search key.

Multilevel Indexing

For very large indexes that cannot fit into memory, multilevel indexing uses an outer index (in memory) that points to blocks of the main index on disk, enabling efficient search with minimal disk reads.

Secondary Indices

Created on non-key attributes to speed up queries involving those attributes.

B-Tree Family Overview

B-trees and B+ trees are balanced tree structures optimized for disk storage.
B-trees store keys and values in all nodes; B+ trees store values only in leaf nodes, using internal nodes for guiding searches.
Both allow efficient insertion, deletion, and search operations with O(log n) complexity.

B+ Tree Properties

Perfectly balanced with all leaves at the same depth
Nodes are at least half-full
Inner nodes have one more child than keys

Why Use B-Trees?

They minimize disk accesses by storing multiple keys per node, reducing tree height and improving performance on large datasets.

Lesson 15

Learning Outcomes

Understand query processing
Evaluate steps in query processing
Calculate query evaluation costs
Evaluate query costs in various operations

What is Query Processing?

Activities involved in extracting data from a database
Includes translation, optimization, and evaluation of queries
Converts high-level queries to physical-level operations

Key Steps

Parsing and Translation
- Convert SQL to relational algebra
- Create query evaluation plans
- Handle views and validate relations
Evaluation
- Execute operations using chosen algorithms
- Consider pipelining for efficiency
Optimization
- Choose least-cost evaluation plan
- Based on statistical data and operation costs

Query Execution Methods

Iterator Interface: Open(), get_next(), close()
Pipelined Execution: Process tuples as they’re produced
Materialization: Store intermediate results temporarily

Cost Measurement

Primary Factors:
- Disk I/O (block transfers, random accesses)
- CPU costs (becoming more significant with SSDs)
Cost Formula: b * tT + S * tS
- Where b=blocks, tT=transfer time, S=seeks, tS=seek time

Operation Costs

Selection Operations

File scan vs. index usage
Clustering vs. secondary indices
Complex selections (conjunctions, disjunctions)

Sorting

External sort-merge algorithm
Two phases: creating sorted runs and merging
Cost depends on memory buffer size

Join Operations

Nested-loop: Simple but expensive
Block nested-loop: More efficient
Indexed nested-loop: Uses indexes
Merge join: Requires sorted inputs

Other Operations

Duplicate elimination (sorting/hashing)
Projection
Set operations
Outer join
Aggregation

Advanced Techniques

Pipelining: Combine operations to avoid temp relations
Cache-conscious algorithms: Optimize for CPU cache
Query compilation: Convert plans to machine code

Key Considerations

Balance between response time and resource consumption
Modern hardware changes cost considerations (SSDs vs HDDs)
Memory access patterns becoming increasingly important

Lesson 16

Key Concepts

Query Optimization: Process of selecting the most efficient query evaluation plan from multiple possible strategies.
Goal: Minimize query execution cost (I/O, CPU, memory).

Equivalence Rules

Selection Operations:
- Conjunctive: σθ1∧θ2(E) ≡ σθ1(σθ2(E))
- Commutative: σθ1(σθ2(E)) ≡ σθ2(σθ1(E))
Join Operations:
- Theta joins commutative: E1 ⋈θ E2 ≡ E2 ⋈θ E1
- Natural joins associative: (E1 ⋈ E2) ⋈ E3 ≡ E1 ⋈ (E2 ⋈ E3)
Distribution:
- Selection over join: σθ1(E1 ⋈θ E2) ≡ (σθ1(E1)) ⋈θ E2

Optimization Techniques

Expression Transformation:
- Rewrite queries to push selections/projections earlier
- Example: Apply filters before joins to reduce dataset size
Join Ordering:
- Critical for performance (affects intermediate result sizes)
- Optimal order: Filter small tables first
Cost-Based Optimization:
- Estimates costs using statistics (table sizes, indexes)
- Considers:
  - Join algorithms (nested loop, hash, merge)
  - Access methods (index scans, full scans)
Heuristics:
- Perform selections early
- Use left-deep join trees for pipelining
- Cache execution plans for similar queries
Subquery Handling:
- Decorrelate nested subqueries when possible
- Convert EXISTS to semi-joins

Practical Implications

Well-optimized queries can be orders of magnitude faster
Requires accurate database statistics
Modern DBMSs handle most optimization automatically
Developers should write clear, logical queries

#Unit 6 Conclusion

Key Takeaways

Indexing (Lesson 14)

Purpose: Speed up data retrieval through specialized data structures
Types:
- Ordered (clustering/non-clustering) vs. Hashed indices
- Dense (all keys) vs. Sparse (selected keys) indexing
B-Tree Family:
- Balanced structures optimized for disk access
- B+ trees store data only in leaves, improving range queries
- Critical for large-scale database performance

Query Processing (Lesson 15)

Three-phase workflow:
1. Parsing → Relational algebra conversion
2. Optimization → Cost-based plan selection
3. Execution → Physical operations
Cost Factors:
- Disk I/O dominates (block transfers + seeks)
- Sorting and joining are most expensive operations
Join Algorithms:
- Nested-loop (simple), Merge (sorted), Hash (large datasets)
- Block nested-loop reduces disk accesses

Query Optimization (Lesson 16)

Core Principle: Find equivalent but more efficient execution plans
Techniques:
- Push selections/projections earlier
- Optimal join ordering (small→large)
- Decorrelate nested subqueries
Modern Systems:
- Use cost-based optimization with statistics
- Employ heuristics + dynamic programming

Conclusion

Effective database performance relies on three pillars:

Proper Indexing: Creating the right access paths for frequent queries
Efficient Processing: Minimizing I/O through smart algorithm selection
Intelligent Optimization: Transforming queries for maximal execution speed

Modern DBMSs automate much of this process, but developers must:

Design schemas with query patterns in mind
Write queries that allow optimization
Maintain accurate statistics through regular maintenance

The techniques covered in these lessons form the foundation for building high-performance database applications that scale efficiently with data growth.

DBS101, Unit5

This post is licensed under CC BY 4.0 by the author.

Lesson 14

Learning Outcomes

What are Indexes?

Types of Indices

Ordered Index Subtypes

Multilevel Indexing

Secondary Indices

B-Tree Family Overview

B+ Tree Properties

Why Use B-Trees?

Lesson 15

Learning Outcomes

What is Query Processing?

Key Steps

Query Execution Methods

Cost Measurement

Operation Costs

Selection Operations

Sorting

Join Operations

Other Operations

Advanced Techniques

Key Considerations

Lesson 16

Key Concepts

Equivalence Rules

Optimization Techniques

Practical Implications

Key Takeaways

Indexing (Lesson 14)

Query Processing (Lesson 15)

Query Optimization (Lesson 16)

Conclusion

Trending Tags