Post

Unit 6

Unit 6

Lesson 14

Learning Outcomes

  • Evaluate different types of indexing
  • Implement multilevel indexing
  • Understand the B-tree family

What are Indexes?

Indexes are special lookup tables or data structures that speed up data retrieval in databases. Without indexes, queries would scan entire tables, which is inefficient especially for large datasets.

Types of Indices

  • Ordered Indices: Store search keys in sorted order.
    • Clustering (Primary) Index: Records stored in the same sorted order as the index.
    • Non-clustering (Secondary) Index: Index entries point to records stored in arbitrary order.
  • Hashed Indices: Use hash functions to distribute values uniformly across buckets.

Ordered Index Subtypes

  • Dense Index: Contains an index entry for every search key value.
  • Sparse Index: Contains entries for only some search keys; requires the data file to be sorted by the search key.

Multilevel Indexing

For very large indexes that cannot fit into memory, multilevel indexing uses an outer index (in memory) that points to blocks of the main index on disk, enabling efficient search with minimal disk reads.

Secondary Indices

Created on non-key attributes to speed up queries involving those attributes.

B-Tree Family Overview

  • B-trees and B+ trees are balanced tree structures optimized for disk storage.
  • B-trees store keys and values in all nodes; B+ trees store values only in leaf nodes, using internal nodes for guiding searches.
  • Both allow efficient insertion, deletion, and search operations with O(log n) complexity.

B+ Tree Properties

  • Perfectly balanced with all leaves at the same depth
  • Nodes are at least half-full
  • Inner nodes have one more child than keys

Why Use B-Trees?

They minimize disk accesses by storing multiple keys per node, reducing tree height and improving performance on large datasets.

Lesson 15

Learning Outcomes

  1. Understand query processing
  2. Evaluate steps in query processing
  3. Calculate query evaluation costs
  4. Evaluate query costs in various operations

What is Query Processing?

  • Activities involved in extracting data from a database
  • Includes translation, optimization, and evaluation of queries
  • Converts high-level queries to physical-level operations

Key Steps

  1. Parsing and Translation
    • Convert SQL to relational algebra
    • Create query evaluation plans
    • Handle views and validate relations
  2. Evaluation
    • Execute operations using chosen algorithms
    • Consider pipelining for efficiency
  3. Optimization
    • Choose least-cost evaluation plan
    • Based on statistical data and operation costs

Query Execution Methods

  • Iterator Interface: Open(), get_next(), close()
  • Pipelined Execution: Process tuples as they’re produced
  • Materialization: Store intermediate results temporarily

Cost Measurement

  • Primary Factors:
    • Disk I/O (block transfers, random accesses)
    • CPU costs (becoming more significant with SSDs)
  • Cost Formula: b * tT + S * tS
    • Where b=blocks, tT=transfer time, S=seeks, tS=seek time

Operation Costs

Selection Operations

  • File scan vs. index usage
  • Clustering vs. secondary indices
  • Complex selections (conjunctions, disjunctions)

Sorting

  • External sort-merge algorithm
  • Two phases: creating sorted runs and merging
  • Cost depends on memory buffer size

Join Operations

  • Nested-loop: Simple but expensive
  • Block nested-loop: More efficient
  • Indexed nested-loop: Uses indexes
  • Merge join: Requires sorted inputs

Other Operations

  • Duplicate elimination (sorting/hashing)
  • Projection
  • Set operations
  • Outer join
  • Aggregation

Advanced Techniques

  • Pipelining: Combine operations to avoid temp relations
  • Cache-conscious algorithms: Optimize for CPU cache
  • Query compilation: Convert plans to machine code

Key Considerations

  • Balance between response time and resource consumption
  • Modern hardware changes cost considerations (SSDs vs HDDs)
  • Memory access patterns becoming increasingly important

Lesson 16

Key Concepts

  • Query Optimization: Process of selecting the most efficient query evaluation plan from multiple possible strategies.
  • Goal: Minimize query execution cost (I/O, CPU, memory).

Equivalence Rules

  1. Selection Operations:
    • Conjunctive: σθ1∧θ2(E) ≡ σθ1(σθ2(E))
    • Commutative: σθ1(σθ2(E)) ≡ σθ2(σθ1(E))
  2. Join Operations:
    • Theta joins commutative: E1 ⋈θ E2 ≡ E2 ⋈θ E1
    • Natural joins associative: (E1 ⋈ E2) ⋈ E3 ≡ E1 ⋈ (E2 ⋈ E3)
  3. Distribution:
    • Selection over join: σθ1(E1 ⋈θ E2) ≡ (σθ1(E1)) ⋈θ E2

Optimization Techniques

  1. Expression Transformation:
    • Rewrite queries to push selections/projections earlier
    • Example: Apply filters before joins to reduce dataset size
  2. Join Ordering:
    • Critical for performance (affects intermediate result sizes)
    • Optimal order: Filter small tables first
  3. Cost-Based Optimization:
    • Estimates costs using statistics (table sizes, indexes)
    • Considers:
      • Join algorithms (nested loop, hash, merge)
      • Access methods (index scans, full scans)
  4. Heuristics:
    • Perform selections early
    • Use left-deep join trees for pipelining
    • Cache execution plans for similar queries
  5. Subquery Handling:
    • Decorrelate nested subqueries when possible
    • Convert EXISTS to semi-joins

Practical Implications

  • Well-optimized queries can be orders of magnitude faster
  • Requires accurate database statistics
  • Modern DBMSs handle most optimization automatically
  • Developers should write clear, logical queries

#Unit 6 Conclusion

Key Takeaways

Indexing (Lesson 14)

  • Purpose: Speed up data retrieval through specialized data structures
  • Types:
    • Ordered (clustering/non-clustering) vs. Hashed indices
    • Dense (all keys) vs. Sparse (selected keys) indexing
  • B-Tree Family:
    • Balanced structures optimized for disk access
    • B+ trees store data only in leaves, improving range queries
    • Critical for large-scale database performance

Query Processing (Lesson 15)

  • Three-phase workflow:
    1. Parsing → Relational algebra conversion
    2. Optimization → Cost-based plan selection
    3. Execution → Physical operations
  • Cost Factors:
    • Disk I/O dominates (block transfers + seeks)
    • Sorting and joining are most expensive operations
  • Join Algorithms:
    • Nested-loop (simple), Merge (sorted), Hash (large datasets)
    • Block nested-loop reduces disk accesses

Query Optimization (Lesson 16)

  • Core Principle: Find equivalent but more efficient execution plans
  • Techniques:
    • Push selections/projections earlier
    • Optimal join ordering (small→large)
    • Decorrelate nested subqueries
  • Modern Systems:
    • Use cost-based optimization with statistics
    • Employ heuristics + dynamic programming

Conclusion

Effective database performance relies on three pillars:

  1. Proper Indexing: Creating the right access paths for frequent queries
  2. Efficient Processing: Minimizing I/O through smart algorithm selection
  3. Intelligent Optimization: Transforming queries for maximal execution speed

Modern DBMSs automate much of this process, but developers must:

  • Design schemas with query patterns in mind
  • Write queries that allow optimization
  • Maintain accurate statistics through regular maintenance

The techniques covered in these lessons form the foundation for building high-performance database applications that scale efficiently with data growth.

This post is licensed under CC BY 4.0 by the author.