Data Engineering Pipeline
End-to-end ELT pipeline project covering batch processing with Apache Spark, real-time streaming with Kafka, orchestration with Airflow, and PostgreSQL — all containerized with Docker Compose.
A structured exploration of the modern data engineering stack: building ELT pipelines that move, transform, and serve data at scale using the same tools used in industry.
Architecture
The project follows a typical data platform pattern: raw data lands in a source database, gets extracted and transformed by batch/streaming jobs, and lands in an analytical store.
Source DB (PostgreSQL)
↓
Airflow DAG (orchestration)
↓
Batch: Apache Spark Stream: Apache Kafka
↓ ↓
Transformed data → Analytical PostgreSQL
All components run in Docker Compose, which means the entire platform can be brought up or torn down with a single command.
Components
SQL & PostgreSQL fundamentals
- Database design: users, films, actors with proper relational modeling (FK constraints, normalised tables)
- Advanced SQL:
INNER JOIN,LEFT JOIN, subqueries,GROUP BYaggregations,UNION/UNION ALL - Multi-table joins across films, film_actors, actors for complex queries
Apache Airflow (orchestration)
- DAG-based workflow definition in Python
- Scheduling batch extraction jobs to run on a cadence
- Airflow 3 metadata database migration, user creation, and web UI
- Docker-based Airflow stack (webserver + scheduler + metadata DB)
Apache Spark (batch processing)
- Reading from PostgreSQL source via JDBC
- DataFrame transformations: filtering, aggregation, schema enforcement
- Writing results to the analytical store
Apache Kafka (streaming)
- Producer pushing events in real time
- Consumer group reading and processing events
- Kafka running in Docker with Zookeeper
Key takeaway
Setting up Airflow alone requires understanding Docker networking, database initialization order (depends_on with health checks), and Airflow’s internal component model. The complexity of infrastructure is often underestimated — this project makes it visible.