Apache Flink for Beginners: Architecture, Concepts, and Stream Processing
A comprehensive introduction to Apache Flink, covering big data architectures, stream processing fundamentals, and Flink’s core features.
Apache Flink for Beginners: Architecture, Concepts, and Stream Processing
A comprehensive introduction to Apache Flink, covering big data architectures, stream processing fundamentals, and Flink’s core features.
Big Data Application Scenarios
Is big data a recent invention?
Have people only recently discovered its value?
Not really.
Decades ago, mathematical analysis was already being applied in the financial industry to build predictive models. Today, with the rise of the internet, massive amounts of data are generated every second, including:
- User browsing and search behavior
- Travel and mobility data
- Purchase and payment records
- Agricultural growth monitoring
- Medical and healthcare data
These data sources drive value across industries.
Typical application scenarios include:
- Healthcare analytics for faster diagnosis
- E-commerce personalization and precision marketing
- Retail customer profiling and recommendations
- Financial modeling and stock prediction
- Smart transportation and traffic optimization
Big Data Processing Architectures
Handling massive data volumes requires computers, not manual processing. Just like microservices in traditional web systems, big data systems rely on distributed architectures.
Key requirements of a big data architecture include:
- Fault tolerance and resilience
- Low latency
- Horizontal scalability
- Extensibility
- Efficient querying
- Ease of maintenance
Two classic architectural models dominate big data system design:
- Lambda Architecture
- Kappa Architecture
These are architectural philosophies rather than concrete products.
Lambda Architecture
Lambda Architecture, proposed by Nathan Marz, divides data processing into three layers:
-
Batch Layer
Processes large volumes of historical data using distributed systems like Hadoop. Suitable for offline computation. -
Speed Layer
Processes real-time incremental data using stream processing frameworks such as Storm, Spark Streaming, or Flink. -
Serving Layer
Merges batch and real-time views to serve query results.
Pros
- Supports both real-time and batch processing
- Balances latency and accuracy
Cons
- Requires maintaining two codebases
- Batch processing may not complete within required time windows
- Potential inconsistencies between batch and real-time results
Kappa Architecture
Kappa Architecture simplifies Lambda Architecture by removing the batch layer.
Core principles:
- Persist historical data (e.g., Kafka)
- Reprocess data via stream processing when needed
- Replace old processing results with new ones
Lambda vs Kappa
| Aspect | Lambda | Kappa |
|---|---|---|
| Historical data | Excellent | Limited |
| Resource cost | High | Lower |
| Complexity | High | Lower |
| Operations | Complex | Simpler |
Why Apache Flink
Popular stream processing frameworks include:
- Storm
- Spark Streaming
- Flink
Flink stands out due to:
- True stream processing (not micro-batching)
- Low latency and high throughput
- Exactly-once semantics
- Stateful computation
- Event-time processing
What Is Apache Flink
Apache Flink is an open-source distributed stream processing framework developed by the Apache Software Foundation.
Key features:
- Unified batch and stream processing
- Event-time semantics and watermarks
- Fault tolerance with exactly-once guarantees
- Rich connector ecosystem (Kafka, HDFS, Elasticsearch, Cassandra)
Flink Architecture
Flink clusters consist of two main process types:
-
JobManager
Coordinates job scheduling, resource management, and checkpoints. -
TaskManager
Executes tasks and manages task slots.
A running Flink cluster contains at least one JobManager and one TaskManager.
Programming Model
Flink programs follow a dataflow model:
Source → Transformations → SinkThe dataflow forms a directed acyclic graph (DAG).
Common transformations include:
- Map
- Filter
- KeyBy
- Window
Time Semantics in Flink
Flink supports three notions of time:
- Event Time: when the event actually occurred
- Processing Time: when the event is processed
- Ingestion Time: when the event enters the system
Event time is crucial for correctness and reproducibility.
Windows
Flink supports multiple window types:
Time Windows
- Tumbling windows
- Sliding windows
stream.timeWindow(Time.minutes(1));
stream.timeWindow(Time.minutes(60), Time.minutes(1));Count Windows
stream.countWindow(100);
stream.countWindow(100, 10);Session Windows
stream.window(SessionWindows.withGap(Time.minutes(5)));Watermarks
Watermarks track progress in event time.
A watermark with timestamp t indicates that no events with timestamps ≤ t are expected.
This mechanism enables correct window closure and late event handling.
Stateful Processing and Exactly-Once
Stateful stream processing introduces correctness challenges.
Flink supports exactly-once semantics through:
- Checkpoints (automatic)
- Savepoints (manual)
State Backends
- MemoryStateBackend
- FsStateBackend
- RocksDBStateBackend
Conclusion
Apache Flink is a powerful stream processing engine suitable for both Lambda and Kappa architectures.
Its strengths in low latency, high throughput, event-time processing, and fault tolerance make it an excellent choice for building modern, real-time data platforms.