Saurabh's Image

Hi, I'm Saurabh! 👋

🔭 Building an Analytics Tool using Angular, NestJs, Java, Scala, Flink, MySQL & Snowflake ...

🌱 Currently learning Flink & AI/ML.

👯 Looking to collaborate on Angular, NestJs Open Source Projects.

🤔 Seeking help with AWS & Networking.

💬 Ask me about any tech-related stuff.

📫 Connect on Twitter: @saurabh47g

😄 Pronouns: He/His

âš¡ Fun fact: NULL

🔗 GitHub: github.com/saurabh47

Apache Flink

Introduction to Stream Processing

Apache Flink Logo

What is Apache Flink?

Apache Flink is a distributed stream processing framework for real-time and batch data processing.

  • Designed for high performance and scalability
  • Supports event-driven applications
  • Fault-tolerant and stateful computations
Flink Parallel Processing

Main Building Blocks

Flink's core architecture consists of:

  • Source: Reads data from external systems
  • Process: Transforms data using operators
  • Sink: Outputs data to external systems

Supported languages: SQL, Java, Python

Flink Architecture

Flink Cluster

Flink Architecture

Watermarking

Watermarks help manage event time processing by tracking lateness.

  • Defines when events should be processed
  • Handles out-of-order event streams
  • Ensures correct time-based aggregations
Watermarking in Flink

Windowing

Windows group data streams into finite sets for processing.

  • Time-based windows (Tumbling, Sliding, Session)
  • Custom windowing strategies
Tumbling Window
Tumbling Window
Sliding Window
Sliding Window
Session Window
Session Window

Flink SQL - (Source & Sink)

-- Source Table / Topic
CREATE TABLE sensor (
	id STRING,
	timestamp TIMESTAMP(3),
	temperature DOUBLE,
	WATERMARK FOR timestamp AS timestamp - INTERVAL '5' SECOND
) WITH (
	'connector' = 'kafka',
	'topic' = 'sensor-data',
	'format' = 'json',
	'properties.bootstrap.servers' = 'localhost:9092'
);
-- Sink Table / Topic
CREATE TABLE aggregated_sensor (
	id STRING,
	window_start TIMESTAMP(3),
	window_end TIMESTAMP(3),
	avg_temp DOUBLE
) WITH (
	'connector' = 'kafka',
	'topic' = 'aggregated-sensor-data',
	'format' = 'json',
	'properties.bootstrap.servers' = 'localhost:9092'
);
                

Processing Logic

INSERT INTO aggregated_sensor
SELECT id, 
	TUMBLE_START(timestamp, INTERVAL '10' SECOND) AS window_start,
	TUMBLE_END(timestamp, INTERVAL '10' SECOND) AS window_end,
	AVG(temperature) AS avg_temp
FROM sensor
GROUP BY id, TUMBLE(timestamp, INTERVAL '10' SECOND);