Santander Stream Processing with Apache Flink

Stream Processing with Apache Flink® Flink Overview 101 +

01 02 03 04 05 Intro : The world has changed..some news Understanding the importance of stream processing Why Apache Flink is becoming the de facto standard Enhancing Apache Flink as a cloud-native service Questions Agenda

Intro about conﬂuent community

Some things you should know about us https://discover.confluent.io/santander-and-confluent-better-together Our Landing page Aun hay mas!!! Coming soon GEN IA & Conﬂuent webinar 7 May 2024 Apuntate ya !!!!

Some - Quick update from Kafka summit london

Apache Flink en Confluent Cloud entra en modo General Available: Anunciamos la disponibilidad general de nuestro servicio totalmente gestionado de Confluent Cloud para Apache Flink®, proporcionando a los clientes una verdadera solución multicloud para implementar procesamiento de streams de eventos en cualquiera de los tres principales CSP's donde residan los datos y aplicaciones. Podéis ver más detalles en el blog post. Presentación de Tableflow: Presentamos nuestra visión para Tableflow, una nueva función en la plataforma de streaming de eventos de Confluent Cloud que permite a los usuarios convertir tópicos de Apache Kafka y sus esquemas asociados en tablas de Apache Iceberg® con un solo clic para abastecer mejor a los data lakes y data warehouses. Tableflow estará disponible en modo early access privado próximamente. Kora, el motor de Confluent Cloud, es más rápido que nunca: Revelamos que Kora, el motor que alimenta el servicio de Kafka nativo cloud de Confluent Cloud, ahora es 16 veces más rápido que Kafka opensource. Mejora de conectores : Se presentaron nuevas mejoras al portafolio de conectores totalmente gestionados de Confluent, que incluyen DNS Forwarding y puntos de acceso de Egress privados. Además se anuncia un SLA mejorado del 99.99%. Mejora de Stream Governance : El servicio de Stream Governance en Confluent sigue evolucionando y ahora está habilitado de forma predeterminada en todos los entornos con un SLA de tiempo de actividad del 99.99% para Schema Registry. Además, anunciamos que la cobertura regional para Stream Governance se expandirá a todas las regiones de Confluent Cloud, proporcionando un acceso más fluido a las funcionalidades exclusivas de gobernanza.

Why …. Why if we can uniﬁed both worlds ?

Understanding the importance of stream processing

Enable frictionless access to up-to-date trustworthy data products Share Reimagine data streaming everywhere, on-prem and in every major public cloud Stream Make data in motion self-service, secure, compliant and trustworthy Govern Drive greater data reuse with always-on stream processing Process Make it easy to on-ramp and off-ramp data from existing systems and apps Connect Stream processing is a critical part of data streaming

DATA IN MOTION Streaming Applications Apache Flink Apache Kafka DATA AT REST Application Layer Compute Layer Storage Layer Traditional Databases File Systems Web Applications Stream processing acts as the compute layer to Kafka, powering real-time applications & pipelines

Processing Kafka Custom apps 3rd party apps Databases Database Data Warehouse SaaS app Queries Analytics Interactions Processing Processing Processing down stream of Kafka increases latency, adds costs and redundancy, and inhibits data reuse Increased complexity from redundant processing Data systems & applications built on stale data Expensive & inefﬁcient to clean and enrich data multiple times

Custom apps 3rd party apps Databases Database Data Warehouse SaaS app Queries Analytics Interactions Processing data at ingest improves latency, data portability, and cost effectiveness Maximized data reusability & consistency Improved cost-efﬁciency from cleaning & enriching data once Real-time apps & data systems reﬂect current state Kafka Storage Flink Compute Stream Processing Process your data once, process your data right

Heatmap service Payment service Supply chain systems Watch lists Profile mgmt Incident mgmt Customer profile data ITSM systems Central log systems Fraud & SIEM systems Alerting systems AI/ML engines Visualization apps Threat vector Transactions Payments Mainframe data Inventory Weather Telemetry IoT data Notification engine Payroll systems CRM systems Mobile application Personalization Web application Clickstreams Customer loyalty Change logs Customer data Recommendation engine Stream processing enables users to filter, join, and enrich streams on-the-fly to drive greater data reuse

Why Apache Flink is becoming the de facto standard

0 50,000 100,000 150,000 2020 2021 2022 2016 2017 2018 Flink Kafka Two Apache Projects, Born a Few Years Apart Monthly Unique Users Flink growth has mirrored the growth of Kafka, the de facto standard for streaming data >75% of the Fortune 500 estimated to be using Kafka >100,000+ orgs using Kafka >41,000 Kafka meetup attendees >750 Kafka Improvement Proposals >12,000 Jiras for Apache Kafka

Innovative companies have adopted both Kafka & Flink

Digital natives leverage Flink to disrupt markets and gain competitive advantage UBER: Real-time Pricing NETFLIX: Personalized Recs STRIPE: Real-time Fraud Detection

Lets have some fun !!! Who is who? 463 M 30B Gaming ● WHY ● SQL drive 90% uses cases with less than 50% TTM ● SQL make stream analytics accessible for all data scientist that allows to create insights connected to revenue ● Keep & Play with statesto deep dive on business questions ● Eg: Revenue x level ● Game user counts RT ● Extremely auto-scalable ● Low latency ● Control temporal de eventos ● Wide variety of handling Script window ● Fault tolerance at the style :Check points & save points

Scalability and Performance Fault Tolerance Flink is a top 5 Apache project and boasts a robust developer community Uniﬁed Processing Flink is capable of supporting stream processing workloads at tremendous scale Language Flexibility Flink's fault tolerance mechanisms ensure it can handle failures effectively and provide high availability Flink supports Java, Python, & SQL with 150+ built-in functions, enabling devs to work in their language of choice Flink supports stream processing, batch processing, and ad-hoc analytics through one technology SO basically Customers choose Flink because of its performance and rich feature set

Processing “Shift Left” with Stream Processing Expensive & Inefﬁcient The same data is cleaned, transformed and enriched multiple locations (often using legacy technologies) Varying Degree of Staleness Every downstream system receives and processes the same data at different times/different time intervals and provides slightly different semantics. This leads to inconsistencies and degraded customer experience downstream. Complex & Error Prone The same business logic needs to be maintain at multiple places, multiple processing engines need to be maintained and operated. Databases Custom Apps SaaS Processing Processing Database DWH Data Lake Queries Analytics Interactions

Processing “Shift Left” with Stream Processing Expensive & Inefﬁcient The same data is cleaned, transformed and enriched multiple locations (often using legacy technologies) Varying Degree of Staleness Every downstream system receives and processes the same data at different times/different time intervals and provides slightly different semantics. This leads to inconsistencies and degraded customer experience downstream. Complex & Error Prone The same business logic needs to be maintain at multiple places, multiple processing engines need to be maintained and operated. Databases Custom Apps SaaS Processing Processing Database DWH Data Lake Queries Analytics Interactions Shift Left Processing

Processing Pitch “Shift Left” with Stream Processing Cost Efﬁcient Data is only processed once in a single place and it is processed continuously spreading the work over time. Fresh Data Everywhere All application are supplied with equally fresh data and represent the current state of your business. Reusable & Consistent Data is processed continuously and meets the latency requirements of the most demanding consumers which further increases reusability. Databases Custom Apps SaaS Database DWH Data Lake Queries Analytics Interactions Processing

What can I do with Flink? 26 Data Exploration Data Pipelines Real-time Apps Engineers and Analysts both need to be able to simply read and understand the event streams stored in Kafka ● Metadata discovery ● Throughput analysis ● Data sampling ● Interactive query Data pipelines are used to enrich, curate, and transform events streams, creating new derived event streams ● Filtering ● Joins ● Projections ● Aggregations ● Flattening ● Enrichment Whole ecosystems of apps feed on event streams automating action in real-time ● Account360 ● Next Best Call ● Quality of Service ● Fraud detection ● Intelligent routing ● Abandoned Shopping Cart

Build streaming data pipelines to inform real-time decision making Create new enriched and curated streams of higher value using: ● Data transformations ● Streaming joins, temporal joins, lookup joins, and versioned joins ● Fan out queries, multi-cluster queries 27 t1, 21.5 USD t3, 55 EUR t5, 35.3 EUR t0, EUR:USD=1.00 t2, EUR:USD=1.05 t4: EUR:USD=1.10 t1, 21.5 USD t3, 57.75 USD t5, 38.83 USD Currency rate Orders STREAMING DATA PIPELINES

Recognize patterns and react to events in a timely manner C price>lag(price) D price<lag(price) C price>lag(price) B price<lag(price) A Double Bottom Period & Volume Price Develop applications using ﬁne-grained control over how time progresses and data is grouped together using: ● Hopping, tumbling, session windows ● OVER aggregations ● Pattern matching with MATCH_RECOGNIZE Pattern Detections SELECT * FROM Ticker MATCH_RECOGNIZE ( PARTITION BY symbol ORDER BY rowtime MEASURES START_ROW.rowtime AS start_tstamp, LAST(PRICE_DOWN.rowtime) AS bottom_tstamp, LAST(PRICE_UP.rowtime) AS end_tstamp ONE ROW PER MATCH AFTER MATCH SKIP TO LAST PRICE_UP PATTERN (START_ROW PRICE_DOWN+ PRICE_UP) DEFINE PRICE_DOWN AS (LAST(PRICE_DOWN.price, 1) IS NULL AND PRICE_DOWN.price < START_ROW.price) OR PRICE_DOWN.price < LAST(PRICE_DOWN.price, 1), PRICE_UP AS PRICE_UP.price > LAST(PRICE_DOWN.price, 1) ) MR;

Analyze real-time data streams to generate important business insights Get up-to-date results to power dashboards or applications requiring continuous updates using: ● Materialized views ● Temporal analytic functions ● Interactive queries Account Balance A $15 B $2 C $15 Account A, +$10 Account B, +$12 Account C, +$5 Account B, -$10 Account C, +$10 Account A, -$5 Account A, +$10 Time REAL-TIME ANALYTICS SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY category ORDER BY sales DESC) AS row_num FROM ShopSales) WHERE row_num <= 5

Scalability and Performance Fault Tolerance Flink is a top 5 Apache project and boasts a robust developer community Uniﬁed Processing Flink is capable of supporting stream processing workloads at tremendous scale Language Flexibility Flink's fault tolerance mechanisms ensure it can handle failures effectively and provide high availability Flink supports Java, Python, & SQL with 150+ built-in functions, enabling devs to work in their language of choice Flink supports stream processing, batch processing, and ad-hoc analytics through one technology Developers choose Flink because of its performance and rich feature set

Flink’s powerful runtime offers limitless scalability Job Manager Client . . . . . . Task Slot . . . . . . Task Slot . . . . . . Task Slot . . . . . . Task Slot Data Streams Deploy, Stop, Cancel Tasks Trigger Checkpoints Submit Job Results Applications are parallelized into possibly thousands of tasks that are distributed and concurrently executed in a cluster

Leverage in-memory performance . . . Durable Storage Logic State Logic State Logic State Input Tasks Output In-Memory or On-Disk State Local State Access Periodic, Asynchronous, Incremental Snapshots Stateful Flink applications are optimized for fast access to local state by maintaining task state in memory or on-disk data structures, resulting in low latency processing.

Flink checkpoints and savepoints enable fault tolerance and stateful processing CHECKPOINTS SAVEPOINTS Automatic snapshot created by Flink periodically ● Used to recover from failures ● Optimized for quick recovery ● Automatically created and managed by Flink User-triggered snapshot at a speciﬁc point in time ● Enables manual operational tasks, such as upgrades ● Optimized for operational ﬂexibility ● Created and managed by the user

Flink recovers from failures in a timely and efﬁcient manner Job Manager Client . . . . . . Task Slot . . . . . . Task Slot . . . . . . Task Slot . . . . . . Task Slot Data Streams Deploy, Stop, Cancel Tasks Trigger Checkpoints Submit Job Results If a task managers fails, the job manager will detect the failure and arrange for the job to be restarted from the most recent state snapshot X

Flink offers layered APIs at different levels of of abstraction to handle both common and specialized use cases Flink SQL Table API DataStream API ProcessFunction Apache Flink Runtime Low-level Stream Operator API DataStream API ProcessFunction Table / SQL API Optimizer / Planner Level of Abstraction How the Code is Organized Flink SQL High-level, declarative API that allows you to write SQL queries to process data streams and batch data as dynamic tables Table API Programmatic equivalent of Flink SQL, allowing you to deﬁne your business logic in either Java or Python, or combine it with SQL DataStream API Low-level, expressive API that exposes the building blocks for stream processing, giving you direct access to things like state and timers ProcessFunction The most low-level API, allowing for ﬁne-grained processing of individual elements for complex event-driven processing logic and state management

Flink SQL is an ANSI-compliant SQL engine that can deﬁne both simple and complex queries, making it well-suited for most stream processing use cases, particularly building real-time data products and pipelines. GROUP BY color events results COUNT WHERE color <> orange 4 3 Streaming made simple.

INSERT INTO enriched_reviews SELECT id , review , invoke_openai(prompt,review) as score FROM product_reviews ; K N B Kate 4 hours ago This was the worst decision ever. Nikola 1 day ago Not bad. Could have been cheaper. Brian 3 days ago Amazing! Game Changer! K N B Kate ★★★★★ 4 hours ago This was the worst decision ever. Nikola ★★★★★ 1 day ago Not bad. Could have been cheaper. Brian ★★★★★ 3 days ago Amazing! Game Changer! The Prompt “Score the following text on a scale of 1 and 5 where 1 is negative and 5 is positive returning only the number” DATA STREAMING PLATFORM Enrich real-time data streams with Generative AI directly from Flink SQL COMING SOON

Flink supports uniﬁed stream and batch processing ● Entire pipeline must always be running ● Execution proceeds in stages, running as needed ● Input must be processed as it arrives ● Input may be pre-sorted by time and key ● Results are reported as they become ready ● Results are reported at the end of the job ● Failure recovery resumes from a recent snapshot ● Failure recovery does a reset and full restart ● Flink guarantees effectively exactly-once results despite out-of-order data and restarts due to failures, etc. ● Effectively exactly-once guarantees are more straightforward

Flink SQL operators work across both stream and batch processing modes STREAMING AND BATCH BATCH ONLY • SELECT FROM [WHERE] • GROUP BY [HAVING] (includes time-based windowing) • OVER aggregations (including Top-N and Deduplication queries) • INNER + OUTER JOINs • MATCH_RECOGNIZE (pattern matching) • Set Operations • User-Deﬁned Functions • Statement Sets STREAMING ONLY • ORDER BY time ascending only • INNER JOIN with Temporal (versioned) table External lookup table • ORDER BY anything

Enhancing Apache Flink as a cloud-native service

Deployment Complexity Setting up Flink requires a deep understanding of resource allocation and management Management & Monitoring Identifying relevant metrics can be overwhelming for DevOps teams Incomplete Ecosystem OSS Flink lacks pre-built integrations with observability, metadata management, data governance, and security tooling Cost & Risk Self-supporting Flink incurs signiﬁcant costs & resources in terms of infra footprint and Dev & Ops FTEs However, operating Flink on your own (along with Kafka) is difﬁcult

Months Minutes Weeks Open Source Apache Flink In-house development and maintenance without support Cloud-hosted Flink services Manual Day 2 operations with basic tooling and/or support Apache Flink on Conﬂuent Cloud Fully managed, elastic, and automated product capabilities with zero overhead Go from zero to production in minutes versus months

Real-time processing Power low-latency applications and pipelines that react to real-time events and provide timely insights Data reusability Share consistent and reusable data streams widely with downstream applications and systems Data enrichment Curate, filter, and augment data on-the-fly with additional context to improve completeness, accuracy, & compliance Efficiency Improve resource utilization and cost-effectiveness by avoiding redundant processing across silos Effortlessly filter, join, and enrich your data streams with Apache Flink “With Confluent’s fully managed Flink offering, we can access, aggregate, and enrich data from IoT sensors, smart cameras, and Wi-Fi analytics, to swiftly take action on potential threats in real time, such as intrusion detection. This enables us to process sensor data as soon as the events occur, allowing for faster detection and response to security incidents without any added operational burden.”

"When used in combination, Apache Flink & Apache Kafka can enable data reusability and avoid redundant downstream processing. The delivery of Flink & Kafka as fully managed services delivers stream processing without the complexities of infrastructure management, enabling teams to focus on building real-time streaming applications & pipelines that differentiate the business." Enterprise-grade security Secure stream processing with built-in identity and access management, RBAC, and audit logs Stream governance Enforce data policies and avoid metadata duplication leveraging native integration with Stream Governance Monitoring Ensure the health and uptime of your Flink queries in the Confluent UI or via 3rd party monitoring services Connectors Ensure the health and uptime of your Flink queries in the Confluent UI or via 3rd party monitoring services Experience Kafka and Flink seamlessly integrated as a unified platform Monitoring Connectors Enterprise-grade Security Stream Governance

Fully managed Easily develop Flink applications with a serverless, SaaS- based experience instantly available & without ops burden Elastic scalability Automatically scale up or down to meet the demands of the most complex workloads without overprovisioning Usage-based billing Pay only for resources used instead of infrastructure provisioned, with scale-to-zero pricing Continuous, no touch updates Build using an always up-to-date platform with declarative, versionless APIs and interfaces Enable high-performance and efﬁcient stream processing at any scale Throughput Over Time Capacity Demand "When used in combination, Apache Flink & Apache Kafka can enable data reusability and avoid redundant downstream processing. The delivery of Flink & Kafka as fully managed services delivers stream processing without the complexities of infrastructure management, enabling teams to focus on building real-time streaming applications & pipelines that differentiate the business."

SQL client in Conﬂuent Cloud CLI Different teams with different skills and needs can access stream processing using the interface of their choice Rich SQL editing user interface Tap into a next-generation, serverless SQL experience …

Select region(s) to create a compute pool 1 Role bindings automatically created for you 2 Start processing in Flink 3 …automatically provisioned and instantly available

Santander Stream Processing with Apache Flink

Santander Stream Processing with Apache Flink

More Related Content

What's hot

Similar to Santander Stream Processing with Apache Flink

More from confluent

Recently uploaded

In this document

Santander Stream Processing with Apache Flink