Why Big Data and application platforms converge

Application and Big Data architectures have evolved significantly over the last few years. Modern applications are moving to event-driven architectures and Big Data has moved to cloud-based, lambda architectures and streaming services.

What still connects the two are still complex point-to-point connections between databases and exceptionally difficult-to-manage ETL tasks.

No alt text provided for this image

Platform teams say maintaining availability of data pipelines is enormously complex and costly.

And there are further challenges: pressure on Big Data to move from batch to streaming. As well as the work of data science teams rarely making it into production or if it does, it leading to high technical debt. Furthermore, there is a disconnect in schemas between the data from applications and the data held in the data lake.

What we’re noticing however is that Apache Kafka is a rare technology because it’s equally appreciated by application platform engineering teams to facilitate the communication between microservices, as by a platform for data engineers to build scalable and reliable data pipelines and stream processing applications.

It’s both a messaging system and a fantastic stream processing framework. This makes it ideal as a cloud-agnostic bridge between both application and Big Data architectures.

No alt text provided for this image

Kafka’s other benefits include how it’s proven to scale in exceptionally large multi-cloud environments, it provides mature frameworks for getting data in and out with connectors (such as to Spark, Cassandra etc.), it’s supported by a mature community and provides high-level abstraction for creating stream processing applications (such as KSQL) and facilitates high data intensive computation.

Allowing analytics, data science and engineering teams to work off the same converged environment can provide great benefits to organisations that want to accelerate building data-rich applications

Of course, this isn’t to say that it doesn’t cause new problems – data pipeline starts to become application-critical and require extra monitoring as well as compliance and governance challenges.

Follow The Data Difference for more blogs