Optimizing Cloud-Native Apache Kafka Performance ft. Alok Nikhil and Adithya Chandra

Streaming Audio: Apache Kafka® & Real-Time Data - A podcast by Confluent, founded by the original creators of Apache Kafka®

Categories:

Maximizing cloud Apache Kafka® performance isn’t just about running data processes on cloud instances. There is a lot of engineering work required to set and maintain a high-performance standard for speed and availability. Alok Nikhil (Senior Software Engineer, Confluent) and Adithya Chandra (Staff Software Engineer II, Confluent) share about their efforts on how to optimize Kafka on Confluent Cloud and the three guiding principles that they follow whether you are self-managing Kafka  or working on a cloud-native system: Know your users and plan for their workloadsInfrastructure matters for performance as well as cost efficiency Effective observability—you can’t improve what you don’t see A large part of setting and achieving performance standards is about understanding that workloads vary and come with unique requirements. There are different dimensions for performance, such as the number of partitions and the number of connections. Alok and Adithya suggest starting by identifying the workload patterns that are the most important to your business objectives for simulation, reproduction, and using the results to optimize the software.   When identifying workloads, it’s essential to determine the infrastructure that you’ll need to support the given workload economically. Infrastructure optimization is as important as performance optimization. It's best practice to know the infrastructure that you have available to you and choose the appropriate hardware, operating system, and JVM to allocate the processes so that workloads run efficiently. With the necessary infrastructure patterns in place, it’s crucial to monitor metrics to ensure that your application is running as expected consistently with every release. Having the right observability metrics and logs allows you to identify and troubleshoot issues relatively quickly. Profiling and request sampling also help you dive deeper into performance issues, particularly, during incidents. Alok and Adithya’s team uses tooling such as the async-profiler for profiling CPU cycles, heap allocations, and lock contention.Alok and Adithya summarize their learnings and processes used for optimizing managed Kafka as a service, which can be applicable to your own cloud-native applications. You can also read more about their journey on the Confluent blog. EPISODE LINKSSpeed, Scale, Storage: Our Journey from Apache Kafka to Performance in Confluent CloudCloud-Native Apache KafkaJoin the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)Watch the video version of this podcast