PubNub with Stephen Blum

Rust in Production - A podcast by Matthias Endler - Thursdays

Categories:

In this episode, we are joined by Steven, the CTO of PubNub, a company that has developed an edge net messaging network with over a billion connected devices. Steven explains that while message buses like Kafka or RabbitMQ are suitable for smaller scales, PubNub focuses on the challenges of connecting mobile devices and laptops at a web scale. They aim to provide instant signal delivery at a massive scale, prioritizing low latency for a seamless user experience. To achieve this, PubNub has architected their system to be globally distributed, running on AWS with Kubernetes clusters spread across all of Amazon's zones. They utilize GeoDNS to ensure users connect to the closest region for the lowest latency possible. Steven goes on to discuss the challenges they faced in building their system, particularly in terms of memory management and cleanup. They had to deal with issues such as segmentation faults and memory leaks, which caused runtime problems, outages, and potential data loss. PubNub had to invest in additional memory to compensate for these leaks and spend time finding and fixing the problems. While C was efficient, it came with significant engineering costs. As a solution, PubNub started adopting Rust, which helped alleviate some of these challenges. When they replaced a service with Rust, they observed a 5x improvement in memory and performance. Steven also talks about choosing programming languages for their platform and the difficulties in finding and retaining C experts. They didn't consider Java due to its perceived academic nature, and Go didn't make the list of options at the time. However, they now have services in production written in Go, though rewriting part of their PubSub bus in Go performed poorly compared to their existing C system. Despite this, they are favoring Rust as their language of choice for new services, citing its popularity and impressive results. The conversation delves into performance considerations with Python and the use of PyPy as a just-in-time compiler for optimization. While PyPy improved performance, it also required a lot of memory, which could be expensive. On the other hand, Rust provided a significant boost in both memory and performance, making it a favorable choice for PubNub. They also discuss provisioning, taking into account budget and aiming to be as close to what they need as possible. Kubernetes and auto scaling with HPAs (Horizontal Pod Autoscaling) are used to dynamically adjust resources based on usage. Integrating new services into PubNub's infrastructure involves both API-based communication and event-driven approaches. They use frameworks like Axiom for API-based communication and leverage Kafka with Protobuf for event sourcing. JSON is also utilized in some cases. Steven explains that they chose Protobuf for high-traffic topics and where stability is crucial. While the primary API for customers is JSON-based, PubNub recognizes the superior performance of Protobuf and utilizes it for certain cases, especially for shrinking down large character strings like booleans. They also discuss the advantages of compression enabled with Protobuf. The team reflects on the philosophy behind exploring Rust's potential for profit and its use in infrastructure and devices like IoT. Rust's optimization for smaller binaries is highlighted, and PubNub sees it as their top choice for reliability and performance. They mention developing a Rust SDK for customers using IoT devices. The open-source nature of Rust and its ability to integrate into projects and develop open standards are also praised. While acknowledging downsides like potential instabilities and longer compilation time, they remain impressed with Rust's capabilities. The conversation covers stability and safety in Rust, with the speaker expressing confidence in the compiler's ability to handle alpha software and packages. Relying on native primitives for concurrency in Rust adds to the speaker's confidence in the compiler's safety. The Rust ecosystem is seen as providing adequate coverage, although packages like libRDKafka, which are pre-1.0, can be challenging to set up or deploy. The speaker emphasizes simplicity in code and avoiding excessive abstractions, although they acknowledge the benefits of features like generics and traits in Rust. They suggest resources like a book by David McCloyd that focuses on learning Rust without overwhelming complexity. Expanding on knowledge sharing within the team, Stephen discusses how Rust advocates within the team have encouraged its use and the possibilities it holds for AI infrastructure platforms. They believe Rust could improve performance and reduce latency, particularly for CPU tasks in AI. They mention the adoption of Rust in the data science field, such as its use in the Parquet data format. The importance of tooling improvements, setting strict standards, and eliminating unsafe code is highlighted. The speaker expresses the desire for a linter that enforces a simplified version of Rust to enhance code readability, maintainability, and testability. They discuss the balance between functional and object-oriented programming in Rust, suggesting object-oriented programming for larger-scale code structure and functional paradigms within functions. Onboarding Rust engineers is also addressed, considering whether to prioritize candidates with prior Rust experience or train individuals skilled in another language on the job. Recognizing the shortage of Rust engineers, Stephen encourages those interested in Rust to pursue a career at PubNub, pointing to resources like their website and LinkedIn page for tutorials and videos. They emphasize the importance of latency in their edge messaging technology and invite users to try it out.