Every Company is Becoming a Software Company

September 25, 2019, 10:08 am

≫ Next: Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset

≪ Previous: Incremental Cooperative Rebalancing in Apache Kafka: Why Stop the World When You Can Change It?

In 2011, Marc Andressen wrote an article called Why Software is Eating the World. The central idea is that any process that can be moved into software, will be. This has become a kind of shorthand for the investment thesis behind Silicon Valley’s current wave of unicorn startups. It’s also a unifying idea behind the larger set of technology trends we see today, such as machine learning, IoT, ubiquitous mobile connectivity, SaaS, and cloud computing. These trends are all, in different ways, making software more plentiful and capable and are expanding its reach within companies.

I believe that there is an accompanying change—one that is easy to miss, but equally essential. It isn’t just that businesses use more software, but that, increasingly, a business is defined in software. That is, the core processes a business executes—from how it produces a product, to how it interacts with customers, to how it delivers services—are increasingly specified, monitored, and executed in software. This is already true of most Silicon Valley tech companies, but it is a transition that is spreading to all kinds of companies, regardless of the product or service they provide.

To make clear what I mean, let’s look at an example: the loan approval process from a consumer bank. This is a business process that predates computers entirely. Traditionally this was a multi-week process where individuals such as a bank agent, mortgage officer, and credit officer each collaborated in a manual process. Today this process would tend to be executed in semi-automated fashion, each of these functions has some independent software applications that help the humans carry out their actions more efficiently. However this is changing now in many banks to a fully automated process where credit software, risk software, and CRM software communicate with each other and provide a decision in seconds. Here, the bank loan business division has essentially become software. Of course, this is not to imply that companies will become only software (there are still plenty of people in even the most software-centric companies), just that the full scope of the business is captured in an integrated software defined process.

Old Way: Using Software vs. New Way: Becoming Software

Companies aren’t just using software as a productivity aid for human processes, they are building out whole parts of the business in code.

This transition has many significant implications, but my focus will be on what it means for the role and design of the software itself. The purpose of an application, in this emerging world, is much less likely to be serving a UI to aid a human in carrying out the activity of the business, and far more likely to be triggering actions or reacting to other pieces of software to carry out the business directly. And, while this is fairly simple to comprehend, it raises a big question:

Are traditional database architectures a good fit for this emerging world?

Databases, after all, have been the most successful infrastructure layer in application development. However virtually all databases, from the most established relational DBs to the newest key-value stores, follow a paradigm in which data is passively stored and the database waits for commands to retrieve or modify it. What’s forgotten is that the rise of this paradigm was driven by a particular type of human-facing application in which a user looks at a UI and initiates actions that are translated into database queries. In our example above, it’s clear a relational database is built to support applications that would aid the human parts in that 1–2 week loan approval process; but is it the right fit for bringing together the full set of software that would comprise a real-time loan approval process built on continuous queries on top of ever-changing data?

Indeed, it’s worth noting that of the applications that came to prominence with the rise of the RDBMS (CRM, HRIS, ERP, etc.), virtually all began life in the era of software as a human productivity aid. The CRM application made the sales team more effective, the HRIS made the HR team more effective, etc. These applications are all what software engineers call “CRUD” apps. They help users create, update, and delete records, and manage business logic on top of that process. Inherent in this is the assumption that there is a human on the other end driving and executing the business process. The goal of these applications is to show something to a human, who will look at the result, enter more data into the application, and then carry out some action in the larger company outside the scope of the software.

This model matched how companies adopted software: in bits and pieces to augment organizations and processes that were carried out by people. But the data infrastructure itself had no notion of how to interconnect or react to things happening elsewhere in the company. This led to all types of ad hoc solutions built up around databases, including integration layers, ETL products, messaging systems, and lots and lots of special-purpose glue code that is the hallmark of large-scale software integration.

Messy Interconnection Between Systems

Because databases don’t model the flow of data, the interconnection between systems in a company is a giant mess.

Emergence of event streams

So what new capabilities do our data platforms need when the primary role of software is not to serve a UI but to directly trigger actions or react to other pieces of software?

I believe the answer starts with the concept of events and event streams. What is an event? Anything that happens—a user login attempt, a purchase, a change in price, etc. What is an event stream? A continually updating series of events, representing what happened in the past and what is happening now.

Event streams present a very different paradigm for thinking about data from traditional databases. A system built on events no longer passively stores a dataset and waits to receive commands from a UI-driven application. Instead, it is designed to support the flow of data throughout a business and the real-time reaction and processing that happens in response to each event that occurs in the business. This may seem far from the domain of a database, but I’ll argue that the common conception of databases is too narrow for what lies ahead.

Apache Kafka^® and its uses

I’ll sketch out some of the use cases for events by sharing my own background with these ideas. The founders of Confluent originally created the open source project Apache Kafka while working at LinkedIn, and over recent years Kafka has become a foundational technology in the movement to event streaming. Our motivation was simple: though all of LinkedIn’s data was generated continuously, 24 hours a day, by processes that never stopped, the infrastructure for harnessing that data was limited to big, slow, batch data dumps at the end of the day and simplistic lookups. The concept of “end-of-the-day batch processing” seemed to be some legacy of a bygone era of punch cards and mainframes. Indeed, for a global business, the day doesn’t end.

It was clear as we dove into this challenge that there was no off-the-shelf solution to this problem. Furthermore, having built the NoSQL databases that powered the live website, we knew that the emerging renaissance of distributed systems research and techniques gave us a set of tools to solve this problem in a way that wasn’t possible before. We were aware of the academic literature on “stream processing,” an area of research that extended the storage and data processing techniques of databases beyond static tables to apply them to the kind of continuous, never-ending streams of data that were the core of a digital business like LinkedIn. ETA: Batch vs. Stream Processing

We’re all familiar with the age-old question: “Are we there yet?” The traditional database is a bit like a child and has no way to answer this other than just asking over and over again. With stream processing, the ETA becomes a continuous calculation always in sync with the position of the car.

In a social network, an event could represent a click, an email, a login, a new connection, or a profile update. Treating this data as an ever-occurring stream made it accessible to all the other systems LinkedIn had.

Our early use cases involved populating data for LinkedIn’s social graph, search, and Hadoop and data warehouse environments, as well as user-facing applications like recommendation systems, newsfeeds, ad systems, and other product features. Over time, the use of Kafka spread to security systems, low-level application monitoring, email, newsfeeds, and hundreds of other applications. This all happened in a context that required massive scale, with trillions of messages flowing through Kafka each day, and thousands of engineers building around it.

After we released Kafka as open source, it started to spread well outside LinkedIn, with similar architectures showing up at Netflix, Uber, Pinterest, Twitter, Airbnb, and others.

As we left LinkedIn to establish Confluent in 2014, Kafka and event streams had begun to garner interest well beyond the Silicon Valley tech companies, and moved from simple data pipelines to directly powering real-time applications.

Some of the largest banks in the world now use Kafka and Confluent for fraud detection, payment systems, risk systems, and microservices architectures. Kafka is at the heart of Euronext’s next-generation stock exchange platform, processing billions of trades in the European markets.

In retail, companies like Walmart, Target, and Nordstrom have adopted Kafka. Projects include real-time inventory management, in addition to integration of ecommerce and brick-and-mortar facilities. Retail had traditionally been built around slow, daily batch processing, but competition from ecommerce has created a push to become integrated and real time.

A number of car companies, including Tesla and Audi, have built out the IoT platform for their next-generation connected cars, modeling car data as real-time streams of events. And they’re not the only ones doing this. Trains, boats, warehouses, and heavy machinery are starting to be captured in event streams as well.

What started as a Silicon Valley phenomenon is now quite mainstream with hundreds of thousands of organizations using Kafka, including over 60% of the Fortune 100.

Event streams as the central nervous system

In most of these companies, Kafka was initially adopted to enable a single, scalable, real-time application or data pipeline for one particular use case. This initial usage tends to spread rapidly within a company to other applications.

The reason for this rapid spread is that event streams are all multi-reader: an event stream can have any number of “subscribers” that process, react, or respond to it.

To take retail as an example, a retailer might begin by capturing the stream of sales that occur in stores for a single use case, say, speeding up inventory management. Each event might represent the data associated with one sale: which products sold, what store they sold in, etc. But though usage might start with a single application, this same stream of sales is critical for systems that do pricing, reporting, discounting, and dozens of other use cases. Indeed, in a global retail business there are hundreds or even thousands of software systems that manage the core processes of the business from inventory management, warehouse operations, shipments, price changes, analytics, and purchasing. How many of these core processes are impacted by the simple event of a sale of a product taking place? The answer is many or most of them, as selling a product is one of the most fundamental activities in retail.

This is a virtuous cycle of adoption. The first application brings with it critical data streams. New applications join the platform to get access to those data streams, and bring with them their own streams. Streams bring applications, which in turn bring more streams.

The core idea is that an event stream can be treated as a record of what has happened, and any system or application can tap into this in real time to react, respond, or process the data stream.

This has very significant implications. Internally, companies are often a spaghetti-mess of interconnected systems, with each application jury-rigged to every other. This is an incredibly costly, slow approach. Event streams offer an alternative: there can be a central platform supporting real-time processing, querying, and computation. Each application publishes the streams related to its part of the business and relies on other streams, in a fully decoupled manner.

Event Streaming Platform

In driving interconnection, the event streaming platform acts as the central nervous system for the emerging software-defined company. We can think of the individual, disconnected UI-centric applications as a kind of single-celled organism of the software world. One doesn’t get to an integrated digital company by simply stacking many of these on top of one another, any more than a dog could be created from a pile of undifferentiated amoebas. A multicell animal has a nervous system that coordinates all the individual cells into a single entity that can respond, plan, and act instantaneously to whatever it experiences in any of its constituent parts. A digital company needs a software equivalent to this nervous system that connects all its systems, applications, and processes.

This is what makes us believe this emerging event streaming platform will be the single most strategic data platform in a modern company.

The event streaming platform: Databases and streams must join together

Doing this right isn’t just a matter of productizing the duct tape companies have built for ad hoc integrations. That is insufficient for the current state, let alone the emerging trends. What is needed is a real-time data platform that incorporates the full storage and query processing capabilities of a database into a modern, horizontally scalable, data platform.

And the needs for this platform are more than just simply reading and writing to these streams of events. An event stream is not just a transient, lossy spray of data about the things happening now—it is the full, reliable, persistent record of what has happened in the business going from past to present. To fully harness event streams, a real-time data platform for storing, querying, processing, and transforming these streams is required, not just a “dumb pipe” of transient event data.

Combining the storage and processing capabilities of a database with real-time data might seem a bit odd. If we think of a database as a kind of container that holds a pile of passive data, then event streams might seem quite far from the domain of databases. But the idea of stream processing turns this on its head. What if we treat the stream of everything that is happening in a company as “data” and allow continuous “queries” that process, respond, and react to it? This leads to a fundamentally different framing of what a database can be.

In a traditional database, the data sits passively and waits for an application or person to issue queries that are responded to. In stream processing, this is inverted: the data is a continuous, active stream of events, fed to passive queries that simply react and process that stream. Tables | Commit Log

In some ways databases already exhibited this duality of tables and streams of events in their internal design if not their external features. Most databases are built around a commit log that acts as a persistent stream of the data modification events. This log is usually nothing more than an implementation detail in traditional databases, not accessible to queries. However in the event streaming world the log needs to become a first-class citizen along with the tables it populates.

The case for integrating these two things is based on more than database internals, though. Applications are fundamentally about reacting to events that occur in the world using data stored in tables. In our retail example, a sale event impacts the inventory on hand (state in a database table), which impacts the need to reorder (another event!). Whole business processes can form from these daisy chains of application and database interactions, creating new events while also changing the state of the world at the same time (reducing stock counts, updating balances, etc.).

Traditional databases only supported half of this problem and left the other half embedded in application code. Modern stream processing systems like KSQL are bringing these abstractions together to start to complete what a database needs to do across both events and traditional tables of stored data, but the unification of events with database is a movement that is just beginning.

Confluent’s mission

Confluent’s mission is to build this event streaming platform and help companies begin to re-architect themselves around it. The founders and a number of its early employees have been working on this project since even before the company was born.

Our approach to building this platform is from the bottom up. We started by helping to make the core of Kafka reliably store, read, and write event streams at massive scale. Then we moved up the stack to connectors and KSQL to make using event streams easy and approachable, which we think is critical to building a new mainstream development paradigm.

Confluent Icon

We’ve made this stack available as both a software product as well as a fully managed cloud service across the major public cloud providers. This allows the event streaming platform to span the full set of environments a company operates in and integrate data and events across them all.

There are huge opportunities for an entire ecosystem to build up on top of this new infrastructure: from real-time monitoring and analytics, to IoT, to microservices, to machine learning pipelines and architectures, to data movement and integration.

As this new platform and ecosystem emerges, we think it can be a major part of the transition companies are going through as they define and execute their business in software. As it grows into this role I think the event streaming platform will come to equal the relational database in both its scope and its strategic importance. Our goal at Confluent is to help make this happen.

Jay Kreps is the CEO of Confluent as well as one of the co-creators of Apache Kafka. He was previously a senior architect at LinkedIn.

I’ll be sharing more about the impact of Apache Kafka and major developments to come during my keynote at Kafka Summit San Francisco, which takes place from September 30^th to October 1^st. If you’re interested in attending, feel free to register with the code blog19 to get 30% off.

↧

Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset

September 26, 2019, 10:00 am

≫ Next: Schema Validation with Confluent 5.4-preview

≪ Previous: Every Company is Becoming a Software Company

In the early days, many companies simply used Apache Kafka^® for data ingestion into Hadoop or another data lake. However, Apache Kafka is more than just messaging. The significant difference today is that companies use Apache Kafka as an event streaming platform for building mission-critical infrastructures and core operations platforms. Examples include microservice architectures, mainframe integration, instant payment, fraud detection, sensor analytics, real-time monitoring, and many more—driven by business value, which should always be a key driver from the start of each new Kafka project: Confluent Use Cases

Access to massive volumes of event streaming data through Kafka has sparked strong interest in interactive, real-time dashboards and analytics, with the idea being similar to what was built on top of batch frameworks like Hadoop in the past using Impala, Presto, or BigQuery: the user wants to ask questions and get answers quickly.

In the most critical use cases, every seconds counts. Batch processing and reports after minutes or even hours is not sufficient. Leveraging Rockset, a scalable SQL search and analytics engine based on RocksDB, and in conjunction with BI and analytics tools, we’ll examine a solution that performs interactive, real-time analytics on top of Apache Kafka and also show a live monitoring dashboard example with Redash. Rockset supports JDBC and integrates with other SQL dashboards like Tableau, Grafana, and Apache Superset. Some Kafka and Rockset users have also built real-time e-commerce applications, for example, using Rockset’s Java, Node.js^®, Go, and Python SDKs where an application can use SQL to query raw data coming from Kafka through an API (but that is a topic for another blog).

Kafka + Rockset

Let’s now dig a little bit deeper into Kafka and Rockset for a concrete example of how to enable real-time interactive queries on large datasets, starting with Kafka.

Apache Kafka as an event streaming platform for real-time analytics

Apache Kafka is an event streaming platform that combines messages, storage, and data processing. The Apache Kafka project includes two additional components: Kafka Connect for integration and Kafka Streams for stream processing. Kafka’s ecosystem also includes other valuable components, which are used in most mission-critical projects. Among these are Confluent Schema Registry, which ensures the right message structure, and KSQL for continuous stream processing on data streams, such as filtering, transformations, and aggregations using simple SQL commands.

Kafka often acts as the core foundation of a modern integration layer. The article Apache Kafka vs. Enterprise Service Bus (ESB) – Friends, Enemies or Frenemies? explains in more detail why many new integration architectures leverage Apache Kafka instead of legacy tools like RabbitMQ, ETL, and ESB.

Not only can Kafka be used for both real-time and batch applications, but it can also integrate with non-event-streaming communication paradigms like files, REST, and JDBC. In addition, it is often used for smaller datasets (e.g., bank transactions) to ensure reliable messaging and processing with high availability, exactly once semantics, and zero data loss.

Kafka Connect is a core component in event streaming architecture. It enables easy, scalable, and reliable integration with all sources and sinks, as can seen through real-time Twitter feeds in our upcoming example. What if mainframes, databases, logs, or sensor data are involved in your use case? The ingested data is stored in a Kafka topic. Kafka Connect acts as sink to consume the data in real time and ingest it into Rockset.

Regardless of whether your data is coming from edge devices, on-premises datacenters, or cloud applications, you can integrate them with a self-managed Kafka cluster or with Confluent Cloud (https://www.confluent.io/confluent-cloud), which provides serverless Kafka, mission-critical SLAs, consumption-based pricing, and zero efforts on your part to manage the cluster.

Complementary to the Kafka ecosystem and Confluent Platform is Rockset, which likewise serves as a great fit for interactive analysis of event streaming data.

Overview of Rockset technology

Rockset is a serverless search and analytics engine that can continuously ingest data streams from Kafka without the need for a fixed schema and serve fast SQL queries on that data. The Rockset Kafka Connector is a Confluent-verified Gold Kafka connector sink plugin that takes every event in the topics being watched and sends it to a collection of documents in Rockset. Users can then build real-time dashboards or data APIs on top of the data in Rockset.

Rockset employs converged indexing, where every document is indexed multiple ways in document, search, and columnar indexes, to provide low-latency queries for real-time analytics. This use of indexing to speed performance is akin to the approach taken by search engines, like Elasticsearch, except that users can query Rockset using standard SQL and do joins across different datasets. Other SQL engines, like Presto and Impala, are optimized for high throughput more so than low latency and rely less on indexing.

Rockset is designed to take full advantage of cloud elasticity for distributed query processing, which ensures reliable performance at scale without managing shards or servers. You can either do interactive queries using SQL in your user interface or command line, or provide developers with real-time data APIs for building applications to automate the queries.

Real-time decision-making and live dashboards using Kafka and Rockset

Let’s walk through a step-by-step example for creating a real-time monitoring dashboard on a Twitter JSON feed in Kafka, without going through any ETL to schematize the data upfront. Because Rockset continuously syncs data from Kafka, new tweets can show up in the real-time dashboard in a matter of seconds, giving users an up-to-date view of what’s going on in Twitter. While Twitter is nice for demos (and some real use cases), you can, of course, integrate with any other event streaming data from your business applications the same way.

Connecting Kafka to Rockset

To connect Kafka to Rockset, use the Rockset Kafka Connector available on Confluent Hub, and follow the setup instructions in the documentation.

Next, create a new integration to allow the Kafka Connect plugin to forward documents for specific Kafka topics. You can do so by specifying Kafka as the integration type in the Rockset console.

Integrate with an External Service

Figure 1. Click to create a new Apache Kafka integration.

Select the data format and add the names of topics you wish to forward to Rockset from Kafka Connect. Once you create the integration, you will be presented with configuration options to be used with Kafka Connect for forwarding the Twitter data to Rockset.

Creating a collection from a Kafka data source

To complete your setup, create a new collection to ingest documents from the Kafka Twitter stream, using the integration you previously set up. If you are only interested in tweets from the past few months, you can configure a retention policy that drops documents from the collection after “n” days or months.

New Collection

Figure 2. Select Apache Kafka as the data source for the new collection, using the previously created integration.

Querying Twitter Data from Kafka

With Kafka data flowing into your Rockset collection, you can run a query to better understand the content of the Twitter feed. The JSON from the Twitter feed shows multiple levels of nesting and arrays, and even though you didn’t perform any data transformation, you can run SQL directly on the raw Twitter stream.

If you are particularly interested in the subset of tweets that contain stock symbols (sometimes referred to as cashtags), you can write a SQL query to find those tweets and unnest the JSON array where the stock symbols are specified.

Rockset-Kafka Query 1

Figure 3. Find a sample of stock symbols in the Twitter feed.

Joining with other datasets

If you want to match the stock symbols with actual company information (e.g., company name and market cap), you can join your collection from the Kafka Twitter stream with more detailed company information from Nasdaq. Here, your query returns the stocks with the most mentions in the past day.

Rockset-Kafka Query 2

Figure 4. Find the most mentioned stocks in the past day, along with more detailed company information.

Generating a real-time monitoring dashboard on Kafka data

Now that you have joined the Kafka stream with stock market data and made it queryable using SQL, connect Rockset to Redash. Note that Rockset supports other dashboarding tools as well, including Grafana, Superset, and Tableau via JDBC. Aside from standard visualization tools, you also have the option to build custom dashboards and applications using SQL SDKs for Python, Java, Node.js, and Go.

Now, let’s generate a real-time monitoring dashboard on the incoming tweets, in which the dashboard is populated with the latest tweets whenever it is refreshed.

Redash

Figure 5. A live dashboard for monitoring spikes in stock symbol mentions in the Twitter stream.

By plugging Kafka into Rockset, you were able to start from a Twitter JSON stream, join different datasets, and create a real-time dashboard using a standard BI tool running SQL queries. No ETL is required, and new data in the Kafka stream shows up in the dashboard within seconds.

Interactive analytics on scalable, event streaming data to act while data is hot

In most projects, data streams are not just consumed by one application, but by several different applications. Since Kafka is not just a messaging system, but also stores data and decouples each consumer and producer from one another, each application can process the data feed when and with the speed it needs to do so.

In the e-commerce example mentioned above, one consumer could process orders in real time using KSQL and Rockset for SQL analytics in the backend. Another consumer could be a CRM system like Salesforce, which saves relevant customer interactions and loyalty information for long-term customer management. And a third consumer could respond to consumer behavior as it happens to recommend additional items or provide a coupon if the user is about to leave the online shop, which can be implemented easily as shown above.

With Confluent Platform and Rockset, you can process and analyze large streams of data in real time using SQL queries, whether it’s through human interaction on a command line or a custom user interface, integrated into the standard BI tool of your company, or automated within a Kafka application.

Interested in more?

Learn more about Rockset, and feel free to stop by the Rockset booth (S24) at Kafka Summit San Francisco next week. If you haven’t already, you can register using the code blog19 to get 30% off. Plus, there will be swag waiting for you…

Shruti Bhat is SVP of products at Rockset. Prior to Rockset, Shruti led product management for Oracle Cloud, with a focus on AI, IoT, and blockchain. Previously, Shruti was VP of marketing at Ravello Systems, where she drove the startup’s rapid growth from pre-launch to hundreds of customers and a successful acquisition. Prior to that, she was responsible for launching VMware’s vSAN and has led engineering teams at HP and IBM.

Kai Waehner works as technology evangelist at Confluent. Kai’s main area of expertise lies within the fields of big data analytics, machine learning, integration, microservices, Internet of Things, stream processing, and blockchain. He is regular speaker at international conferences, such as JavaOne, O’Reilly Software Architecture, and ApacheCon, writes articles for professional journals, and enjoys writing about his experiences with new technologies.

↧

Schema Validation with Confluent 5.4-preview

September 27, 2019, 9:00 am

≫ Next: Free Apache Kafka as a Service with Confluent Cloud

≪ Previous: Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset

Robust data governance support through Schema Validation on write is now supported in Confluent Platform 5.4-preview. This gives operators a centralized location to enforce data format correctness within Confluent Platform. Enforcing data correctness on write is the first step towards enabling centralized policy enforcement and data governance within your event streaming platform.

Why centralized data governance is important

Data governance ensures that an organization’s data assets are formally and properly managed throughout the enterprise to secure accountability and transferability: different teams and projects within the organization can collaborate on the same contract of how data is generated, transmitted, and interpreted. Once an architectural luxury, data governance has become a necessity for the modern enterprise across the entire stack. It represents a mature set of well-established data management disciplines from the database world, but with event streaming systems, it takes on some new nuances:

Any application or other producer sending new messages to the event streaming platform can “speak one language” that all others can understand at any time
New types of messages conform to organizational policies, such as a prohibition on personally identifiable information (PII)
All clients connecting to the cluster use recent (and efficient) protocol versions to avoid extra costs for protocol upgrade/downgrade at the server side

It is important to enforce data governance policies in a single place. The best place is inside the event streaming platform itself, so that we don’t have to audit each client to make sure their application code has respected all the rules. In a large organization with lots of teams and products all leveraging the platform to build their real-time business logic, trying to enforce such data governance policies is extremely difficult.

Take schemas as an example. Today, nearly everyone uses standard data formats like Avro, JSON, and Protobuf to define how they will communicate information between services within an organization, either synchronously through RPC calls or asynchronously through Apache Kafka^® messages.

For Kafka, all producers and consumers are required to agree on those data schemas to serialize and deserialize messages. In practice, a schema registry service such as the Confluent Schema Registry is used to manage all the schemas associated with the Kafka topics, and all clients talk to this service to register and fetch schemas. Using a schema registry service makes it easier to enforce agreements between clients while ensuring data compatibility and preventing data corruption.

However, these schemas are only enforced as “agreement” between the clients and are totally agnostic to brokers, which still see all messages as entirely untyped byte arrays. In other words, we cannot prevent unformatted data from being published to and stored in Kafka servers. Today, there is no programmatic way of enforcing that producers talk to a schema registry service to serialize their data according to the defined schema before sending them to Kafka.

Although the schema contracts between clients can at least prevent consumers from returning unformatted messages to users, a mature data governance mechanism requires that we enforce schema validation on the broker itself.

Schema Validation: How hard is it?

To allow Schema Validation on write, Confluent Server must be schema aware. confluent.value.schema.validation=true

Confluent Server is a component of the Confluent Platform that includes Kafka and additional cloud-native and enterprise-level features. Confluent Server is fully compatible with Kafka, and users can migrate in place between Confluent Server and Kafka.

For Confluent Server to become schema aware, the broker has had to develop a direct interface to the Confluent Schema Registry, just like schema-managing clients have always done. We need to watch out for potentially significant overhead, as we are required to validate schema on a message-by-message basis.

The first step to checking every message’s schema is to add the confluent.schema.registry.url configuration parameter at the broker level—similar to what has been in use on the client side—to let brokers find the Confluent Schema Registry servers and fetch schemas from them. Then we allow users to turn on Schema Validation at the topic level with confluent.key.schema.validation and confluent.value.schema.validation. Setting these configurations to “true” indicates that schema IDs encoded in the keys and values of messages inbound to this Kafka topic will be validated against the Schema Registry service. We also extended the producer protocol to allow brokers to indicate which messages within the batch are rejected for schema validation reasons. Now when a producer gets an error indication in the producer response, the invalid messages will be dropped from the batch, and the callback indicates the error. For more details, please feel free to read KIP-467.

Get started in five minutes

To enable Schema Validation, set confluent.schema.registry.url in your server.properties file.

For example:

confluent.schema.registry.url=http://schema-registry:8081

By default, Confluent Server uses the TopicNameStrategy to map topics with schemas in Schema Registry. This can be changed for both the key and value fields via confluent.key.subject.name.strategy and confluent.value.subject.name.strategy within the broker properties.

To enable Schema Validation on a topic, set confluent.value.schema.validation=true and confluent.key.schema.validation=true.

For example:

kafka-topics --create --bootstrap-server localhost:9092 --replication-factor 1 \
--partitions 1 --topic movies \
--config confluent.value.schema.validation=true

That’s it! If a message is produced to the topic movies that doesn’t have a valid schema registered in the Schema Registry, the client will receive an error back from the broker.

Conclusion

Schema Validation lays the foundation for data governance in Confluent Platform. With just one server configuration parameter, a Confluent Platform operator can now have better control over the data being written to the system down to the topic level. This is just the beginning of a series of data governance features to come.

For more information:

See the Confluent Platform 5.4-preview documentation to get started
Learn more about Built-In Multi-Region Replication with Confluent Platform 5.4-preview
Register for Kafka Summit San Francisco to learn more about what you can do with event streaming. You can use the code blog19 to get 30% off!

This work could not be done without my colleagues Tu Tran, Robert Yokota, Addison Huddy, and Tushar Thole.

Guozhang Wang is a PMC member of Apache Kafka, and also a tech lead at Confluent leading the Streams team. He received his PhD from Cornell University as part of the Cornell Database Group, where he worked on scaling iterative data-driven applications. Prior to Confluent, Guozhang was a senior software engineer at LinkedIn, developing and maintaining its backbone event streaming infrastructure with Apache Kafka and Apache Samza.

↧

Free Apache Kafka as a Service with Confluent Cloud

September 30, 2019, 8:00 am

≫ Next: Kafka Summit San Francisco 2019: Day 1 Recap

≪ Previous: Schema Validation with Confluent 5.4-preview

Go from zero to production on Apache Kafka^® without talking to sales reps or building infrastructure

Apache Kafka is the standard for event-driven applications. But it’s not without its challenges, and the ops burden can be heavy. Organizations that successfully build and run their own Kafka environment must make significant investments in engineering and operations to account for failover and security. And they spend hundreds of thousands of dollars on capacity that is often idle but necessary to handle unexpected bursts or spikes in their data. These are the typical reasons for running in the cloud.

There is fantastic news if you’re just coming up to speed on Kafka: we are eliminating these challenges and lowering the entry barrier for Kafka by making Kafka serverless and offering Confluent Cloud for free*.

Free Kafka as a service

As we are announcing today at Kafka Summit San Francisco, you can get started with Confluent Cloud on the cloud of your choice for free. Sign up on the Confluent Cloud landing page, and we’ll give you up to $50 USD off each month for the first three months. We’ve simplified pricing, so $50 goes a long way, and you can easily calculate what you would pay beyond that if you go over.

Kafka made serverless

Confluent Cloud provides a serverless experience for Apache Kafka on your cloud of choice, including Google Cloud Platform (GCP), Microsoft Azure, and Amazon Web Services (AWS). Kafka made serverless means:

Think outcomes, not clusters—you no longer need to worry about pre-provisioning or managing a cluster
Grow as you go based on load, as your underlying Kafka infrastructure scales up and down transparently
Pay precisely for what you use with consumption-based pricing

1. Think outcomes, not clusters

Kafka made serverless means you never need to worry about configuring or managing clusters—all you have to think about is the problem you are trying to solve. In other words, that means you focus on building apps, not infrastructure.

2. Grow as you go with elastic scaling

One of the biggest challenges in building a Kafka environment can be sizing for the future, including unexpected peaks or bursts. If you do not pre-provision this capacity, or expand the cluster fast enough to accommodate an increase or spike in traffic, you run the risk of downtime or data loss.

With Kafka made serverless in Confluent Cloud, you get true elastic scaling without having to change so much as a single configuration parameter in your account. Your environment grows seamlessly from development to testing to production loads without you spending time manually sizing the cluster for peaks or future expansions.

3. Pay precisely for what you use

With consumption-based pricing, you only pay for what you use now (data in, data out, and data retained), and with elastic scaling, your environment is scale proofed. Don’t pay for capacity you don’t need today or idle capacity that you don’t use most of the time. We have always said this is what the cloud is supposed to do, and now in the case of Kafka, it does.

There are three dimensions to Confluent Cloud’s consumption-based pricing:

Per GB of data in (starts at $0.11/GB)
Per GB of data out (starts at $0.11/GB)
Per GB-month of data retained (starts at $0.10/GB, excluding replication)

There is no charge for fixed infrastructure components like brokers or ZooKeeper, so the consumption-based charges scale to zero when your usage does. You never see a minimum charge, and you don’t have to make any commitments beyond the next message you produce or consume.

If you were to stream 1 GB of data in, retained that GB and did nothing else, it would cost you exactly $0.11 for data in plus $0.10 for storage with 3x replication, for a total of $0.41 on your bill that month. As an example development use case, let’s say you streamed in 50 GB of data, stored all of it, and had two consumers, so you streamed out 100 GB. That translates to $31.50 for the month.

Support from the experts

Another concern you might have as you’re building out a system with Kafka is downtime resulting from running out of file descriptors, connection and authentication storms, running out of disk space, bad topic/broker configurations, or any other administrative headache you might think of. With Confluent Cloud, these become things of the past. Our service-level agreement (SLA) guarantees an uptime of 99.95%. All you have to do is write applications.

You can also leverage the Confluent Cloud community for support, or add on a Confluent Cloud support plan based on the scope of your project. With three tiers in addition to free community support, it’s easy to find the right level you need. Developer support starts at just $29 per month, so you can leverage world-class support even at the earliest stages of development or production.

Choose how and where to deploy

When it comes to Confluent Cloud, you don’t have to make all-or-nothing decisions. As you prove the value of Kafka with your first few event streaming applications built on Confluent Cloud, adoption can be incremental. As the value becomes apparent throughout your organization, Kafka adoption will grow, and your environment will scale effortlessly.

Now that you can leverage all the benefits of Kafka and serverless, get started today on any cloud of your choice. Confluent Cloud is now available on Microsoft Azure in addition to Google Cloud Platform (GCP) and Amazon Web Services (AWS).

You can start with simple consumption-based billing (charged monthly), add the appropriate tier of support when you need it, then make an annual commitment for discounts as your event streaming needs mature. When you get to a steady state for your event streams and know what commitment makes sense, you can take advantage of significant discounts.

With Confluent Cloud Enterprise, we can help you create a custom setup if the needs of your mission-critical apps are no longer met by the standard setup, for example, if you are bursting to more than 100 MBps or need a more sophisticated networking architecture such as VPC peering or a transit gateway.

Ultimately, your event streaming platform will become the central nervous system of your business, powering applications and transporting all your enterprise events. I encourage you to try the serverless experience of Confluent Cloud today—for free*—and take the first step of the journey.

*With try free promotion for Confluent Cloud, receive up to $50 USD off your bill each calendar month for the first three months. New signups only. Offer ends December 31, 2019.

Priya Shivakumar is the senior director of product marketing at Confluent, where she focuses on product marketing and go-to-market strategy for Confluent Cloud, a fully managed event streaming service that extends Apache Kafka.

↧

Kafka Summit San Francisco 2019: Day 1 Recap

September 30, 2019, 6:27 pm

≫ Next: Kafka Summit San Francisco 2019: Day 2 Recap

≪ Previous: Free Apache Kafka as a Service with Confluent Cloud

Day 1 of the event, summarized for your convenience.

They say you never forget your first Kafka Summit. Mine was in New York City in 2017, and it had, what, 300 people? Today we welcomed nearly 2,000 to a giant ballroom in San Francisco. There were laser beams. There were two Tony Stark allusions in the first 60 seconds. And most importantly, there was a vibrant and ever-growing community present.

Kafka Summit SF 2019

Let me recap the day for you in case you couldn’t make it, or if you could make it and you just want to relive it instead of going to the afterparty. I mean, it’s your call.

Keynote by Jun Rao

Jun Rao, one of the original co-creators of Apache Kafka^® and one of the co-founders of Confluent, walked on the stage to Daft Punk’s Harder, Better, Faster, Stronger and proceeded to take us to school on the basics. He recounted the motivations that originally prompted him, his co-founders Jay Kreps and Neha Narkhede, and other colleagues back at LinkedIn to create Kafka in the first place. They were, like so many open source creators, scratching an architectural itch of their own, the result of which has now grown into the second most active Apache project in use by more than one-third half of Fortune 500 companies.

That is, like Ron Burgundy, kind of a big deal.

Jun Rao Keynote at Kafka Summit SF 2019

Jun led Chris Kasten, VP of Walmart Cloud, through a Q&A session that I personally found to be downright informative. This fact is not precisely tech-related, but I learned that Walmart has a physical location within 10 miles of 90% of the population of the United States. Chris explained how some of Walmart’s digital transformation investments are designed to make use of that fact, in ways like grocery delivery and pickup. And of course, we got to see that awesome commercial with every iconic automobile from the entire panoply of 80s pop culture. We might say that they were speaking my language.

Chris Kasten

A recording, for your viewing pleasure, is provided at the end of this post.

How I spent my breaks

It’s fair to say I had a very productive morning and afternoon breaks….

Checked out the on-demand donut station.

And, I saw fresh Kafka swag in the making.

I even stopped by Build-A-Bear at lunchtime with the inaugural class of Confluent Community Catalysts! These are people (see the link for more details) who have made outstanding contributions to the Confluent Community in the past year. Get to know them if you can.

And the day’s not over yet!

Stay tuned

On the agenda for day two: a keynote from Confluent CEO Jay Kreps, a Q&A with Lyft, more sessions and, I am told, more on-demand donuts…

Tim Berglund is a teacher, author and technology leader with Confluent, where he serves as the senior director of developer experience. He can frequently be found at speaking at conferences in the U.S. and all over the world. He is the co-presenter of various O’Reilly training videos on topics ranging from Git to distributed systems, and is the author of Gradle Beyond the Basics. He tweets as @tlberglund and lives in Littleton, CO, U.S., with the wife of his youth and their youngest child, the other two having mostly grown up.

↧

Kafka Summit San Francisco 2019: Day 2 Recap

October 1, 2019, 6:04 pm

≫ Next: Why Scrapinghub’s AutoExtract Chose Confluent Cloud for Their Apache Kafka Needs

≪ Previous: Kafka Summit San Francisco 2019: Day 1 Recap

If you looked at the Kafka Summits I’ve been a part of as a sequence of immutable events (and they are, unless you know something about time I don’t), it would look like this: New York City 2017, San Francisco 2017, London 2018, San Francisco 2018, New York City 2019, London 2019, San Francisco 2019. That makes this the seventh Summit I’ve attended.

Yes, you read that right. I’ve officially been to seven Kafka Summits in my career. I started going well before I worked for Confluent. The passionate community, fascinating use cases, and high-quality sessions kept me coming back year after year. It’s grown so much since the early days, but Summit still manages to stay true to the committers and developers who make everything we do at Confluent possible.

Book Signing at Kafka Summit San Francisco

I can’t bear it

Did I mention we went to Build-A-Bear yesterday with the 2020 class of Confluent Community Catalysts? Well, I wanted to share the experience with everyone present at Summit and not just blog readers, so I brought my stuffed bear (I named him “Bear”) onto the stage to get things kicked off. I think he was well received.

Tim Berglund with Bear

Keynote by Jay Kreps

Then Confluent’s somewhat more self-respecting CEO, Jay Kreps, opened the morning with a slight remix of Marc Andreeson’s famous truism that software is eating the world. It is not just that software is eating the world, but that companies themselves are becoming software; that is, not only are businesses using more software—that by itself is a somewhat vacuous observation—but more business processes are being executed end to end by software, asynchronously, without the need for human interaction.

We tend to think of software as a support system for a user interface, even as that interface has changed over the past 50 years from teletype to terminal to GUI to the web to mobile apps. In all cases, a person does a thing to the interface and waits synchronously for a response. Databases have grown up as the optimal tools for managing the data of systems like these, but the more pressing asynchronous needs of the emerging class of applications calls for something different.

Reached for comment, Bear described the streaming database concept as “an exciting development in data infrastructure purpose-built for today’s asychronous back ends.” I’ve only known him for a day, but I’ve already come to know this sort of perspicacity to be downright typical for him.

Keynote by Jay Kreps

Then Lyft’s Engineering Manager of Streaming Platforms, Dev Tagare, sat down with Jay to talk about how Lyft uses Apache Kafka^®. One memorable example from their conversation: the little moving car you see when you’re waiting for your ride? Those movement events flow through Kafka in real time. I find this kind of thing very helpful when I’m explaining to non-tech people what I do. Everybody uses Kafka all day every day. You can’t avoid it, even if you don’t know it’s happening.

Q&A with Dev Tagare from Lyft

Next up, my esteemed co-worker Priya Shivakumar took the stage to talk about Confluent Cloud. She has had a pivotal role in shaping the product, so it was good to hear from her where it’s going and why. Like she explained in her blog post yesterday, Confluent Cloud presents us with Kafka made serverless—no brokers, no scaling of clusters—and $50 off your bill each month for the first three months after you sign up. Barriers to cloud Kafka usage, be gone.

Priya Shivakumar

Until next year, Kafka Summit!

Of course I’ve said very little about the sessions themselves, which are the backbone of this event. I’ll summarize those for you when the videos come out in a few weeks. And after that, I’ll be telling you about the 2020 London Summit before you know it.

↧

Why Scrapinghub’s AutoExtract Chose Confluent Cloud for Their Apache Kafka Needs

October 3, 2019, 10:00 am

≫ Next: How to Deploy Confluent Platform on Pivotal Container Service (PKS) with Confluent Operator

≪ Previous: Kafka Summit San Francisco 2019: Day 2 Recap

We recently launched a new artificial intelligence (AI) data extraction API called Scrapinghub AutoExtract, which turns article and product pages into structured data. At Scrapinghub, we specialize in web data extraction, and our products empower everyone from programmers to CEOs to extract web data quickly and effectively.

Article Extraction Example

Example of article extraction on Introducing a Cloud-Native Experience for Apache Kafka^® in Confluent Cloud

As part of our journey, we moved Scrapinghub AutoExtract to the cloud with Confluent Cloud. Being a small team, we wanted to offload as much of the infrastructure work to managed cloud services for ease and cost reasons.

Our use case

Scrapinghub's Use Case

Simplified system overview

The goal of Scrapinghub AutoExtract is to enable users to extract content from a given URL without having to write any custom code. This means users don’t need to worry about site changes or their ability to scale their content extraction from various websites.

Our system receives a URL as an input from a user. This URL needs to be fetched, rendered, and screenshotted. However, it can be a complex procedure—is rendering required? Does JavaScript need to be evaluated? Does a proxy need to be used? How fast does the site load? Do pop-ups appear? As you can imagine, this can become a bottleneck for the system. By using Kafka, we partition the workload and process each partition in parallel by a separate instance of our downloader component.

Once the page content has been acquired, it needs to be understood and transformed into structured data. Our AI-enabled data extract engine responsible for this, however, involves a resource-intensive process. Fortunately, Confluent Cloud allows the system to scale by distributing the workload out to several instances of our AI-enabled data extraction engine. System overview showing how Kafka is used to scale and distribute requests

System overview showing how Kafka is used to scale and distribute requests

Why move to the cloud?

Moving the entire Scrapinghub AutoExtract stack to Google Cloud Platform (GCP) with Google Kubernetes Engine (GKE) lets us make use of on-demand instances to quickly and easily scale the system to meet customer demand. Additionally, using cloud services enables us to offload many responsibilities such as running a database, Kafka, or Kubernetes to a cloud provider, and allows our small team to focus on the product.

Choosing Confluent Cloud

For Scrapinghub, using Confluent Cloud for our Kafka needs has allowed us to offload the responsibility of running Kafka. It provides all the benefits of Kafka without requiring that we tune our cluster, manage broker upgrades, worry about encrypting data at rest and during transit, figure out the best way to run Kafka and ZooKeeper on Kubernetes, among many other things.

In addition, Confluent Cloud offers much more than reducing management overhead. The clever consumption-based pricing model enables us to only incur costs based on stored data and system activity. Confluent Cloud also provides vendor independence as the offering is based on open source Apache Kafka and is available on all major cloud providers (Google Cloud Platform, Microsoft Azure, and Amazon Web Services). Why Confluent Cloud?

Reasons to choose Confluent Cloud

Before moving to Confluent Cloud, we looked at some alternatives, but none of them seemed to meet all of our needs.

Running Kafka on Kubernetes ourselves

Confluent provides Helm Charts, which make getting Kafka on Kubernetes up and running straightforward. We chose not to progress with this option as we wanted reliability and low maintenance overhead.

However, for times when we want more control over our Kafka brokers, such as testing how our application responds to a broker outage, we do make use of Helm Charts.

Amazon Managed Streaming for Apache Kafka (Amazon MSK)

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a managed Kafka offering that comes as part of the AWS suite. We chose not to go with it for the following reasons:

Vendor lock-in: Kafka clients must run on AWS infrastructure to access the Kafka cluster, so you do not have the vendor independence
Pricing: AWS has provisioned-based pricing, meaning you will be charged even if the cluster isn’t in use
Lack of important features: Amazon MSK only uses open source Apache Kafka, which doesn’t have features Confluent includes in their platform
Support: AWS does not have the same Apache Kafka expertise as Confluent
Basic Kafka: Amazon MSK does not offer the latest Apache Kafka version

Getting started with Confluent Cloud

After choosing Confluent Cloud, we first got started by getting our staging environment to operate successfully against Kafka. This process was simple. We created a cluster using the Confluent Cloud UI and updated our application with the configuration provided. No other work was involved.

Nothing changed about how we created our Kafka topics. Using the above configuration, we were able to use the kafka-topics script to create topics, just like we used to do before with our own Kafka cluster.

As our final test, we load tested our system with maximum capacity for 24 hours. Confluent Cloud held up just fine—we didn’t experience any latency issues, and our throughput didn’t go beyond the provided 100 MBps.

To move our production environment over, all we needed to do was create another Kafka cluster on the Confluent Cloud UI and change the application configuration.

As with all managed services, there are always some tradeoffs to be made. Given that Confluent Cloud is a multi-tenant system, access to ZooKeeper isn’t provided. This doesn’t prove to be a large issue as most of the Kafka tooling no longer requires ZooKeeper. Besides, compared to running ZooKeeper yourself, this is not a bad compromise. Still, there are some popular tools which have yet to be updated, such as Kafka Manager and Burrow.

For our team, the Confluent Cloud UI provides enough functionality that not having Kafka Manager isn’t a large issue. Initially, the absence of Burrow caused an issue as it was used to feed consumer metrics into our Prometheus-based monitoring system. We switched to a consumer metrics exporter that uses the Kafka API to work around this.

Consumer Metrics Exporter

Although our system can generate some large messages, this didn’t prove to be an issue while running on Confluent Cloud. In our experience, the system limits are very generous.

Interested in more?

If you’re interested in experiencing hands-off Kafka, you can sign up for Confluent Cloud today with no commitments and only pay for what you use. If web scraping also piques your interest, ScrapingHub offers a 14-day trial of AutoExtract.

Ian Duffy is a DevOps engineer on the AutoExtract team at Scrapinghub, where he works on providing the necessary infrastructure and automation to ensure the product runs smoothly.

↧

How to Deploy Confluent Platform on Pivotal Container Service (PKS) with Confluent Operator

October 4, 2019, 9:00 am

≫ Next: How to Run Apache Kafka with Spring Boot on Pivotal Application Service (PAS)

≪ Previous: Why Scrapinghub’s AutoExtract Chose Confluent Cloud for Their Apache Kafka Needs

This tutorial describes how to set up an Apache Kafka^® cluster on Enterprise Pivotal Container Service (Enterprise PKS) using Confluent Operator, which allows you to deploy and run Confluent Platform at scale on virtually any Kubernetes platform, including Pivotal Container Service (PKS). With Enterprise PKS, you can deploy, scale, patch, and upgrade all the Kubernetes clusters in your system without downtime. Confluent Operator | Kubernetes | Pivotal Container Service

You’ll start by creating a Kafka cluster in Enterprise PKS using Confluent Operator. Then, you’ll configure it to expose external endpoints so the cluster will be available for use outside of the Enterprise PKS environment. This is useful in cases where you are deploying a Pivotal Application Service (PAS) that produces and/or consumes to Kafka running in Enterprise PKS.

Let’s begin!

Requirements

Access to PKS: for this tutorial, a PKS cluster (PKS 1.4.0) with one master node and six worker nodes was used. Your PKS environment URL, username, and password are needed.
Ensure that your PKS environment has access to Docker because Confluent Operator packages are released as container images.
Install the PKS Command Line Interface (CLI) on your laptop. You may need to create an account on Pivotal Network in order to sign in and download.
A sample PKS Helm Chart file, such as this file we are using
The Kubernetes command line tool kubectl
Install Helm 2.9+.
Download and expand Confluent Platform (.tar or .zip). (Note: this is technically not required for deployment, but you need access to the scripts in the expanded bin/ directory for the external verification section of this tutorial.)
Download and expand the Confluent Helm bundle tarball, as seen in step 1 of the documentation.

Operator and Kafka Setup Tasks

Run all the following command line tasks in a terminal unless explicitly noted otherwise.

Run pks login -a https://api.pks.example.cf-app.com:9021 -u confluent -p confluent-password -k using the URL, username, and password from the first requirement in the section above.
Run pks get-credentials confluent-cluster.
Issue kubectl config use-context confluent-cluster to point kubectl to the Enterprise PKS cluster by default.
Configure Tiller to use a Helm Chart with Enterprise PKS.
Create a pks.yaml Helm Chart file ( in the helm/providers directory wherever your Confluent Operator distribution was expanded. For this example, you can expand the Confluent Operator tarball to ~/dev/confluent-operator-20190726-v0.65.0 and create a file in ~/dev/confluent-operator-20190726-v0.65.0/helm/providers.

In the helm/ directory in your terminal, run the command below:

helm install \
     -f ./providers/pks.yaml \
     --name operator \
     --namespace operator \
     --set operator.enabled=true \
     ./confluent-operator

Validate that Confluent Operator is running kubectl get pods -n operator. You should see something similar to the following after issuing this command:

NAME                          READY   STATUS              RESTARTS   AGE
cc-manager-7bf99846cc-qx2hb   1/1     ContainerCreating             1          6m40s
cc-operator-798b87b77-lx962   1/1     ContainerCreating             0          6m40s

Wait until the status changes from ContainerCreating to Running.

Install ZooKeeper with the following command, similar to how you previously installed Operator:

helm install \
    -f ./providers/pks.yaml \
    --name zookeeper \
    --namespace operator \
    --set zookeeper.enabled=true \
    ./confluent-operator

Check on the status via the kubectl get pods -n operator sample output:

kubectl get pods -n operator

NAME                          READY   STATUS              RESTARTS   AGE
cc-manager-7bf99846cc-qx2hb   1/1     Running             1          6m40s
cc-operator-798b87b77-lx962   1/1     Running             0          6m40s
zookeeper-0                   0/1     ContainerCreating   0          15s
zookeeper-1                   0/1     ContainerCreating   0          15s
zookeeper-2                   0/1     Pending             0          15s

Wait until the ZooKeeper pods are in Running status before proceeding to the next step. It takes approximately one minute.

Install Kafka with the following command:

helm install \
    -f ./providers/pks.yaml \
    --name kafka \
    --namespace operator \
    --set kafka.enabled=true \
    ./confluent-operator

Issue kubectl get pods -n operator and wait for the status of the kafka-0, kafka-1, and kafka-2 pods to be Running.

At this point, the setup is complete and you are ready to verify that the installation is successful.

Before beginning verification, some of you may be wondering about the rest of the Confluent Platform, such as Schema Registry, KSQL, Control Center, etc. Based on the steps you just went through, I suspect you already know the answer. Nevertheless, here’s an example of deploying Schema Registry:

helm install -f ./providers/pks.yaml --name schemaregistry --namespace operator --set schemaregistry.enabled=true ./confluent-operator

Verification of the Kafka cluster setup

In this tutorial, verification involves both internal and external verification of tasks because you configured your pks.yaml file for exposing external endpoints.

Internal verification

Internal verification involves connecting to one of the nodes in the cluster and ensuring there is communication amongst the nodes by executing various Kafka scripts.

Run kubectl -n operator exec -it kafka-0 bash

There are no editors on Kafka cluster nodes, so create a properties file with cat via:

cat << EOF > kafka.properties
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username=test password=test123;
bootstrap.servers=kafka:9071
security.protocol=SASL_PLAINTEXT
EOF

Then, the following command should return successfully: kafka-broker-api-versions --command-config kafka.properties --bootstrap-server kafka:9071

This is what running through these three steps looks like:

Internal Verification of Kafka Cluster

External validation

For external validation, you can interact with your Kafka cluster from the outside, such as performing these steps on your laptop. Make sure to have Confluent Platform downloaded and extracted, and have the scripts available in the bin/ directory.

Determine the external IP addresses from the kubectl get services -n operator “EXTERNAL-IP” column:

NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP
kafka                  ClusterIP      None                         
kafka-0-internal       ClusterIP      10.100.200.223              
kafka-0-lb             LoadBalancer   10.100.200.231   35.192.236.163   
kafka-1-internal       ClusterIP      10.100.200.213              
kafka-1-lb             LoadBalancer   10.100.200.200   35.202.102.184   
kafka-2-internal       ClusterIP      10.100.200.130               
kafka-2-lb             LoadBalancer   10.100.200.224   104.154.134.167   
kafka-bootstrap-lb     LoadBalancer   10.100.200.6     35.225.209.199

Next, you need to set the name resolution for the external IP. Long term, you’d probably create DNS entries, but in this tutorial, simply update your local /etc/hosts file. As an example, here are some specific entries you can make to your local /etc/hosts file:
```
35.192.236.163 b0.supergloo.com b0
35.202.102.184 b1.supergloo.com b1
104.154.134.167 b2.supergloo.com b2
35.225.209.199 kafka.supergloo.com kafka
```
It is critical to map b0, b1, and b2 hosts to their corresponding kafka-0, kafka-1, and kafka-2 external IPs. The same goes for the bootstrap mapping. (Note: the domain supergloo.com was configured in the pks.yaml file.)
For a quick canary test, ping one of the entries to ensure that name resolution is working correctly, such as ping kafka.supergloo.com.

In the directory where you expanded your Confluent Platform download, create a kafka.properties file with the following content:

sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username=test password=test123;
bootstrap.servers=kafka.supergloo.com:9071
security.protocol=SASL_PLAINTEXT

Run some commands to ensure external connectivity, such as the command for creating a topic:

bin/kafka-topics --create --command-config kafka.properties --zookeeper localhost:2181/kafka-operator --replication-factor 3 --partitions 1 --topic example

List the newly created topic via:

bin/kafka-topics --list --command-config kafka.properties --bootstrap-server kafka.supergloo.com:9092

Congratulations!

You’ve completed the tutorial and deployed a Kafka cluster to Enterprise PKS using Confluent Operator!

Next up, we’ll walk through how to deploy a sample Spring Boot application to PAS and configure it to produce and consume for the Kafka cluster created in this tutorial. Keep an eye out for part 2.

Want more?

For more, check out Kafka Tutorials and find full code examples using Kafka, Kafka Streams, and KSQL.

Todd McGrath is a partner solution engineer at Confluent where he assists partners who are designing, developing, and embedding the Confluent Platform in their customer solutions. Todd has held a variety of roles and responsibilities over many years in software, including hands-on development, entrepreneurship, business development, engineering management, and pre-sales.

↧

How to Run Apache Kafka with Spring Boot on Pivotal Application Service (PAS)

October 7, 2019, 10:00 am

≫ Next: Building a Real-Time, Event-Driven Stock Platform at Euronext

≪ Previous: How to Deploy Confluent Platform on Pivotal Container Service (PKS) with Confluent Operator

This tutorial describes how to set up a sample Spring Boot application in Pivotal Application Service (PAS), which consumes and produces events to an Apache Kafka^® cluster running in Pivotal Container Service (PKS). With this tutorial, you can set up your PAS and PKS configurations so that they work with Kafka.

For a tutorial on how to set up a Kafka cluster in PKS, please see How to Deploy Confluent Platform on Pivotal Container Service (PKS) with Confluent Operator.

If you’d like more background on working with Kafka from Spring Boot, you can also check out How to Work with Apache Kafka in your Spring Boot Application.

Methodology

Starting with the requirements, this tutorial will then go through the specific tasks required to connect PAS applications to Kafka. The sample Spring Boot app is pre-configured to make the setup steps as streamlined as possible.

You’ll review the configuration settings that streamline the deployment so you know what to change for your environment. Afterward, the tutorial will run through some ways to verify your PAS app to Kafka in your PKS setup.

Requirements

Run a Kafka cluster in Enterprise PKS. To set up Kafka in PKS via Confluent Operator and expose external endpoints, you can refer to part 1.
- Especially note the exposing external endpoints and proper DNS setup explained in part 1. External endpoint exposure with public DNS is required for this tutorial. Here’s a screenshot of my DNS setup for the domain name used in part 1.
  The /etc/hosts trick mentioned in part 1 will not work now because we don’t have access to the hosts file in PAS containers. Therefore, we need our Spring Boot app to be able to resolve DNS to our Kafka cluster running in PKS.
Running and accessible Confluent Schema Registry, which was mentioned in part 1, in PKS.
Install Maven. (Note: familiarity with Git and building Java applications with Maven is presumed.)
Access the springboot-kafka-avro repo.
Install the Cloud Foundry (cf) CLI.
Your PAS environment username, password, and fully qualified domain name (FQDN). At the time of this writing, you can obtain a PAS environment if you sign up for a free Pivotal Web Services account.

Cloud Foundry (cf) CLI prerequisites

If this is your first time deploying an application to PAS, you’ll need to do the following in order to perform the later steps. If you have already set up your PAS environment or are familiar with PAS, feel free to adjust accordingly.

Performing the following steps will create a ~/.cf/config.json file if you don’t have one created already.

Log in with cf l -a <my-env> -u <my-username> -p <my-password>.
Exit and execute the commands below. Substitute <my-*> with settings that are appropriate for your PAS environment. For example, based on my Pivotal Web Services account setup, I used api.run.pivotal.io for the <my-env>:

- cf create-org confluent
- cf target -o confluent
- cf create-space dev
- cf target -s dev

The commands in step 2 are optional depending on you how like to keep things organized. In any case, you should be all set at this point with a ~/.cf/config.json file and may proceed to setting up the sample PAS app with Kafka in PKS.

For more details on the cf CLI, see the documentation.

Deploy a Sample Spring Boot Microservice App with Kafka to Pivotal Application Service (PAS)

Run all command line tasks in a terminal unless explicitly stated otherwise.

Clone springboot-kafka-avro and enter the directory. For example:

git clone https://github.com/confluentinc/springboot-kafka-avro && cd springboot-kafka-avro.

Create a Pivotal user-provider service instance (USPI) with the following command:
```
cf create-user-provided-service cp -p '{"brokers":"kafka.supergloo.com:9092","jaasconfig":"org.apache.kafka.common.security.plain.PlainLoginModule required username='test' password='test123',"sr": "http://schemaregistry.supergloo.com:8081";"}'
```
This USPI delivers dynamic configuration values to our sample application upon startup. USPI is an example of the aforementioned PAS-specific requirements. The username and password values of test and test123 used above were the defaults used in the Helm Chart from part 1. These settings might depend on your environment so adjust accordingly.
Also, note the brokers and sr variable settings and their related brokers and sr variable values in the src/main/resources/application-pass.yaml file. Again, these settings are defaults from part 1, so you may need to adjust for your environment. I’ll explain it in more detail later on, but for now, focus on getting your example running.
Push the sample Spring Boot microservice app to PAS with:
```
mvn clean package -DskipTests=true && cf push --no-start
```
Notice how the --no-start option is sent, as the previously created USPI service has not yet been bound and attempting to start the application would result in failure.

You should see something similar to the following. Pay attention to the routes output which you’ll need in later steps. In the following example, my routes output was spring-kafka-avro-fluent-hyrax.cfapps.io, but yours will look different.
Next, as you probably already guessed, perform the binding: cf bind-service spring-kafka-avro cp.This command binds the cp service to the spring-kafka-avro app that was deployed earlier. You should see something similar to the following in the Pivotal console under your cp service settings:
Then perform cf start spring-kafka-avro. After about 30 seconds, the spring-kafka-avro state should be running.

At this point, your setup is complete. Now, you are ready to verify the installation is successful.

Verification

Determine the external URL of your newly deployed app with cf apps. Look to the urls column. As previously noted, mine is spring-kafka-avro-fluent-hyrax.cfapps.io.
The sample app code shows one available REST endpoint in KafkaController.java. You can post to this endpoint with different age and name parameters such as:
```
curl -X POST -d 'name=vik&age=33' spring-kafka-avro-fluent-hyrax.cfapps.io/user/publish
```
Or, change up the name and age values:
```
curl -X POST -d 'name=todd&age=22' 
spring-kafka-avro-fluent-hyrax.cfapps.io/user/publish
```
Or, to flex your Schema Registry integration, notice what happens when you attempt to send values that are not appropriate for the user schema (see src/main/avro/user.avsc):
```
curl -X POST -d 'name=todd&age=much_younger_than_vik_gotogym' 
spring-kafka-avro-fluent-hyrax.cfapps.io/user/publish
```
Check out any topics created by the sample app with bin/kafka-topics --list --command-config kafka.properties --bootstrap-server kafka.supergloo.com:9092. (As shown, you need access to both the kafka-topics script and a kafka.properties file which was described in the section from part 1 on external validation.)

If you wish, you can consume the users topic via a command similar to:

kafka-avro-console-consumer --bootstrap-server kafka.supergloo.com:9092 --consumer.config kafka.properties --topic users --from-beginning --property schema.registry.url=http://schemaregistry.supergloo.com:8081

Noteworthy configuration and source code

Now that you’ve verified your app is up and running and communicating with Kafka (and Schema Registry), let’s examine the configuration and source code by breaking down the setup steps above.

How does your PAS app know which Kafka cluster to use and how to authorize? How does the app know which Schema Registry to use?

First, look to the manifest.yaml file for the env stanza setting of SPRING_PROFILES_ACTIVE: paas.

This will force Spring Boot to reference the src/main/resources/application-pass.yaml for environment configuration settings. In application-pass.yaml, the values for brokers, sr, and jaasconfig appear to be dynamically set, e.g., ${vcap.services.cp.credentials.brokers}. So if you’re thinking there must be string interpolation action happening somehow, I say loudly, “You are correct!” (That was my poor attempt of a Phil Hartman impersonation by the way). The interpolation magic happens on app startup via the USPI that we created and used to bind our app in step 2 above.

But why does your POST attempt fail when you send an age value that isn’t a number? How/where this is set in the Java code is not visible.

This is due to the schema.registry.url property setting in application-paas.yaml. For more information on Schema Registry, check out How to Use Schema Registry and Avro in Spring Boot Applications.

Tutorial completed

This tutorial covered how to deploy a Spring Boot microservice app to PAS that produces and consumes from a Kafka cluster running in Pivotal PKS.

And finally, from the “credit-where-credit-is-due” department, much thanks to Sridhar Vennela from Pivotal for his support. I’d also like to thank Viktor Gamov, my colleague at Confluent, for developing the sample application used in this tutorial, and Victoria Yu as well for making me appear more precise and interesting than I usually am.

Building a Real-Time, Event-Driven Stock Platform at Euronext

October 8, 2019, 10:00 am

≫ Next: Kafka Summit San Francisco 2019 Session Videos

≪ Previous: How to Run Apache Kafka with Spring Boot on Pivotal Application Service (PAS)

As the head of global customer marketing at Confluent, I tell people I have the best job. As we provide a complete event streaming platform that is radically changing how companies handle data, I get to work with customers in almost every industry, partner closely with our sales teams, and learn from and be inspired by the event streaming community.

Collectively, we have a great opportunity to leverage the community that has formed around interesting stories and to celebrate its successes. Starting today, we’ll use the blog to occasionally do just that.

Euronext, one of the largest stock exchanges in the world (spanning Belgium, France, Ireland, the Netherlands, Norway, Portugal, and the UK), has built a brand new market infrastructure and event-driven trading platform called Optiq with Confluent Platform at the core.

They’re a Customer Project of the Year finalist for 2019’s Computing Technology Product Awards, and we couldn’t be more proud.

As a company that can trace its roots back to 1602, reinventing the way it does business was a major undertaking. For mission-critical platforms that support the market capitalization of six countries, it’s important to ensure that everyone has access to the same data at the same time—performance and reliability are non-negotiable (no pressure at all!).

That’s why Euronext turned to Confluent. Using Confluent Platform, Euronext easily leveraged the power of Apache Kafka^® to implement a reliable, scalable persistence layer for market orders that supports millisecond latencies. Euronext was able to replace its market data gateway with one that handles billions of messages per day, sending market data to vendors, as well as Euronext’s trading members that use the information in their trading strategies. Confluent Platform also enables them to build applications that interface with clearinghouses, monitor market latency, perform replication for disaster recovery, and store records in a data warehouse in compliance with regulatory requirements.

Euronext

With Confluent Platform as the backbone of their persistence engine, Euronext is now running Optiq in production on all its cash markets. The results are impressive: Optiq provides a tenfold increase in capacity to ingest messages and an average performance latency as low as 15 microseconds.

According to Alain Courbebaisse, CIO at Euronext, “We have stringent requirements for real-time performance and reliability, and we have confirmed—from proof of concept to deployment of a cutting-edge production trading platform—that we made the right decision.”

Help support Euronext’s project of the year and vote! #TechProductAwards

To learn more about Euronext and their event streaming infrastructure, watch the video and read the full case study.

Angela Burk runs Confluent’s global customer marketing function. She joined Confluent from ServiceNow where she built and led a global customer advocacy function for seven years. She’s also led marketing and communications functions at NetApp, Jive, Interwoven (part of HP), Clarify (part of Amdocs), Octel Communications (acquired by Lucent), and NEC. Angela is a California native and graduate of San Jose State University.

↧

Kafka Summit San Francisco 2019 Session Videos

October 9, 2019, 9:00 am

≫ Next: Internet of Things (IoT) and Event Streaming at Scale with Apache Kafka and MQTT

≪ Previous: Building a Real-Time, Event-Driven Stock Platform at Euronext

Last week, the Kafka Summit hosted nearly 2,000 people from 40 different countries and 595 companies—the largest Summit yet. By the numbers, we got to enjoy four keynote speakers, 56 sessions, 75 speakers, 38 sponsors, and one big party, including the classic Apache Kafka^® ice sculpture, per the traditions handed down to us. (I guess we handed those traditions down to ourselves in this case, but still.)

Kafka Summit SF 2019 Party

Apache Kafka Ice Sculpture

Every time I open the Summit, I remind the people attending that they don’t need to worry about missing sessions, because every one of them is being recorded. And if you weren’t there at all, this is an even better thing. With that, permit me to announce that the session videos and slides are now available—and in record time.

Top 10 Highest-Rated Sessions

Kafka Summit SF 2019 Session

With so many sessions to choose from, perhaps you’re wondering where to start. Based on the votes of Summit attendees from within the Kafka Summit mobile app, here are the top-rated talks:

Building Stream Processing Applications with Apache Kafka Using KSQL by Robin Moffatt of Confluent
Kafka on Kubernetes: Keeping It Simple by Nikki Thean of Etsy
Why Stop the World When You Can Change It? Design and Implementation of Incremental Cooperative Rebalancing by Konstantine Karantasis of Confluent
How Kroger Embraced a “Schema-First” Philosophy in Building Real-Time Data Pipelines by Rob Hoeting, Rob Hammonds, and Lauren McDonald of Systems Evolution
KSQL Performance Tuning for Fun and Profit by Nick Dearden of Confluent
Mission-Critical, Real-Time Fault Detection for NASA’s Deep Space Network Using Apache Kafka by Rishi Verma of NASA Jet Propulsion Laboratory
Please Upgrade Apache Kafka. Now. by Gwen Shapira of Confluent
Event Sourcing, Stream Processing, and Serverless by Ben Stopford of Confluent
Kafka Needs No Keeper by Jason Gustafson and Colin McCabe of Confluent
Using Kafka to Discover Events Hidden in Your Database by Anna McDonald of SAS Institute

Develop /her

The sessions aren’t the only thing that made this year’s Kafka Summit San Francisco extra special. We also hosted our first ever, sold out Girl Geek Dinner with Neha Narkhede (Co-founder and CPO, Confluent), Bret Scofield (UX Researcher, Confluent), Elizabeth Bennett (Software Engineer, Confluent), and Priya Shivakumar (Senior Director of Product Marketing, Confluent) as speakers. The event was emceed delightfully by Confluent’s own Dani Traphagen (Solutions Engineer, Western U.S.).

Girl Geek Dinner: Elizabeth Bennett and Neha Narkhede

develop /her

In addition to the Girl Geek Dinner and the Summit proper, there were plenty of other opportunities to learn. One hundred of you participated in the tutorial on Sunday, and 175 of you showed up after the Summit for three days of training. A number of you were also certified and attended the breakfast on Tuesday morning, where we talked about how to market your certification. Congratulations to everyone who completed the curriculum!

On Tuesday night after the crew started disassembling the booths and taking down the ubiquitous Kafka signage from the hotel, those of us who just couldn’t let it end stayed behind for a meetup at Confluent’s San Francisco office. We had three talks that night—let it never be said that we lack enthusiasm—led by Frank Greco of Confluent; Lei Chen of Bloomberg LP; and Qianqian Zhong, Xu Zhang, and Zuofei Wang of Airbnb. Frank talked about how to build a cloud Kafka service, Lei talked about how to run Kafka Streams without having a Kafka underneath it (yes, this is possible), and the Airbnb folks talks about how they built their two major Kafka-powered systems in use there.

Post Kafka Summit SF 2019 Meetup

Photo Credit: Derek C.

View the videos and slides!

I thought I was over the post-Summit blues, but writing this post has brought it all back. I think the best remedy is to watch the videos. You know what must be done.

↧

Internet of Things (IoT) and Event Streaming at Scale with Apache Kafka and MQTT

October 10, 2019, 10:00 am

≫ Next: 🚂 On Track with Apache Kafka – Building a Streaming ETL Solution with Rail Data

≪ Previous: Kafka Summit San Francisco 2019 Session Videos

The Internet of Things (IoT) is getting more and more traction as valuable use cases come to light. A key challenge, however, is integrating devices and machines to process the data in real time and at scale. Apache Kafka^® and its surrounding ecosystem, which includes Kafka Connect, Kafka Streams, and KSQL, have become the technology of choice for integrating and processing these kinds of datasets.

Kafka-native options to note for MQTT integration beyond Kafka client APIs like Java, Python, .NET, and C/C++ are:

Kafka Connect source and sink connectors, which integrate with MQTT brokers in both directions
Confluent MQTT Proxy, which ingests data from IoT devices without needing a MQTT broker
Confluent REST Proxy for a simple but powerful HTTP-based integration

Before I discuss these in more detail, let’s take a look at some common use cases where Confluent Platform and Confluent Cloud are used for IoT projects today.

Use cases for IoT technologies and an event streaming platform

Confluent Platform and Confluent Cloud are already used in many IoT deployments, both in Consumer IoT and Industrial IoT (IIoT). Most scenarios require a reliable, scalable, and secure end-to-end integration that enables bidirectional communication and data processing in real time. Some specific use cases are:

Connected car infrastructure: cars communicate with each other and the remote datacenter or cloud to perform real-time traffic recommendations, prediction maintenance, or personalized services.
- Example: Audi
Smart cities and smart homes: Buildings, traffic lights, parking lots, and many other things are connected to each other in order to enable greater efficiency and provide a more comfortable lifestyle. Energy providers connect houses to buy or sell their own solar energy and provide additional digital services.
- Example: E.ON
Smart retail and customer 360: Real-time integration between mobile apps of customers and backend services like CRMs, loyalty systems, geolocation, and weather information creates a context-specific customer view and allows for better cross-selling, promotions, and other customer-facing services.
- Example: Target
Intelligent manufacturing: Industrial companies integrate machines and robots to optimize their business processes and reduce costs, such as scrapping parts early or predictive maintenance to replace machine parts before they break. Digital services and subscriptions are provided to customers instead of just selling them products.
- Example: Severstal

Machine learning plays a huge role in many of these use cases, regardless of the industry, and you can read Using Apache Kafka to Drive Cutting-Edge Machine Learning for more insights.

Let’s now take a look at the 10,000-foot view of a robust IoT integration architecture.

End-to-end enterprise integration architecture

IoT integration architectures need to integrate the edge (devices, machines, cars, etc.) with the datacenter (on premises, cloud, and hybrid) to be able to process IoT data. Edge | Datacenter/Cloud

Requirements and challenges of IoT integration architectures

To be flexible and future ready, an IoT integration architecture should possess the following requirements:

Scalable data movement and processing: handles backpressure and can process increasing throughput
Agile development and loose coupling: different sources and sinks should be their own decoupled domains. Different teams can develop, maintain, and change integration to devices and machines without being dependent on other sources or the sink systems that process and analyze the data. Microservices, Apache Kafka, and Domain-Driven Design (DDD) covers this in more detail.
Innovative development: new and different technologies and concepts can be used depending on the flexibility and requirements of a specific use case. For instance, one application might already send data to an MQTT broker so that you can consume from there while another project does not use an MQTT broker at all, and you just want to push the data into the event streaming platform directly for further processing.

But several challenges increase the complexity of IoT integration architectures:

Complex infrastructure and operations that often cannot be changed—despite the need to integrate with existing machines, you are unable g to add code to the machine itself easily
Integration with many different technologies like MQTT or OPC Unified Architecture (OPC UA) while also adhering to legacy and proprietary standards
Unstable communication due to bad IoT networks, resulting in high cost and investment in the edge

Given these requirements and challenges, let’s take a look now at how MQTT and other IoT standards help integrate datacenters and the edge.

IoT standards and technologies: MQTT, OPC UA, Siemens S7, and PROFINET

There are many IoT standards and technologies available on the market. If we had to choose, these are the most common options for implementing IoT integrations:

Proprietary interfaces: especially in Industrial IoT (IIoT), this is the most common scenario. Machines provide a large number of usually closed and incompatible protocols in a proprietary format. Examples are S7, PROFINET, Modbus, or an automated dispatch system (ADS). Supervisory control and data acquisition (SCADA) is often used to control and monitor these systems.
OPC UA: this is an open and cross-platform, machine-to-machine communication protocol for industrial automation. Every device must be retrofitted with the ability to speak a new protocol and use a common client to speak with these devices. License costs and modification of the existing hardware are required to enable OPC UA.
PLC4X: As an Apache framework, it provides a unified API by implementing drivers (similar to JDBC for relational databases) for communicating with most industrial controllers in the protocols they natively understand. No license costs or hardware modifications are required.
MQTT: This is built on top of TCP/IP for constrained devices and unreliable networks, applying to many (open source) broker implementations and many client libraries. It contains IoT-specific features for bad network/connectivity, and is widely used (mostly in IoT, but also in web and mobile apps via MQTT over WebSockets).

No wonder technical know-how is not evenly distributed in both realms. In the IoT environment, for example, a large number of protocols for data exchange have developed in recent years. Only MQTT will seem familiar to an automation technology employee.

In the same way, industrial protocols are a book with seven seals for software engineers. It may be that some industrial protocols are well suited for a specific IoT solution, just as certain security features of modern IoT protocols are suited for industry. But that doesn’t move much.

MQTT has become the standard solution for most IoT scenarios today, especially outside of IIoT. Although MQTT is the focus of this blog post, in a future article I will cover MQTT integration with IIoT and its proprietary protocols, like Siemens S7, Modbus, and ADS, through leveraging PLC4X and its Kafka integration. For more details about using Kafka Connect and PLC4X for IIoT integration scenarios, you can check out these slides on flexible and scalable integration in the automation industry and the accompanying video explaining the relation between IIoT, Apache Kafka, and PLC4X.

Based on my conversations with industrial customers—who are pained by the challenges of closed, inflexible interfaces—I noticed that more and more IIoT devices and machines also provide an MQTT interface that can be integrated into modern systems.

Regarding the tradeoffs of MQTT, consider the pros and cons:

Pros

Widely adopted
Lightweight
Has a simple API
Built for poor connectivity and high latency scenarios
Supports many client connections (tens of thousands per MQTT server)

Cons

Just queuing, not stream processing
Inability to handle usage surges (no buffering)
Most MQTT brokers don’t support high scalability
Asynchronous processing (often offline for long time)
Lacking a good integration with the rest of the enterprise
Single infrastructure (typically somewhere at the edge)
Inability to reprocess of events

These tradeoffs show that MQTT is built for IoT scenarios but requires help when it comes to integrating into the enterprise architecture of a company. This is where the event streaming platform Apache Kafka and its ecosystem come into play.

Apache Kafka as an event streaming platform

Apache Kafka is an event streaming platform that combines messaging, storage, and processing of data to build highly scalable, reliable, secure, and real-time infrastructure. Those who use Kafka often use Kafka Connect as well to enable integration with any source or sink. Kafka Streams is also useful, because it allows continuous stream processing. From an IoT perspective, Kafka presents the following tradeoffs:

Pros

Stream processing, not just queuing
High throughput
Large scale
High availability
Long-term storage and buffering
Reprocessing of events
Good integration with the rest of the enterprise
Hybrid, multi-cloud, and global deployments

Cons

Not built for tens of thousands of connections
Requires a stable network and solid infrastructure
Lacks IoT-specific features like Keep Alive and Last Will and Testament

Since Kafka was not built for IoT communication at the edge, the combination of Apache Kafka and MQTT together are a match made in heaven for building scalable, reliable, and secure IoT infrastructures.

How do you integrate both?

The following sections demonstrate three Kafka-native options, meaning you generally do not need an additional technology besides MQTT devices/gateways/brokers and Confluent Platform to integrate and process IoT data.

Confluent MQTT connectors (source and sink)

Kafka Connect is a framework included in Apache Kafka that integrates Kafka with other systems. Its purpose is to make it easy to add new systems to scalable and secure event streaming pipelines while leveraging all the features of Apache Kafka, such as high throughput, scalability, and reliability. The easiest way to download and install new source and sink connectors is via Confluent Hub. You can find installation steps, documentation, and even the source code for connectors that are open source.

The Kafka Connect MQTT connector is a plugin for sending and receiving data from a MQTT broker.

Kafka Connect MQTT Connector

The MQTT broker is persistent and provides MQTT-specific features. It consumes push data from IoT devices, which Kafka Connect pulls at its own pace, without overwhelming the source or getting overwhelmed by the source. Out-of-the-box scalability and integration features like Kafka Connect Converters and Single Message Transforms (SMTs) are further advantages of using Kafka Connect connectors.

The MQTT connectors are independent of a specific MQTT broker implementation. I have seen several projects start with Mosquitto and then move towards a reliable, scalable broker like HiveMQ during the transition from a pilot project to pre-production.

MQTT Proxy for data ingestion without an MQTT broker

In some scenarios, the main challenge and requirement is to ingest data into Kafka for further processing and analytics in other backend systems. In this case, an MQTT broker is just added complexity, cost, and operational overhead.

Confluent MQTT Proxy delivers a Kafka-native MQTT proxy that allows organizations to eliminate the additional cost and lag of intermediate MQTT brokers. MQTT Proxy accesses, combines, and guarantees that IoT data flows into the business without adding additional layers of complexity.

MQTT Proxy is horizontally scalable, consumes push data from IoT devices, and forwards it to Kafka brokers with low latency. No MQTT broker is required as an intermediary. The Kafka broker is the source of truth responsible for persistence, high availability, and reliability of the IoT data.

However, although everybody thinks about IoT standards like MQTT or OPC UA when integrating IoT devices, oftentimes REST and HTTP(S) are a much simpler solution.

REST Proxy as a “simple” option for an IoT integration

REST Proxy provides a RESTful interface to a Kafka cluster, making it easy to produce and consume messages, view the state of the cluster, and perform administrative actions without using the native Kafka protocol or clients.

Why might you use HTTP(S) for an IoT integration? Due to various reasons, REST Proxy makes implementation and deployment simpler, faster, and easier compared with IoT-specific technologies:

It’s simple and understood
HTTP(S) Proxy is push based
Security is easier from an organizational and governance perspective—ask your security team!
Scalability with a standard load balancer, though it is still synchronous HTTP which is not ideal for high scalability
Supports thousands of messages per second

No matter how you decide to integrate IoT devices, building a reliable end-to-end monitoring infrastructure is essential.

End-to-end monitoring and security

Distributed systems are hard to monitor and secure. A Kafka cluster is not much different—you have to monitor and secure the Kafka brokers, ZooKeeper nodes, client consumer groups (Java, Python, Go, REST, etc.), and Connect and KSQL clusters.

In terms of monitoring your whole Kafka infrastructure end to end, Confluent Control Center delivers insights into the inner workings of your Kafka clusters and the data that flows through them. Control Center gives the administrator monitoring and management capabilities through curated dashboards, so that they can deliver optimal performance and meet SLAs for their clusters. This includes:

End-to-end monitoring from producers to brokers to consumers
Management of Connect clusters (sources and sinks), no matter if it’s a central infrastructure or if there are domain-driven components in the architecture
Role-Based Access Control (RBAC) for secure communication and ensuring compliance
Monitoring and alerting for availability, latency, consumption, data loss, etc.

Confluent Control Center: consumer.group.id.001

With security features like Role-Based Access Control (RBAC), you also have the ability to enable simple and standardized authentication and authorization for all components of the Confluent Platform.

Choosing the right components for your IoT integration challenges

The use cases for IoT integration scenarios are always similar: integrate with devices or machines; ingest the event streaming data in real time into the Kafka cluster (on premises or in the cloud); process the data with Kafka Streams and KSQL; and then send the data back to the device or machine, and/or to the other sinks like a database, analytics tool, or any other business application.

With Kafka-native options like clients, MQTT connectors, MQTT Proxy, or REST Proxy, you can integrate IoT technologies and interfaces to establish a powerful but simple architecture without using additional tools. This is especially recommended in 24/7 mission-critical deployments, where each additional component increases complexity, risk, and cost. You have many options, so choose the one that suits your situation best.

If you want to read a complete story about building an end-to-end IoT architecture from edge to cloud, read Enabling Connected Transformation with Apache Kafka and TensorFlow on Google Cloud Platform, which focuses on Google Cloud Platform, Confluent Cloud, and MQTT integration for building a scalable and reliable machine learning infrastructure.

The content of this blog post is also captured in this interactive lightboard recording called End-to-End Integration: IoT Edge to Confluent Cloud.

If you’re encountering similar challenges and use cases in your company, feel free to reach out and I’d be happy to discuss with you further.

Interested in more?

Download the Confluent Platform to get started with the leading distribution of Apache Kafka.

↧

🚂 On Track with Apache Kafka – Building a Streaming ETL Solution with Rail Data

October 16, 2019, 10:00 am

≫ Next: 4 Steps to Creating Apache Kafka Connectors with the Kafka Connect API

≪ Previous: Internet of Things (IoT) and Event Streaming at Scale with Apache Kafka and MQTT

Trains are an excellent source of streaming data—their movements around the network are an unbounded series of events. Using this data, Apache Kafka^® and Confluent Platform can provide the foundations for both event-driven applications as well as an analytical platform. With tools like KSQL and Kafka Connect, the concept of streaming ETL is made accessible to a much wider audience of developers and data engineers. The platform shown in this article is built using just SQL and JSON configuration files—not a scrap of Java code in sight.

Event Streaming Platform

My source of data is a public feed provided by the UK’s Network Rail company through an ActiveMQ interface. Additional data is available over REST as well as static reference data published on web pages. As with any system out there, the data often needs processing before it can be used. In traditional data warehousing, we’d call this ETL, and whilst more “modern” systems might not recognise this term, it’s what most of us end up doing whether we call it pipelines or wrangling or engineering. It’s taking data from one place, getting the right bit of it in the right shape, and then using it or putting it somewhere else.

Data Processing with Apache Kafka, ActiveMQ, Kafka Connect, and KSQL

Once the data is prepared and available through Kafka, it’s used in multiple ways. Rapid analysis and visualisation of the real-time feed—in addition to historical data—is provided through Elasticsearch and Kibana.

Analysis and visualization with Elasticsearch and Kibana

For more advanced analytics work, the data is written to two places: a traditional RDBMS (PostgreSQL) and a cloud object store (Amazon S3).

Advanced Analytics with Postgres and S3

I also used the data in an event-driven “operational” service that sends push notifications whenever a train is delayed at a given station beyond a configured threshold. This implementation of simple SLA monitoring is generally applicable to most applications out there and is a perfect fit for Kafka and its stream processing capabilities.

Rail Alerts

The data

As with any real system, the data has “character.” That is to say, it’s not perfect, it’s not entirely straightforward, and you have to get to know it properly before you can really understand it. Conceptually, there are a handful of entities:

Movement: a train departs or arrives from a station
Location: a train station, a signal point, sidings, and so on
Schedule: the characteristics of a given route, including the time and location of each scheduled stop, the type of train that will run on it, whether reservations are possible, and so on.
Activation: an activation ties a schedule to an actual train running on a given day and its movements.

Movement | Activation | Schedule | Location

Ingesting the data

All of the data comes from Network Rail. It’s ingested using a combination of Kafka Connect and CLI producer tools. ActiveMQ, Kafka Connect, CLI Producer Tools ➝ Apache Kafka

The live stream of train data comes in from an ActiveMQ endpoint. Using Kafka Connect and the ActiveMQ connector, we stream the messages into Kafka for a selection of train companies.

The remaining data comes from an S3 bucket which has a REST endpoint, so we pull that in using curl and kafkacat.

There’s also some static reference data that is published on web pages. After we scrape these manually, they are produced directly into a Kafka topic.

Wrangling the data

With the raw data in Kafka, we can now start to process it. Since we’re using Kafka, we are working on streams of data. As events arrive, they get processed and written back onto a Kafka topic, either for further processing or use downstream. Source stream ⟶ Apache Kafka ⟵ Apache Kafka ⟶ ⟵ KSQL

The kind of processing that we need to do includes:

Splitting one stream into many. Several of the sources include multiple types of data on a single stream, such as the data from ActiveMQ which includes Movement, Activation, and Cancellation message types all on the same topic.
Applying a schema to the data
Serialising the data into Avro for easier use downstream
Deriving and setting the message key to ensure correct ordering of the data as it is partitioned across the Kafka cluster
Enriching columns with conditional concatenation to make the values easier to read
Joining data together, such as a movement event to reference information about its location, including full name and coordinates so as to be able to plot it on a map at a later point in the pipeline
Resolving codes in events to their full values. An example of this is the train operating company name (EM is the code used in events, which users of the data will expect to see in its full form, East Midlands Trains).

All of this is done using KSQL, in several stages where necessary. This diagram shows a summary of one of the pipelines:

Rail Streaming Pipeline – KSQL

Using the data

With the ingest and transform pipelines running, we get a steady stream of train movement information.

Live Data

This live data can be used for driving alerts as we will see in more detail shortly. It can also be used for streaming to target datastores with Kafka Connect for further use. This might apply to a team who wants the data in their standard platform, such as Postgres, for analytics.

With the Kafka Connect JDBC connector, it’s easy to hook up a Kafka topic to stream to a target database, and from there use the data. This includes simple queries, as well as more complex ones. Here’s an analytical aggregate function to show by train company the number of train movements that were on time, late, or even early:

pgAdmin

This is a classic analytical query that any analyst would run, and an RDBMS seems like an obvious place in which to run it. But what if we could actually do that as part of the pipeline, calculating the aggregates as the data arrives and making the values available for anyone to consume from a Kafka topic? This is possible within KSQL itself:

ksql> SELECT TIMESTAMPTOSTRING(WINDOWSTART(),'yyyy-MM-dd') AS Date, 
      VARIATION_STATUS as Variation,
        SUM(CASE WHEN TOC = 'Arriva Trains Northern'       THEN 1 ELSE 0 END) AS Arriva,
        SUM(CASE WHEN TOC = 'East Midlands Trains'         THEN 1 ELSE 0 END) AS EastMid,
        SUM(CASE WHEN TOC = 'London North Eastern Railway' THEN 1 ELSE 0 END) AS LNER,
        SUM(CASE WHEN TOC = 'TransPennine Express'         THEN 1 ELSE 0 END) AS TPE
 FROM  TRAIN_MOVEMENTS WINDOW TUMBLING (SIZE 1 DAY)
GROUP BY VARIATION_STATUS;
+------------+-------------+--------+----------+------+-------+
| Date       | Variation   | Arriva | East Mid | LNER | TPE   |
+------------+-------------+--------+----------+------+-------+
| 2019-07-02 | OFF ROUTE   | 46     | 78       | 20   | 167   |
| 2019-07-02 | ON TIME     | 19083  | 3568     | 1509 | 2916  |
| 2019-07-02 | LATE        | 30850  | 7953     | 5420 | 9042  |
| 2019-07-02 | EARLY       | 11478  | 3518     | 1349 | 2096  |
| 2019-07-03 | OFF ROUTE   | 79     | 25       | 41   | 213   |
| 2019-07-03 | ON TIME     | 19512  | 4247     | 1915 | 2936  |
| 2019-07-03 | LATE        | 37357  | 8258     | 5342 | 11016 |
| 2019-07-03 | EARLY       | 11825  | 4574     | 1888 | 2094  |

The result of this query can be stored in a Kafka topic, and from there made available to any application or datastore that needs it—all without persisting the data anywhere other than in Kafka, where it is already persisted.

Along with using Postgres (or KSQL as shown above) for analytics, the data can be streamed using Kafka Connect into S3, from where it can serve multiple roles. In S3, it can be seen as the “cold storage”, or the data lake, against which as-yet-unknown applications and processes may be run. It can also be used for answering analytical queries through a layer such as Amazon Athena.

New query 2

Alerting

One of the great things about an event streaming platform is that it can serve multiple purposes. Reacting to events as they happen is one of the keys to responsive applications and happy users. Such event-driven applications differ from the standard way of doing things, because they are—as the name says—driven by events, rather than intermittently polling for them with the associated latency.

Let’s relate this directly to the data at hand. If we want to know about something happening on the train network, we generally want to know sooner than later. Taking the old approach, we would decide to poll a datastore periodically to see if there was data matching the given condition. This is relatively high in latency and has negative performance implications for the datastore. Consider now an event-driven model. We know the condition that we’re looking for, and we simply subscribe to events that match it.

The wonderful thing about events is that they are two things: notification that something happened and state describing what happened. So we’re not getting a notification to go and fetch information from somewhere else; we’re getting a notification along with the state itself that we need.

In this example, one of the things we’d like to alert on is when a train arrives at a given station over a certain number of minutes late. Perhaps we’re monitoring service levels at a particular station. The notifications are pushed out directly to the user. To implement this is actually very straightforward using Confluent Platform.

The conditions are stored in a compacted Kafka topic using the station name as a key and the threshold as a value. That way, different thresholds can be set for different stations. This Kafka topic is then modelled as a table in KSQL and joined to the live stream of movement data. Any train arriving at one of the stations in the configuration topic will satisfy the join, causing a new row to be written to the target topic. This is therefore a topic of all alert events and can be used as any other Kafka topic would be used.

`TRAIN_MOVEMENTS` | `ALERT_CONFIG`

We subscribe to the alert topic using Kafka Connect, which takes messages as they arrive on the topic and pushes them to the user, in this case using the REST API and Telegram.

Rail Alerts

The platform in detail

Let’s now take a look at each stage of implementation in detail.

Ingest

All of the data comes from Network Rail. It’s ingested using a combination of Kafka Connect and CLI producer tools.

ActiveMQ, Kafka Connect, CLI Producer Tools ➝ Apache Kafka

The live stream of train data comes in from an ActiveMQ endpoint. Using Kafka Connect, the ActiveMQ connector, and externally stored credentials, stream the messages into Kafka for a selection of train companies:

curl -X PUT -H "Accept:application/json" \
            -H  "Content-Type:application/json" \
    http://localhost:8083/connectors/source-activemq-networkrail-TRAIN_MVT_EA_TOC-01/config \
    -d '{
    "connector.class": "io.confluent.connect.activemq.ActiveMQSourceConnector",
    "activemq.url": "tcp://datafeeds.networkrail.co.uk:61619",
    "activemq.username": "${file:/data/credentials.properties:NROD_USERNAME}",
    "activemq.password": "${file:/data/credentials.properties:NROD_PASSWORD}",
    "jms.destination.type": "topic",
    "jms.destination.name": "TRAIN_MVT_EA_TOC",
    "kafka.topic": "networkrail_TRAIN_MVT",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter.schemas.enable": "false",
    "key.converter": "org.apache.kafka.connect.json.JsonConverter",
    "key.converter.schemas.enable": "false",
    "confluent.license": "",
    "confluent.topic.bootstrap.servers": "kafka:29092",
    "confluent.topic.replication.factor": 1
}'

The data is written to the .text field of the payload as a batch of up to 30 messages in escaped JSON form, which looks like this: Envelope | Payload

To make these into an actual stream of individual events, I use kafkacat and jq to parse the JSON payload and stream it into a new topic after exploding each batch of messages into individual ones:

kafkacat -b localhost:9092 -G tm_explode networkrail_TRAIN_MVT | \
 jq -c '.text|fromjson[]' | \
 kafkacat -b localhost:9092 -t networkrail_TRAIN_MVT_X -T -P

The resulting messages look like this:

Messages

The remaining data comes from an S3 bucket which has a REST endpoint, so we pull that in using curl and again kafkacat:

curl -s -L -u "$NROD_USERNAME:$NROD_PASSWORD" "https://datafeeds.networkrail.co.uk/ntrod/CifFileAuthenticate?type=CIF_EA_TOC_FULL_DAILY&day=toc-full" | \
  gunzip | \
  kafkacat -b localhost -P -t CIF_FULL_DAILY

There’s also some static reference data that is published on web pages. After scraping these manually, they’re produced directly into a Kafka topic:

kafkacat -b localhost:9092 -t canx_reason_code -P -K: <

Data wrangling with stream processing

With the reference data loaded and the live stream of events ingesting continually through Kafka Connect, we can now look at the central part of the data pipeline in which the data is passed through a series of transformations and written back into Kafka. These transformations make the data usable for both applications and analytics which subscribe to the Kafka topics populated.

Splitting one stream into many

Several of the sources include multiple types of data on a single stream, such as the data from ActiveMQ, which includes Movement and Cancellation message types all on the same topic. Each is identifiable through the value of MSG_TYPE.

Train Movement | Train Cancellation | Train Activation

Using KSQL, we can simply read every message as it arrives and route it to the appropriate topic:

CREATE STREAM ACTIVATIONS AS 
	SELECT * FROM SOURCE WHERE MSG_TYPE='0001';

CREATE STREAM MOVEMENTS AS 
	SELECT * FROM SOURCE WHERE MSG_TYPE='0003';

CREATE STREAM CANCELLATIONS AS 
	SELECT * FROM SOURCE WHERE MSG_TYPE='0002';

NETWORK_TRAIN_MVT_X ➝ TRAIN_ACTIVTATIONS | TRAIN_CANCELLATIONS | TRAIN_MOVEMENTS

Schema derivation and Avro reserialisation

All of the data we’re working with here has a schema, and one of the first steps in the pipeline is to reserialise the data from JSON (no declared schema) to Avro (declared schema). Aside from Avro’s smaller message sizes, the key benefit is that we don’t have to declare the schema in subsequent queries and derivations against Kafka topics.

At the beginning of the pipeline, declare the schema once against the JSON data:

CREATE STREAM SCHEDULE_RAW (
      TiplocV1       STRUCT)
  WITH (KAFKA_TOPIC='CIF_FULL_DAILY',
        VALUE_FORMAT='JSON');

We then reserialise it to Avro using the WITH (VALUE_FORMAT='AVRO') syntax whilst selecting just the data types appropriate for the entity:

CREATE STREAM TIPLOC
     WITH (VALUE_FORMAT='AVRO') AS
SELECT  *
FROM    SCHEDULE_RAW
WHERE   TiplocV1 IS NOT NULL;

We can also manipulate the schema, flattening it by selecting nested elements directly:

CREATE STREAM TIPLOC_FLAT_KEYED AS 
SELECT  TiplocV1->TRANSACTION_TYPE  AS TRANSACTION_TYPE ,
       TiplocV1->TIPLOC_CODE       AS TIPLOC_CODE ,
       TiplocV1->NALCO             AS NALCO ,
       TiplocV1->STANOX            AS STANOX ,
       TiplocV1->CRS_CODE          AS CRS_CODE ,
       TiplocV1->DESCRIPTION       AS DESCRIPTION ,
       TiplocV1->TPS_DESCRIPTION   AS TPS_DESCRIPTION
FROM    TIPLOC;

The two strategies (reserialise to Avro and flatten schema) can be applied in the same query, but they’re just shown separately here for clarity. While it can flatten a declared STRUCT column, KSQL can also take a VARCHAR with JSON content and extract just particular columns from it. This is very useful if you don’t want to declare every column upfront or have more complex JSON structures to work with, from which you only want a few columns:

CREATE STREAM SCHEDULE_00 AS
	SELECT           extractjsonfield(schedule_segment->schedule_location[0],'$.tiploc_code') as ORIGIN_TIPLOC_CODE,
[…]
FROM JsonScheduleV1;

Deriving and setting the message key and partitioning strategy

The key on a Kafka message is important for two reasons:

Partitioning is a crucial part of the design because partitions are the unit of parallelism. The greater the number of partitions, the more you can parallelise your processing. The message key is used to define the partition in which the data is stored, so all data that should be logically processed together, in order, needs to be located in the same partition and thus have the same message key.
Joins between streams and tables rely on message keys. The key is used to route the partitions inside different streams and tables to the right machine so that distributed join operations can be performed.
The data at point of ingest has no key, so we can easily use KSQL to set one. In some cases, we can just use an existing column:

CREATE STREAM LOCATION_KEYED AS 
	SELECT * FROM LOCATION
	PARTITION BY LOCATION_ID;

In other cases, we need to derive the key first as a composite of more than one existing column. An example of this is in the schedule data, which needs to be joined to activation data (and through that to movement) on three columns, which we generate as a key as shown below. In the same statement, we’re declaring four partitions for the target topic:

CREATE STREAM SCHEDULE_00 WITH (PARTITIONS=4) AS 
SELECT CIF_train_uid,
       schedule_start_date,
       CIF_stp_indicator,
       CIF_train_uid + '/' + schedule_start_date + '/' + CIF_stp_indicator AS SCHEDULE_KEY,
       atoc_code,
	[…]
FROM JsonScheduleV1
PARTITION BY SCHEDULE_KEY;

Enriching data with joins and lookups

Much of the data in the primary event stream of train movements and cancellations is in the form of foreign keys that need to be resolved out to other sources in order to be understood. Take location, for example: Movement | Location

A record may look like this:

{
  "event_type": "ARRIVAL",
  "actual_timestamp": "1567772640000"
  "train_id": "161Y82MG06"
  "variation_status": "LATE"
  "loc_stanox": "54311"
}

We want to resolve the location code (loc_stanox), and we can do so using the location reference data from the CIF data ingested into a separate Kafka topic and modelled as a KSQL table:

    SELECT  EVENT_TYPE, 
            ACTUAL_TIMESTAMP,
            LOC_STANOX,
            S.TPS_DESCRIPTION AS LOCATION_DESCRIPTION
    FROM TRAIN_MOVEMENTS_00 TM 
            LEFT OUTER JOIN STANOX S
            ON TM.LOC_STANOX = S.STANOX;

+------------+-----------------+------------+---------------------+
|EVENT_TYPE  |ACTUAL_TIMESTAMP |LOC_STANOX  |LOCATION_DESCRIPTION |
+------------+-----------------+------------+---------------------+
|ARRIVAL     |1567772640000    |54311       |LONDON KINGS CROSS   |

More complex joins can also be resolved by daisy-chaining queries and streams together. This is useful for resolving the relationship between a train’s movement and the originating schedule, which gives us lots of information about the train (power type, seating data, etc.) and route (planned stops, final destination, etc.).

Movement | Activation | Schedule

To start with we need to join movements to activations:

CREATE STREAM TRAIN_MOVEMENTS_ACTIVATIONS_00 AS
SELECT  *
 FROM  TRAIN_MOVEMENTS_01 TM
       LEFT JOIN
       TRAIN_ACTIVATIONS_01_T TA
       ON TM.TRAIN_ID = TA.TRAIN_ID;

Having done that we can then join the resulting stream to the schedule:

CREATE STREAM TRAIN_MOVEMENTS_ACTIVATIONS_SCHEDULE_00 AS
SELECT *
 FROM  TRAIN_MOVEMENTS_ACTIVATIONS_00 TMA
       LEFT JOIN
       SCHEDULE_02_T S
       ON TMA.SCHEDULE_KEY = S.SCHEDULE_KEY;

Column value enrichment

As well as joining to other topics in Kafka, KSQL can use the powerful CASE statement in several ways to help with data enrichment. Perhaps you want to resolve a code used in the event stream but it’s a value that will never change (famous last words in any data model!), and so you want to hard code it:

SELECT  CASE WHEN TOC_ID = '20' THEN 'TransPennine Express'
              WHEN TOC_ID = '23' THEN 'Arriva Trains Northern'
              WHEN TOC_ID = '28' THEN 'East Midlands Trains'
              WHEN TOC_ID = '61' THEN 'London North Eastern Railway'
             ELSE '<unknown TOC code: ` + TOC_ID + '>'
       END AS TOC,
[…]
  FROM TRAIN_MOVEMENTS_00

What if you want to prefix a column with some fixed text when it has a value, but leave it blank when it doesn’t?

SELECT  CASE WHEN LEN( PLATFORM)> 0 THEN 'Platform' + PLATFORM
            ELSE ''
         END AS PLATFORM,
[…]
  FROM TRAIN_MOVEMENTS_00

Or maybe you want to concatenate the value of another column conditionally based on another:

SELECT  CASE WHEN VARIATION_STATUS = 'ON TIME' THEN 'ON TIME'
            WHEN VARIATION_STATUS = 'LATE' THEN TM.TIMETABLE_VARIATION + ' MINS LATE'
            WHEN VARIATION_STATUS='EARLY' THEN TM.TIMETABLE_VARIATION + ' MINS EARLY'
       END AS VARIATION ,
[…]
  FROM TRAIN_MOVEMENTS_00 TM

Handling time

One of the most important elements to capture in any event streaming system is when different things actually happen. What time did a train arrive at the station? If it was cancelled, when was it cancelled? Other than simply ensuring accuracy in reporting the individual events, time becomes really important when we start aggregating and filtering on it. If we want to know how many trains a given operating company cancelled each hour, it’s no use simply counting the cancellation messages received in that hour. That would just be telling us how many cancellation messages we received in the hour, which might be interesting but doesn’t answer the question asked. Instead of using system time, we want to work with event time. Each message has several timestamps in it and can tell KSQL to use the appropriate one. Here’s an example cancellation message:

{
   "header": {
       "msg_type": "0002",
       "source_dev_id": "V2PY",
       "user_id": "#QRP4246",
       "original_data_source": "SDR",
       "msg_queue_timestamp": "1568048168000",
       "source_system_id": "TRUST"
   },
   "body": {
       "train_file_address": null,
       "train_service_code": "12974820",
       "orig_loc_stanox": "",
       "toc_id": "23",
       "dep_timestamp": "1568034360000",
       "division_code": "23",
       "loc_stanox": "43211",
       "canx_timestamp": "1568051760000",
       "canx_reason_code": "YI",
       "train_id": "435Z851M09",
       "orig_loc_timestamp": "",
       "canx_type": "AT ORIGIN"
   }
}

There are four timestamps, each with different meanings. Along with the timestamps in the message payload, there’s also the timestamp of the Kafka message. We want to tell KSQL to process the data based on the canx_timestamp field, which is when the cancellation was entered into the source system (and thus our closest field for event time):

CREATE STREAM TRAIN_CANCELLATIONS_01
 WITH (TIMESTAMP='CANX_TIMESTAMP') AS
 SELECT *
 FROM   TRAIN_CANCELLATIONS_00 ;

Data sinks

Kafka Connect is used to stream the data from the enriched topics through to the target systems:

Elasticsearch
Amazon S3 (Google Cloud Storage and Azure Blob Storage connectors are also available)
Postgres (the JDBC sink supports any other RDBMS too)
Neo4j

Because the enriched data is persisted in Kafka, it can be reloaded to any target as required as well as streamed to additional ones at a future date.

One of the benefits of using Elasticsearch is it can serve the kind of interactive dashboards through Kibana, as shown above. Another benefit is that it has data profiling tools. These are useful for understanding the data during iterations of pipeline development:

tTRAIN_CATEGORY | tTRAIN_STATUS

Fast track to streaming ETL

This article has shown how Apache Kafka as part of Confluent Platform can be used to build a powerful data system. Events are ingested from an external system, enriched with other data, transformed, and used to drive both analytics and real-time notification applications.

If you want to try out the code shown in this article you can find it on GitHub and download the Confluent Platform to get started.

Robin Moffatt is a developer advocate at Confluent, as well as an Oracle Groundbreaker Ambassador and ACE Director (alumnus). His career has always involved data, from the old worlds of COBOL and DB2, through the worlds of Oracle and Hadoop and into the current world with Kafka. His particular interests are analytics, systems architecture, performance testing, and optimization. You can follow him on Twitter.

↧

4 Steps to Creating Apache Kafka Connectors with the Kafka Connect API

October 23, 2019, 10:00 am

≫ Next: Getting Started with Rust and Apache Kafka

≪ Previous: 🚂 On Track with Apache Kafka – Building a Streaming ETL Solution with Rail Data

If you’ve worked with the Apache Kafka^® and Confluent ecosystem before, chances are you’ve used a Kafka Connect connector to stream data into Kafka or stream data out of it. While there is an ever-growing list of connectors available—whether Confluent or community supported⏤you still might find yourself needing to integrate with a technology for which no connectors exist. Don’t despair, my friend! You can create a connector with the Kafka Connect API, which provides an easy way to create fault-tolerant Kafka producers or consumers for streaming data in and out of Kafka.

This article will cover the basic concepts and architecture of the Kafka Connect framework. Then, we’ll dive into four steps for being well on your way toward developing a Kafka connector. Our discussion will largely focus on source connectors, but many of the concepts covered will apply to sink connectors as well. We’ll also discuss next steps for learning more about Kafka Connect development best practices, as well as harnessing Confluent’s help in getting your connector verified and published on the Confluent Hub.

What is Kafka Connect?

Kafka Connect specializes in copying data into and out of Kafka. At a high level, a connector is a job that manages tasks and their configuration. Under the covers, Kafka Connect creates fault-tolerant Kafka producers and consumers, tracking the offsets for the Kafka records they’ve written or read.

Beyond that, Kafka connectors provide a number of powerful features. They can be easily configured to route unprocessable or invalid messages to a dead letter queue, apply Single Message Transforms before a message is written to Kafka by a source connector or before it is consumed from Kafka by a sink connector, integrate with Confluent Schema Registry for automatic schema registration and management, and convert data into types such as Avro or JSON. By leveraging existing connectors⏤for example, those listed on the Confluent Hub⏤developers can quickly create fault-tolerant data pipelines that reliably stream data from an external source into records in Kafka topics or from Kafka topics into an external sink, all with mere configuration and no code!

Each connector instance can break down its job into multiple tasks, thereby parallelizing the work of copying data and providing scalability. When a connector instance starts up a task, it passes along the configuration properties that each task will need. The task stores this configuration—as well as the status and the latest offsets for the records it has produced or consumed—externally in Kafka topics. Since the task does not store any state, tasks can be stopped, started, or restarted at any time. Newly started tasks will simply pick up the latest offsets from Kafka and continue on their merry way.

Source ➝ Kafka Connect ➝ Apache Kafka

Kafka connectors can be run in either standalone or distributed mode. In standalone mode, Kafka Connect runs on a single worker⏤that is, a running JVM process that executes the connector and its tasks. In distributed mode, connectors and their tasks are balanced across multiple workers. The general recommendation is to run Kafka Connect in distributed mode, as standalone mode does not provide fault tolerance.

To start a connector in distributed mode, send a POST request to the Kafka Connect REST API, as described in the documentation. This request triggers Kafka Connect to automatically schedule the execution of the connectors and tasks across multiple workers. In the instance that a worker goes down or is added to the group, the workers will automatically coordinate to rebalance the connectors and tasks amongst themselves.

Getting started with Kafka connectors

Kafka Connect is part of Apache Kafka but in it of itself doesn’t include connectors. You can download connectors separately, or you can download the Confluent Platform, which includes both Apache Kafka and a number of connectors, such as JDBC, Elasticsearch, HDFS, S3, and JMS. Starting these connectors is as easy as submitting a POST request to the Kafka Connect REST API with the required configuration properties. For integration with other sources or sinks, you are likely to find a connector that suits your needs on the Confluent Hub.

In case a Kafka connector does not already exist for the technology you want to integrate with, this article will guide you through the first steps toward developing a Kafka connector that does. As we will see, creating a connector is just a matter of implementing several Kafka Connect interfaces. The Kafka Connect framework takes care of the rest so that you can focus on implementing the logic specific to your integration, without getting bogged down by boilerplate code and operational complexities.

Interested in creating a connector? Kafka Connect API to the rescue!

The Kafka Connect API allows you to plug into the power of the Kafka Connect framework by implementing several of the interfaces and abstract classes it provides. A basic source connector, for example, will need to provide extensions of the following three classes: SourceConnector, SourceTask, and AbstractConfig. Together, these define the configuration and runtime behavior of your custom Kafka connector. In the following sections, we’ll cover the essential components that will get you up and running with your new Kafka connector.

Step 1: Define your configuration properties

When connectors are started, they pick up configuration properties that allow the connector and its tasks to communicate with an external sink or source, set the maximum number of parallel tasks, specify the Kafka topic to stream data to or from, and provide any other custom information that may be needed for the connector to do its job.

Configuration values are first provided to the connector as String instances. See, for example, the method signature for Connector#start:

public abstract class Connector implements Versioned {
[...]
	public abstract void start(Map<String, String> props);
[...]
}

Once passed to the connector on startup, the provided properties can be parsed into more appropriate types by passing them to an instance of the AbstractConfig class provided by the Kafka Connect API. The first step in developing your connector is to create a class that extends AbstractConfig, which allows you to define types along with default values, validations, recommenders, and documentation for each property.

Suppose, for example, you are writing a source connector to stream data from a cloud storage provider. Among the configuration properties needed to start such a connector, you may want to include the Kafka topic name to produce records to, say, a whitelist of key prefixes for the objects to import. Here is an example configuration class you might write:

public class CloudStorageSourceConnectorConfig extends AbstractConfig {

    public CloudStorageSourceConnectorConfig(Map originals) {
        super(configDef(), originals);
    }

    protected static ConfigDef configDef() {
        return new ConfigDef()
                .define("bucket",
                        ConfigDef.Type.STRING,
                        ConfigDef.Importance.HIGH,
                        "Name of the bucket to import objects from")
                .define("prefix.whitelist",
                        ConfigDef.Type.LIST,
                        ConfigDef.Importance.HIGH,
                        "Whitelist of object key prefixes")
                .define("topic",
                        ConfigDef.Type.STRING,
                        ConfigDef.Importance.HIGH,
                        "Name of Kafka topic to produce to");
    }
}

Note that in our example, we define the prefix.whitelist property to be of List type. When we pass the map of original values to the parent AbstractConfig class, the configuration properties will be parsed into their appropriate types according to the configuration definition. As a result, we can later grab the prefix.whitelist value as a List from our connector’s configuration instance, even though the value was originally provided to the connector as a comma-delimited String, e.g., “path/to/file/1,path/to/file/2,path/to/file/3”.

At a minimum, each configuration definition will require a configuration key, the configuration value type, a level of importance, a brief description documenting the configuration property, and in most cases, a default value. However, you should also take advantage of more advanced features, such as the ability to define groups of configs, pass in validators that will be invoked on startup, provide recommenders that suggest configuration values to the user, and specify the order of configs or a dependency on other configs. In fact, it’s best practice to include validators, recommenders, groups, and defaults where possible to ensure that your user gets immediate feedback upon misconfiguration and can easily understand the available configuration options and their logical groupings.

Having made our configuration class, we can now turn our attention to starting the connector. Here’s an example implementation of start in our CloudStorageSourceConnector class:

public class CloudStorageSourceConnector extends SourceConnector {

    private CloudStorageSourceConnectorConfig connectorConfig;

    @Override
    public void start(Map<String, String> props) {
        this.connectorConfig = new CloudStorageConnectorConfig(props);
        this.configProps = Collections.unmodifiableMap(props);
    }

   [...]
}

When the connector starts, a new instance of our custom configuration class is created, which provides a configuration definition to the Kafka Connect framework. If any of the required configurations are missing or provided as an incorrect type, validators will automatically cause startup failures with an appropriate error message.

Step 2: Pass configuration properties to tasks

The next step is to implement the Connector#taskConfigs method, which returns a list of maps containing the configuration properties each task will use to stream data into or out of Kafka:

public abstract class Connector implements Versioned {
[...]
	public abstract List<Map<String, String>> taskConfigs(int maxTasks);
[...]
}

The method accepts an int value for the maximum number of tasks to run in parallel and is pulled from the tasks.max configuration property that is provided on startup.

Each map in the List returned by taskConfigs corresponds with the configuration properties used by a task. Depending on the kind of work your connector is doing, it may make sense for all tasks to receive the same config properties, or you may want different task instances to get different properties. For example, suppose you want to divide the number of object key prefixes to stream data evenly across the number of running task instances. If given a whitelist with three key prefixes, provide only one key prefix to each of the three task instances to import objects for. Each task can then focus on streaming data for objects whose keys have a particular prefix, splitting up the work into parallel tasks.

There are several considerations to keep in mind when implementing taskConfig. First, the tasks.max configuration property is provided to allow users the ability to limit the number of tasks to be run in parallel. It provides the upper limit of the size of the list returned by taskConfig. Second, the size of the returned list will determine how many tasks start. With a database connector, for example, you might want each task to pull data from a single table. If your database is relatively simple and only has two tables, then you could have your taskConfigs return a list of size two, even if the maxTasks value passed into the method is greater than two. On the other hand, if you have six tables but a maxTasks value of two, then you will need each task to pull from three tables.

To help perform this grouping, the Kafka Connect API provides the utility method ConnectorUtils#groupPartitions, which splits a target list of elements into a desired number of groups. Similarly, in our cloud storage example, we can implement taskConfig to get the whitelist of object key prefixes, divide that list based on the value of maxTasks or the size of the prefix whitelist, and return a list of configs, with each config containing different object key prefixes for the task to stream objects for. Below is an example implementation:

    @Override
    public List<Map<String, String>> taskConfigs(int maxTasks) {
        List prefixes = connectorConfig.getList(PREFIX_WHITELIST_CONFIG);
        int numGroups = Math.min(prefixes.size(), maxTasks);
        List<List> groupedPrefixes = ConnectorUtils.groupPartitions(prefixes, numGroups);
        List<Map<String, String>> taskConfigs = new ArrayList<>(groupedPrefixes.size());
        
        for (List taskPrefixes : groupedPrefixes) {
            Map<String, String> taskProps = new HashMap<>(configProps);
            taskProps.put(TASK_PREFIXES, String.join(",", taskPrefixes));
            taskConfigs.add(taskProps);
        }

        return taskConfigs;
    }

On startup, the Kafka Connect framework will pass each configuration map contained in the list returned by taskConfigs to a task.

The connector will also need additional methods implemented, but the implementation of those methods are relatively straightforward. Connector#stop gives you an opportunity to close any resources that may be open before the connector is stopped. Although simple in what it needs to accomplish, it’s important for Connector#stop not to block the shutdown process for too long. Connector#taskClass returns the class name of your custom task. Connector#config should return the ConfigDef defined in your custom configuration class. Lastly, Connector#version must return the connector’s version.

Step 3: Task polling

As with the Connector class, Task includes abstract methods for start, stop, and version. Most of the logic for streaming data into Kafka, however, will occur in the poll method, which is continually called by the Kafka Connect framework for each task:

    public abstract List poll() throws InterruptedException;

As we can see, the poll method returns a list of SourceRecord instances. A source record is used primarily to store the headers, key, and value of a Connect record, but it also stores metadata such as the source partition and source offset.

Source partitions and source offsets are simply a Map that can be used to keep track of the source data that has already been copied to Kafka. In most cases, the source partition reflects the task configuration that allows the task to focus on importing specific groups of data.

For example, our cloud storage source connector imports objects based on a whitelist of object key prefixes. In the implementation for Task#poll, the imported object is wrapped in a SourceRecord that contains a source partition, which is a Map that has information about where the record came from. The source partition could store the object key prefix that the task used to import the object. SourceRecord instances also contain a source offset, which is used to identify the object that was imported from the source. The source offset could contain identification information about the object in the bucket⏤the full object key name, version ID, last modified timestamp, and other such fields. The source partition and offset can later be used by the task to track the objects that have already been imported.

The Kafka Connect framework automatically commits offsets to the topic configured by the offset.storage.topic property. When a Connect worker or task is restarted, it can use the task’s SourceTaskContext to obtain an OffsetStorageReader, which has an offset method for getting the latest offset recorded for a given source partition. The task can then use the offset and partition information to resume importing data from the source without duplicating or skipping records.

Step 4: Create a monitoring thread

The Kafka Connect REST API includes an endpoint for modifying a Connector’s configuration. Submit a PUT request as described in the documentation, and your connectors and tasks will rebalance across the available workers to ensure that the configuration changes do not prompt an uneven workload across nodes.

However, you may want to design your connector to be able to pick up changes in the source, pick up new configs, and rebalance the workload across available workers without having to manually submit a request to the Kafka Connect API. Connectors monitoring for changes in the external source that may require reconfiguration and automatically reconfigure to accommodate those changes are called dynamic connectors.

To make your connector dynamic, you will need to create a separate thread for monitoring changes and create a new instance of the monitoring thread upon connector startup:

public class MySourceConnector extends SourceConnector {

    private MonitoringThread monitoringThread;

    @Override
    public void start(Map<String, String> props) {
        [...]
        monitoringThread = new MonitoringThread(context);
    }
    [...]
}

Your source connector will also need to pass its ConnectorContext to the monitoring thread. If the monitor detects changes in the external source, requiring reconfiguration, it will call ConnectorContext#requestTaskReconfiguration to trigger the Kafka Connect framework to update its task configuration.

Since updated configuration often means changes to the input partitions, the Kafka Connect framework also rebalances the workload across the available workers. On startup, the source connector can pass a polling interval property to the monitoring thread that can set a wait time on a CountDownLatch. Here is a sample implementation, which waits a certain number of milliseconds before querying the external source again for changes:

public class MonitoringThread extends Thread {

    [...]
    private final Long pollInterval;

    public MonitoringThread(ConnectorContext context, Long pollInterval) {
        [...]
        this.pollInterval = pollInterval;
    }

    @Override
    public void run() {
        while (shutdownLatch.getCount() > 0) {
            if (sourceHasChanged()) {
                context.requestTaskReconfiguration();
            }

            try {
                shutdownLatch.await(pollInterval, TimeUnit.MILLISECONDS);
            } catch (InterruptedException e) {
                log.warn("MonitoringThread interrupted: ", e);
            }
        }
    }
    [...]
}

Having implemented a monitoring thread that triggers task reconfiguration when the external source has changed, you now have a dynamic Kafka connector!

Next steps

Although further implementation is required to have a fully running connector, we’ve covered the major components you need to start creating a dynamic source connector. To learn more about Kafka Connect development, see the documentation. Also be sure to check out Robin Moffatt’s awesome talk From Zero to Hero with Kafka Connect, which covers how to use Kafka connectors to create a pipeline for streaming data from a database to Kafka and then to Elasticsearch, including a discussion of common issues that may arise and how to resolve them.

If you’re interested in developing or submitting a connector for distribution on the Confluent Hub, the Confluent Verified Integrations Program is a fantastic resource for getting guidance on developing and verifying your connector. There, you’ll find a verification guide and checklist that contains connector development best practices as well as the connector features and behaviors required to achieve a Confluent-verified Gold status. The verification guide is another great resource for learning more about Kafka Connect development.

Tiffany Chang is a Java developer who loves collaborating and sharing knowledge with others. She is passionate about enabling agile teams to build reliable and fast event-driven data platforms using Spring Cloud enabled microservices and high-performance technologies, such as Apache Kafka and Geode. In her current role as the engineering anchor at Enfuse.io, she partners with client teams to incrementally build out event-driven systems for reliability and scalability. In her spare time, she enjoys nature walks and pasta.

↧

Getting Started with Rust and Apache Kafka

October 24, 2019, 10:00 am

≪ Previous: 4 Steps to Creating Apache Kafka Connectors with the Kafka Connect API

I’ve written an event sourcing bank simulation in Clojure (a lisp build for Java virtual machines or JVMs) called open-bank-mark, which you are welcome to read about in my previous blog post explaining the story behind this open source example. As a next step, specifically for this article I’ve added SSL and combined some topics together, using the subject name strategy option of Confluent Schema Registry, making it more production like, adding security, and making it possible to put multiple kinds of commands on one topic. We will examine how the application works, and what was needed to change one of the components from Clojure to Rust. We’ll also take a look at some performance tests to see if Rust might be a viable alternative for Java applications using Apache Kafka^®.

The bank application

The bank application simulates an actual bank where you can open an account and transfer money. Alternatively, you can get money into the system by simply depositing money with the push of a button. Either way, both are accomplished with event sourcing. In this case, that means a command is created for a particular action, which will be assigned to a Kafka topic specific for that action. Each command will eventually succeed or fail. The result will be put on another topic, in which case a failed response would contain a reason for the failure, and the successful response might contain additional information. Processing the commands can cause derived events—events that happened because a command was executed. This happens when money is successfully transferred from one bank account to another. In this example, the balance of two accounts have been changed, creating a balance-changed event and putting it on another topic. Overview

Here, the orange pieces are part of the Confluent Platform, while the gray ones are small Clojure or Rust applications. The blue parts represent PostgreSQL databases, and turquoise is a Nginx web server. All messages use a String for the key and Avro for the value. In this setup, the schemas are set with the synchronizer, which also creates the topics. This makes it easy to automate since all the components of the application are Dockerized, and it contains scripts for easily setting up and testing the application.

GraphQL is used to handle the interaction between frontend and backend. By filling out a form on the frontend, either a ConfirmAccountCreation or ConfirmMoneyTransfer event is sent to the backend, and the feedback is passed to the frontend again. The events are handled by the command handler, which is the part of the system that has been ported to Rust. Later, the changes in the frontend will be used to measure end-to-end latency.

Note that the picture does not display a module called Topology, which is included in all Clojure applications as a dependency and contains a description of both the schemas and the topics, along with information about which schemas are used for which topic. The Synchronizer needs this to set the schemas correctly in the Schema Registry, according to the TopicRecordNameStrategy. It allows for different schemas within one topic and for the Schema Registry to check if new schemas are compatible.

The schemas are also useful for generating specific Java classes. Using Java interop, these classes make working with the data in Clojure easier. They offer several functions that are included in Topology and can be used in different applications, such as getting a Java UUID from the fixed bytes ID in the schema. Also, there are several functions wrapping the Java clients for Kafka. Depending on environment variables and whether SSL is set or not, it takes care of setting the configuration (like the serializers) as well.

A short introduction to Rust

On May 15, 2015, the Core Kafka team released version 1.0 of Rust. Since then, it’s made it possible to build libraries, called crates in Rust, that are compatible with the latest version of Rust. This works because Rust strives to be backwards compatible. By building on top of the crates that already exist, crates with a higher level of abstraction can be created. Although Rust is a system programming language, you can indeed use it to write applications at a level that are relatively on par with that of Java.

Rust is a compiled language, so code needs to be compiled first in order to be executed. Several IDEs with support for Rust exist, but in this blog post, I only use the Rust plugin for IntelliJ IDEA. It behaves a lot like you were using Java, since it can auto-import, make it easy to rename functions, and has been validated. It’s also easy to run Clippy from IntelliJ, a linter that suggests improvements, which is especially useful while learning Rust. Most errors are caught when compiling the program. Debugging IntelliJ is impossible, so for the remaining errors, you either need helpful log statements or you need to inspect the code.

Clojure command handler

The command handler has two external connections to Kafka and PostgreSQL. The first version was written in Clojure using a separate Kafka consumer and producer with Java interop. Because Clojure is a functional language, it’s easy to define a function that’s taking the topics as well as the function itself for each event. Usually you’d use the default configuration, but linger.ms has been set to 100 for a higher throughput of the producer. This is the same value that is used by default with Kafka Streams.

The Java classes generated by the topology are holding the data. These classes are based on Avro schemas and are not idiomatic Clojure which is immutable. This way, you can use the Avro serializers from Confluent.

For all commands, first check whether the command was already processed earlier. If so, it should send the same response back. This is what the cmt and cac tables are used for. Using the UUID from the command, they try to find the entry; if they do, they send the same kind of response back as earlier.

If the command has not yet been performed, the command handler will make an attempt using the balance table. In the case of a ConfirmAccountCreation, it will generate a random new bank account number. In other cases, it will also generate a token and send a confirmation. With ConfirmMoneyTransfer, it will try to transfer money. Given this, it is important to validate the token send with the command, and make sure that the balance won’t drop below the minimum amount. If the money is transferred, it will also create a BalanceChanged event for every balance change.

Command Handler

Similar to the Java main method, Clojure applications have a main function that starts the consumers for both ConfirmAccountCreation and ConfirmMoneyTransfer, which share the same producer. To connect to PostgreSQL, next-jdbc provides low-level access from Clojure to JDBC-based databases.

Rust libraries for Kafka

There are two libraries, or crates, for using Kafka in combination with Rust, which are called kafka and rdkafka. kafka is an implementation of the Kafka protocol in Rust, while rdkafka is a wrapper for librdkafka.

Both libraries offer a consumer and a producer but are a bit more low-level than the Java clients. Since you can’t add serializers, you need to wrap the clients for serialization. This is also helpful for defining how to handle errors. When a potential error is detected, most libraries return the Result<T, E> type. This indicates the possibility of an error, and it’s up to the user to decide what to do. Depending on the error, it will either be logged or the application will be terminated.

rust
Some(security_config) => Consumer::from_hosts(brokers)
   .with_topic(topic.to_string())
   .with_group(group.to_string())
   .with_fallback_offset(FetchOffset::Earliest)
   .with_offset_storage(GroupOffsetStorage::Kafka)
   .with_security(security_config)
   .create()
   .expect("Error creating secure consumer"),

Above is an example of creating a consumer with the kafka library. Although the kafka library works with Apache Kafka 2.3, it does not have all the features from Kafka 0.9 and newer. For example, it’s missing LZ4 compression support, which depending on the Kafka configuration could make it unusable. In addition, the library has not been updated for almost two years. The configuration is easy and well typed, but also limited. Support for SSL is present but requires some additional code, not just setting properties.

rust
let context = CustomContext;
let consumer: ProcessingConsumer = ClientConfig::new()
   .set("group.id", group_id)
   .set("bootstrap.servers", brokers)
   .set("enable.auto.commit", "true")
   .set("enable.auto.offset.store", "false")
   .set("statistics.interval.ms", "0")
   .set("fetch.error.backoff.ms", "1")
   .set("auto.offset.reset", "earliest")
   .set_log_level(RDKafkaLogLevel::Warning)
   .optionally_set_ssl_from_env()
   .create_with_context(context)
   .expect("Consumer creation failed");

Above is an example of creating a consumer using the rdkafka library. The rdkafka library is based on librdkafka 1.0.0. Work is being done to support more features, and recently admin client functions were added. The crate offers three producer APIs and two consumer APIs, which are either synchronous or asynchronous. Because it’s based on librdkafka, the configuration is pretty similar to the Java client.

Using Schema Registry

For Java clients, using Schema Registry is fairly simple. Basically, it involves three parts:

Include the correct Avro serializer depending on the type of client.
Add some configuration, the minimum being the Schema Registry URL. To configure it, use the serializer from the first step.
Most difficult of all, start using Avro objects for the data. One option is to have a separate Java project with the Avro schemas and generate the Java classes from there.

In lieu of serializers, I wrote some code to make the translation from bytes created by the Avro serializers for Java to something that Rust would understand, and vice versa. For making REST calls to Schema Registry, I used a Rust library based on curl and avro-rs for Avro serialization. Initially, I was happy when I got it to work, because the value in Rust would be of the type Value, which is sort of a generic record. Therefore, getting the actual values out was harder than it could have been, as seen in the code example. The values of a record are modeled as a vector, in which each part can be of multiple types. Make sure it is indeed an ID and that the Value matches the expected type Fixed, with 16 bytes. It would be easier if you could just get the ID property.

rust
let id = match cac_values[0] {
   (ref _id, Value::Fixed(16, ref v)) => ("id", Value::Fixed(16, v.clone())),
   _ => panic!("Not a fixed value of 16, while that was expected"),
};

I tried adding support for a more convenient way of handling this with the 1.1.0 release of the schema_registry_converter. This worked fine for simple schemas. However, once you start using more complex ones, the avro-rs library won’t handle serialization of both fixed and enum values correctly. To avoid changing the current schemas to work around the bugs, I instead added some code both in the serializer and in the project itself to handle the bugs.

With the Confluent serializers, you can use specific Avro classes, as long as the classes matching the records are on the classpath. In Rust, it’s not possible to dynamically create data objects this way. That’s why if you want to use the specific class equivalent in Rust, you need a function that takes the raw values and the name, and outputs a specific type.

Command handler in Rust

When rewriting the command handler in Rust, I tried to make it as similar in functionality to Java as possible, keeping as much of the configuration the same as possible. I also wanted to use the same producer to be shared for both processes. This is something that is easy on the JVM, as long as it’s thread safe, but it is a little harder with Rust, which does not have a garbage collector (unlike the JVM). Certain times on the JVM, an object will be checked for garbage collection eligibility, which comes down to whether or not it has non-weak references. In Rust, memory is freed as soon as the object is no longer referenced. For memory safety, the much-feared borrow checker that is part of the compiler ensures that an object is either mutable and not shared, or immutable and shared. In this example, we need a mutable producer that can be shared, which is isn’t allowed by the borrow checker. Luckily, there are several ways to work around this.

A multi-producer, single consumer FIFO queue communication allows you to use one producer from multiple threads. This is part of the standard library and makes it possible to send the same kind of objects from different threads to one receiver. With this intermediate construction, the actual Kafka producer can live in a single thread but still receive messages from multiple threads. Both consumer threads can write to the FIFO queue, and the single producer can read from the queue, solving the borrow checker problem. Diesel, an ORM and query builder, is used with the database.

For all three libraries—rdkafka sync, rdkafka async, and Rust-native kafka—each has examples that make them easy to use. You will not handle more cases of errors than you would with Java clients. At the same time, getting to the actual data within the records may involve some lift, especially in the rdkafka library since it is wrapped in an Ok(Ok(Record)) for a potential error reading a response from Kafka or a problem with the record itself.

Running applications with Docker

Even more so now that Kubernetes is popular, running applications with Docker provides a nice way to deploy applications, as Docker images can easily be run on different platforms.

Creating a Docker image to run a Java application is easy. Once you have compiled a JAR, it can run on any JVM of the corresponding version independent of the platform, though the images can get quite big because you need a JVM.

With Rust, it’s slightly more complicated since by default it will compile for the current environment. In the case of rdkafka, I used the standard Rust Docker image, based on Debian, to create a production build executable of the code and a slim Debian image for the actual image. For the native one, I used the clux/muslrust for building, compiling everything statically and FROM scratch to generate a very small image of 9.18 MB for the actual image. Despite the risks involved with maintaining a Docker image, this is beneficial not only due to its size but also because it’s more secure than building on top of several other layers, which might have security issues.

Part of the configuration for both the JVM and Rust variants can be set using environment variables. It falls back to default values when they are not present to run them more easily without Docker.

By keeping the names for the Docker image the same, only a few changes are needed to run the open bank application with Rust instead of Clojure. With Docker builders, you don’t even need to have Rust installed locally. The environment variables for Kafka, ZooKeeper, and SSL are different, but everything else keeps working exactly the same because the script for creating the certificates also creates non-keystore variants for the command handler.

Performance-testing transactions

All tests were run by first executing the prepare.sh script in the branch to test. To start the actual test, I restarted the laptop, and 10 tests were done in succession using loop.sh 10. The laptop used for the benchmarks was an Apple MacBook Pro 2008 with 2.6 Ghz i7 and 16 GB of memory. Docker was configured to have 9 CPUs and 12 GB memory. To measure CPU and memory, some code was added to lispyclouds/clj-docker-client. The CPU scores in the test are relative to amount used for Docker. So a value of 11% means about 1 CPU of the nine available is spending all of its time on that process.

Benchmarking is hard, especially on a whole application. The test focused on end-to-end latency and measured it by adding or subtracting money from the opened account, followed by a check every 20 ms to see if the balance was properly updated. This occurs every second, with a timeout of five seconds. If the expected update does not show up within five seconds five times, the test is stopped.

While testing the latency, the load on the system is increased by letting the heartbeat service produce more messages. The additional load value in the diagrams are only the heartbeat messages. The actual number of messages will be about triple that, because for every heartbeat message generated, there is also a command and a result message. Every 20 seconds, parts of the system are measured for CPU and memory use. The raw data of the performance tests can be found on GitHub or online via the background tab.

End-to-end latency while increasing load with different implementations of the command handler

The most important statistic when comparing latencies remains highly debated. While average latency does provide some information, if some of the responses are really slow, it can still hinder the user experience a lot. Because some users will have to wait long to see some results. The 99^th percentile latency reveals within what time 99% of the calls were returned.

Latency Graph

As shown above, there are some spikes, which could be explained by small hiccups in one of the tests that bring the score up. For Rust native, code was added to simulate the behavior of linger.ms for the producer, and .poll(ms) for the consumer. The main reason for doing this was that on an earlier, similar kind of test, it would already fail at about 220 messages. It sent every message individually, so this was expected. With the refreshed code, you can send all available messages, sleep a little, then send all available messages again. This is simple using the FIFO queue, but you do lose some time, because all available messages need to be serialized before being sent.

The two rdkafka clients don’t appear to be different, which makes sense since most of the code is the same, and the advantage of async might only prove relevant for high loads.

x-axis: Additional load (messages/s) | y-axis: Average CPU command handler (% from total)

CPU load command handler

When you look at the CPU load of the command handler below, you see a lot of load at the start for Clojure. This makes sense as the JIT compiler is generating and optimizing bytecode. The async client needs more memory and significantly more CPU than the sync variant. The rdkafka library requires more CPU than both Clojure and Rust native, which might either be because of C with Rust interop, or because the Rust rdkafka library is not very efficient. Aside from the initial starting spike for Clojure, all the languages are pretty linear in relation to load.

x-axis: Additional load (messages/s) | y-axis: Average mem command handler (MiB)

Memory needed for the command handler

Here, we see some big differences. While Clojure goes up quickly to 150 MiB, the rdkafka clients only need about 5 MiB, and the native implementation only needs 2 MiB. The big difference is in the small runtime of Rust, which doesn’t use a garbage collector. The Clojure implementation needs relatively little memory. In an earlier test, another JVM implementation with Kotlin and Spring Boot was using about twice as much memory as Clojure.

No Rust on Kafka just yet

I would have liked this blog post to have been a story about Rust being more safe, less memory hungry, and a great replacement for Java when working with Kafka, particularly in combination with the Schema Registry. The truth is a bit different. The two client libraries work but specifically the native one might not even be usable because it’s been almost two years since the last release. If you use LZ4 compression, or want to use header values, it’s not possible at the moment. Nevertheless, the rdkafka library is kept up to date and more functions are being added, such as the interface for the admin client part of librdkafka.

The real concern, however, is the Avro library’s inability to correctly serialize all supported complex types. It would also be nice to generate some additional code from the schemas to more easily deserialize to Rust structs. Fortunately, all the code is open source so you can just copy parts of the code as needed and make it Java compatible, at least for enum and fixed types. It’s possible to build something which fetches all the latest schema from Schema Registry and generate all the structs needed together with some functions to make serialization possible.

When interoperability with Java and Confluent Schema Registry is not required, there are many other options that can be used for data serialization. One example is Protobuf, which is used about a hundred times more than Avro in Rust based on recent crate downloads.

Pluggable serializers, like those in the Java client, are another option. Because this would be a breaking change, or introduce even more types of clients to the rdkafka library, it would probably be best to do this in a separate library.

Still, using Rust with Kafka instead of using Java presents some advantages, such as low memory use and potentially better security because you need less layers to run the application. Of course, there are many more factors like speed of development, availability of developers, monitorability, and the language ecosystem.

Ultimately, Rust is not recommended as a replacement to Java applications that use Kafka, but there is hope that the ecosystem will improve in time as usage of Rust becomes more prominent. I personally really enjoyed working with Rust and would like to keep making contributions that help take things forward.

Gerard Klijs is a Java developer at Open Web. He uses his knowledge of Kafka to solve infrastructure and implementation challenges both internally and for clients, the most recent being Payconiq. He previously worked at Axual building a Kafka-based platform for Rabobank, and he is the maintainer of crate, making it easy to use the Confluent Schema Registry with Rust.

↧

Emergence of event streams

Apache Kafka® and its uses

Event streams as the central nervous system

The event streaming platform: Databases and streams must join together

Confluent’s mission

Apache Kafka as an event streaming platform for real-time analytics

Overview of Rockset technology

Real-time decision-making and live dashboards using Kafka and Rockset

Connecting Kafka to Rockset

Creating a collection from a Kafka data source

Querying Twitter Data from Kafka

Joining with other datasets

Generating a real-time monitoring dashboard on Kafka data

Interactive analytics on scalable, event streaming data to act while data is hot

Interested in more?

Why centralized data governance is important

Schema Validation: How hard is it?

Get started in five minutes

Conclusion

Go from zero to production on Apache Kafka® without talking to sales reps or building infrastructure

Free Kafka as a service

Kafka made serverless

1. Think outcomes, not clusters

2. Grow as you go with elastic scaling

3. Pay precisely for what you use

Support from the experts

Choose how and where to deploy

Day 1 of the event, summarized for your convenience.

Keynote by Jun Rao

How I spent my breaks

Stay tuned

I can’t bear it

Keynote by Jay Kreps

Until next year, Kafka Summit!

Our use case

Why move to the cloud?

Choosing Confluent Cloud

Running Kafka on Kubernetes ourselves

Amazon Managed Streaming for Apache Kafka (Amazon MSK)

Getting started with Confluent Cloud

Interested in more?

Requirements

Operator and Kafka Setup Tasks

Verification of the Kafka cluster setup

Internal verification

External validation

Congratulations!

Want more?

Methodology

Requirements

Cloud Foundry (cf) CLI prerequisites

Deploy a Sample Spring Boot Microservice App with Kafka to Pivotal Application Service (PAS)

Verification

Noteworthy configuration and source code

Tutorial completed

More tutorials, please!

Top 10 Highest-Rated Sessions

Develop /her

View the videos and slides!

Use cases for IoT technologies and an event streaming platform

End-to-end enterprise integration architecture

Requirements and challenges of IoT integration architectures

IoT standards and technologies: MQTT, OPC UA, Siemens S7, and PROFINET

Apache Kafka as an event streaming platform

Confluent MQTT connectors (source and sink)

MQTT Proxy for data ingestion without an MQTT broker

REST Proxy as a “simple” option for an IoT integration

End-to-end monitoring and security

Choosing the right components for your IoT integration challenges

Interested in more?

The data

Ingesting the data

Wrangling the data

Using the data

Alerting

The platform in detail

Ingest

Data wrangling with stream processing

Splitting one stream into many

Schema derivation and Avro reserialisation

Apache Kafka^® and its uses

Go from zero to production on Apache Kafka^® without talking to sales reps or building infrastructure