Data Analytics is often described as one of the biggest challenges associated with big data, but even before that step can happen, data must be ingested and made available to enterprise users. That’s where Apache Kafka comes in. Kafka’s growth is exploding, more than 1⁄3 of all Fortune 500 companies use Kafka. These companies includes the top ten travel companies, 7 of top ten banks, 8 of top ten insurance companies, 9 of top ten telecom companies, and much more. LinkedIn, Microsoft and Netflix process four comma messages a day with Kafka (1,000,000,000,000).
Apache Kafka is a streaming platform for collecting, storing, and processing high volumes of data in real-time. Apache Kafka is a highly scalable, fast and fault-tolerant messaging application used for streaming applications and data processing. This application is written in Java and Scala programming languages. Apache Kafka is a distributed data streaming platform that can publish, subscribe to, store, and process streams of records in real time. It is designed to handle data streams from multiple sources and deliver them to multiple consumers. In short, it moves massive amounts of data – not just from point A to B, but from points A to Z and anywhere else you need, all at the same time.
Apache Kafka started out as an internal system developed by LinkedIn to handle 1.4 trillion messages per day, but now it’s an open source data streaming solution with application for a variety of enterprise needs.
- Apache Kafka is a distributed publish-subscribe messaging system that is designed to be fast, scalable, and durable
- Apache Kafka is designed for distributed high throughput systems
- Apache Kafka tends to work very well as a replacement for a more traditional message broker
- Apache Kafka has better throughput, built-in partitioning, replication and inherent fault-tolerance, which makes it a good fit for large-scale message processing applications
- Apache Kafka maintains feeds of messages in topics
- Producers write data to topics and consumers read from topics
- Since Kafka is a distributed system, topics are partitioned and replicated across multiple nodes
- Kafka is very fast and guarantees zero downtime and zero data loss.
Who uses Apache Kafka?
A lot of large companies who handle a lot of data use Kafka. LinkedIn, where it originated, uses it to track activity data and operational metrics. Twitter uses it as part of Storm to provide a stream processing infrastructure. Square uses Kafka as a bus to move all system events to various Square data centers (logs, custom events, metrics, and so on), outputs to Splunk, Graphite (dashboards), and to implement an Esper-like/CEP alerting systems. It gets used by other companies too like Spotify, Uber, Tumbler, Goldman Sachs, PayPal, Box, Cisco, CloudFlare, NetFlix, and much more.
Why is Kafka so Fast?
Kafka relies heavily on the OS kernel to move data around quickly. It relies on the principals of Zero Copy. Kafka enables you to batch data records into chunks. These batches of data can be seen end to end from Producer to file system (Kafka Topic Log) to the Consumer. Batching allows for more efficient data compression and reduces I/O latency. Kafka writes to the immutable commit log to the disk sequential; thus, avoids random disk access, slow disk seeking. Kafka provides horizontal Scale through sharding. It shards a Topic Log into hundreds potentially thousands of partitions to thousands of servers. This sharding allows Kafka to handle massive load.
Apache Kafka API:
Apache Kafka is a popular tool for developers because it is easy to pick up and provides a powerful event streaming platform complete with 4 APIs: Producer, Consumer, Streams, and Connect.
Basically, it has four core APIs:
- Producer API: This API permits the applications to publish a stream of records to one or more topics.
- Consumer API: The Consumer API lets the application to subscribe to one or more topics and process the produced stream of records.
- Streams API: This API takes the input from one or more topics and produces the output to one or more topics by converting the input streams to the output ones.
- Connector API: This API is responsible for producing and executing reusable producers and consumers who are able to link topics to the existing applications.
Need for Apache Kafka :
- Kafka is a unified platform for handling all the real-time data feeds
- Kafka supports low latency message delivery and gives guarantee for fault tolerance in the presence of machine failures
- It has the ability to handle a large number of diverse consumers
- Kafka is very fast, performs 2 million writes/sec
- Kafka persists all data to the disk, which essentially means that all the writes go to the page cache of the OS (RAM)
- This makes it very efficient to transfer data from page cache to a network socket
Apache Kafka – Use Cases:
Kafka can be used in many Use Cases. Some of them are listed below −
- Metrics− Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.
- Twitter: Registered users can read and post tweets, but unregistered users can only read tweets. Twitter uses Storm-Kafka as a part of their stream processing infrastructure.
- Netflix: Netflix is an American multinational provider of on-demand Internet streaming media. Netflix uses Kafka for real-time monitoring and event processing.
- Log Aggregation Solution− Kafka can be used across an organization to collect logs from multiple services and make them available in a standard format to multiple con-sumers.
- LinkedIn: Apache Kafka is used at LinkedIn for activity stream data and operational metrics. Kafka messaging system helps LinkedIn with various products like LinkedIn Newsfeed, LinkedIn Today for online message consumption and in addition to offline analytics systems like Hadoop.
- Stream Processing− Popular frameworks such as Storm and Spark Streaming read data from a topic, processes it, and write processed data to a new topic where it becomes available for users and applications. Kafka’s strong durability is also very useful in the context of stream processing.
- Website activity tracking – The web application sends events such as page views and searches Kafka, where they become available for real-time processing, dashboards and offline analytics in Hadoop.