Writing tests for apache beam sliding window based streaming pipeline with late event triggers

I have been working with apache beam for a few years now, from developing the most basic batch pipelines, all the way to real-time, stateful streaming analytics pipelines. In this post, I will show you to control the watermarks and test your sliding window logic while also considering late events. If you are new to windowing in apache beam, I suggest you quickly go through the concepts before continuing with this post. Apache Beam Sliding Window I have developed a simple pipeline, that consumes a […]

Running the LLaMA AI Language Model on a Laptop

LLaMA is an open source large language model built by FAIR team at Meta AI and released to the public. It is quite small in size compared to other similar models like GPT-3, thus with the potential to be run on everyday hardware, atleast for fun, like I did. It is impressive how complex AI models like these can be packaged into files of few gigabytes and can be launched anywhere.The trained model of LLaMA was only made available to researchers, but it leaked online. The data used to train the mod […]

Infrastructure as Code On Google Cloud Platform using Terraform

In the previous posts, we covered development aspects of projects. In this and the coming posts, we will cover how developers can deploy their projects using various DevOps practices such as IaC and CI/CD.Terraform is a tool designed for provisiong infrastructure on various platforms using the Infrastructure as Code (IaC) approach. In this approach, you define configuration files to create resources, on cloud platforms, instead of manually creating them either through the UI or through respective […]

Source Code Analysis Using SonarQube

When it comes to building an application, we always ensure we develop it according to the business need and also write unit tests to check if what we built works as expected.But we developers often tend to not check whether the test cases cover every line of code they have written and also if the source code follows development and security best practices.This is where SonarQube drops in. SonarQube is an open source code analysis tool to inspect code quality, detect bugs, code smells and security […]

Spring Kafka Unit Test with Embedded Kafka

In our previous post, we saw how we can build a telemetry receiver using spring boot and publish the message to kafka using spring kafka.We covered most of the development in there and in this post, we will be concentrating more on writing unit tests for the same, by using an embedded kafka broker within the application.JUnit 5 Jupiter will be our choice of library for unit tests along with Mockito and spring-kafka-test dependency.Test PropertiesTo begin with, we need to provide a properties file […]

Spring Boot based Telemetry Data Receiver API with Kafka Producer

In the bigdata world, a lot of data is generated through IoT sensors and telemetries emitted from various apps, websites and devices. In this post, we will develop a spring kafka based REST API that will act as a cloud receiving endpoint for the telemetry data.We will consume data in json format with abbreviation, convert it to a data transfer object (DTO) while expanding the abbrevated fields and push the data to a kafka topic.The abbreviation helps reduce the amout of data sent from edge system […]

Streaming Data Deduplication using Apache Spark State Store

Data deduplication is a process of eliminating repeated data, which helps in reduced network transfer and makes better utilization of storage. In the long term, it also helps reduce the size of data that needs to be processed to gain insights. Repeated data is usually sent by IoT sensors, website telemetry and other systems, that are usually designed to send data at regular intervals and not when an event occurs (a state change). In most cases, we do not need the repeated data and would only be i […]

Spark Structured Streaming Word Count

Word counts are the hello worlds of bigdata. In this post let us write a streaming word count program using spark structured streaming. As seen in our spark essential guide, spark structured streaming is a streaming capability coming from SparkSQL component, not the spark streaming component. This will be built on top of the batch word count program we already wrote for the spark core. So, head over to that post first, understand it and proceed with the below. We will read the data from the kafka […]

Apache Kafka: The Essential Guide

Apache Kafka is the most popular distributed streaming platform in the bigdata landscape. Initially built at LinkedIn for website telemetry, it is now used by the big names in the industry.Kafka, based on publish-subscribe model, is fast and offers very high throughput, fault tolerance and horizontal scalability. It follows a distributed architecture and thus, runs as a cluster.Kafka is written in scala and hence offers seamless integration with multiple frameworks like Apache Spark, Flink, Beam […]

Apache Spark: The Essential Guide

We have seen how we can install Apache Spark and also how to code a basic word count program. Now it is time to dig deeper into the framework and understand how things work, while also going through certain jargons/terminologies associated with it.Components of Apache SparkApache Spark has a number of componentes that provide various features. Below is a list with details about them.Spark CoreThe core component provides the execution engine of Spark. All other components are built on top of the c […]