End-to-End Data Pipeline (S3 to Delta Tables in Databricks)

Summary

Being a software company that collects canine data through sensors, makes it useful and feeds it to their ML algorithms for further insights, Kinship had to face challenges while fetching data older than 90 days for their Machine Learning Algorithms. Folio3 augmented its staff with an experienced Data Engineering team for faster deployment, reduced system errors, and a smooth integration on the cloud.

About the Customer

Kinship, a division of Mars Petcare, is a platform for brands building the future of pet care, combining insights, products, and services to help people be the best pet parents they can be.

Team composition
3 members
Client name
Kinship
Expertise used

Data Engineering
Duration

6 months
Services provided

Data Pipelining, Data Ingestion, Data Parallelisation, PySpark, Team Augmentation
Country

US
Industry

Embedded Software Products

Understanding the Challenge

Kinship (petinsight) collected raw accelerometer canine data from IoT devices. It was possible to fetch a single dog’s data for a short time but scaling to thousands of animals with increased time (over months or even years) seemed to be a challenge. Accessing data older than 90 days took an indefinite amount of time since there was no easy mechanism in place to dynamically convert these log files into PetInsightTimeData objects.

Solution

Folio3 AI-augmented Kinship’s staff with an experienced Data Engineering team that oversaw all the aspects of data ingestion and data parallelization, and delivered milestones on time and with the minimum amount of system and logical errors. The software solution produced consisted of multiple steps. The steps were performed simultaneously for multiple dogs to ensure multi-threading.

Avoiding Data Duplication

The pulled log paths were checked against the existing log file paths in the Delta table. Only distinct paths were saved back to the Delta table to avoid data duplication.

Pulling data from S3

Data against the newly added file paths was pulled from the S3 bucket. Each file path returned a PITD (PetInsight Time Data Object)

Querying Log File Paths from DynamoDB

Based on the parameters (dog’s ID, start date and an end date), log file paths of s3 files were pulled from dynamoDB.

PITD to Accel Data Conversion

Each PITD object contained multiple Pandas Dataframes. Dataframes from all the PITD objects were traversed and concatenated in batches of Dataframes.

Data Engineering Service

Folio3’s team of Data Engineers made sure all the steps were integrated smoothly according to Kinship's consumer requirements with minimum errors and maximum efficiency.

Deployed on Databricks

All the steps are deployed, and functional on the Databricks platform. Steps are working inside Databricks workflows.

Dataframes Dumped to Delta Table

Concatenated data frames were appended to the Delta Table for further use by Data Scientists.

Result

With the Data Engineering team from Folio3, all project challenges were met and years of data on a number of animals could be fetched within minutes, instead of hours with the data pipeline developed and integrated by Folio3.