End-to-End Data Pipeline (S3 to Delta Tables in Databricks)
Summary
Being a software company that collects canine data through sensors, makes it useful and feeds it to their ML algorithms for further insights, Kinship had to face challenges while fetching data older than 90 days for their Machine Learning Algorithms. Folio3 augmented its staff with an experienced Data Engineering team for faster deployment, reduced system errors, and a smooth integration on the cloud.
About the Customer
Kinship, a division of Mars Petcare, is a platform for brands building the future of pet care, combining insights, products, and services to help people be the best pet parents they can be.

-
Team composition
3 members
-
Client name
Kinship
-
Expertise used
Data Engineering
-
Duration
6 months
-
Services provided
Data Pipelining, Data Ingestion, Data Parallelisation, PySpark, Team Augmentation
-
Country
US
-
Industry
Embedded Software Products
Solution
Folio3 AI-augmented Kinship’s staff with an experienced Data Engineering team that oversaw all the aspects of data ingestion and data parallelization, and delivered milestones on time and with the minimum amount of system and logical errors. The software solution produced consisted of multiple steps. The steps were performed simultaneously for multiple dogs to ensure multi-threading.
Avoiding Data Duplication
The pulled log paths were checked against the existing log file paths in the Delta table. Only distinct paths were saved back to the Delta table to avoid data duplication.

Pulling data from S3
Data against the newly added file paths was pulled from the S3 bucket. Each file path returned a PITD (PetInsight Time Data Object)



Querying Log File Paths from DynamoDB
Based on the parameters (dog’s ID, start date and an end date), log file paths of s3 files were pulled from dynamoDB.

PITD to Accel Data Conversion
Each PITD object contained multiple Pandas Dataframes. Dataframes from all the PITD objects were traversed and concatenated in batches of Dataframes.
Data Engineering Service
Folio3’s team of Data Engineers made sure all the steps were integrated smoothly according to Kinship's consumer requirements with minimum errors and maximum efficiency.

Deployed on Databricks
All the steps are deployed, and functional on the Databricks platform. Steps are working inside Databricks workflows.


Dataframes Dumped to Delta Table
Concatenated data frames were appended to the Delta Table for further use by Data Scientists.
Result
With the Data Engineering team from Folio3, all project challenges were met and years of data on a number of animals could be fetched within minutes, instead of hours with the data pipeline developed and integrated by Folio3.