In November 2024, I had the privilege of delivering a big data training and consultation program for one of the leading Fortune 500 companies. The goal was to empower data and business analysts with the skills and knowledge needed to design and build an end-to-end big data pipeline—one capable of processing massive datasets, extracting actionable insights, and driving informed decision-making.
This program focused on three key components:
🔹 Data Ingestion – Storing and extracting large datasets into the Hadoop Distributed File System (HDFS) for scalable, centralized data access.
🔹 Data Processing – Leveraging Apache Spark to process massive datasets and uncover actionable insights.
🔹 Data Visualization – Transferring processed results to Tableau to create interactive visualizations that support decision-making.
By the end of the program, participants were equipped with practical skills to design and implement robust big data pipelines that could accelerate analysis and deliver meaningful insights for their teams.
Inspired by the success of this training, I decided to take things a step further. My goal was to replicate and extend the workflow in a hands-on environment, building an in-house infrastructure to test, refine, and demonstrate a fully scalable big data pipeline in action.
To illustrate this process, I chose the MovieLens dataset—a rich dataset featuring 32 million movie ratings and 2 million tags across nearly 88,000 movies from over 200,000 users.
The goal is to process this dataset and extract the top 20 movies based on two criteria:
🔹 Average Rating – Ratings range from 1 to 5, with higher values indicating better user feedback.
🔹 Rating Count – A movie must have a significant number of ratings to ensure reliability. For instance, a movie with an average rating of 4.5 and more than 5,000 ratings would be considered a “good movie.”
To achieve this, I primarily worked with two files:
🔹 Ratings File (ratings.csv) – This file contains detailed records of user ratings, including userId, movieId, rating, and timestamp.
🔹 Movies File (movies.csv) – This file maps each movieId to its title and genres, allowing us to identify the movie details for the top-rated films.
Together, these files provide the foundation for identifying and interpreting the top 20 movies based on user preferences.
Below is an example of the structure of the ratings.csv file, which captures user feedback on movies:
Similarly, the movies.csv file maps movie IDs to titles and genres. Here’s a snapshot of its structure:
With the dataset identified, the next step is to build a scalable infrastructure to process and analyze this data effectively. To achieve this, I designed a Kubernetes cluster that integrates Hadoop, Spark, and Jupyter. This cluster enables distributed data storage, processing, and collaborative development, ensuring scalability and efficiency.
The following is a list of pods (nodes) in this cluster:
Hadoop Cluster on Kubernetes
To handle data ingestion, I set up a Hadoop cluster on Kubernetes, consisting of 1 NameNode and 3 DataNodes. This setup enabled distributed storage and provided scalability, making it ideal for managing large datasets.
The following screenshot shows the Hadoop Web UI, highlighting the nodes in operation within the cluster:
Once the cluster was set up, I ingested the MovieLens dataset into HDFS (Hadoop Distributed File System) to centralize and organize the data. Below is a snapshot of the dataset files stored in HDFS. Notably, the ratings.csv file is over 836 MB and contains 32 million movie ratings, while the movies.csv file is 4 MB and includes a list of 88,000 movies.
Spark Cluster on Kubernetes
Next, I set up a Spark cluster on Kubernetes, consisting of 1 Master Node and 3 Worker Nodes, to process the dataset. Spark’s distributed computing capabilities enabled efficient processing of the MovieLens dataset, allowing me to quickly extract meaningful insights.
The Spark Web UI provided a detailed view of the cluster in action. Below is a screenshot of the setup, showing active workers and resource utilization:
Using Spark, I processed the dataset to calculate two metrics for each movie:
🔹 Average Ratings – Highlighting movies with the highest user feedback.
🔹 Rating Count – Measuring how frequently a movie was rated, to ensure the reliability of the average ratings.
The result of this data processing is a CSV file containing the top 20 movies, ranked by a combination of average ratings and popularity. Below is a snapshot of the resulting dataset:
Jupyter Cluster on Kubernetes
To streamline development and collaboration, I created a Jupyter cluster on Kubernetes. This environment served as a versatile platform for:
🔹 Pulling datasets from HDFS for analysis.
🔹 Processing data using Spark.
🔹 Preparing results for visualization and further analysis.
Jupyter’s interactive interface provided an efficient way to work collaboratively and test various data processing workflows. Below is an example of the Jupyter notebook web interface used in this project:
Automated Publishing and Visualization with Tableau
To complete the pipeline, I developed a Python module that automated the conversion of processed Spark results into a Tableau-compatible Hyper file format. This module streamlined the process by:
🔹 Converting Data: Transforming the output from Spark into the Hyper file format required by Tableau.
🔹 Publishing to Tableau: Automatically publishing the Hyper file directly to Tableau, making the data instantly available for visualization.
This automation reduced manual effort and ensured that results were updated and visualized efficiently. Using Tableau, I created interactive visualizations, such as scatter plots, to showcase insights like the top 20 movies based on average ratings and popularity.
Based on the plot, Shawshank Redemption stands out as a clear outlier, with exceptionally high average ratings and a substantial count of ratings, solidifying its position as a fan favorite.
Extending the Pipeline to Machine Learning Applications
This pipeline lays a solid foundation for exploring more advanced data science tasks, such as building machine learning models. For instance, it could be extended to create a recommendation system that leverages the MovieLens dataset. By applying collaborative filtering techniques or matrix factorization methods, we could predict user preferences and recommend movies they are likely to enjoy. With the pipeline’s scalability and the integration of Spark’s MLlib, the transition from data processing to machine learning becomes seamless, unlocking more possibilities for personalized insights.
Conclusion
This article provided an overview of the end-to-end implementation of a scalable big data pipeline, integrating Hadoop for storage, Spark for processing, and Tableau for visualization. While this implementation focused on movie ratings, the same principles can be applied across various industries, such as healthcare, finance, and e-commerce, to derive impactful insights from data—paving the way for more data-driven innovation and informed decision-making.
Take Your Skills to the Next Level
If you’re interested in data science, AI, or big data, we offer two Dubai Government-accredited programs designed to provide hands-on experience and industry-relevant skills:
🔹 AI Professional Program – A beginner-friendly course covering Python programming, data analysis, machine learning, deep learning, and real-world applications to prepare you for a career in AI and data science. Learn more here.
🔹 Big Data Professional Program – A hands-on training program in Hadoop, Apache Spark, Kafka, and Tableau, equipping you with the skills to build scalable big data pipelines and tackle real-world data challenges. Learn more here.
#BigData #DataScience #ApacheSpark #Hadoop #MachineLearning #DataVisualization #Tableau #Kubernetes #CloudComputing #AI #DataAnalytics #DataEngineering #TechTraining #PythonProgramming #RecommenderSystems #BusinessIntelligence