How to Build a Scalable Big Data Pipeline with Hadoop, Spark, and Tableau

In November 2024, I had the privilege of delivering a big data training and consultation program for one of the leading Fortune 500 companies. The goal was to empower data and business analysts with the skills and knowledge needed to design and build an end-to-end big data pipeline—one capable of processing massive datasets, extracting actionable insights, and driving informed decision-making.

This program focused on three key components:

🔹 Data Ingestion – Storing and extracting large datasets into the Hadoop Distributed File System (HDFS) for scalable, centralized data access.

🔹 Data Processing – Leveraging Apache Spark to process massive datasets and uncover actionable insights.

🔹 Data Visualization – Transferring processed results to Tableau to create interactive visualizations that support decision-making.

By the end of the program, participants were equipped with practical skills to design and implement robust big data pipelines that could accelerate analysis and deliver meaningful insights for their teams.

Inspired by the success of this training, I decided to take things a step further. My goal was to replicate and extend the workflow in a hands-on environment, building an in-house infrastructure to test, refine, and demonstrate a fully scalable big data pipeline in action.

To illustrate this process, I chose the MovieLens dataset—a rich dataset featuring 32 million movie ratings and 2 million tags across nearly 88,000 movies from over 200,000 users.

The goal is to process this dataset and extract the top 20 movies based on two criteria:

🔹 Average Rating – Ratings range from 1 to 5, with higher values indicating better user feedback.

🔹 Rating Count – A movie must have a significant number of ratings to ensure reliability. For instance, a movie with an average rating of 4.5 and more than 5,000 ratings would be considered a “good movie.”

To achieve this, I primarily worked with two files:

🔹 Ratings File (ratings.csv) – This file contains detailed records of user ratings, including userId, movieId, rating, and timestamp.

🔹 Movies File (movies.csv) – This file maps each movieId to its title and genres, allowing us to identify the movie details for the top-rated films.

Together, these files provide the foundation for identifying and interpreting the top 20 movies based on user preferences.

Below is an example of the structure of the ratings.csv file, which captures user feedback on movies:

Sample structure of ratings.csv, showing columns for userId, movieId, rating, and timestamp
Sample structure of ratings.csv, showing columns for userId, movieId, rating, and timestamp

Similarly, the movies.csv file maps movie IDs to titles and genres. Here’s a snapshot of its structure:

Sample structure of movies.csv, showing columns for movieId, title, and genres, which provide details about each movie's metadata
Sample structure of movies.csv, showing columns for movieId, title, and genres, which provide details about each movie’s metadata

With the dataset identified, the next step is to build a scalable infrastructure to process and analyze this data effectively. To achieve this, I designed a Kubernetes cluster that integrates Hadoop, Spark, and Jupyter. This cluster enables distributed data storage, processing, and collaborative development, ensuring scalability and efficiency.

The following is a list of pods (nodes) in this cluster:

Snapshot of the Kubernetes cluster showing the active pods, including Hadoop DataNodes, Jupyter clients, and Spark nodes
Snapshot of the Kubernetes cluster showing the active pods, including Hadoop DataNodes, Jupyter clients, and Spark nodes

Hadoop Cluster on Kubernetes

To handle data ingestion, I set up a Hadoop cluster on Kubernetes, consisting of 1 NameNode and 3 DataNodes. This setup enabled distributed storage and provided scalability, making it ideal for managing large datasets.

The following screenshot shows the Hadoop Web UI, highlighting the nodes in operation within the cluster:

Hadoop Web UI showing the NameNode and DataNodes in operation, with details about capacity, usage, and block distribution.
Hadoop Web UI showing the NameNode and DataNodes in operation, with details about capacity, usage, and block distribution

Once the cluster was set up, I ingested the MovieLens dataset into HDFS (Hadoop Distributed File System) to centralize and organize the data. Below is a snapshot of the dataset files stored in HDFS. Notably, the ratings.csv file is over 836 MB and contains 32 million movie ratings, while the movies.csv file is 4 MB and includes a list of 88,000 movies.

HDFS directory showing the MovieLens dataset files, including ratings.csv, movies.csv, and other metadata files.
HDFS directory showing the MovieLens dataset files, including

Spark Cluster on Kubernetes

Next, I set up a Spark cluster on Kubernetes, consisting of 1 Master Node and 3 Worker Nodes, to process the dataset. Spark’s distributed computing capabilities enabled efficient processing of the MovieLens dataset, allowing me to quickly extract meaningful insights.

The Spark Web UI provided a detailed view of the cluster in action. Below is a screenshot of the setup, showing active workers and resource utilization:

Spark Web UI displaying the cluster setup, including the active master node, three worker nodes, allocated resources, running applications, and completed jobs
Spark Web UI displaying the cluster setup, including the active master node, three worker nodes, allocated resources, running applications, and completed jobs

Using Spark, I processed the dataset to calculate two metrics for each movie:

🔹 Average Ratings – Highlighting movies with the highest user feedback.

🔹 Rating Count – Measuring how frequently a movie was rated, to ensure the reliability of the average ratings.

The result of this data processing is a CSV file containing the top 20 movies, ranked by a combination of average ratings and popularity. Below is a snapshot of the resulting dataset:

Top 20 movies from the MovieLens dataset, ranked by average rating and rating count, demonstrating the results of Spark processing
Top 20 movies from the MovieLens dataset, ranked by average rating and rating count, demonstrating the results of Spark processing

Jupyter Cluster on Kubernetes

To streamline development and collaboration, I created a Jupyter cluster on Kubernetes. This environment served as a versatile platform for:

🔹 Pulling datasets from HDFS for analysis.

🔹 Processing data using Spark.

🔹 Preparing results for visualization and further analysis.

Jupyter’s interactive interface provided an efficient way to work collaboratively and test various data processing workflows. Below is an example of the Jupyter notebook web interface used in this project:

Jupyter notebook showcasing the integration of Spark for data processing, including loading the MovieLens dataset, analyzing ratings, and calculating insights in an interactive development environment
Jupyter notebook showcasing the integration of Spark for data processing, including loading the MovieLens dataset, analyzing ratings, and calculating insights in an interactive development environment

Automated Publishing and Visualization with Tableau

To complete the pipeline, I developed a Python module that automated the conversion of processed Spark results into a Tableau-compatible Hyper file format. This module streamlined the process by:

🔹 Converting Data: Transforming the output from Spark into the Hyper file format required by Tableau.

🔹 Publishing to Tableau: Automatically publishing the Hyper file directly to Tableau, making the data instantly available for visualization.

This automation reduced manual effort and ensured that results were updated and visualized efficiently. Using Tableau, I created interactive visualizations, such as scatter plots, to showcase insights like the top 20 movies based on average ratings and popularity.

Scatter plot in Tableau showcasing the top 20 movies by average rating and popularity, generated and published automatically through the pipeline
Scatter plot in Tableau showcasing the top 20 movies by average rating and popularity, generated and published automatically through the pipeline

Based on the plot, Shawshank Redemption stands out as a clear outlier, with exceptionally high average ratings and a substantial count of ratings, solidifying its position as a fan favorite.

Extending the Pipeline to Machine Learning Applications

This pipeline lays a solid foundation for exploring more advanced data science tasks, such as building machine learning models. For instance, it could be extended to create a recommendation system that leverages the MovieLens dataset. By applying collaborative filtering techniques or matrix factorization methods, we could predict user preferences and recommend movies they are likely to enjoy. With the pipeline’s scalability and the integration of Spark’s MLlib, the transition from data processing to machine learning becomes seamless, unlocking more possibilities for personalized insights.

Conclusion

This article provided an overview of the end-to-end implementation of a scalable big data pipeline, integrating Hadoop for storage, Spark for processing, and Tableau for visualization. While this implementation focused on movie ratings, the same principles can be applied across various industries, such as healthcare, finance, and e-commerce, to derive impactful insights from data—paving the way for more data-driven innovation and informed decision-making.

Take Your Skills to the Next Level

If you’re interested in data science, AI, or big data, we offer two Dubai Government-accredited programs designed to provide hands-on experience and industry-relevant skills:

🔹 AI Professional Program – A beginner-friendly course covering Python programming, data analysis, machine learning, deep learning, and real-world applications to prepare you for a career in AI and data science. Learn more here.

🔹 Big Data Professional Program – A hands-on training program in Hadoop, Apache Spark, Kafka, and Tableau, equipping you with the skills to build scalable big data pipelines and tackle real-world data challenges. Learn more here.

#BigData #DataScience #ApacheSpark #Hadoop #MachineLearning #DataVisualization #Tableau #Kubernetes #CloudComputing #AI #DataAnalytics #DataEngineering #TechTraining #PythonProgramming #RecommenderSystems #BusinessIntelligence

Building Scalable, Open-Source Big Data Infrastructure: Lessons and Best Practices from Abu Dhabi

INTRODUCTION: WHY BUILD AN OPEN-SOURCE BIG DATA INFRASTRUCTURE?

In early September 2024, I had the opportunity to deliver a Big Data training program for a large organization in Abu Dhabi. This project involved building an open-source Big Data infrastructure using Hadoop for storage, Apache Spark for processing, and JupyterLab for data analysis, which allowed us to process millions of records. Through this hands-on experience, I saw firsthand how open-source tools can support an organization’s data processing needs without relying on expensive, proprietary cloud services.

For organizations today, an open-source Big Data infrastructure offers substantial advantages: cost-effectiveness, customization, complete control over data security, and the flexibility to scale and adjust components as needed. This article outlines the core setup we implemented in Abu Dhabi, as well as the enhancements I developed upon returning to Dubai, which demonstrate future possibilities for scalable, open-source Big Data solutions.

KEY COMPONENTS OF THE BIG DATA STACK: THE ABU DHABI SETUP

1. Hadoop HDFS for Reliable, Distributed Storage

We began by deploying Hadoop Distributed File System (HDFS) on virtual machines (VMs) to manage large datasets. HDFS allows for reliable, distributed data storage across multiple nodes with built-in fault tolerance, which is essential for handling large datasets efficiently. Running HDFS on VMs provided stable and reliable performance, avoiding the added complexity of containerized storage and ensuring consistent resource allocation. This VM-based approach also enables horizontal scaling, which was ideal for high-performance data management during the training.

Hadoop HDFS console displaying DataNode status, illustrating the distributed storage setup and capacity utilization in our Big Data infrastructure.

2. Apache Spark for High-Speed Data Processing

For data processing, we deployed Apache Spark, also on VMs. Spark’s in-memory computing capabilities made it highly effective for handling the dataset, enabling fast batch processing and real-time analytics. By integrating Spark with HDFS, we created a robust environment for data analysis, allowing the team in Abu Dhabi to experience the full potential of distributed data processing without vendor lock-in.

3. JupyterLab for Data Science and Analysis

To facilitate data exploration and analysis, we installed JupyterLab, which provided a flexible, interactive environment for users to work with the data. While JupyterLab was a powerful tool for analysis, it was set up without centralized management; each user accessed their own environment. This setup was adequate for the training program but limited in terms of collaborative controls, which are beneficial for larger teams.

ENHANCING THE DESIGN IN DUBAI: KUBERNETES FOR SCALABILITY AND CENTRALIZED MANAGEMENT

Upon returning to Dubai, I identified key improvements to make the Big Data infrastructure more scalable and manageable. In Dubai, I set up a Kubernetes cluster using Cilium as the networking overlay and deployed a Hadoop cluster on top of it. Kubernetes offers container orchestration capabilities that simplify scaling and resource management, making it an ideal solution for future growth.

As I explored further enhancements, I realized that a hybrid approach—deploying Hadoop on VMs or physical hardware while running Spark on Kubernetes—could offer the best of both worlds. This setup allows HDFS to benefit from stable, direct-access storage while leveraging Kubernetes for Spark’s dynamic workloads, where containerization enhances resource allocation and task isolation.

Kubernetes console with Hadoop components deployed as pods, showcasing containerized management of the Hadoop cluster for flexible resource allocation.

Centralized Management with JupyterHub

Prior to this experience, I had deployed JupyterHub, a centralized platform that enables user authentication, role management, and persistent storage. JupyterHub allows multiple users to access Jupyter notebooks within a single, managed environment, making it ideal for collaborative workflows and team-based projects. This centralized setup has proven advantageous for streamlined management and enhanced data accessibility across users.

JupyterHub Admin Console showing centralized user management, server control, and collaboration tools for data science workflows.

PRACTICAL USE CASES FOR OPEN-SOURCE BIG DATA INFRASTRUCTURE

The Abu Dhabi project highlighted several high-impact applications for open-source Big Data infrastructure that offer organizations strategic advantages:

  • Real-Time Analytics: Spark’s streaming capabilities make it ideal for real-time analytics, particularly relevant for sectors like finance and IoT.
  • Machine Learning at Scale: By integrating JupyterHub and Spark MLlib, we can perform machine learning on large datasets, supporting applications like recommendation systems and predictive modeling.
  • Data Lake Management: With HDFS, organizations can create scalable, cost-effective data lakes to store and manage structured and unstructured data for future analysis.

These applications illustrate the versatility and power of open-source tools in addressing complex, large-scale data challenges without the high costs associated with proprietary software.

CONCLUSION: LEADING WITH OPEN-SOURCE BIG DATA SOLUTIONS

The Abu Dhabi project underscored the potential of open-source Big Data infrastructure. Using Hadoop for storage, Spark for processing, and JupyterLab for analysis, we built a robust and adaptable infrastructure tailored to organizational needs. Returning to Dubai opened new possibilities, and by enhancing the setup with Kubernetes and JupyterHub for centralized management, we’re now exploring even greater scalability and operational efficiency in data management and processing.

For organizations exploring Big Data infrastructure, open-source solutions offer the scalability, customization, and control essential for gaining a competitive edge in today’s data-driven landscape. If your organization is interested in building a robust, open-source Big Data solution tailored to your needs, feel free to connect with me. You can reach me at ahmed@innosoftgulf.com to discuss the possibilities.