Automated data pipeline extracting Spotify listening data with AWS cloud infrastructure, real-time analytics, and personalized music insights using modern data engineering practices.
It started with a simple question: "Why do I keep listening to the same songs on repeat?"
As someone who spends hours every day with music playing in the background—while coding, studying, or just relaxing—I began noticing patterns in my listening habits. Some weeks I'd be obsessed with indie rock, others with classical piano pieces. But I had no way to quantify these trends or understand what was driving my musical choices.
This curiosity about my own listening behavior led me to build SpotiFriend, an automated data pipeline that extracts, processes, and analyzes Spotify listening data using AWS cloud services. What began as a personal project to understand my music taste evolved into a comprehensive data engineering solution that demonstrates real-world ETL processes, cloud architecture, and analytics automation.
While the goal was to gain insights into my musical preferences, the project became a perfect showcase of modern data engineering practices—from API integration and data extraction to cloud storage, processing, and visualization.
Spotify's annual Wrapped feature is great, but it's limited. It only shows you data once a year, and it doesn't let you dive deep into the patterns. I wanted to answer questions like:
How does my music taste change throughout the day?
Which artists am I discovering vs. returning to?
Do I listen to different genres on weekdays vs. weekends?
How often do I skip songs, and what patterns emerge?
The Spotify API provides access to this data, but it's raw and unstructured. To get meaningful insights, I needed to build a system that could automatically collect, process, and analyze this data on a regular basis. That's when I decided to create a complete data pipeline using AWS services.
To make SpotiFriend scalable and automated, I designed a cloud-native architecture that could handle data extraction, processing, and analysis with minimal manual intervention. Here's how I approached it:
I built a Python-based ETL system using the Spotify Web API with PKCE authentication for secure access to user data. The system extracts multiple data types:
The extraction process handles both free and premium account features, with graceful degradation for premium-only data points.
I implemented a serverless architecture using AWS services for scalability and cost-effectiveness:
The processing layer handles data transformation and quality:
I used Pandas for data manipulation and Boto3 for AWS service integration, ensuring the pipeline could handle both small and large datasets efficiently.
For insights and reporting, I implemented:
Here's the complete data engineering architecture I designed for SpotiFriend:
Spotify Web API
User Authentication (PKCE)
Python ETL Script
Spotipy Library
AWS Lambda (Serverless)
Data Transformation
AWS S3 Data Lake
Partitioned JSONL Format
AWS Glue
Schema Management
Amazon Athena (SQL)
Amazon QuickSight
EventBridge scheduler triggers daily data collection runs
PKCE authentication ensures secure API access
Serverless architecture scales automatically with demand
The core of SpotiFriend is a robust ETL pipeline that handles data extraction, transformation, and loading with proper error handling and monitoring:
Using the Spotify Web API with PKCE authentication, the system extracts:
The processing layer handles data transformation and quality assurance:
Data is stored in AWS S3 with a well-organized structure:
The analytics layer provides actionable insights:
Here are the key technical challenges I solved and the solutions I implemented:
Implemented PKCE (Proof Key for Code Exchange) authentication for secure Spotify API access without exposing client secrets. This ensures the application can run securely in production environments.
Used AWS EventBridge to schedule daily data collection runs, ensuring fresh insights without manual intervention. The system handles failures gracefully with retry logic and error notifications.
Designed the system to handle both individual user data and potential multi-user scenarios. The serverless architecture automatically scales based on demand and only charges for actual usage.
Implemented proper error handling, logging, and monitoring throughout the pipeline. Used environment variables for configuration management and followed AWS best practices for security and performance.
SpotiFriend has the foundation for several exciting enhancements that would further demonstrate advanced data engineering capabilities:
Implement recommendation systems using AWS SageMaker to predict user preferences and suggest new music based on listening patterns and audio features.
Scale the system to handle multiple users with proper data isolation, user management, and personalized dashboards for each user.
Implement real-time data streaming using AWS Kinesis to process listening events as they happen, enabling live analytics and instant insights.
Develop a mobile app that provides personalized music insights, trend notifications, and social features for sharing music discoveries.
SpotiFriend started as a way to understand my music taste, but it became much more than that. It's a comprehensive demonstration of modern data engineering practices, from API integration and cloud architecture to automated processing and analytics.
The project showcases my ability to:
While the insights about my music taste were interesting, the real value came from building a system that could reliably collect, process, and analyze data at scale. It's a perfect example of how data engineering can turn curiosity into actionable insights.
SpotiFriend represents my approach to data engineering: start with a real problem, design a scalable solution, implement it with best practices, and continuously improve based on results. It's not just about the technology—it's about using data to understand the world better.