SpotiFriend

Automated data pipeline extracting Spotify listening data with AWS cloud infrastructure, real-time analytics, and personalized music insights using modern data engineering practices.

Python Spotify API AWS S3 AWS Lambda AWS Glue Amazon Athena Amazon QuickSight EventBridge Pandas Boto3

View Code Live Demo

Introduction: From Music Discovery to Data Discovery

It started with a simple question: "Why do I keep listening to the same songs on repeat?"

As someone who spends hours every day with music playing in the background—while coding, studying, or just relaxing—I began noticing patterns in my listening habits. Some weeks I'd be obsessed with indie rock, others with classical piano pieces. But I had no way to quantify these trends or understand what was driving my musical choices.

This curiosity about my own listening behavior led me to build SpotiFriend, an automated data pipeline that extracts, processes, and analyzes Spotify listening data using AWS cloud services. What began as a personal project to understand my music taste evolved into a comprehensive data engineering solution that demonstrates real-world ETL processes, cloud architecture, and analytics automation.

While the goal was to gain insights into my musical preferences, the project became a perfect showcase of modern data engineering practices—from API integration and data extraction to cloud storage, processing, and visualization.

Identifying the Problem: Beyond the Spotify Wrapped

Spotify's annual Wrapped feature is great, but it's limited. It only shows you data once a year, and it doesn't let you dive deep into the patterns. I wanted to answer questions like:

How does my music taste change throughout the day?

Which artists am I discovering vs. returning to?

Do I listen to different genres on weekdays vs. weekends?

How often do I skip songs, and what patterns emerge?

The Spotify API provides access to this data, but it's raw and unstructured. To get meaningful insights, I needed to build a system that could automatically collect, process, and analyze this data on a regular basis. That's when I decided to create a complete data pipeline using AWS services.

Designing the System: From API to Analytics

To make SpotiFriend scalable and automated, I designed a cloud-native architecture that could handle data extraction, processing, and analysis with minimal manual intervention. Here's how I approached it:

Data Extraction Layer

I built a Python-based ETL system using the Spotify Web API with PKCE authentication for secure access to user data. The system extracts multiple data types:

Top tracks and artists across different time ranges (short, medium, long term)
Recently played tracks with timestamps and audio features
Playlist data including track details and user behavior
User profile information and listening statistics

The extraction process handles both free and premium account features, with graceful degradation for premium-only data points.

AWS Cloud Infrastructure

I implemented a serverless architecture using AWS services for scalability and cost-effectiveness:

S3: Data lake storage with partitioned JSONL format for efficient querying
Lambda: Serverless functions for data processing and transformation
Glue: Data catalog and ETL jobs for schema management
Athena: Serverless SQL queries for analytics
EventBridge: Automated scheduling for daily data collection

Data Processing Pipeline

The processing layer handles data transformation and quality:

JSON to JSONL conversion for efficient storage
Data validation and error handling
Partitioning by date for optimal query performance
Schema evolution handling for API changes
Incremental processing to avoid duplicate data

I used Pandas for data manipulation and Boto3 for AWS service integration, ensuring the pipeline could handle both small and large datasets efficiently.

Analytics and Visualization

For insights and reporting, I implemented:

Amazon QuickSight: Interactive dashboards for trend analysis
Custom Analytics: SQL queries for specific insights
Automated Reports: Scheduled generation of listening summaries
Real-time Monitoring: Pipeline health and data quality metrics

System Architecture

Here's the complete data engineering architecture I designed for SpotiFriend:

AWS Data Pipeline Architecture

Data Sources

Spotify Web API

User Authentication (PKCE)

Data Extraction

Python ETL Script

Spotipy Library

Cloud Processing

AWS Lambda (Serverless)

Data Transformation

Data Storage

AWS S3 Data Lake

Partitioned JSONL Format

Data Catalog

AWS Glue

Schema Management

Analytics

Amazon Athena (SQL)

Amazon QuickSight

Automation

EventBridge scheduler triggers daily data collection runs

Security

PKCE authentication ensures secure API access

Scalability

Serverless architecture scales automatically with demand

Data Pipeline Implementation

The core of SpotiFriend is a robust ETL pipeline that handles data extraction, transformation, and loading with proper error handling and monitoring:

1. Data Extraction

Using the Spotify Web API with PKCE authentication, the system extracts:

Top Tracks/Artists: Across short-term (4 weeks), medium-term (6 months), and long-term (calculated from several years) time ranges
Recently Played: Track history with timestamps, audio features, and metadata
Playlist Data: User-created and followed playlists with track details
User Profile: Account information and listening statistics

2. Data Processing

The processing layer handles data transformation and quality assurance:

JSON to JSONL: Conversion for efficient storage and querying
Data Validation: Schema validation and error handling for API responses
Partitioning: Date-based partitioning for optimal query performance
Deduplication: Handling of duplicate records and incremental processing

3. Data Storage

Data is stored in AWS S3 with a well-organized structure:

Raw Data: Partitioned by data type and date for efficient access
Processed Data: Cleaned and transformed datasets ready for analysis
Metadata: Schema definitions and data lineage information

4. Analytics & Insights

The analytics layer provides actionable insights:

Trend Analysis: Listening patterns over time and seasonal variations
Genre Analysis: Music taste evolution and genre preferences
Artist Insights: Discovery patterns and loyalty metrics
Behavioral Analysis: Skip rates, listening duration, and session patterns

Technical Implementation Highlights

Here are the key technical challenges I solved and the solutions I implemented:

Secure Authentication

Implemented PKCE (Proof Key for Code Exchange) authentication for secure Spotify API access without exposing client secrets. This ensures the application can run securely in production environments.

Automated Scheduling

Used AWS EventBridge to schedule daily data collection runs, ensuring fresh insights without manual intervention. The system handles failures gracefully with retry logic and error notifications.

Scalable Architecture

Designed the system to handle both individual user data and potential multi-user scenarios. The serverless architecture automatically scales based on demand and only charges for actual usage.

Code Quality

Implemented proper error handling, logging, and monitoring throughout the pipeline. Used environment variables for configuration management and followed AWS best practices for security and performance.

Future Enhancements

SpotiFriend has the foundation for several exciting enhancements that would further demonstrate advanced data engineering capabilities:

Machine Learning Integration

Implement recommendation systems using AWS SageMaker to predict user preferences and suggest new music based on listening patterns and audio features.

Multi-User Support

Scale the system to handle multiple users with proper data isolation, user management, and personalized dashboards for each user.

Real-time Processing

Implement real-time data streaming using AWS Kinesis to process listening events as they happen, enabling live analytics and instant insights.

Mobile Application

Develop a mobile app that provides personalized music insights, trend notifications, and social features for sharing music discoveries.

Conclusion: Beyond the Music

SpotiFriend started as a way to understand my music taste, but it became much more than that. It's a comprehensive demonstration of modern data engineering practices, from API integration and cloud architecture to automated processing and analytics.

The project showcases my ability to:

Design and implement end-to-end data pipelines
Work with cloud-native technologies and serverless architectures
Handle real-world data challenges like authentication, error handling, and scalability
Create meaningful insights from raw data
Build production-ready systems with proper monitoring and maintenance

While the insights about my music taste were interesting, the real value came from building a system that could reliably collect, process, and analyze data at scale. It's a perfect example of how data engineering can turn curiosity into actionable insights.

SpotiFriend represents my approach to data engineering: start with a real problem, design a scalable solution, implement it with best practices, and continuously improve based on results. It's not just about the technology—it's about using data to understand the world better.