In this project a music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
The aim of the project is create a ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights into what songs their users are listening to.
The datasets used in this project are found in public S3 buckets, where both of the two files are of the format JSON. The song data contains the contains metadata about a song and the artist of that song and the event data contains information on the users, their activities, location and gender etc.
song_data.json (song data)
log_data.json (events data)
To access this data, Redshift will be utilised to ingest it using the COPYcommand to read the JSON files within the buckets and copy their content to our staging tables.
This project has five files: create.py, etl.py, sql_queries.py, dwh.cfg, and infrastructure_code.ipynb. A fact vs dimension star schema was used to optimize for queries on song play analysis. This includes the following tables:
- staging_songs - information of songs and artists
- staging_events - Information on users (their activities, location and gender etc.. )
- songplays - records in event data associated with song plays i.e. records with page
NextSong(songplay_id,start_time , user_id, level, song_id, artist_id, session_id, location, user_agent)
- users - users in the app (user_id, first_name, last_name, gender, level)
- songs - songs in music database (song_id, title, artist_id, year, duration)
- artists - artists in music database (artist_id, name, location, latitude, longitude)
- time - timestamps of records in songplays broken down into specific units (start_time, hour, day, week, month, year, weekday)
- Create a new
IAMuser in yourAWSaccount. - Give it AdministratorAccess and Attach
policies. - Use access key and secret key to create clients for
EC2,S3,IAM, andRedshift. - Create an
IAMRole that makesRedshiftable to accessS3 bucket (ReadOnly). - Create a
RedShift Clusterand get theDWH_ENDPOINT(Host address)andDWH_ROLE_ARNand fill theconfigfile.
- Set up tables to hold data from S3 buckets.
- Transferring data from S3 buckets to staging tables in the Redshift cluster.
- Data was inserted into the fact and dimension tables from the staging tables.
SQL_queries.pycontains variables with SQL statements in String formats, partitioned byCREATE,DROP,COPY, andINSERTstatement.etl.pyexecutes the queries that ingest JSON data from the S3 bucket.create_tables.pywill drop old tables (if existing) and re-create new ones.Redshift,IAM, andS3configuration information is found in the dhw.cfg file.
Run the Infrastructure code to execute the sql_querie.py, create.py and etl.py.
