Movie Network Analysis Project

Category:

Data Analysis

Client:

Duration:

1 Month

Preview

GitHub Page

Project Overview

This project involves scraping and analyzing data from Rotten Tomatoes' "Movies on Netflix 2024 (At Home)" category. The primary goal is to construct and analyze a network graph representing relationships between movies based on their Tomatometer scores. The project employs network analysis techniques, including PageRank, clustering coefficients, and degree distribution, to uncover insights about movie relationships and importance.

Data Collection

Scraping Process

Data Source: Rotten Tomatoes' "Movies on Netflix 2024 (At Home)" category.
Tool Used: Apify Rotten Tomatoes Scraper (link).
Scraping Input: URL - Rotten Tomatoes Netflix Movies.
Settings: Maximum results set to 500.
Output: 864 results scraped in 8 hours.
Data Export: Extracted into CSV format.

Collected Data Attributes

Aspect Ratio
Audience Score
Box Office (Gross USA)
Cast
Director
Distributor
Genre
Original Language
Producer
Production Company
Rating
Release Date (Streaming)
Release Date (Theaters)
Rerelease Date (Theaters)
Runtime
Sound Mix
Synopsis
Title
Tomatometer
URL
Collection View
Writer

Cleaned Dataset

After cleaning, the following attributes were retained:

Audience Score
Box Office (Gross USA)
Genre
Original Language
Release Date (Theaters)
Runtime
Title
Tomatometer

Network Graph Construction

Nodes: Movie titles.
Edges: Relationships based on Tomatometer scores.
- Weight: Inversely proportional to the similarity in scores (higher score difference = higher edge weight).

The network graph consists of:

Nodes: 264 movies.
Edges: Numerous, weighted connections between movies.
Visualization:
- Graph size: 50x50 (default) or 250x250 (requires heavy computational resources).
- A 25,000x25,000 PNG image of the network is included in the project zip file.

Analysis Techniques

1. PageRank

Assesses the significance of nodes (movies) based on their connectivity.
Higher PageRank scores indicate more influential or well-connected movies.
Example scores:
- "Fallen Leaves": 0.00522
- "Tótem": 0.00522
- "Four Daughters": 0.00548

2. Degree Distribution

Plots the distribution of connections (edges) per node.

3. Average Clustering Coefficient

Measures the tendency of nodes to form tightly-knit groups.
Value: 1.0, indicating a fully connected network.

Results and Insights

Graph Visualization: High-rated movies are placed lower, and low-rated movies are placed higher in the graph.
Clustering Coefficient: Indicates a highly interconnected network.
PageRank: Highlights central movies in the network.
Degree Distribution: Reflects the diversity in movie connections.

Insights

The graph reveals relationships between movies based on ratings.
It can help predict movie preferences or recommend movies based on similar scores.

Further Questions

Can expanding the dataset provide deeper insights?
Can analysis methods improve network efficiency?

Future Work

Create genre-specific or director-specific graphs.
Analyze relationships based on additional attributes like release year or production company.

How to Use

Load the CSV data.
Run the provided scripts to generate the network graph and calculate metrics.
Visualize and analyze results.

Tools and Technologies

Apify: For web scraping.
NetworkX: For graph construction and analysis.
Matplotlib: For data visualization.

Contact: For queries or contributions, please contact Dhruv Singh.

Download the project assets, including the 25,000x25,000 PNG image and analysis scripts, in the provided zip file.

Preview

GitHub Page

Other Projects

Data Anlaysis

Sales Insight

Data Anlaysis

Sales Insight

Data Anlaysis

Sales Insight

Data Science

Scalable Multi-Class Classification with LightGBM and CatBoostt

Data Science

Scalable Multi-Class Classification with LightGBM and CatBoostt

Data Science

Scalable Multi-Class Classification with LightGBM and CatBoostt

Machine Learing

Volleyball Player Statistics Analysis for NCAA Division 3 (2024)

Machine Learing

Volleyball Player Statistics Analysis for NCAA Division 3 (2024)

Machine Learing

Volleyball Player Statistics Analysis for NCAA Division 3 (2024)

Data Analysis

Twitch Social Network Analysis Project

Data Analysis

Twitch Social Network Analysis Project

Data Analysis

Movie Network Analysis Project

Project Overview

Data Collection

Scraping Process

Collected Data Attributes

Cleaned Dataset

Network Graph Construction

Analysis Techniques

1. PageRank

2. Degree Distribution

3. Average Clustering Coefficient

Results and Insights

Insights

Further Questions

Future Work

How to Use

Tools and Technologies

Other Projects

Sales Insight

Sales Insight

Sales Insight

Scalable Multi-Class Classification with LightGBM and CatBoostt

Scalable Multi-Class Classification with LightGBM and CatBoostt

Scalable Multi-Class Classification with LightGBM and CatBoostt

Volleyball Player Statistics Analysis for NCAA Division 3 (2024)

Volleyball Player Statistics Analysis for NCAA Division 3 (2024)

Volleyball Player Statistics Analysis for NCAA Division 3 (2024)

Twitch Social Network Analysis Project

Twitch Social Network Analysis Project

Twitch Social Network Analysis Project

Contact Me

Social

Useful Links

Contact Me

Social

Useful Links

Contact Me

Social

Useful Links