Exercise 8

Movie Revenue Prediction

Today we will be working with more movie-related data. This data is real, and it's also bit messier. We'll need to do some cleaning and feature engineering to our data to get it ready for training.

Like in most ML applications to real data, you will likely spend much more time understanding and preparing your data than actually training. So get better used to it!

Getting the data

Please download and unzip the dataset from here:

curl -o ./tmdb-box-office-prediction.zip https://learn-data-science-site.vercel.app/tmdb-box-office-prediction.zip
You will notice that there are two csv files in there. What do they represent? What is the one major difference in the columns?
This dataset is from an ML competition in which the goal is to predict the amount of money a movie will make at the box office. Because of the competition format, one of the files contains a separate set of unlabeled data. Competitors have to submit their models and results on this unlabeled dataset, and and the organizers, who know the true answers, can see whose model performs the best.

EDA (Exploratory Data Analysis)

What columns are in the dataset? Which ones would you intuitively guess might contribute to the overall revenue of a movie?
Let's quantify this by making a seaborn pairplot with some numerical features of your choice. Do not pick more than ~5 or it might take to long to produce the plot. Make sure to include revenue.
Feel free to make any other plots you think are interesting.

Feature Engineering

Note: even though we are not training with the unlabeled data, any transformations you make to the labeled dataset must also be applied to the unlabeled dataset. That way, when we make our predictions, the model will be able to find all the columns we trained with.

Logically, the year a movie was released should have some influence on the revenue possibilities. A movie from 1950 would not have been able to make 1 billion euros like today. Use your pandas knowledge to add a new column to you data called "year" based on the "release_date" column.
- Hint: You may use the datetime package if you wish, specifically the strptime() method might be useful. You are also welcome to parse the strings manually.
- Make sure the way your converting the year doesn't create any dates that are in the future (plot a histogram of the years to check this!)
- Tip: The newest movie in the dataset was released in 2017
In addition to the release year, the genre of a movie might also impact things. Examine the "genres" column, and look at a few examples from various rows. Think about how you might extract the necessary information from this column. Notice that movies may have multiple genres associated.
We will use a technique called "multi-hot encoding" to pass genre information to our regressor. For this technique to work, we need to create a new column for every genre in the dataset, perhaps f"genre_{x}" , where x is each genre (or genre ID). The values of each genre column must be 1 or 0, depending on whether the movie has this genre or not. If this is unclear, please do ask!
- Hint: Look up the built-in ast.literal_eval() function to help you parse the genres column. It might come in handy!

Training

On the labeled dataset, create our X and Y, where X contains the columns:
- budget
- popularity
- runtime
- year
- and all of our binary multi-hot encoded genre columns
Perform a train/test split with a test_size=0.2 and a random_state=42 .
Run a Random Forest Regressor with all the default parameters and a random_state=42 .

Evaluating

Predict on the 20% labeled Y dataset that you set aside during the train-test split
- Plot the predicted revenue vs the actual revenue

Use the following code to extract the importance of each feature in determining the revenue

# plot feature importance
# X here is the full DataFrame just before train-test splitting
# (but after you filtered down to the final columns list)

feature_importances = forest_model.feature_importances_
feature_importances_df = pd.DataFrame({
    "feature": X.columns,
    "importance": feature_importances
})
feature_importances_df.sort_values(by="importance", ascending=False, inplace=True)

feature_importances_df

What is the sum of the importance column?
Which features are most important? How important is each genre towards determining the revenue? And what if you sum all the importances from all the genres together, now do they seem more important?

Predicting the unlabeled data

Make sure the unlabeled DataFrame has the exact columns and format as your X training DataFrame just before your test-train split
Pass this DataFrame to your predictor to create a predictions array.
Add this data as a new column called pred_revenue to the unlabeled DataFrame
Sort the rows of this DataFrame by pred_revenue . What are the movies with the highest predicted revenue? Does this look more or less as you might expect?
Find the IMDB ID of the movie with the highest predicted revenue, and use it as your secret word today ✨