Automatic Recipe Tagging with Machine Learning

Introduction

Recipe websites like food.com use tags to categorize their content. However, human tagging is slow and sometimes in accurate. By analyzing a dataset of over 80,000 recipes with their associated tags, I wil show how I built a predictive model that can automatically suggest relevant tags based on a recipe’s characteristics. This automation could significantly improve recipe discovery and organization on cooking platforms.

The dataset contains several relevant columns:

Quantitative features: minutes, n_steps, n_ingredients, calories, total_fat, sugar, sodium, protein, saturated_fat, carbohydrates

Categorical features: ingredients, steps

Target variable: tags

Data Cleaning and Exploratory Data Analysis

The raw dataset required several cleaning steps to ensure quality, since the data was originally scraped from the food.com website. During this process, I:

Removed invalid entries using type validation
Converted nutrition information from string lists to separate columns
Removed recipes with unrealistic cooking times (>1 year)
Standardized tag formatting and removed empty tags
Merged ratings data with reviews data and calculated average ratings per recipe

Univariate Analysis

For this problem, let’s also look at the most common tags:

Bivariate Analysis

The relationship between sugar content and calories looks nice, and there’s a clear pattern. This pattern is good, since it shows the data is accurate and consistent:

Interesting Aggregates

Recipe complexity, measured by the number of steps and ingredients, has an interesting relationship with ratings. Recipes with few steps and many ingrients get worse ratings, perhaps because they’re confusing, while recipes with many steps and ingredients do really well:

Framing a Prediction Problem

Predicting tags is a multiclass classification problem where we predict which tags should be applied to a recipe. The response variable is the ‘tags’ column, specifically the text strings located within each one. Recipes have several tags, but it’d be unreasonable to measure its ability to predict every single one. Later I will discuss the exact metric I like to use to evaluate the model, but the priority here is to minimize false positives (wrong tags) over false negatives (missing tags). The statistical method to assess false positives is precision. However, I’ll also use accuracy for many steps because it’s easy to interpret and the data set is very unbalanced (many tags are rare). In context, precision makes sense because sers are more frustrated by irrelevant recommendations than by missing ones.

The data used in the models will vary, but the target variable will always be tags. Things unknown at the time of prediction, like ratings, user id, date, etc will be excluded from the model. The models will be trained on the remaining columns, which include quantitative and categorical features.

Baseline Model

The baseline model uses a pipeline with two main components, StandardScaler for feature normalization and LogisticRegression classifiers (one per tag).

I used the following features in this model, which are all quantitative and nominal: calories, total_fat, sugar, sodium, protein, saturated_fat, carbohydrates, n_steps, n_ingredients, minutes

In hindsight, this model performs really bad across all metrics for most tags. I tried many different settings with the Logistic Regression model, but turns out numeric data alone is only good at predicting a few things. This looks more needlessly complicated than it is, but I wanted to try a few different things to see if there was any way to squeeze out better results.

Here are the top predicted tags for a random recipe. It looks reasonable, but it’s not very accurate to the actual tags:

Final Model

For the final, more advanced, model, we can introduce a more complex model. This will use multinomial logistic regression. In addition to the raw numbers, we can take vectors from the text (like what Google Search does - see tf-idf) and simplify them a bit with PCA to make predictions faster without sacrificing too much accuracy. Overall, this approach should be more accurate than the baseline model because it uses a better achitecture and the vectors from the text should help the model predict data for things like cuisines and ingredients much better.

Let’s take a look at how this approach (with using text features) impacts the weights across all tags:

Keep in mind that the figure above uses a log scale for the y-axis. However, this shows that while the numeric features are more important overall (higher median), the text features are significant for many outliers and brings up the mean. This shows that the text features are important for some tags, but not all, as we predicted.

Here are the top predicted tags for an actual recipe.

Here’s the corresponding recipe entry:

When we run standard analysis methods on the model overall, we get generally poor results:

Accuracy: 0.0019097636667462401

Precision: 0.1367988040416587

Recall: 0.05092093590853017

F1 Score: 0.061727870107098404

However, when we look at the confusion matrix for any given tag, we see that the model is actually decent in precision and great in accuray.

So what gives? The predicted tags in the above recipe look accurate!

The problem is that expecting the model to be perfect to count something as successful is unreasonable. Since each recipe has several tags, it would be nearly impossible for the model to predict all of them correctly. So, I explored a different approach to seeing how good the model can be when it doesn’t need to be perfect.

One idea I had was to try for a percentage of tags, but this is still a bad metric because it doesn’t account for the fact that maybe the model gets the first 3 tags right and the next 4 wrong. This is below 50%, but going 3/3 would be amazing!

Results

In a real world scenario, where these predictions might be presented to the user in a UI, the most important factor is that a good enough number of the predictions are correct. Then they can simply tap to confirm or select the predictions.

So if we present 6 predictions and the user clicks 4 of them (or 2/3 or 3/5, etc), that’s good because they saved time by not manually typing or selecting a long list of tags (and skimming a short list doesn’t cost that much time).

Looking across all tags, here’s results from some combinations (though this may change depending on tf-idf features, pca components, number of iterations, etc):

1/3:

Accuracy: 0.926
Recall: 0.136
Precision: 0.669

2/2:

Accuracy: 0.550
Recall: 0.097
Precision: 0.712

2/3:

Accuracy: 0.715
Recall: 0.136
Precision: 0.669

2/6:

Accuracy: 0.888
Recall: 0.227
Precision: 0.571

3/5:

Accuracy: 0.645
Recall: 0.200
Precision: 0.599

One limitation to point out is this model does not account for the fact that a recipe tagged “30-minutes-or-less” is technically also “60-minutes-or-less”, so the model may predict both when the actual tag data only has one. Although this doesn’t help with the status, in practice it’s not a big deal because the user can still select what they prefer and it’s still correct nonetheless. It also means the stats make the model look worse than it is because it’s predicting tags that are correct, but the scraped data has it wrong.

The final model shows significant improvement in precision, particularly for cuisine-specific and subjective tags, likely due to incorporating text features alongside numerical characteristics. It hovers around 50-70% in most of these combinations, which is a significant improvement over the baseline model and is likely good enough for practical use.

It’s also highly accurate overall for numerical tags like ‘60-minutes-or-less’ and ‘low-in-something’.

Altogether, the final model is clearly better than the (admittedly terrible) baseline model, and next steps would be reducing the training time (it’s currently around 15 minutes with pretty convervative settings) and seeing how the model predicts more tags that are less common.