Ingredient parsing
Online recipes can be messy. Ingredient lists can be messy too: they come in many flavors. For example, "1/4 cup cooked shredded chicken", or "1/3 cup chopped fresh cilantro" or "Fresh avocado cubes, for garnish, if desired". While, in fact, all we are interested to get from these is chicken , cilantro and avocado . For humans, parsing these ingredient lists and figuring out what we need to buy at the grocery store is trivial. For a computer, not so much.
Natural Language Processing allows computers to extract data from human languages. I used NLP and, in particular, the Python NLTK package (Natural Language ToolKit) to split phrases into words and categorize these words into word classes (noun, verb, adjectives, ...): this is a process known as part-of-speech tagging or POS-tagging.
Using the words themselves, their POS tags and various combinations of consecutive words and tags as features, we can build a Machine Learning model. Such a model will then be able to predict what kind of word is a specific word in a phrase. Is it an ingredient name, a quantity, a unit of measurement or some other word that can be ignored for this project? In order to achieve this, I used a model called Conditional Random Field (often used for labeling or parsing sequential data, such as natural language text or biological sequences) and the Python python-crfsuite package.
For training the model, I used 1000 recipes for which I semi-manually tagged every word of every ingredient phase as QTY
(quantity), UNIT
(unit of measurement), NAME
(ingredient name) or COM
(comment). I split this data into 80% training and 20% testing data. After training the model, the precision for ingredient name prediction for the testing data was about 94%.
I then wrote a small Flask API that will return the model prediction from an ingredient phrase. You can test it out here using the following form. The result will appear underneath the form and will be color-coded by category: green for quantities , blue for units and red for ingredient names (the rest is classified as comments).