Quickly building training datasets for NLP applications

Author

Ander Steele, PhD

September 5, 2024

The most critical component for any NLP classification task is high-quality, labeled training data. It is also the most expensive—generating high quality labels requires thoughtful design of annotation guidelines, training and managing a workforce of human annotators (oftentimes across borders and timezones), and careful review and monitoring of annotations. All of this takes time and money, but now with powerful zero-shot models, it’s possible to dramatically speed up the laborious process of sourcing and labeling data.

Sourcing training data

Depending on the nature of your problem, you may be starting without any data, or you may have far too much data to reasonably annotate. In the former situation, one option is to source data for annotation by crawling public data or using existing archives of crawl data—for example, Common Crawl. But now you are back to the first problem: how do you subset down to a reasonably sized training dataset?

One quick way to find relevant examples for annotation is to use generalist models to sample the data. For example, if you were trying to build an NER model to identify recipe ingredients, you could use GLiNER to look for text samples in common crawl where the model predicts ingredient spans.

Of course, the model predictions are likely not going to be accurate enough for your use-case, but now you have a small sample for review¹. If the model predictions are accurate enough for your use case, you may wish to simply use the model for inference. On the other hand, if your application demands higher accuracy, then you’ll want to refine these weak labels.

Fig 1: An example of common crawl data and GLiNER ingredient predictions

Reviewing annotations

Nothing can replace careful human review, but it may be possible to speed this review up by using powerful LLMs like GPT-4o, Claude Sonnet 3.5, or Llama 3.1 405B. For example, we may notice that our sample of ingredient data has many false positives.

Fig 2: While technically ingredients, these examples don’t conform to our intent

By prompting a powerful LLM to adjudicate each identified span as correct or incorrect with respect to our annotation guidelines, we can quickly and cheaply refine our machine generated labels—inference costs for these powerful LLMS are small enough that you can quickly review a batch of 1000 annotations² for a few dollars. At this point, we may have enough data to fine-tune a BERT model for detecting recipe ingredients, having discarded some of the false positives.

Human labeled data

Of course, to answer this question definitively, we will need a test set, and for this purpose nothing can possibly replace human labeled data. But instead of annotating thousands of examples for training and evaluating a model, we can start with annotating a few hundred samples to evaluate our models trained on machine-generated labels.

With good tooling and a little bit of coffee, this may be possible to accomplish with a single annotator in a matter of hours. For example, if the average handle time for an annotation is 3 minutes, then it’s an afternoon’s work to annotate 100 examples of a test set and over a week’s worth to annotate 1000 examples for a test and train dataset.

Conclusions

Moving fast doesn’t have to mean breaking the bank. Time and budgets are limited these days, but at the same time, teams need to build quickly and efficiently in order to keep pace with the progress of innovation in AI. We were faced with this very problem when we were building Tonic Textual: either wait 6-12 months and spend hundreds of thousands of dollars preparing human-annotated training data, or find a faster and cheaper way to train our models that wouldn’t break the bank or cause us to miss the market.

We’ve proposed a helpful methodology to train targeted NLP models at a fraction of the time and cost typically required for data sourcing and annotation work that eats away at available budget. While LLMs can be used for synthesizing data, we’ve also found great success in using small language models to effectively sample huge (real) datasets and winnow them down to relevant subsets. Using this sampling methodology, we can then use a combination of large models to annotate the data, rate the annotations, and then pick a smaller subset of labels to undergo human review. By doing this, we’ve drastically reduced the high cost and long cycles required to engage a team of human annotators to label all of our training data.

You can see the output of this work for yourself in Tonic Textual’s NER models, which are used to detect entities in your text data for the purposes of redacting sensitive information or enriching your vector database for RAG. Sign up for a free account of Textual today.

‍

¹ GLiNER also has a 512-token context window, so only the first ~500 words can be considered annotated

² Assumes annotation guidelines and examples are less than 1,000 tokens, and negligible output tokens are needed

Synthesize your unstructured data for use in AI.

Unblock your AI initiatives and build features faster by securely leveraging your free-text data.

Book a demo