Harnessing the Power of Hugging Face and AWS for Scalable Image Classification

Published in

EatCodePlay

6 min read3 days ago

At RealScout, we recently embarked on a journey to revamp our image classification solution from the ground up. In this article, we share the technologies, infrastructure, and key learnings that guided our new approach. We also touch on how this led to significantly improved performance and scalability.

The Need for Change

Our previous image classification system had served us well for many years. As our company evolves, we tend to look for opportunities to reduce data ingestion times across our infrastructure. Meanwhile, the AWS EC2 instances we relied on were nearing their end of life, requiring more frequent upgrades. Given these challenges, our image classification system became a prime candidate for a revamp, enabling us to achieve performance gains while taking advantage of recent advancements in computer vision.

Designing a New System

To achieve our goals, we opted to rebuild our solution using modern, developer friendly technologies. We chose a self-managed approach, as opposed to a fully-managed service, to offer more flexibility over our application settings and costs. During our design phase, we chose two resources that greatly accelerated our implementation efforts and time: the Hugging Face platform and AWS’s Deep Learning AMI (Amazon Machine Image).

Hugging Face is a machine learning platform that allows users to find and share pre-trained models and datasets. This open source community includes hundreds of thousands of models for natural language processing, computer vision, audio, and more. The available models to fine-tune with minimal data was a game-changer. Overall, we found the developer experience, particularly in the realm of computer vision, to be fantastic.

The AWS Deep Learning AMI was another key component in our new design. It provided an intuitive, out of the box tooling for our image classification needs. This pre-configured environment contains popular deep learning frameworks, tools, and libraries. This reduced the time spent setting up infrastructure and allowed us to focus on model training and optimization right away.

Together, these resources helped streamline our development process. However, a significant portion of our effort went into building a scalable and reliable system. This involved setting up supporting technologies, such as queueing systems (SQS/Celery), autoscaling policies, and daemonizing the application. The next section covers the infrastructure and tooling that made this possible.

Infrastructure and Tooling

AWS:

G5g.* instances with NVIDIA GPUs: These instances are crucial for heavy graphics computing required by image classification tasks. The cost-effective Graviton processors gave us the largest cost reduction in our switchover (compared to the old Intel based processors of our legacy system). Additionally, by leveraging spot instances instead of on-demand instances, we were able to further reduce costs without sacrificing performance.
Autoscaling Policies: We used autoscaling to dynamically adjust server capacity based on workload and internal SLAs. With predictable peak usage, we optimized resource allocation by monitoring workloads over a couple of weeks and refining the configuration. During peak hours, we scale out aggressively by adding multiple instances, while scaling in conservatively when the oldest message age drops below a defined threshold. This approach has proven effective, even during unexpected spikes in workload.

Other AWS services we leveraged included: CloudFormation for our IaC service, CloudWatch for monitoring and alerts, SQS as our message broker

SysAdmin and Supporting Software:

Celery: Utilized for task queueing and management in conjunction with SQS. We configured 4 workers for each EC2 instance.
SystemD: Enables daemonization, crucial for managing background processes effectively.

Curating Real Estate Photos

To fine-tune the image classification model, we first defined the set of labels that the model would recognize in real estate images — bedrooms, bathrooms, pools, kitchens, etc. A significant portion of our time was dedicated to curating a diverse and representative dataset for each label.

We iterated frequently during the fine-tuning stage to ensure the training set had a balanced distribution of images. We took care to include photos with varying lighting, angles, and interior designs, as well as a range of property types (e.g., apartments, single-family homes). We also introduced a “miscellaneous” label to account for images we knew would be encountered in production but were not relevant to the classification requirements (e.g., neighborhood signs, blueprints, etc.).

Fine-Tuning the Model

A key component of the fine-tuning code was using the Hugging Face’s Trainer. The Trainer streamlines the process of building and fine-tuning machine learning models. The Trainer class, in conjunction with the TrainingArguments class, sets up the model and configurations, data sets for training and evaluation, and other settings such as checkpoints for saving models. Below is a summary of the key code components involved in the fine-tuning process.

Define Training Arguments: Customizing how the model is trained, including hyperparameters, output directories, and evaluation strategies.

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="output",
    evaluation_strategy="steps",
    save_total_limit=2,
    logging_dir='logs',
    load_best_model_at_end=True
)

2. Define the Trainer: The Trainer requires several parameters, including the model, training arguments, datasets, and custom functions for metrics and data processing.

from transformers import Trainer

trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_fn,
        data_collator=collate_fn,
        tokenizer=processor
    )

3. Training the Model: Once everything is set up, the model is trained with a simple call to train().

trainer.train()

Real-Time Classification: Using the Fine-Tuned Model

Just as the Trainer was the core component in our model fine-tuning code, the Pipeline plays a central role in our classification process. The Pipeline is a powerful abstraction provided by Hugging Face’s Transformers library, designed to simplify the tasks associated with using a model. Hugging Face offers a range of pipelines for tasks such as text classification, sentiment analysis, and more. For our application, we leverage the Image Classification Pipeline. In just a few lines of code, we can easily load a model, classify the image(s), and retrieve the results. Below is a simplified code snippet.

from transformers import pipeline

# Load an image classification pipeline
pipe = pipeline("image-classification", model=path_to_fine_tuned_model)

def analyze(image):
   # Classify the image (or list of images)
    prediction = pipe(
        images=image
    )

    return prediction

Here is a unit test to demonstrate the calling code, using some photos saved locally, and expected results:

import os
import unittest

import image_classifier as ic

class TestClassifier(unittest.TestCase):
    def test_classify_batch_photos(self):
        bathroom_photo = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'images', 'bathroom.jpeg')
        bedroom_photo = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'images', 'bedroom.jpeg')
        dining_room_photo = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'images', 'dining_room.jpeg')
        kitchen_photo = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'images', 'kitchen.jpeg')

        classifications = ic.analyze_batch([bathroom_photo,bedroom_photo,dining_room_photo,kitchen_photo])
        self.assertEqual(classifications[0][0]['label'], 'bathroom')
        self.assertEqual(classifications[1][0]['label'], 'bedroom')
        self.assertEqual(classifications[2][0]['label'], 'dining_room')
        self.assertEqual(classifications[3][0]['label'], 'kitchen')

    def test_classify_single_photo(self):
        path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'images', 'bathroom.jpeg')

        # Note that an array of labels is returned sorted (with highest scoring label at index 0)
        classification = ic.analyze(path)[0]['label']
        self.assertEqual(classification, 'bathroom')

Image Classification with Fine-Tuned Model

Performance

By leveraging G5g.* instances and optimizing concurrent Celery workers for each instance, we achieved a 3.5x increase in throughput. This improvement was made possible at a fraction of our previous costs, thanks to the cost-effective Graviton processors and our reduced compute needs. Additionally, our autoscaling policy enabled us to efficiently scale with fluctuating traffic, ensuring consistent throughput.

In Summary

This new infrastructure and approach have enabled us to scale our image classification system more efficiently while improving accuracy. This approach allowed us to build a robust system with less time to train a model from scratch. Lastly, we benefitted from the Deep Learning AMI in terms of infrastructure set up.

This article covers the core components of our revamped image classification system. While we’ve focused on the key aspects, there are many additional details involved in implementing such a system. If you have any questions or would like to dive deeper into any part of this process, feel free to reach out!

Lastly, I’d like to acknowledge my colleague, Anthony Sosso, for his invaluable contributions to both this project and article!

EatCodePlay

Harnessing the Power of Hugging Face and AWS for Scalable Image Classification

The Need for Change

Performance

In Summary

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in EatCodePlay

Written by Christine Betadam

No responses yet