Image Recognition using Amazon Rekognition

Building a machine learning model is an inherently complex process. This complexity increases substantially when delving into the domain of image recognition. Amazon Rekognition, an object detection service offered by Amazon Web Services (AWS), provides a user-friendly platform for model training, allowing for the refinement of models through the straightforward upload of image datasets. Rekognition provides access to a repertoire of pre-trained models tailored to diverse tasks, encompassing label detection, image moderation, and facial analysis.

Challenge – Building a Fashion Image Recognition Model

The challenge addressed in this paper is the development of a customized fashion image recognition model. Rekognition in its current state does not offer a pre-trained model that meets the desired performance standards. 

There were two approaches considered for the creation of a labeled image set which would be used to train and develop the model:

  1. Manually uploading and labeling clothing images
  2. Leveraging Amazon SageMaker Ground Truth to outsource the labeling process to AWS

The first option would require labeling thousands of clothing images, which would be a time restrictive process. The second approach, despite streamlining the labeling process to a certain extent, would still require considerable manual labor for image labeling, resulting in potential time inefficiencies and inflated  costs.

Recognizing the need for a more time-effective solution, a third approach was considered, one that capitalizes on an existing dataset  (DeepFashion2) designed for clothing recognition. 

In this approach, a Python script was employed to facilitate the translation of the dataset’s labeling system into a.manifest file format, which aligns with the file type requirements of Ground Truth. Additional detail on this process can be found  in the ‘Creating a manifest file’ documentation.

DeepFashion2 Dataset Overview

The DeepFashion2 dataset was selected for this project as a comprehensive fashion image dataset. It contains 491,000 images showcasing 13 popular clothing categories, sourced from both commercial stores and individual consumers. In total, it encompasses a staggering 801,000 distinct clothing items.

What makes DeepFashion2 uniquely valuable is the detailed labeling it provides. Each clothing item within an image comes with various attributes labeled including scale, occlusion, zoom level, viewpoint, category, style, bounding box information, and precise landmarks. Additionally, the dataset includes 873,000 clothing pairs, bridging the gap between commercial and consumer fashion choices. This wealth of information makes DeepFashion2 an invaluable resource for fashion-related machine learning projects.

Data Set Division and Usage

The DeepFashion2 dataset is organized into three segments: a training set containing 391,000 images, a validation set comprising 34,000 images, and a test set with 67,000 images.

To accelerate the model development process, the smaller validation set was chosen for both training and testing purposes. The set was further subdivided into two segments: 80% allocated for training and 20% for testing. The image below offers a glimpse of the dataset in action:

eguKrk39733txHJLyyf3dkjzpz0YyJ ZSuIHVsQlwrc 1vWNDtdtby9EBLkRFFYiTpiiB1hU5lsWq9ujeSLmGnEZUG 4ggJkiFVYqN9bSRN65YBe91OlpRCFjfoJ7qP1LZ

Critical Dataset Information

The DeepFashion2 dataset is comprehensive, rich in details and annotations that support various modeling and prediction tasks. However, for the specific use case of clothing recognition with Amazon Rekognition, the primary focus centered on extracting two key elements: the bounding_box and the image category. This focus aided in speeding up the processing and making the model training less laborious.

The image categories are as follows: [long_sleeved_dress, long_sleeved_outwear, long_sleeved_shirt, short_sleeved_dress, short_sleeved_outwear, short_sleeved_shirt, shorts, skirt, sling, sling_dress, trousers, vest, vest_dress].

To improve the accuracy of the model and reduce redundancy in clothing types, ‘sling’ and ‘sling dress’ were consolidated into ‘vest’ and ‘vest dress’ respectively in the final iteration of the model. This consolidation streamlined the categories, resulting in a more precise and efficient model.

DeepFashion2 dataset information

Annotation File Structure

Each image within a distinct image set was uniquely identified by a six-digit identifier, such as 000001.jpg. Accompanying each image is an annotation file in JSON format, named similarly, such as 000001.json.

The structure of each annotation file is as follows:

source: A string indicating whether the image is sourced from a commercial store (‘shop’) or user-captured (‘user’).

pair_id: A numeric value. Images originating from the same shop, and their corresponding consumer-taken images, share the same pair_id.

Item 1:

  • category_name: A string specifying the item’s category.
  • category_id: A number corresponding to the category name, where each number represents a specific clothing category.
  • style: A numeral used to differentiate between clothing items within images sharing the same pair_id. Distinct style numbers within images having an identical pair_id denote varied attributes such as color, printing, and logo. Positive style numbers indicate a positive commercial-consumer pair if they are identical and greater than 0.
  • bounding_box: Expressed as [x1, y1, x2, y2], where these coordinates define the bounding box’s position.
  • landmarks: Expressed as [x1, y1, v1, …, xn, yn, vn], where ‘v’ denotes visibility: v=2 (visible); v=1 (occlusion); v=0 (not labeled). Different categories have varied landmark definitions. The landmark annotation sequences are illustrated in figure 2.
  • segmentation: Given as [[x1, y1, …xn, yn], [ ]], where [x1, y1, xn, yn] , outlining a polygon for a clothing item. It is possible for a single clothing item to comprise multiple polygons.
  • scale: A numeric value representing the scale (1= small, 2= modest, or 3= large).
  • occlusion: A numeric value indicating the degree of occlusion (1= slight/none, 2= medium, or 3=heavy).
  • zoom_in: A numeric value signifying the level of zoom (1= none, 2= medium, or 3= significant).
  • viewpoint: A numeric value describing the viewpoint (1 = not worn, 2= frontal, or 3= side/rear).

Following the initial item (Item 1), subsequent items within the annotation file follow a similar structure, denoted as Item 2, Item 3, and so forth until the last item, denoted as Item n.

‘pair_id’ and ‘source’ serve as image-level labels. Every clothing item within an image inherits the same ‘pair_id’ and ‘source’. This consistent labeling schema simplifies the organization and association of clothing items within each image.

An annotation file with this structure was utilized to generate the .manifest file using a Python script.

Dataset Processing

Preparing the DeepFashion2 Dataset

Phase 1: Utilizing the “” script available in the DeepFashion2 GitHub repository, the DeepFashion annotations were converted into COCO annotations. COCO annotations consist of a list of objects that provide details about entities present in an image. Each object contains information such as the class label, bounding box coordinates, and segmentation mask. More details can be found in the COCO format – Rekognition documentation

Phase 2: In the second data processing phase, the “” script provided by AWS was employed. This script transforms a JSON file containing COCO annotations into a manifest file. More information on transforming a COCO dataset can be accessed here

This method of data processing can be extended and applied to nearly any image recognition dataset. This process involves translating annotations into the COCO format, thereby enabling the creation of various label recognition tasks without the need for manual image labeling. The complete code and image set used for this process are available in the following GitHub repository

Amazon Rekognition Custom Labels

Amazon Rekognition excels in image recognition and classification tasks. However, it comes with certain limitations when using a pre-made model. The default service does not provide flexibility for fine-tuning with various hyperparameters or making modifications to enhance accuracy.

In this context, Rekognition Custom Labels was chosen. Custom Labels is a feature within Amazon Rekognition that allows users to develop their own label detection models by uploading a custom dataset such as DeepFashion2.

Note: While Custom Labels offers customization, it is important to note that for other functionalities of Amazon Rekognition, such as face detection, image properties analysis, and text-to-image, users must rely on Rekognition’s pre-established functions and cannot train entirely distinct models.


To establish the operational model, the process starts with uploading the image set and the manifest file into an S3 bucket. Subsequently, a new Rekognition Custom Label project is initiated, which utilizes the manifest file as the dataset for model training.

Fashion image recognition pipeline architecture diagram

1. Get the data (Images + Annotations)

Clone the following GitHub repository.

2. Create the S3 bucket

Create the folders for organization.

LNquYQ5Q0xUfCqiZfXyw9P9nL3XZOLtyaZp1UPbe6NSx7ykVceypGc 1gYR TkjAH2NqhsiJSaQVFnzqCU0IJpXEGtIUF8MTy6Q8QhIGERhCYYZizfsLDlC5gyC0IuYTi RU7qjv Ke0HuZ8UZQU4FE

3. Upload the images

Upload the images that will be used for the training, there is no need to upload the annotation file.

4. Execute the following Python scripts

  • Change the s3_bucket_name, local_path in
  • Run python to create the coco file.
  • Run python to create the manifest file and upload the s3.

5. Create the Rekognition Custom Labels Project

Go to Rekognition Custom Labels and create a new project.

Amazon Rekognition Custom Labels setup

6. Create the Dataset

Select the “Import images labeled by SageMaker Ground Truth” option and provide the URI of the manifest file from the S3 bucket.

y7U gs75tEJUchqr 0fi14InPyMhQ 4r2JB4uKiN8R14ifYGdeXl 06pI1yuwIS2fZ
Xk2761Op03B1bMV zKy1hymxsTDmhDzlSYrIeb8 V4Uv30e86hH0P6Whq6JHf8RxcK1xHq4EApqFRJ7ZE6P2CE IQ6kg85EQWyPMWu4fr0R8 SjfzcmZP5bH73rP0ioc

7. Train the model

Verify the dataset images to ensure the correct number was uploaded. Once verified, start the model training process; be aware that training on the full dataset can take close to 60 hours.

1 S7lDzDIgT2BiuTBF1e9Gg4fwLHpBvuLPjmDo7iIMhbOAz5vtkOGoLmOy2PWXgnRXSMRMwM7dyCWIcGoglN83nw9f0TAQHoc3b4H7Z3

8. Use the model

Upon the completion of training, the model becomes available for evaluation and utilization through the AWS SDK or CLI. In this example, the CLI is utilized:

aws rekognition start-project-version \

  –project-version-arn “MODEL_ARN” \

  –min-inference-units 1 \

  –region us-west-2

aws rekognition detect-custom-labels \

  –project-version-arn “MODEL_ARN”” \

  –image ‘{“S3Object”: {“Bucket”: “MY_BUCKET”,”Name”: “PATH_TO_MY_IMAGE”}}’ \

  –region us-west-2

aws rekognition stop-project-version \

  –project-version-arn “MODEL_ARN” \

  –region us-west-2


The trained model achieved an overall accuracy of 85%. For certain clothing types, it reached figures surpassing 90%. This was accomplished after an extensive training period of 57,178 hours, using 25,719 images for training and 6,434 images for testing. When evaluating images from Fashion Nova, a commercial clothing platform, the model demonstrated a confidence level exceeding 90%.

Exploring Further Improvements

There is significant room for enhancement in the current model. Potential improvements include integrating the active model with an API for image evaluation and refining the visualization of model outputs by clearly outlining the recognized clothing’s location.

hvo2fpe8SHWuuwIUv 5E0PCtYP0DZ cSG4qHpmv DqMKA4r4uG4UdlSQjJ8Ag0zSWBb9Q27MIEYJ Pv2ebUo l5 Ioay
Fashion image recognition pipeline in action.
b8Ww7u3d1u8i4t3kHBLOFsA4IRvbjqGnHrsHCfNmzBHCnUfB86uMq7xscvBYQ1i5d4vo8a4GWc7vVbEIrG1JaRnK0es4gIl1w dn53ecjmSGy iLTaY tOYmftSrh

Exploring Advanced Fashion Image Recognition Models

Numerous intricate machine learning models are available for predicting various fashion image categories and attributes. These models can focus on aspects such as clothing fabric, single clothing item recognition per image, or even individual posture and gender detection. Notable examples include MMfashion, DeepFashion, FashionGen, Clothing1M, FashionIQ, etc. 

For more sophisticated datasets that require fine-tuning and experimentation with multiple machine learning solutions, a different approach from Rekognition becomes necessary. In such cases, the logical progression involves the use of SageMaker in conjunction with Python libraries such as PyTorch.

About TrackIt

TrackIt is an Amazon Web Services Advanced Tier Services Partner specializing in cloud management, consulting, and software development solutions based in Marina del Rey, CA. 

TrackIt specializes in Modern Software Development, DevOps, Infrastructure-As-Code, Serverless, CI/CD, and Containerization with specialized expertise in Media & Entertainment workflows, High-Performance Computing environments, and data storage.

In addition to providing cloud management, consulting, and modern software development services, TrackIt also provides an open-source AWS cost management tool that allows users to optimize their costs and resources on AWS.