Project

Building a Working VQA learning model in Python

A Visual Question Answering (VQA) model is a system that answers questions about an image, using a combination of natural language processing and computer vision

All my code for this project

‎

Soon I will be working at an internship for Gentian.io involving geolocation, geospatial data, and building/tuning LLMs, so I wanted to create a functional template for a VQA image analyzing model that I can progressively build stronger.

I began by using Tensorflow and built a very straightforward model where I could load in a question, and a few potential answers. To start off, I used:

Question: “What color is the ball?”

Answers: “Red”, “Blue”, “White”, “Yellow”, “Purple”, etc

I then loaded images into the model, as seen below.

First Image:

Result:

Second Image:

Result:

Third Image:

Result:

This model was semi-accurate, getting 8/10 tests correct, with the 2 incorrect tests having a high probability of the correct (unchosen) answer.

My next step was to expand the model to understand more parameters

Parameters:

Color

Shape

Object

Why this is important: To create more specified instances or classes of existing entities, the process would go as follows:

Use the given model to determine color, object, shape, etc
Use the given values to create a probability comparison of whether or not something in the image is what it is (i.e. I think this is a red cube, does the color and shape match?)
Add any number of specific objects relevent to the task at hand and assign them values of each trait, allowing for fine-tuned models and specific tasks

From here, I tested both object and shape, both which worked perfectly. The logic within my code goes as follows:

Color Detection: I use OpenCV for image processing, then the image is converted to RGB and reshaped into a 2D array of pixels (each pixel represented as 3 integers for RGB). The pixels are then clustered using KMeans, which determines how many color groups the algorithm should make. From there, I essentially determine the closest color by calculating the Euclidian Distance between the three values.

The Euclidean distance between two RGB colors is calculated as: \[ d = \sqrt{(r_1 - r_2)^2 + (g_1 - g_2)^2 + (b_1 - b_2)^2} \]

Where: \[ \begin{aligned} r_1, g_1, b_1 & : \text{ RGB values of the detected dominant color} \\ r_2, g_2, b_2 & : \text{ RGB values of the closest standard color} \end{aligned} \]

Shape Detection: Once again I used OpenCV to convert the image to grayscale and apply Gaussian blur, which is done to reduce noise and increase smoothness, both of which are beneficial for contour detection. Next, I use thresholding to convert the image to a binary form of black and white (intensity >= 60 is set to white, otherwise black), where I can approximate contours and determine the number of vertices. The number of vertices is then used to give me the general shape, which I plan on expanding more to include more shapes.

Object Detection: I decided to use the pretrained DETR model from HuggingFace, which preprocesses and detects the object within the image. This is a part where I have yet to do more research into the model itself, but it currently works as intended and scores objects within the model. In the future, I plan on adding specifics, such as a certain part of the image so more data/recognition can be done with a more specific prompt, such as "What is the object in the top right of the image?"

Technologies Used

Python
Opencv
Torch
Sklearn

Sources

https://visualqa.org/

https://huggingface.co/tasks/visual-question-answering