Project

Building a Working VQA learning model in Python

A Visual Question Answering (VQA) model is a system that answers questions about an image, using a combination of natural language processing and computer vision

All my code for this project

Soon I will be working at an internship for Gentian.io involving geolocation, geospatial data, and building/tuning LLMs, so I wanted to create a functional template for a VQA image analyzing model that I can progressively build stronger.

I began by using Tensorflow and built a very straightforward model where I could load in a question, and a few potential answers. To start off, I used:

Question: “What color is the ball?”

Answers: “Red”, “Blue”, “White”, “Yellow”, “Purple”, etc

I then loaded images into the model, as seen below.

First Image:

First Image

Result:

First Result

Second Image:

Second Image

Result:

Second Result

Third Image:

Third Image

Result:

Third Result

This model was semi-accurate, getting 8/10 tests correct, with the 2 incorrect tests having a high probability of the correct (unchosen) answer.

My next step was to expand the model to understand more parameters

Parameters:

  • Color
  • Shape
  • Object
  • Why this is important: To create more specified instances or classes of existing entities, the process would go as follows:

    1. Use the given model to determine color, object, shape, etc
    2. Use the given values to create a probability comparison of whether or not something in the image is what it is (i.e. I think this is a red cube, does the color and shape match?)
    3. Add any number of specific objects relevent to the task at hand and assign them values of each trait, allowing for fine-tuned models and specific tasks

    From here, I tested both object and shape, both which worked perfectly. The logic within my code goes as follows:

  • Color Detection: I use OpenCV for image processing, then the image is converted to RGB and reshaped into a 2D array of pixels (each pixel represented as 3 integers for RGB). The pixels are then clustered using KMeans, which determines how many color groups the algorithm should make. From there, I essentially determine the closest color by calculating the Euclidian Distance between the three values.
  • The Euclidean distance between two RGB colors is calculated as: \[ d = \sqrt{(r_1 - r_2)^2 + (g_1 - g_2)^2 + (b_1 - b_2)^2} \]

    Where: \[ \begin{aligned} r_1, g_1, b_1 & : \text{ RGB values of the detected dominant color} \\ r_2, g_2, b_2 & : \text{ RGB values of the closest standard color} \end{aligned} \]

  • Shape Detection: Once again I used OpenCV to convert the image to grayscale and apply Gaussian blur, which is done to reduce noise and increase smoothness, both of which are beneficial for contour detection. Next, I use thresholding to convert the image to a binary form of black and white (intensity >= 60 is set to white, otherwise black), where I can approximate contours and determine the number of vertices. The number of vertices is then used to give me the general shape, which I plan on expanding more to include more shapes.
  • Object Detection: I decided to use the pretrained DETR model from HuggingFace, which preprocesses and detects the object within the image. This is a part where I have yet to do more research into the model itself, but it currently works as intended and scores objects within the model. In the future, I plan on adding specifics, such as a certain part of the image so more data/recognition can be done with a more specific prompt, such as "What is the object in the top right of the image?"
  • Technologies Used

    Sources

    https://visualqa.org/

    https://huggingface.co/tasks/visual-question-answering