I have a dataset containing 330 images which contain guns. Along with the images, I have a text file associated with each image file which contains,
- The number of objects ( guns ) in the image.
- Coordinates for bounding boxes around the gun in the image.
I need to train a model which takes an image as an input and outputs 4 integer values which are the coordinates for the bounding box ( vertices of the bounding box ).
For training an object detection model, should the image be kept as an input and the coordinates as the output of the model? Should there be Convolution layers for feature extraction and then FC layers for learning the features for producing 4 outputs ( coordinates of the bounding box )?
Is this notion of the model architecture correct? Any other tips/suggestions?
I am creating this model entirely in TensorFlow Keras without using any of the pretrained stuff.