I am trying to build a near-real time object detection model which should run on a mobile device. As I am new to this specific area of computer vision I would appreciate every advice on my current progress and feedback on what I could do differently to achieve the goal.
The goal
The goal is to detect garbage in images and classify them into one of the following disposal methods (3 target classes):
- yellow sack/ can (German)
- paper
- glass
In addition to that the model should be lightweight so that it is possible to efficiently run it on a mobile device.
The dataset
I am using the trashnet dataset which includes exactly 2527 images that are distributed among the classes: glass, paper, plastic, trash, cardboard, metal. Notable here is that there is only one item per image. Also the background of every image is the same (plain white).
The methodology
Quiet frankly I am following the YouTube Tutorial from Sentdex on Mac'n'cheese detection and this medium post on gun detection. Therefore I am using Google Colab as my environment. Also I am trying to retrain a pretrained model (ssd_mobilenet_v2_coco_2018_03_29). Training the model and exporting the inference graph is done by using the provided methods from the tensorflow API (model_main.py and export_inference_graph.py). I am using the samples config from tensorflow for this model.
My steps so far
- I've set up my Google Colab environment similar to the Colab Notebook from the Medium post I mentioned before.
- I split the data into training and test data by 3/4 and 1/4 respectively.
- I labeled my data by using the popular labelImg tool so that every object has a bounding box.
- I deleted every image where the object fills the whole space or ranges out of the image since the bounding box wouldn't make that much sense.
- I created the
label_map,csvandtfrecordfiles. - I played around with the
initial_learning_rate, thel2_regularizer > weightrate of the box predictor and feature extrator, setuse_dropout=trueand increased thebatch_size=32.
My current results
Most of the models I built had a very bad AP/AR, kinda high loss and tended to overfit. Also the model is only able to detect one object at a time within new images (maybe because of the dataset?).
Here are some screenshots from my tensorboard. These were made after around 12k steps. I think this is also the point were the overfitting begins to show since the AP is suddenly rising and predicted images have an accuarcy around 90-100%.
Scalars:
Predicted images:
Questions from my side
- Is it problematic that every image has only one object in it? Might this cause problems when running the model on a video stream?
- Are these enough images to build an accurate model?
- Does anyone of you guys have experience in this area and could give me advice on how to fine tune the pretrained model?
- I also ran the model on a video stream from my webcam but all models tend to classify the whole screen. So it seems that the model is detecting an object but draws the bounding box all over the screen. Might this be related to the nature of the dataset/ the poor model quality?
This has been a long post so thank you in advance for taking time to read this. I hope I was able to make my goal clear and provided enough details for you guys to follow my current progress.
Current adjusted configuration for the pretrained ssd_mobilenet_v2_coco_2018_03_29 model:
model { ssd { num_classes: 3 box_coder { faster_rcnn_box_coder { y_scale: 10.0 x_scale: 10.0 height_scale: 5.0 width_scale: 5.0 } } matcher { argmax_matcher { matched_threshold: 0.5 unmatched_threshold: 0.5 ignore_thresholds: false negatives_lower_than_unmatched: true force_match_for_each_row: true } } similarity_calculator { iou_similarity { } } anchor_generator { ssd_anchor_generator { num_layers: 6 min_scale: 0.2 max_scale: 0.95 aspect_ratios: 1.0 aspect_ratios: 2.0 aspect_ratios: 0.5 aspect_ratios: 3.0 aspect_ratios: 0.3333 } } image_resizer { fixed_shape_resizer { height: 300 width: 300 } } box_predictor { convolutional_box_predictor { min_depth: 0 max_depth: 0 num_layers_before_predictor: 0 #use_dropout: false use_dropout: true dropout_keep_probability: 0.8 kernel_size: 1 box_code_size: 4 apply_sigmoid_to_scores: false conv_hyperparams { activation: RELU_6, regularizer { l2_regularizer { #weight: 0.00004 weight: 0.001 } } initializer { truncated_normal_initializer { stddev: 0.03 mean: 0.0 } } batch_norm { train: true, scale: true, center: true, decay: 0.9997, epsilon: 0.001, } } } } feature_extractor { type: 'ssd_mobilenet_v2' min_depth: 16 depth_multiplier: 1.0 conv_hyperparams { activation: RELU_6, regularizer { l2_regularizer { #weight: 0.00004 weight: 0.001 } } initializer { truncated_normal_initializer { stddev: 0.03 mean: 0.0 } } batch_norm { train: true, scale: true, center: true, decay: 0.9997, epsilon: 0.001, } } } loss { classification_loss { weighted_sigmoid { } } localization_loss { weighted_smooth_l1 { } } hard_example_miner { num_hard_examples: 3000 iou_threshold: 0.99 loss_type: CLASSIFICATION max_negatives_per_positive: 3 min_negatives_per_image: 3 } classification_weight: 1.0 localization_weight: 1.0 } normalize_loss_by_num_matches: true post_processing { batch_non_max_suppression { score_threshold: 1e-8 iou_threshold: 0.6 max_detections_per_class: 1 max_total_detections: 1 } score_converter: SIGMOID } } } train_config: { batch_size: 32 optimizer { rms_prop_optimizer: { learning_rate: { exponential_decay_learning_rate { initial_learning_rate: 0.01 decay_steps: 800720 decay_factor: 0.95 } } momentum_optimizer_value: 0.9 decay: 0.9 epsilon: 1.0 } } fine_tune_checkpoint: "PATH" fine_tune_checkpoint_type: "detection" # Note: The below line limits the training process to 200K steps, which we # empirically found to be sufficient enough to train the pets dataset. This # effectively bypasses the learning rate schedule (the learning rate will # never decay). Remove the below line to train indefinitely. num_steps: 200000 data_augmentation_options { random_horizontal_flip { } } data_augmentation_options { ssd_random_crop { } } } train_input_reader: { tf_record_input_reader { input_path:"PATH" } label_map_path: "PATH" } eval_config: { num_examples: 197 # Note: The below line limits the evaluation process to 10 evaluations. # Remove the below line to evaluate indefinitely. #max_evals: 10 num_visualizations: 20 } eval_input_reader: { tf_record_input_reader { input_path: "PATH" } label_map_path: "PATH" shuffle: false num_readers: 1 }