[Code]
1. Summary
With the aid of deep learning, this study classifies and segments 4 classic Vietnamese foods, including Cơm Tấm, Bánh Mì, Phở, and Bánh Tráng Nướng. These foods were selected in a foods dataset called 30VNfoods described in an article called "30VNFoods: A Dataset for Vietnamese Foods Recognition”.
Why choose only 4 foods?
- We will label and annotate these foods for the Segmentation problem because we don't have a lot of time and there are a lot of individuals who label, so we just select 4 dishes to accomplish this.
- Additionally, since Colab is the only training environment employed, the model training throughout project execution is constrained.
Model?
- We will build a basic model from MLP to modern networks like VGG and ResNet in the classification and efficiency comparison problem.
- We will use U-Net for segmentation problem
2. Processing Datasets
The dataset is divided into 3 parts: train, val and test
Table 1. Number of images per dataset
Train | Val | Test | |
Bánh mì | 935 | 133 | 268 |
Bánh tráng nướng | 556 | 80 | 159 |
Phở | 564 | 81 | 162 |
Cơm tấm | 659 | 94 | 189 |
The images are collected by the author on the internet with many different sources, so the image sizes are different. For the convenience of input training, the dish images are resized 224x224x3 and normalized from 0 to 1
Data Annotations
- On the web platform, Segments.ai uses data that we label and annotate. This software is plenty of tools that can be used to accentuate the food's edges. The application also has a built-in library. This allows us to alter and reconstruct the data set necessary to solve the segmentation problem.
- From the dataset used to train the classification model, we randomly select 2824 samples for labeling and annotation.
3. Experiments and Results
Model for classification
- For classification problem, we use MLP, CNN, miniVgg networks and trained networks on Image-Net dataset (pre-trained model). For MLP, we will experiment by gradually increasing the number of nodes in the hidden layer and increasing the number of hidden layers, if the nodes have good results, they will keep increasing the hidden layer and vice versa.
- For the CNN network, we build a simple model and a model based on VGG's architecture but more shallow. In addition, we continue to use pre-trained models on the Image-Net set to evaluate
Model for segmentation
- Next to perform image segmentation, we use the Unet structure as described above. Here the encoder will be reused the pre-trained models on the Image-Net set to get better results. We experiment on different pre-training models: VGG16, Resnet18, Resnet34
Table 2: Segmentation models
Encoder | VGG16 | ResNet18 | ResNet34 |
Map | Copy and concatenate | Copy and concatenate | Copy and concatenate |
Decoder | Revert VGG16 + Conv1 | Rever ResNet18 + Conv1 | Revert ResNet34 + Conv1 |
Metrics
Classification problem:
In there :
• True Positive (TP): the number of points of the class Positive that are correctly classified as Positive.
• True Negative (TN): the number of points of the negative class that are correctly classified as negative.
Segmentation problem:
In there:
• A is the predicted segment
• B is Ground truth
Results
Classification
Table 3: Classification results using various models
Methods | Accuracy | Loss | Val_Accuracy | Val_Loss | Test_accuracy |
Resnet18_pretrained | 99.926 | 6.78E-05 | 96.907 | 0.1106 | 95.886 |
Resnet18 | 99.486 | 0.0003 | 80.154 | 0.7141 | 78.663 |
VGG16_pretrained | 99.266 | 0.0005 | 94.587 | 0.4035 | 95.758 |
VGG16 | 95.229 | 0.0030 | 78.350 | 0.6939 | 77.763 |
miniVGG | 99.926 | 0.0001 | 82.989 | 0.6325 | 87.917 |
SimpleCNN | 99.559 | 0.0008 | 86.597 | 0.3855 | 86.632 |
MLP_4hidden512node | 53.651 | 0.0678 | 45.103 | 2.8904 | 47.043 |
MLP_3hidden1024node | 44.403 | 0.1080 | 34.278 | 4.8297 | 38.946 |
MLP_3hidden512node | 55.486 | 0.0707 | 40.721 | 5.5563 | 44.987 |
MLP_4hidden | 47.706 | 0.0583 | 37.886 | 2.3706 | 38.303 |
MLP_3hidden | 49.761 | 0.0512 | 36.082 | 3.0187 | 41.902 |
MLP_2hidden | 48.844 | 0.0438 | 40.979 | 1.6916 | 41.516 |
Table 3 shows that the pre-trained models with the highest results are ResNet18 and VGG16 with over 95% on the test set. As for networks trained from scratch, miniVGG achieved the best results, better than VGG16 and ResNet18 retrained from scratch.
Then we will use the miniVGG network to continue the experiment to get better results:
- Experiment with different optimization algorithms such as Adam, SGD, RMSProp
- Use l2 regularization
- Choose the most optimal algorithm to experiment with different learning rates
- Method to reduce learning rate
- Using Augmentation: o RandomHorizontalFlip o RandomGrayscale o RandomAdjustSharpness
Table 4: Experimental results on miniVGG
Methods | Accuracy | Loss | Val_Accuracy | Val_Loss | Test_accuracy |
miniVGG_adam_l2_lr_0.0001_aug | 92.587 | 0.0072 | 89.948 | 0.3201 | 86.246 |
miniVGG_adam_l2_lr_0.0003 | 95.045 | 0.0050 | 91.494 | 0.2484 | 88.431 |
miniVGG_adam_l2_lr_0.0001 | 98.458 | 0.0024 | 87.628 | 0.3348 | 88.817 |
miniVGG_adam_l2_lr_0.001 | 99.853 | 0.0004 | 88.144 | 0.349 | 87.917 |
miniVGG_RMS | 86.165 | 0.0118 | 77.061 | 0.6142 | 74.293 |
miniVGG_SGD | 95.559 | 0.0054 | 84.278 | 0.4104 | 84.190 |
miniVGG_adam | 95.486 | 0.0043 | 87.886 | 0.3675 | 86.118 |
Experimental results on miniVG Through Table 4, it shows that the adam optimization algorithm is the best algorithm for this problem. Then I tested two different learning_rate, both have similar accuracy on the test set, then we apply augmentaion but the result is not better than the original.
Parameters taken in the segmentation problem:
- The optimal algorithm is Adam
- Method to reduce learning rate
- Use Augmentation o RandomBrightnessContrast o HueSaturationValue o Horizontal Flip o IAAAdditiveGaussianNoise
Table 5: Final results on training set of segmentation models
Name | iou/train | iou_banhmi | iou_banhtrang | iou_comtam | iou_pho | iou_clutter |
Unet_ResNet34 | 0.8526 | 0.8262 | 0.8207 | 0.6916 | 0.7174 | 0.9037 |
Unet-ResNet18 | 0.9158 | 0.9087 | 0.8832 | 0.8760 | 0.8744 | 0.9375 |
Unet-VGG16 | 0.8818 | 0.8771 | 0.8636 | 0.7854 | 0.8211 | 0.9173 |
Table 6: Final results on validations set of segmentation models
Name | iou/valid | iou_banhmi | iou_banhtrang | iou_comtam | iou_pho | iou_clutter |
Unet_ResNet34 | 0.8625 | 0.8273 | 0.8529 | 0.7083 | 0.7099 | 0.9084 |
Unet-ResNet18 | 0.8828 | 0.8655 | 0.8897 | 0.7893 | 0.7571 | 0.9214 |
Unet-VGG16 | 0.8716 | 0.8627 | 0.8713 | 0.7395 | 0.7463 | 0.9146 |
Tables 5 and 6 confirm once again that the Unet model with the encoder ResNet18 trained on the Image-Net dataset has the best results compared to the other two models with the encoder. As for the clutter part is the background, which is segmented easily with IoU above 0.9. Broken rice and pho are two dishes with an IoU threshold between 0.7 and 0.78
Thank you
- That is the result that we achieved in a short time, thank you for watching and reading the article until the last minute.
- If you find this post useful, please give me 1 star in my Github repo, I appreciate it
Minh-Hai Tran (Harly)