Classification and Segmentation 4 Vietnamese Foods using Deep Learning

[Code]

1. Summary

With the aid of deep learning, this study classifies and segments 4 classic Vietnamese foods, including Cơm Tấm, Bánh Mì, Phở, and Bánh Tráng Nướng. These foods were selected in a foods dataset called 30VNfoods described in an article called "30VNFoods: A Dataset for Vietnamese Foods Recognition”.

Why choose only 4 foods?

We will label and annotate these foods for the Segmentation problem because we don't have a lot of time and there are a lot of individuals who label, so we just select 4 dishes to accomplish this.

Additionally, since Colab is the only training environment employed, the model training throughout project execution is constrained.

Model?

We will build a basic model from MLP to modern networks like VGG and ResNet in the classification and efficiency comparison problem.

We will use U-Net for segmentation problem

2. Processing Datasets

The dataset is divided into 3 parts: train, val and test

Table 1. Number of images per dataset

	Train	Val	Test
Bánh mì	935	133	268
Bánh tráng nướng	556	80	159
Phở	564	81	162
Cơm tấm	659	94	189

The images are collected by the author on the internet with many different sources, so the image sizes are different. For the convenience of input training, the dish images are resized 224x224x3 and normalized from 0 to 1

Data Annotations

On the web platform, Segments.ai uses data that we label and annotate. This software is plenty of tools that can be used to accentuate the food's edges. The application also has a built-in library. This allows us to alter and reconstruct the data set necessary to solve the segmentation problem.

From the dataset used to train the classification model, we randomly select 2824 samples for labeling and annotation.

**Figure 1.** Statistical chart of the number of samples per class

**Figure 2:** Data display and associated mask.

3. Experiments and Results

Model for classification

For classification problem, we use MLP, CNN, miniVgg networks and trained networks on Image-Net dataset (pre-trained model). For MLP, we will experiment by gradually increasing the number of nodes in the hidden layer and increasing the number of hidden layers, if the nodes have good results, they will keep increasing the hidden layer and vice versa.

For the CNN network, we build a simple model and a model based on VGG's architecture but more shallow. In addition, we continue to use pre-trained models on the Image-Net set to evaluate

Model for segmentation

Next to perform image segmentation, we use the Unet structure as described above. Here the encoder will be reused the pre-trained models on the Image-Net set to get better results. We experiment on different pre-training models: VGG16, Resnet18, Resnet34

Table 2: Segmentation models

Encoder	VGG16	ResNet18	ResNet34
Map	Copy and concatenate	Copy and concatenate	Copy and concatenate
Decoder	Revert VGG16 + Conv1	Rever ResNet18 + Conv1	Revert ResNet34 + Conv1

Metrics

Classification problem:

In there : • True Positive (TP): the number of points of the class Positive that are correctly classified as Positive. • True Negative (TN): the number of points of the negative class that are correctly classified as negative.

Segmentation problem:

In there: • A is the predicted segment • B is Ground truth

Results

Classification

Table 3: Classification results using various models

Methods	Accuracy	Loss	Val_Accuracy	Val_Loss	Test_accuracy
Resnet18_pretrained	99.926	6.78E-05	96.907	0.1106	95.886
Resnet18	99.486	0.0003	80.154	0.7141	78.663
VGG16_pretrained	99.266	0.0005	94.587	0.4035	95.758
VGG16	95.229	0.0030	78.350	0.6939	77.763
miniVGG	99.926	0.0001	82.989	0.6325	87.917
SimpleCNN	99.559	0.0008	86.597	0.3855	86.632
MLP_4hidden512node	53.651	0.0678	45.103	2.8904	47.043
MLP_3hidden1024node	44.403	0.1080	34.278	4.8297	38.946
MLP_3hidden512node	55.486	0.0707	40.721	5.5563	44.987
MLP_4hidden	47.706	0.0583	37.886	2.3706	38.303
MLP_3hidden	49.761	0.0512	36.082	3.0187	41.902
MLP_2hidden	48.844	0.0438	40.979	1.6916	41.516

Table 3 shows that the pre-trained models with the highest results are ResNet18 and VGG16 with over 95% on the test set. As for networks trained from scratch, miniVGG achieved the best results, better than VGG16 and ResNet18 retrained from scratch.

Then we will use the miniVGG network to continue the experiment to get better results:

Experiment with different optimization algorithms such as Adam, SGD, RMSProp

Use l2 regularization

Choose the most optimal algorithm to experiment with different learning rates

Method to reduce learning rate

Using Augmentation: o RandomHorizontalFlip o RandomGrayscale o RandomAdjustSharpness

Table 4: Experimental results on miniVGG

Methods	Accuracy	Loss	Val_Accuracy	Val_Loss	Test_accuracy
miniVGG_adam_l2_lr_0.0001_aug	92.587	0.0072	89.948	0.3201	86.246
miniVGG_adam_l2_lr_0.0003	95.045	0.0050	91.494	0.2484	88.431
miniVGG_adam_l2_lr_0.0001	98.458	0.0024	87.628	0.3348	88.817
miniVGG_adam_l2_lr_0.001	99.853	0.0004	88.144	0.349	87.917
miniVGG_RMS	86.165	0.0118	77.061	0.6142	74.293
miniVGG_SGD	95.559	0.0054	84.278	0.4104	84.190
miniVGG_adam	95.486	0.0043	87.886	0.3675	86.118

Experimental results on miniVG Through Table 4, it shows that the adam optimization algorithm is the best algorithm for this problem. Then I tested two different learning_rate, both have similar accuracy on the test set, then we apply augmentaion but the result is not better than the original.

Firuge 3: Loss of different parameters on Val set of MiniVGG

Parameters taken in the segmentation problem:

The optimal algorithm is Adam

Method to reduce learning rate

Use Augmentation o RandomBrightnessContrast o HueSaturationValue o Horizontal Flip o IAAAdditiveGaussianNoise

Table 5: Final results on training set of segmentation models

Name	iou/train	iou_banhmi	iou_banhtrang	iou_comtam	iou_pho	iou_clutter
Unet_ResNet34	0.8526	0.8262	0.8207	0.6916	0.7174	0.9037
Unet-ResNet18	0.9158	0.9087	0.8832	0.8760	0.8744	0.9375
Unet-VGG16	0.8818	0.8771	0.8636	0.7854	0.8211	0.9173

Table 6: Final results on validations set of segmentation models

Name	iou/valid	iou_banhmi	iou_banhtrang	iou_comtam	iou_pho	iou_clutter
Unet_ResNet34	0.8625	0.8273	0.8529	0.7083	0.7099	0.9084
Unet-ResNet18	0.8828	0.8655	0.8897	0.7893	0.7571	0.9214
Unet-VGG16	0.8716	0.8627	0.8713	0.7395	0.7463	0.9146

Tables 5 and 6 confirm once again that the Unet model with the encoder ResNet18 trained on the Image-Net dataset has the best results compared to the other two models with the encoder. As for the clutter part is the background, which is segmented easily with IoU above 0.9. Broken rice and pho are two dishes with an IoU threshold between 0.7 and 0.78

Figure 3: IoU of each model during testing on the set Val

Thank you

That is the result that we achieved in a short time, thank you for watching and reading the article until the last minute.

If you find this post useful, please give me 1 star in my Github repo, I appreciate it

Minh-Hai Tran (Harly)