文章目录
- 一、Classify Leaves竞赛介绍
- 二、数据分析
- 2.1 训练数据信息统计和查看
- 2.2 测试数据统计和分析
- 2.3 可视化训练数据
- 三、整理
一、Classify Leaves竞赛介绍
描述:(叶子种类分类,总共176类,训练数据18353张图,测试数据8800张图片,每一类至少有50张图片)
The task is predicting categories of leaf images. This dataset contains 176 categories, 18353 training images, 8800 test images. Each category has at least 50 images for training. The test set is split evenly into the public and private leaderboard.
The evaluation metric for this competition is Classification Accuracy.
Good luck and have fun!
kanggle竞赛地址:
https://www.kaggle.com/c/classify-leaves
二、数据分析
import pandas as pd
import numpy as np
from d2l import torch as d2l
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
import os
from PIL import Image
from torchvision import transforms
2.1 训练数据信息统计和查看
train_data = pd.read_csv('./data/classify-leaves/train.csv')
train_data.head()
| image | label |
---|
0 | images/0.jpg | maclura_pomifera |
---|
1 | images/1.jpg | maclura_pomifera |
---|
2 | images/2.jpg | maclura_pomifera |
---|
3 | images/3.jpg | maclura_pomifera |
---|
4 | images/4.jpg | maclura_pomifera |
---|
train_data.describe()
| image | label |
---|
count | 18353 | 18353 |
---|
unique | 18353 | 176 |
---|
top | images/8006.jpg | maclura_pomifera |
---|
freq | 1 | 353 |
---|
train_data['label'].value_counts()
maclura_pomifera 353
ulmus_rubra 235
prunus_virginiana 223
acer_rubrum 217
broussonettia_papyrifera 214...
cedrus_deodara 58
ailanthus_altissima 58
crataegus_crus-galli 54
evodia_daniellii 53
juniperus_virginiana 51
Name: label, Length: 176, dtype: int64
labels_unique = train_data['label'].unique()
labels_unique
array(['maclura_pomifera', 'ulmus_rubra', 'broussonettia_papyrifera','prunus_virginiana', 'acer_rubrum', 'cryptomeria_japonica','staphylea_trifolia', 'asimina_triloba', 'diospyros_virginiana','tilia_cordata', 'ulmus_pumila', 'quercus_muehlenbergii','juglans_cinerea', 'cercis_canadensis', 'ptelea_trifoliata','acer_palmatum', 'catalpa_speciosa', 'abies_concolor','eucommia_ulmoides', 'quercus_montana', 'koelreuteria_paniculata',..., 'sassafras_albidum', 'acer_griseum','ailanthus_altissima', 'pinus_thunbergii', 'crataegus_crus-galli','juniperus_virginiana'], dtype=object)
labelencoder = LabelEncoder()
labelencoder.fit(train_data['label'])
train_data['label'] = labelencoder.transform(train_data['label'])
label_map = dict(zip(labelencoder.classes_,labelencoder.transform(labelencoder.classes_)))
label_inv_map = {v:k for k,v in label_map.items()}
label_map
{'abies_concolor': 0,'abies_nordmanniana': 1,'acer_campestre': 2,'acer_ginnala': 3,'acer_griseum': 4,'acer_negundo': 5,'acer_palmatum': 6,'acer_pensylvanicum': 7,'acer_platanoides': 8,'acer_pseudoplatanus': 9,'acer_rubrum': 10,...'zelkova_serrata': 175}
top20_trainData = train_data['label'].value_counts().sort_values(ascending=False).head(20)
print(top20_trainData)
plt.figure(figsize=(15,10))
sns.barplot(x=top20_trainData.index,y=top20_trainData)
plt.xticks(rotation=70)
plt.title("Top 20 categories of leaf statistics")
plt.show()
2.2 测试数据统计和分析
test_data = pd.read_csv('./data/classify-leaves/test.csv')
test_data
| image |
---|
0 | images/18353.jpg |
---|
1 | images/18354.jpg |
---|
2 | images/18355.jpg |
---|
3 | images/18356.jpg |
---|
4 | images/18357.jpg |
---|
... | ... |
---|
8795 | images/27148.jpg |
---|
8796 | images/27149.jpg |
---|
8797 | images/27150.jpg |
---|
8798 | images/27151.jpg |
---|
8799 | images/27152.jpg |
---|
8800 rows × 1 columns
test_data.describe()
| image |
---|
count | 8800 |
---|
unique | 8800 |
---|
top | images/20051.jpg |
---|
freq | 1 |
---|
2.3 可视化训练数据
folder_path = "./data/classify-leaves/"
fig, ax = plt.subplots(nrows=3,ncols=4,sharex=True,sharey=True, figsize=(18,12)
)ax = ax.flatten()
transform = transforms.Compose([transforms.Resize((224,224)),transforms.ToTensor()])
for i in range(12):img_path = os.path.join(folder_path,train_data['image'][i])data = Image.open(img_path)data = transform(data)ax[i].imshow(data.permute((2,1,0))) ax[i].set(title=train_data['label'][i])ax[i].title.set_size(25)ax[0].set_xticks([])
ax[0].set_yticks([])
plt.tight_layout()
plt.show()
三、整理
- 训练数据和测试数据都是RG图,可以考虑转换为灰度图进行识别(判断颜色特征对数据集是否特别重要)RGB图
- 给定的训练数据和测试数据都是规整的大小,但是叶子占据的比例较小,可以考虑进行图片裁减
- 训练数据数量不是特别大,可以考虑进行数据增强,扩大数据集数据量小
- 由于数据较小,使用k折交叉验证可以得到一个更好的结果