One part of the development cycle that took lots of time is creating a data loader. Almost every real world project requires the developer to carefully plan this part. The bigger the dataset, the more effort needed to develop efficient loaders.
In this blog post, I want to share three methods of data loading that are useful for any Keras projects.
Built-in Image Generator
This generator assume you to arrange your data in this format.
└───data
├───test_set
│ ├───cats
│ └───dogs
└───training_set
├───cats
└───dogs
The data are placed in their respective classes. Test and train set have the same number of classes. This arrangement can utilize Keras built-in ImageDataGenerator feature.
def get_generator(data_path, batch_size, image_size, label_type, training=True):
if training:
datagen = ImageDataGenerator(
width_shift_range=0.1, height_shift_range=0.1,
shear_range=0.2, zoom_range=0.2, horizontal_flip=True,
rescale=1. / 255.0
)
shuffle=True
else:
datagen = ImageDataGenerator(
rescale=1. / 255.0
)
shuffle=False
data_flow = datagen.flow_from_directory(
directory=data_path, batch_size=batch_size,
target_size=image_size, class_mode=label_type,
shuffle=shuffle
)
return data_flow
In the code above, the data_path
is the directory containing all the classes, i.e test_set, train_set. Usually, training and testing data required different preprocessing steps. You can apply data transformation by passing the options wanted.
The flow_from_directory
method will do the heavy work to pull out and apply preprocessing to your data. One thing to note is the class_mode
argument. You have to choose the right class mode to ensure the labels are processed correctly. Refer to Keras documentation to see more control options.
Custom Data Generator 1
Sometimes, you may need a lot more preprocessing steps that are not available in the built-in feature. This is when a custom Data Loader is useful. Take example of the dataset below,
data/test_set\cats\cat.4999.jpg cats
data/test_set\cats\cat.5000.jpg cats
data/test_set\dogs\dog.4001.jpg dogs
data/test_set\dogs\dog.4002.jpg dogs
The dataset is for cats vs dogs classifier. All the paths and labels are saved into train and test txt files which can be loaded into lists using the code below.
with open("train_set.txt", "r") as f:
data = f.readlines()
img_path_list, label_list = [], []
for d in data:
img_path, label = d.strip().split('\t')
img_path_list.append(img_path)
label_list.append(label)
img_path_list
store all the path to images while label_list
store their respective labels. Then your custom data loader will look something like this
class Custom_ImageGenerator(Sequence):
def __init__(self, data_filenames, labels, label_encode, batch_size, image_size):
self.data_filenames = data_filenames
self.labels = labels
self.label_encode = label_encode
self.batch_size = batch_size
self.image_size = image_size
self.work_dir = Path(".").parent.absolute()
def __len__(self):
return (np.ceil(len(self.data_filenames) / self.batch_size)).astype(np.int)
def __getitem__(self, idx):
data_batch = self.data_filenames[idx * self.batch_size : (idx+1) * self.batch_size]
data_labels_batch = self.labels[idx * self.batch_size : (idx+1) * self.batch_size]
temp = []
for img_path in data_batch:
img = cv2.imread(os.path.join(self.work_dir, img_path))
img = cv2.resize(img, self.image_size)
img = img / 255.
temp.append(img)
encoded_label = [self.label_encode[label] for label in data_labels_batch]
processed_img_batch = np.array(temp)
encoded_label_batch = np.array(to_categorical(encoded_label))
return processed_img_batch, encoded_label_batch
The whole process of loading and processing the data will be…
if __name__ == "__main__":
# create_files()
label_encode = {"cats" : 0, "dogs" : 1}
with open("train_set.txt", "r") as f:
data = f.readlines()
img_path_list, label_list = [], []
for d in data:
img_path, label = d.strip().split('\t')
img_path_list.append(img_path)
label_list.append(label)
train_gen = Custom_ImageGenerator(img_path_list, label_list, label_encode, batch_size=64,
image_size=(224, 224))
To understand how the code works, you need to know a few things. The main ingredients in our custom made class is Sequence
. This is a built-in Keras feature that is optimized for multiprocessing. According to their official documentation, This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators.
By inheriting Sequence
, you must implement two methods, __getitem__
and __len__
.
__getitem__
: This method will output data for the the batch specified by the argumentindex
__len__
: This method computes the number of batches for every sequence.
The only thing you care about is what you should do in the __getitem__
section which consists of how you can get the data for a given batch_size and index. In this example, I store the imgs_path and labels as a list which can be sliced accordingly. All data preprocessing is performed in this section.
Custom Data Generator 2
In some projects, I found out that the previous custom generator is not useful enough. I was working on the CRNN model for Optical Character Recognition (OCR), when I found out that the train and test model have different input shapes. One way to use the custom generator before is to change the model to have the same train and test input shape.
But, since most(almost all) of open source research won’t do it this way, it is wise to not reinvent the process so you can have more references. And, using the built-in Keras data loader is definitely not usable if you are loading from raw data as the labels are preprocessed differently.
For this, we are going to modify the code a little bit. The data are stored in the same manner as before which consists of image paths and labels. So, the first step is to load the txt file to get the information. Then here comes the changes.
class DataLoader(Sequence):
__metaclass__ = abc.ABCMeta
def __init__(self, filenames, labels, char_list, batch_size, training, image_size=(32, 128)):
self.filenames = filenames
self.labels = labels
self.char_list = char_list
self.batch_size = batch_size
self.training = training
self.w = image_size[0]
self.h = image_size[1]
self.max_label_len = 0
def __len__(self):
return (np.ceil(len(self.filenames) / self.batch_size)).astype(np.int)
def __getitem__(self, idx):
data_batch = self.filenames[idx * self.batch_size: (idx + 1) * self.batch_size]
label_batch = self.labels[idx * self.batch_size: (idx + 1) * self.batch_size]
X, y = self._generate_data(data_batch, label_batch)
return X, y
def _preprocess_image(self, img, w, h):
if w < self.w:
add_zeros = np.ones((self.w - w, h)) * 255.0
img = np.concatenate((img, add_zeros))
if h < self.h:
add_zeros = np.ones((self.w, self.h - h)) * 255.0
img = np.concatenate((img, add_zeros), axis=1)
img = np.expand_dims(img, axis=2)
# Normalize Images
img = img / 255.0
return img
def encode_to_labels(self, txt, char_list):
# encoding each output word into digits
dig_lst = []
for _, char in enumerate(txt):
try:
dig_lst.append(char_list.index(char))
except:
print(char)
return dig_lst
@abc.abstractmethod
def _generate_data(self, data_batch, label_batch):
# Method to generate batches of data
pass
Our data loader still inherits from Sequence
. The addition is using the abc library as the __metaclass__
. This library, through @abc.abstractmethod
creates abstraction of the decorated methods. In our example, we decorated _generate_data
as abstract to be used somewhere else.
The somewhere else are…
class TrainGenerator(DataLoader):
def _generate_data(self, data_batch, label_batch):
batch_imgs, batch_labels, input_len, label_len = [], [], [], []
for img_path, label in zip(data_batch, label_batch):
img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
# Preprocess image
w, h = img.shape
if h > self.h or w > self.w:
continue
img = self._preprocess_image(img, w, h)
label = self.encode_to_labels(label, self.char_list)
batch_imgs.append(img)
batch_labels.append(label)
input_len.append(self.w - 1)
label_len.append(len(label))
batch_labels = pad_sequences(batch_labels, maxlen=self.max_label_len, padding='post', value=len(self.char_list))
batch_imgs = np.array(batch_imgs).astype('float32')
batch_labels = np.array(batch_labels).astype('float32')
inputs_len = np.array(input_len).astype('float32')
labels_len = np.array(label_len).astype('float32')
return [batch_imgs, batch_labels, inputs_len, labels_len], batch_labels
class ValidationGenerator(DataLoader):
def _generate_data(self, data_batch, label_batch):
batch_imgs, batch_labels, input_len, label_len = [], [], [], []
for img_path, label in zip(data_batch, label_batch):
img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
# Preprocess image
w, h = img.shape
if h > self.h or w > self.w:
continue
img = self._preprocess_image(img, w, h)
label = self.encode_to_labels(label, self.char_list)
batch_imgs.append(img)
batch_labels.append(label)
batch_labels = pad_sequences(batch_labels, maxlen=self.max_label_len, padding='post', value=len(self.char_list))
batch_imgs = np.array(batch_imgs).astype('float32')
return batch_imgs, batch_labels
Just by observing the output, you can understand that they need to be processed separately. Both classes inherit from the custom data loader which enables them to use all the methods such as _preprocess_image
and encode_to_labels
and modify _generate_data
separately.
You only need to call TrainGenerator
and ValidationGenerator
in the main function to load the data. These data loading techniques are not only limited to images, but as you can see, texts can be preprocessed the same way. It’s very useful!
That’s it for this post. Hope you will learn something useful. Thanks. View code used here.