Create TensorFlow Dataset from TFRecord files

Thiago G. Martins
1 min readNov 5, 2021

Prototyping with YouTube 8M video-level features

Play around with Youtube 8M video-level dataset. The goal of this post is to create a tf.data.Dataset from a set of .tfrecords file.

Link to the original notebook used to create this post.

Requirements

This code works with TensorFlow 2.6.0.

2.6.0

Load data

The sample data were downloaded with

per instruction available on the YouTube 8M dataset download page.

Load raw dataset

Import libraries and specify data_folder.

List .tfrecord files to be loaded.

/home/default/video/train0093.tfrecord
/home/default/video/train3749.tfrecord

Load .tfrecord files into a raw (not parsed) dataset.

Parse raw dataset

According to YouTube 8M dataset download section, the video-level data are stored as TensorFlow Example protocol buffers with the following text format:

Create a function to parse the raw data:

Apply the parse function to each file contained in the raw_dataset:

<MapDataset shapes: {
id: (1,),
labels: (None,),
mean_audio: (128,),
mean_rgb: (1024,)
}, types: {
id: tf.string,
labels: tf.int64,
mean_audio: tf.float32,
mean_rgb: tf.float32
}>

Check parsed dataset

{
'id': <tf.Tensor:
shape=(1,),
dtype=string,
numpy=array([b'eXbF'],
dtype=object)>,
'labels': <tf.Tensor:
shape=(2,),
dtype=int64,
numpy=array([ 0, 12])>,
'mean_audio': <tf.Tensor:
shape=(128,),
dtype=float32,
numpy=array([-1.2556146 , 0.17297305, ..., 0.81667864], dtype=float32)>,
'mean_rgb': <tf.Tensor:
shape=(1024,),
dtype=float32,
numpy=array([ 0.5198898 , 0.30175963, ..., -0.48050806], dtype=float32)>
}

--

--