Create TensorFlow Dataset from TFRecord files
Prototyping with YouTube 8M video-level features
Play around with Youtube 8M video-level dataset. The goal of this post is to create a tf.data.Dataset from a set of .tfrecords
file.
Link to the original notebook used to create this post.

Requirements
This code works with TensorFlow 2.6.0.
2.6.0
Load data
The sample data were downloaded with
per instruction available on the YouTube 8M dataset download page.
Load raw dataset
Import libraries and specify data_folder
.
List .tfrecord
files to be loaded.
/home/default/video/train0093.tfrecord
/home/default/video/train3749.tfrecord
Load .tfrecord
files into a raw (not parsed) dataset.
Parse raw dataset
According to YouTube 8M dataset download section, the video-level data are stored as TensorFlow Example protocol buffers with the following text format:
Create a function to parse the raw data:
Apply the parse function to each file contained in the raw_dataset
:
<MapDataset shapes: {
id: (1,),
labels: (None,),
mean_audio: (128,),
mean_rgb: (1024,)
}, types: {
id: tf.string,
labels: tf.int64,
mean_audio: tf.float32,
mean_rgb: tf.float32
}>
Check parsed dataset
{
'id': <tf.Tensor:
shape=(1,),
dtype=string,
numpy=array([b'eXbF'],
dtype=object)>,
'labels': <tf.Tensor:
shape=(2,),
dtype=int64,
numpy=array([ 0, 12])>,
'mean_audio': <tf.Tensor:
shape=(128,),
dtype=float32,
numpy=array([-1.2556146 , 0.17297305, ..., 0.81667864], dtype=float32)>,
'mean_rgb': <tf.Tensor:
shape=(1024,),
dtype=float32,
numpy=array([ 0.5198898 , 0.30175963, ..., -0.48050806], dtype=float32)>
}