Part 1: Conceptual overview
The image above shows, on the left, the first lines of a transcript file (our input) and, one the right, what we want to obtain in videoScope (our output). To do so, we need to go through 3 main steps:
- Segmentation
- Tokenization
- Time alignment
Segmentation
In LCP, the source text needs to be segmented: segments will be the space within which end-users will look for sequences of words. Typically, segments will correspond to sentences.
SRT files often have one sentence per numbered block, as in the image above (one sentence in the two text lines of block 1, another sentence in the two text lines of block 2).
Tokenization
In addition, each segment is further divided into tokens. Tokens typically correspond to words, which roughly correspond to space-separated bits of text in the input.
shouldn't
, which arguably corresponds to two tokens (should
and not
). For the sake of simplicity, we will also use '
as a delimiter and accordingly map `shouldn't` to two tokens with the forms shouldn
and t
.
Time alignment
Time alignment is the process of reporting where the units previously defined (segments, tokens) fall along the time axis in the corpus.
SRT files do not provide a good estimate for timecode for words, but we can use the blocks' timecodes to align the segments.
Plan
In the next part, we will first apply segmentation and tokenization. These will produce a text-only corpus that we can upload to catchphrase.
In a second step, we will add annotations to the data:(1) we will report which tokens are preceded by '
(we will flag them as shortened
), and (2) we will associate each segment with its original text (including the delimiter characters) for display purposes.
In the last step, we will add time alignment and associate each document with a video file, so we can upload the corpus to videoScope.