Part 2: Processing the data

Setup

Make sure you have lcpcli installed in your local python environment:

pip install lcpcli==0.2.2

The lcpcli library provides us with a tool to help us build LCP corpora in python.

Corpus builder

At the simplest level, creating an LCP corpus requires very few steps:

from lcpcli.builder import Corpus

corpus = Corpus("my corpus")
corpus.Document(
    corpus.Segment(
        corpus.Token("Hello"),
        corpus.Token("world")
    )
)
corpus.make()

The lines above will create a text-only corpus consisting of only one document, with only one segment, with two tokens whose forms are Hello and world.

The last line, c.make(), builds and outputs files for upload in the current directory.

Our implementation will be a little more sophisticated than this hello-world example, in that we will have multiple segments and two documents, but the basic idea will remain the same.

Segmentation and Tokenization

The Corpus class allows us to define arbitrary names instead of Token, Segment and Document. In this tutorial, we will use the more common (albeit less technical) terms Word, Sentence and Clip instead:

corpus = Corpus(
    "Tutorial corpus",
    document="Clip",
    segment="Sentence",
    token="Word"
)

To start with, we will only process one SRT file, database_explorer.srt, so we will only create a single clip in the corpus, using clip = corpus.Clip()

A first rough approach would be to treat every non-empty line of the SRT file as a segment and split it by space to get words:

clip = corpus.Clip()

with open("database_explorer.srt", "r") as input:
    while line := input.readline():
        line = line.strip()
        if not line:
            continue
        clip.Sentence(*[
            corpus.Word(w.strip())
            for w in line.split(" ")
            if w.strip()
        ])

clip.make()

( ! ) Note that .Sentence can be called either on clip, as in this snippet, or on corpus, which is the method used in the hello-world example (resp. .Segment). In the hello-world, the segment is created directly as an argument of corpus.Document and, as such, belongs to it; here, the script does not pass sentences as direct arguments of the clip; instead, the sentences are created on the fly, which why it is necessary to call .Sentence on clip.

Although functional, the snippet above outputs unusable data: becaue each line is mapped to one sentence, some will consist just of a block number, some of timestamps, and many actually correspond to chunks of sentences distributed over multiple transcriptoin blocks.

This commented python script extends the logic illustrated in the snippet to ignore the transcription block numbers and timestamps, and uses sentence delimiters for segmentation instead of mapping one line to one sentence:

If the line reports a block number or the block's timestamps, it is ignored
If the line is a linebreak, the next two lines are flagged as a block number and its timestamps
If the line has some text, it is split by sentence delimiters (., !, ?) and each resulting chunk is in turn split by token delimiters (`,,,'`); the resulting word is added to the current sentence

Whenever the script hits a sentence delimiter, it creates a new sentence with sentence = clip.Sentence(), and whenever it processes a word, it adds it to the current sentence with sentence.Word(w).

( ! ) The same is true of .Sentence and .Word: each can be called on corpus or on their parent (respectively, clip and segment) depending on whether you decide to pass them as direct arguments of their parent.

Import

Visit catchphrase and create a new corpus collection, then open its setting by clicking the gear icon, click the "API" tab and create a new API key; write down the key and the secret.

Open a terminal and run the following command, replacing path/to/output with a path pointing to the folder containing the files generated by the script, $API_KEY with the key you just got, $API_SECRET with the secret you just got and $PROJECT with the name of the collection you just created:

lcpcli -c path/to/output/ -k $API_KEY -s $API_SECRET -p "$PROJECT" --live

You should get a confirmation message, and your corpus should now be visible in your collection after you refresh the page!

Multiple documents

To parse multilpe SRT files, the edits to the script are minimal: one just needs to iterate over the SRT files to create a dedicated instance of Clip for each, and one can also report the name of the file for reference purposes, as in: clip = corpus.Clip(name="database_explorer").

( ! ) One can pass arbitrary keyword arguments to define attributes, as illustrated here with name="database_explorer". Another option is to set the attribute on the instance, as in clip.name = "database_explorer". This applies to all entities, not just to documents.

An updated version of the script that processes all the SRT documents in the working directory can be found here.

Adding annotations

For the sake of illustration, we will now add two pieces of annotation:

one at the sentence level, reporting the original text of each sentence, i.e. including the token-delimiter characters
one at the word level, reporting whether each word was preceded by a single quote

An updated python script that does that can be found here.

The original text of the sentence is set as sentence.original = original (where original is a string containing the original text) and whether a word is preceded by ' is set when creating the word instance, as in sentence.Word(word, shortened=shortened) (where shortened is appropriately set to "yes" or "no").

Time alignment and video

Associating a document with a video only takes one command: clip.set_media("clip", "database_explorer.mp4").

( ! ) The first argument of set_media is a string defining the name of the media slot for the media file. Unlike this corpus, some video corpora can have more than one video per document; for example, scenes could be filmed from two different angles in parallel, resulting in two video files per document, so that one media slot could be named "left" and the other "right".

The challenging part is time alignment. Command-wise, it again only takes one line to align a segment: sentence.set_time(0, 25). This would align the sentence from 0s to 1s (remember that LCP uses a convention of 25 frames per second).

We will use the timestamps of the transcription blocks to time-align the sentences, while leaving the words unaligned. Because some sentences can span multiple blocks, and each block can contain more than one sentence, we have to be smart about it: only when we hit a sentence delimiter do we use the start timecode of the block as the start of the next sentence, otherwise we need to use the start timecode of the first block in which the sentence starts.

( ! ) Because we process multiple videos, we also need to offset the time-alignment of each successive clip by the duration of the preceding ones. Failing to do so would cause sentences from different video clips to overlap in LCP, which would undermine the value of time-based queries.

An updated python script that implements time alignment can be found here.

( ! ) Make sure to download and place the files database_explorer.mp4 and presenter_pro.mp4 in your folder so they can be included in the output folder and uploaded to videoScope.
(The MP4 files next to convert.py in the repository are empty, placeholder files and will not play back in videoScope.)

Part 2