Part 2: Processing the data
Setup
Make sure you have lcpcli
installed in your local python environment:
pip install lcpcli==0.2.2
The lcpcli
library provides us with a tool to help us build LCP corpora in python.
Corpus builder
At the simplest level, creating an LCP corpus requires very few steps:
from lcpcli.builder import Corpus
corpus = Corpus("my corpus")
corpus.Document(
corpus.Segment(
corpus.Token("Hello"),
corpus.Token("world")
)
)
corpus.make()
The lines above will create a text-only corpus consisting of only one document, with only one segment, with two tokens whose forms are Hello
and world
.
The last line, c.make()
, builds and outputs files for upload in the current directory.
Our implementation will be a little more sophisticated than this hello-world example, in that we will have multiple segments and two documents, but the basic idea will remain the same.
Segmentation and Tokenization
The Corpus
class allows us to define arbitrary names instead of Token
, Segment
and Document
. In this tutorial, we will use the more common (albeit less technical) terms Word
, Sentence
and Clip
instead:
corpus = Corpus(
"Tutorial corpus",
document="Clip",
segment="Sentence",
token="Word"
)
To start with, we will only process one SRT file, database_explorer.srt
, so we will only create a single clip in the corpus, using clip = corpus.Clip()
A first rough approach would be to treat every non-empty line of the SRT file as a segment and split it by space to get words:
clip = corpus.Clip()
with open("database_explorer.srt", "r") as input:
while line := input.readline():
line = line.strip()
if not line:
continue
clip.Sentence(*[
corpus.Word(w.strip())
for w in line.split(" ")
if w.strip()
])
clip.make()
.Sentence
can be called either on clip
, as in this snippet, or on corpus
, which is the method used in the hello-world example (resp. .Segment
). In the hello-world, the segment is created directly as an argument of corpus.Document
and, as such, belongs to it; here, the script does not pass sentences as direct arguments of the clip; instead, the sentences are created on the fly, which why it is necessary to call .Sentence
on clip
.
Although functional, the snippet above outputs unusable data: becaue each line is mapped to one sentence, some will consist just of a block number, some of timestamps, and many actually correspond to chunks of sentences distributed over multiple transcriptoin blocks.
This commented python script extends the logic illustrated in the snippet to ignore the transcription block numbers and timestamps, and uses sentence delimiters for segmentation instead of mapping one line to one sentence:
- If the line reports a block number or the block's timestamps, it is ignored
- If the line is a linebreak, the next two lines are flagged as a block number and its timestamps
- If the line has some text, it is split by sentence delimiters (
.
,!
,?
) and each resulting chunk is in turn split by token delimiters (`,
,,
'`); the resulting word is added to the current sentence
Whenever the script hits a sentence delimiter, it creates a new sentence with sentence = clip.Sentence()
, and whenever it processes a word, it adds it to the current sentence with sentence.Word(w)
.
.Sentence
and .Word
: each can be called on corpus
or on their parent (respectively, clip
and segment
) depending on whether you decide to pass them as direct arguments of their parent.
Import
Visit catchphrase and create a new corpus collection, then open its setting by clicking the gear icon, click the "API" tab and create a new API key; write down the key and the secret.
Open a terminal and run the following command, replacing path/to/output
with a path pointing to the folder containing the files generated by the script, $API_KEY
with the key you just got, $API_SECRET
with the secret you just got and $PROJECT
with the name of the collection you just created:
lcpcli -c path/to/output/ -k $API_KEY -s $API_SECRET -p "$PROJECT" --live
You should get a confirmation message, and your corpus should now be visible in your collection after you refresh the page!
Multiple documents
To parse multilpe SRT files, the edits to the script are minimal: one just needs to iterate over the SRT files to create a dedicated instance of Clip
for each, and one can also report the name of the file for reference purposes, as in: clip = corpus.Clip(name="database_explorer")
.
name="database_explorer"
. Another option is to set the attribute on the instance, as in clip.name = "database_explorer"
. This applies to all entities, not just to documents.
An updated version of the script that processes all the SRT documents in the working directory can be found here.
Adding annotations
For the sake of illustration, we will now add two pieces of annotation:
- one at the sentence level, reporting the original text of each sentence, i.e. including the token-delimiter characters
- one at the word level, reporting whether each word was preceded by a single quote
An updated python script that does that can be found here.
The original text of the sentence is set as sentence.original = original
(where original
is a string containing the original text) and whether a word is preceded by '
is set when creating the word instance, as in sentence.Word(word, shortened=shortened)
(where shortened
is appropriately set to "yes"
or "no"
).
Time alignment and video
Associating a document with a video only takes one command: clip.set_media("clip", "database_explorer.mp4")
.
set_media
is a string defining the name of the media slot for the media file. Unlike this corpus, some video corpora can have more than one video per document; for example, scenes could be filmed from two different angles in parallel, resulting in two video files per document, so that one media slot could be named "left"
and the other "right"
.
The challenging part is time alignment. Command-wise, it again only takes one line to align a segment: sentence.set_time(0, 25)
. This would align the sentence from 0s to 1s (remember that LCP uses a convention of 25 frames per second).
We will use the timestamps of the transcription blocks to time-align the sentences, while leaving the words unaligned. Because some sentences can span multiple blocks, and each block can contain more than one sentence, we have to be smart about it: only when we hit a sentence delimiter do we use the start timecode of the block as the start of the next sentence, otherwise we need to use the start timecode of the first block in which the sentence starts.
An updated python script that implements time alignment can be found here.
database_explorer.mp4
and presenter_pro.mp4
in your folder so they can be included in the output folder and uploaded to videoScope.(The MP4 files next to
convert.py
in the repository are empty, placeholder files and will not play back in videoScope.)