Corpus builder
The tool lcpcli ships with a helper python class Corpus to prepare LCP corpora.
The tutorial uses the Corpus class to process SRT files and import a video corpus into LCP.
The following repositories also use the Corpus class to convert existing data sets:
- XML-to-LCP conversion of NOAH's Corpus
- TEI-to-LCP conversion of OFROM
- Conversion of Alto-XML scanned documents to LCP
The various tests in the lcpcli repository give concrete examples on how to use the Corpus class.
Corpus
You need to instantiate the Corpus class to create a new corpus.
Arguments:
name(str, mandatory) is the name of the corpusdocument(str, optional, default"Document") is the name of the document-level layer of the corpussegment(str, optional, default"Segment") is the name of the sentence-level layer of the corpustoken(str, optional, default"Token") is the name of the word-level layer of the corpusauthors(str, optional, default"placeholder") is the name(s) of the author(s) of the corpusinstitution(str, optional, default"") is the name of the institution associated with the corpusdescription(str, optional, default"") is a description of the corpus, as it will be presented to end usersdate(str, optional, default"placeholder") is the date when the corpus was curatedrevision(int | float, optional, default1) is the revision number of the corpusurl(str, optional, default"placeholder") is the source URL of the corpuslicense(str | None, optional, defaultNone) is the code of the license of the corpus
The values of authors, institution, description, date, revision, url and license can be modified in LCP after import.
Example
from lcpcli.builder import Corpus
c = Corpus("my great corpus", document="Book", segment="Sentence", token="Word")
Instance methods
An instance of the Corpus class has an open set of methods, which should all start with a capital letter, and which will create and return an entity in the corpus with the passed attributes (an instance of the class Layer)
All corpora should create at least one entity by calling each of the methods named after the values passed as document, segment and token when instantiating the Corpus class.
Example
from lcpcli.builder import Corpus
c = Corpus("my great corpus")
c.Document(
c.Segment(
c.Token("hello"),
c.Token("world")
)
)
c.make("path/to/output/")
make
Writes all the CSV files and the configuration JSON file of the corpus to the passed directory.
The make method is the only valid method on a Corpus instance that starts with a non-capital letter.
Arguments:
destination(str, mandatory) is a path where to place the output filesis_global(dict, optional, default{}) maps layers to attribute names whose possible values are defined globally, such as theuposon tokens
Returns None
Layer class
This class is never instantiated explicitly (as in Layer(...)). Instances of the Layer class are returned by calling a method that starts with a capital letter on an instance of Corpus, or on another Layer instance.
Upon instantiation, pass named arguments to create corresponding attributes on the entity, as in corpus.Document(title="my document", year=2026). There are three special instantiation cases:
- When instantiating tokens, you should pass a string as the sole unnamed argument to represent the form, as in
corpus.Token("is", lemma="be"). - When instantiating segments, you can pass token instances as unnamed arguments, as in
document.Segment(corpus.Token("hello"), corpus.Token("world")). - You can pass a dictionary as the sole unnamed argument (and no named argument) to create a
GlobalAttributeinstance instead of aLayerinstance, as incorpus.Speaker({"id": "johndoe", "firstname": "John", "lastname": "Doe"})
Instance methods
Just like Corpus instances, Layer instances have an open set of methods, which should all start with a capital letter, and which will create and return a new Layer instance.
In addition, Layer instances come with the methods listed below.
make
If called with no argument: makes the instance (and its children) in the destination corpus and will clear it from memory as soon as possible.
If call with multiple unnamed arguments: each argument should be a Layer instance with at least one attribute that points to another Layer instance. This is typically used for token dependency relations in a segment (see example below).
Arguments:
child: (Layer, mandatory) aLayerinstance to add as a child of the current instance.
Returns:
- the same
Layerinstance it was called on
Examples:
Normal layer
s = c.Segment(c.Token("hello"), c.Token("world"))
s.make() # makes the segment and the tokens
Dependency realtions
t1 = c.Token("the")
t2 = c.Token("cat")
s = c.Segment(t1, t2)
s.make()
c.DepRel.make(
c.DepRel(dependent=t1, udep="root"),
c.DepRel(head=t2, dependent=t1, udep="detj")
)
add
Adds a Layer instance as a child of the current instance.
Arguments:
child: (Layer, mandatory) aLayerinstance to add as a child of the current instance.
Returns:
- the same
Layerinstance it was called on
Example:
s = c.Segment()
t = c.Token("hello")
# Add the token as a child of the segment
s.add(t)
set_media
Call this on a Layer instance at the document level to associate with a media file.
Arguments:
name(str, mandatory) is an arbitrary name for the media fieldfilename(str, mandatory) is the filename of the media file associated with the current document. It can be an audio or a video file.
Returns:
- the same
Layerinstance it was called on
Example:
d = c.Document(title="Interview with the vampire")
d.set_media("interview", "vampire.mp4")
set_char
Sets the character anchoring of the current instance. It is usually not necessary to call this manually, since the character axis is determined by the forms in the token sequence of the corpus; however, one can use set_char to add parallel annotations on that axis.
Arguments:
left(int) is the index of the left anchor on the character axisright(int) is the index of the right anchor on the character axis
Returns:
- the same
Layerinstance it was called on
Example:
forms = " ".split("the LCP corpus")
tokens = [c.Token(f) for f in forms]
s = c.Segment(*tokens)
# making the segment anchors the tokens
s.make()
ne = c.NamedEntity(form=forms[1])
# set the named entity to overlap with the "LCP" token
ne.set_char(tokens[1].get_char())
get_char
Returns the left and right anchors of the current layer entity on the character axis.
Return type: tuple[int, int]
Example:
forms = " ".split("the LCP corpus")
tokens = [c.Token(f) for f in forms]
s = c.Segment(*tokens)
# making the segment anchors the tokens
s.make()
ne = c.NamedEntity(form=forms[1])
# set the named entity to overlap with the "LCP" token
ne.set_char(tokens[1].get_char())
set_time
Sets the time anchoring of the current instance.
Arguments:
start(int) is the index of the left anchor on the time axis (in seconds times 25)end(int) is the index of the right anchor on the time axis (in seconds times 25)
Returns:
- the same
Layerinstance it was called on
Example:
transcript = [
{
"form": "hello",
"start": 0.0,
"end": 0.4
},
{
"form": "world",
"start": 0.45,
"end": 0.8
},
]
tokens = []
for t in transcript:
token = c.Token(t['form'])
token.set_time(int(t['start'] * 25), int(t['end'] * 25))
s = c.Segment(*token)
s.make() # inherits the start and end time anchors from the tokens
get_time
Returns the start and end time anchors of the current layer entity (in seconds times 25).
Return type: tuple[int, int]
Example:
# store the end time of the current document in offset
_, offset = doc.get_time()
# create a new document
doc = c.Document(name="Second Document")
doc.set_media("interview", "vampire.mp4")
hello = c.Token("hello")
world = c.Token("world")
hello.set_time(offset + 0, offset + 10)
world.set_time(offset + 11, offset + 20)
s = doc.Segment(hello, world)
s.make() # inherits the start and end time anchors from the tokens
doc.make() # inherits the start and end time anchors from the sentence
set_xy
Sets the (X1,Y1,X2,Y2) plane anchoring of the current instance, typically used to map annotations to areas of images.
Take care of offsetting the annotations of separate documents so they do not overlap, on the horizontal axis, on the vertical axis, or both.
Arguments:
x1(int) is the left-most point of the annotation areay1(int) is the top-most point of the annotation areax2(int) is the right-most point of the annotation areay2(int) is the bottom-most point of the annotation area
Returns:
- the same
Layerinstance it was called on
Example:
p = c.Page(name="Page 1", image="file.png")
p._attributes["image"]._type = "image" # force an image type on this attribute
p.set_xy(0, 0, 100, 50) # a 100x50px document
s = p.Segment()
hello = s.Token("hello")
world = s.Token("world")
hello.set_xy(10, 10, 33, 40)
world.set_xy(50, 10, 90, 40)
s.make() # s inherits the plane anchoring from its tokens
p.make() # the page's anchors were explicitly set above
get_xy
Returns the (X1,Y1,X2,Y2) plane anchors of the current instance.
Return type: tuple[int, int, int, int]
Example:
_, _, offset, _ = p.get_xy() # horizontal offset
p = c.Page(name="Page 2", image="file2.png")
p.set_xy(offset + 0, 0, offset + 100, 50)
GlobalAttribute class
Pass a dictionary as the sole unnamed argument (and no named argument) of a new-layer method to create a GlobalAttribute instance instead. You can then pass this an attribute of multiple layers.
Arguments:
attributes(dict) is a key-value dictionary of the global attribute's sub-attributes.
Example:
jane = c.Speaker({"firstname": "Jane", "lastname": "Doe"})
john = c.Speaker({"firstname": "John", "lastname": "Doe"})
s1 = doc.Segment(c.Token("Hello"), c.Token("world"), speaker=jane)
s2 = doc.Segment(c.Token("Bye"), c.Token("people"), speaker=john)
s3 = doc.Segment(c.Token("So"), c.Token("long"), speaker=john)
s1.make()
s2.make()
s3.make()