Corpus builder
The tool lcpcli
ships with a helper python class Corpus
to prepare LCP corpora.
The tutorial uses the Corpus
class to process SRT files and import a video corpus into LCP.
The various tests in the lcpcli
repository give concrete examples on how to use the Corpus
class.
The following repositories also use the Corpus
class to convert existing data sets:
Corpus
You need to instantiate the Corpus
class to create a new corpus.
Arguments:
name
(str
, mandatory) is the name of the corpusdocument
(str
, optional, default"Document"
) is the name of the document-level layer of the corpussegment
(str
, optional, default"Segment"
) is the name of the sentence-level layer of the corpustoken
(str
, optional, default"Token"
) is the name of the word-level layer of the corpusauthors
(str
, optional, default"placeholder"
) is the name(s) of the author(s) of the corpusinstitution
(str
, optional, default""
) is the name of the institution associated with the corpusdescription
(str
, optional, default""
) is a description of the corpus, as it will be presented to end usersdate
(str
, optional, default"placeholder"
) is the date when the corpus was curatedrevision
(int | float
, optional, default1
) is the revision number of the corpusurl
(str
, optional, default"placeholder"
) is the source URL of the corpuslicense
(str | None
, optional, defaultNone
) is the code of the license of the corpus
The values of authors
, institution
, description
, date
, revision
, url
and license
can be modified in LCP after import.
Example
from lcpcli.builder import Corpus
c = Corpus("my great corpus", document="Book", segment="Sentence", token="Word")
Instance methods
An instance of the Corpus
class has an open set of methods, which should all start with a capital letter, and which will create and return an entity in the corpus with the passed attributes (an instance of the class Layer
)
All corpora should create at least one entity by calling each of the methods named after the values passed as document
, segment
and token
when instantiating the Corpus
class.
Example
from lcpcli.builder import Corpus
c = Corpus("my great corpus")
c.Document(
c.Segment(
c.Word("hello"),
c.Word("world")
)
)
c.make("path/to/output/")
make
Writes all the CSV files and the configuration JSON file of the corpus to the passed directory.
The make
method is the only valid method that starts with a non-capital letter.
Arguments:
destination
(str
, mandatory) is a path where to place the output filesis_global
(dict
, optional, default{}
) maps layers to attribute names whose possible values are defined globally, such as theupos
on tokens