API Reference¶

This page provides detailed documentation for all classes and functions in the Vivre library. The documentation is automatically generated from the source code docstrings.

Top-Level Functions¶

These are the main functions you’ll use for most tasks.

vivre.read(epub_path: str | Path) → Chapters[source]¶

Parse an EPUB file and extract chapters.

Parameters:

epub_path – Path to the EPUB file

Returns:

Chapters object containing parsed chapters

Raises:

FileNotFoundError – If the EPUB file doesn’t exist
ValueError – If the file is not a valid EPUB

Example

>>> chapters = vivre.read('path/to/epub')
>>> print(f"Found {len(chapters)} chapters")
>>> for title, content in chapters:
...     print(f"Chapter: {title}")

Align parallel EPUB files or Chapters objects and return an AlignmentResult.

This function can accept either file paths or Chapters objects, making it flexible for different workflows. The language_pair parameter is required for accurate alignment.

Parameters:

source – Source language EPUB file path or Chapters object
target – Target language EPUB file path or Chapters object
language_pair – Language pair code (e.g., “en-fr”, “es-en”) - REQUIRED
method – Alignment method (currently only “gale-church” supported)
_pipeline – Optional pre-existing VivrePipeline instance for dependency injection
**kwargs – Additional arguments passed to the pipeline

Returns:

AlignmentResult object with methods for different output formats

Raises:

FileNotFoundError – If EPUB files don’t exist (when using file paths)
ValueError – If method is not supported or language_pair is invalid

Example

# Using file paths >>> result = vivre.align(‘english.epub’, ‘french.epub’, ‘en-fr’) >>> print(result.to_json()) >>> print(result.to_csv())

# Using Chapters objects (seamless workflow) >>> source_chapters = vivre.read(‘english.epub’) >>> target_chapters = vivre.read(‘french.epub’) >>> result = vivre.align(source_chapters, target_chapters, ‘en-fr’) >>> print(result.to_text())

# Using dependency injection for better performance >>> pipeline = VivrePipeline(‘en-fr’) >>> result = vivre.align( … source_chapters, target_chapters, ‘en-fr’, _pipeline=pipeline … ) >>> print(result.to_dict())

# Get as dictionary for programmatic access >>> data = result.to_dict() >>> print(f”Found {len(data[‘chapters’])} chapters”)

vivre.quick_align(source_epub: str | Path, target_epub: str | Path, language_pair: str) → List[Tuple[str, str]][source]¶

Quick alignment function that returns simple sentence pairs.

This is a convenience function for simple use cases where you just need sentence pairs without the full corpus structure.

Parameters:

source_epub – Path to source language EPUB
target_epub – Path to target language EPUB
language_pair – Language pair code (e.g., “en-fr”, “es-en”) - REQUIRED

Returns:

List of (source_sentence, target_sentence) tuples

Raises:

FileNotFoundError – If either EPUB file doesn’t exist
ValueError – If language_pair is invalid

Example

>>> pairs = vivre.quick_align('english.epub', 'french.epub', 'en-fr')
>>> for source, target in pairs[:3]:
...     print(f"EN: {source}")
...     print(f"FR: {target}")

vivre.get_supported_languages() → List[str][source]¶

Get a list of supported languages for segmentation.

Returns:: List of supported language codes.

Example

>>> languages = vivre.get_supported_languages()
>>> print(f"Supported languages: {languages}")

vivre.clear_pipeline_cache() → None[source]¶

Clear the pipeline cache.

This is useful for testing or when you want to free up memory.

Example

>>> vivre.clear_pipeline_cache()

Core Classes¶

The main classes that provide the core functionality.

Chapter¶

AlignmentResult¶

class vivre.AlignmentResult(corpus: Dict[str, Any])[source]¶

Bases: object

A container for alignment results with multiple output format options.

This class holds the aligned corpus data and provides methods to output it in various formats.

to_csv() → str[source]¶: Return the corpus as CSV string.

to_dict() → Dict[str, Any][source]¶: Return the corpus as a dictionary.

to_json(indent: int = 2) → str[source]¶: Return the corpus as JSON string.

to_text() → str[source]¶: Return the corpus as formatted text.

to_xml() → str[source]¶: Return the corpus as XML string.

Chapters¶

class vivre.Chapters(chapters: List[Chapter], book_title: str = '')[source]¶

Bases: object

A container for parsed chapters with segmentation capabilities.

This class holds the parsed chapters and provides methods to segment the text into sentences.

get_segmented() → List[Tuple[str, List[str]]][source]¶: Get the segmented chapters.

segment(language: str | None = None) → Chapters[source]¶

Segment all chapters into sentences.

Parameters:: language – Language code for segmentation (auto-detected if None)
Returns:: Self with segmented chapters

VivreParser¶

class vivre.VivreParser[source]¶

Bases: object

A robust parser for EPUB files that extracts story content while filtering non-story elements.

This parser follows EPUB standards to extract chapter titles and content from EPUB files, intelligently filtering out front matter, back matter, and other non-story content.

The parser implements a multi-stage approach: 1. EPUB validation and structure analysis 2. Table of contents parsing for chapter titles 3. Content extraction with intelligent filtering 4. Text cleaning and normalization

The parser can handle various EPUB formats and structures, including different table of contents formats (NCX and HTML) and various content organization patterns.

IMPORTANT: This parser is stateless and can be safely reused for multiple EPUB files without state pollution. Each parse_epub() call is independent.

file_path¶: Path to the currently loaded EPUB file, if any.

_is_loaded¶: Boolean indicating whether an EPUB file is currently loaded.

Example

>>> parser = VivreParser()
>>> chapters1 = parser.parse_epub("book1.epub")  # Safe to reuse
>>> chapters2 = parser.parse_epub("book2.epub")  # No state pollution
>>> print(f"Found {len(chapters1)} chapters in book1")
>>> print(f"Found {len(chapters2)} chapters in book2")

NON_STORY_KEYWORDS = {'de': ['umschlag', 'titel', 'titelseite', 'vorderer umschlag', 'hinterer umschlag', 'danksagung', 'danksagungen', 'inhaltsverzeichnis', 'inhalt', 'index', 'urheberrecht', 'copyright', 'rechtlich', 'haftungsausschluss', 'über den autor', 'autorenbiografie', 'biografie', 'übersetzer', 'übersetzung', 'übersetzernotiz', 'vorwort', 'einleitung', 'anhang', 'bibliografie', 'referenzen', 'zitate', 'notizen', 'glossar', 'credits', 'widmung', 'kolophon'], 'en': ['cover', 'title', 'titlepage', 'front cover', 'back cover', 'acknowledgement', 'acknowledgments', 'acknowledgements', 'table of contents', 'contents', 'toc', 'copyright', 'legal', 'disclaimer', 'about the author', 'author bio', 'biography', 'translator', 'translation', "translator's note", 'preface', 'foreword', 'introduction', 'afterword', 'appendix', 'index', 'bibliography', 'references', 'citations', 'notes', 'glossary', 'credits', 'dedication', 'colophon'], 'es': ['cubierta', 'título', 'página de título', 'cubierta frontal', 'cubierta trasera', 'agradecimientos', 'reconocimientos', 'tabla de contenidos', 'contenidos', 'índice', 'derechos de autor', 'copyright', 'legal', 'descargo de responsabilidad', 'sobre el autor', 'biografía del autor', 'biografía', 'traductor', 'traducción', 'nota del traductor', 'prefacio', 'introducción', 'apéndice', 'bibliografía', 'referencias', 'citas', 'notas', 'glosario', 'créditos', 'dedicatoria', 'colofón'], 'fr': ['couverture', 'titre', 'page de titre', 'couverture avant', 'couverture arrière', 'remerciements', 'table des matières', 'sommaire', 'index', 'copyright', "droits d'auteur", 'légal', 'avertissement', "à propos de l'auteur", "biographie de l'auteur", 'biographie', 'traducteur', 'traduction', 'note du traducteur', 'préface', 'avant-propos', 'introduction', 'appendice', 'bibliographie', 'références', 'citations', 'notes', 'glossaire', 'crédits', 'dédicace', 'colophon'], 'it': ['copertina', 'titolo', 'frontespizio', 'copertina anteriore', 'copertina posteriore', 'ringraziamenti', 'indice', 'contenuti', 'copyright', "diritti d'autore", 'legale', 'disclaimer', "sull'autore", "biografia dell'autore", 'biografia', 'traduttore', 'traduzione', 'nota del traduttore', 'prefazione', 'introduzione', 'appendice', 'bibliografia', 'riferimenti', 'citazioni', 'note', 'glossario', 'crediti', 'dedica', 'colophon']}¶

is_loaded() → bool[source]¶

Check if an EPUB file is currently loaded.

Returns:: True if an EPUB file is loaded, False otherwise.

load_epub(file_path: str | Path) → bool[source]¶

Load and validate an EPUB file from the given path.

This method performs comprehensive validation including: - Input path validation (None, empty, invalid characters) - File existence and accessibility checks - EPUB format validation (ZIP structure, required files) - Corrupted file detection

The validation process ensures that the file is a valid EPUB by checking: 1. File exists and is readable 2. File is not empty and has minimum size 3. File has ZIP magic number (PK) 4. ZIP structure is valid and contains required EPUB files 5. META-INF/container.xml exists (required for EPUB)

Parameters:

file_path – Path to the EPUB file to load. Can be a string or Path object.

Returns:

True if the file was successfully loaded and validated.

Raises:

FileNotFoundError – If the EPUB file doesn’t exist.
ValueError – If the file path is invalid, file is not readable, or file is not a valid EPUB (empty, corrupted, wrong format).

Example

>>> parser = VivreParser()
>>> success = parser.load_epub("book.epub")
>>> if success:
...     print("EPUB loaded successfully")

parse_epub(file_path: str | Path) → List[Chapter][source]¶

Parse an EPUB file and extract chapter titles and text content.

This method performs comprehensive EPUB parsing following EPUB standards: 1. Reads container.xml to locate content.opf 2. Parses content.opf to get manifest and spine 3. Extracts chapter titles from table of contents 4. Processes spine items in reading order 5. Filters out non-story content 6. Extracts chapter text content

Parameters:

file_path – Path to the EPUB file to parse. Can be a string or Path object.

Returns:

List of Chapter objects containing chapter information. Only story chapters are included, with non-story content filtered out.

Raises:

FileNotFoundError – If the EPUB file doesn’t exist.
ValueError – If the file path is invalid, file is not a valid EPUB, or the EPUB structure cannot be parsed.

Segmenter¶

class vivre.Segmenter[source]¶

Bases: object

A text segmenter that splits text into sentences using spaCy models.

This class provides methods to segment text into meaningful units using language detection and spaCy’s sentence tokenization.

Batch Processing: - segment_batch(): For single-language batches (requires explicit language) - segment_mixed_batch(): For mixed-language batches (auto-detects languages)

Note: Some languages (Arabic, Hindi, Thai) use a general-purpose multilingual model (xx_ent_wiki_sm) which may provide lower segmentation accuracy compared to dedicated language models. For higher accuracy with these languages, consider using larger (_lg) or transformer (_trf) spaCy models if available.

get_supported_languages() → List[str][source]¶

Get list of supported language codes.

Note: Some languages (Arabic, Hindi, Thai) use a general-purpose multilingual model (xx_ent_wiki_sm) which may provide lower segmentation accuracy compared to dedicated language models. For higher accuracy with these languages, consider using larger (_lg) or transformer (_trf) spaCy models if available.

Returns:: List of supported language codes.

is_language_supported(language: str) → bool[source]¶

Check if a language is supported.

Parameters:: language – Language code to check.
Returns:: True if language is supported, False otherwise.

segment(text: str, language: str | None = None) → List[str][source]¶

Segment text into sentences using spaCy models.

Parameters:

text – The text to segment.
language – Optional language code (e.g., ‘en’, ‘es’, ‘fr’). If provided, this language will be used without question. If None, language will be auto-detected using langdetect. User override takes precedence for maximum accuracy.

Returns:

List of sentence segments.

Raises:

OSError – If the required spaCy model is not installed.
ValueError – If the language is not supported.

segment_batch(texts: List[str], language: str) → List[List[str]][source]¶

Segment multiple texts into sentences using spaCy’s optimized batch processing.

This method uses spaCy’s pipe() method for efficient batch processing, making better use of multi-core CPUs and improving performance significantly for bulk tasks.

IMPORTANT: All texts in the batch must be of the same language. Mixed-language batches are not supported and will result in incorrect segmentation. Use separate batch calls for different languages.

Parameters:

texts – List of texts to segment.
language – Language code (e.g., ‘en’, ‘es’, ‘fr’). All texts in the batch must be of this language.

Returns:

List of sentence segments for each input text.

Raises:

OSError – If the required spaCy model is not installed.
ValueError – If the language is not supported or if texts list is empty.

segment_mixed_batch(texts: List[str]) → List[List[str]][source]¶

Segment multiple texts that may be in different languages.

This method automatically detects the language of each text and groups them by language for efficient batch processing. This is the recommended method for processing mixed-language text collections.

Parameters:

texts – List of texts to segment (can be in different languages).

Returns:

List of sentence segments for each input text, in the same order.

Raises:

OSError – If required spaCy models are not installed.
ValueError – If texts list is empty.

Aligner¶

class vivre.Aligner(language_pair: str = 'en-es', c: float | None = None, s2: float | None = None, gap_penalty: float | None = None)[source]¶

Bases: object

A class for aligning source and target texts using the Gale-Church algorithm.

This class provides functionality to align segments of text between source and target languages, creating parallel corpora for translation and analysis purposes.

align(source_sentences: List[str], target_sentences: List[str]) → List[Tuple[str, str]][source]¶

Align source and target sentences into parallel segments.

Parameters:

source_sentences – List of source language sentences (pre-tokenized).
target_sentences – List of target language sentences (pre-tokenized).

Returns:

A list of tuples containing aligned (source_segment, target_segment) pairs.

VivrePipeline¶

class vivre.VivrePipeline(language_pair: str = 'en-es', c: float | None = None, s2: float | None = None, gap_penalty: float | None = None)[source]¶

Bases: object

High-level interface for the complete vivre text processing pipeline.

This class provides a convenient interface for processing parallel texts through the complete workflow: parsing EPUB files, segmenting text into sentences, and aligning sentences between languages.

The pipeline supports both single-chapter and multi-chapter processing, with options for automatic language detection and custom alignment parameters.

parser¶: The EPUB parser instance

segmenter¶: The sentence segmenter instance

aligner¶: The text aligner instance

language_pair¶: The language pair for alignment (e.g., “en-es”)

Example

>>> pipeline = VivrePipeline("en-es")
>>> alignments = pipeline.process_parallel_epubs(
...     "english_book.epub", "spanish_book.epub"
... )
>>> for source, target in alignments:
...     print(f"EN: {source}")
...     print(f"ES: {target}")

Process multiple pairs of EPUB files in batch.

This method processes multiple pairs of EPUB files, returning alignments for each pair in a dictionary keyed by the source file path.

Parameters:

epub_pairs – List of (source_path, target_path) tuples
source_language – Source language code (optional, auto-detected if None)
target_language – Target language code (optional, auto-detected if None)
max_chapters_per_book – Maximum chapters per book (optional)

Returns:

Dictionary mapping source file paths to alignment results

get_pipeline_info() → Dict[str, Any][source]¶

Get information about the current pipeline configuration.

Returns:: Dictionary containing pipeline configuration information

process_parallel_chapters(source_chapters: List[Tuple[str, str]], target_chapters: List[Tuple[str, str]], source_language: str | None = None, target_language: str | None = None) → List[Tuple[str, str]][source]¶

Process parallel chapter lists through the pipeline.

This method processes two lists of chapters (title, content pairs) through segmentation and alignment, skipping the parsing step.

Parameters:

source_chapters – List of (title, content) pairs for source language
target_chapters – List of (title, content) pairs for target language
source_language – Source language code (optional, auto-detected if None)
target_language – Target language code (optional, auto-detected if None)

Returns:

List of aligned sentence pairs (source, target)

Process parallel EPUB files through the complete pipeline.

This method processes two EPUB files (source and target languages) through the complete pipeline: parsing, segmentation, and alignment.

Parameters:

source_epub_path – Path to source language EPUB file
target_epub_path – Path to target language EPUB file
source_language – Source language code (optional, auto-detected if None)
target_language – Target language code (optional, auto-detected if None)
max_chapters – Maximum number of chapters to process (optional)

Returns:

List of aligned sentence pairs (source, target)

Raises:

FileNotFoundError – If EPUB files don’t exist
ValueError – If parsing or alignment fails

process_parallel_texts(source_text: str, target_text: str, source_language: str | None = None, target_language: str | None = None) → List[Tuple[str, str]][source]¶

Process parallel text content through the pipeline.

This method processes two text strings (source and target languages) through segmentation and alignment, skipping the parsing step.

Parameters:

source_text – Source language text content
target_text – Target language text content
source_language – Source language code (optional, auto-detected if None)
target_language – Target language code (optional, auto-detected if None)

Returns:

List of aligned sentence pairs (source, target)

Pipeline Functions¶

vivre.create_pipeline(language_pair: str = 'en-es', **kwargs: Any) → VivrePipeline[source]¶

Create a new vivre pipeline instance.

This is a convenience function for creating pipeline instances with default or custom parameters.

Parameters:

language_pair – Language pair for alignment
**kwargs – Additional arguments to pass to VivrePipeline constructor

Returns:

Configured VivrePipeline instance

Example

>>> pipeline = create_pipeline("en-fr", gap_penalty=5.0)
>>> alignments = pipeline.process_parallel_texts(
...     "Hello world.", "Bonjour le monde."
... )

CLI Functions¶

Command-line interface functions for processing EPUB files.

vivre.cli.align(source_epub: ~pathlib.Path = <typer.models.ArgumentInfo object>, target_epub: ~pathlib.Path = <typer.models.ArgumentInfo object>, language_pair: str = <typer.models.ArgumentInfo object>, method: str = <typer.models.OptionInfo object>, format: str = <typer.models.OptionInfo object>, output: ~pathlib.Path | None = <typer.models.OptionInfo object>, c: float | None = <typer.models.OptionInfo object>, s2: float | None = <typer.models.OptionInfo object>, gap_penalty: float | None = <typer.models.OptionInfo object>, verbose: bool = <typer.models.OptionInfo object>) → None[source]¶

Align two EPUB files using the complete pipeline.

This command parses both EPUB files, segments the text into sentences, and aligns them using the specified method. The language_pair parameter is required for accurate alignment.

Examples

$ vivre align english.epub french.epub en-fr $ vivre align english.epub spanish.epub es-en –format csv $ vivre align english.epub french.epub en-fr –output result.json

vivre.cli.parse(epub_path: ~pathlib.Path = <typer.models.ArgumentInfo object>, show_content: bool = <typer.models.OptionInfo object>, max_chapters: int | None = <typer.models.OptionInfo object>, format: str = <typer.models.OptionInfo object>, output: ~pathlib.Path | None = <typer.models.OptionInfo object>, segment: bool = <typer.models.OptionInfo object>, language: str | None = <typer.models.OptionInfo object>, verbose: bool = <typer.models.OptionInfo object>) → None[source]¶

Parse and analyze an EPUB file with comprehensive details.

This command provides detailed analysis of EPUB files including metadata, chapter structure, content statistics, and optional sentence segmentation. It’s the one-stop-shop for analyzing a single EPUB file.

Examples

$ vivre parse book.epub $ vivre parse book.epub –show-content –max-chapters 3 $ vivre parse book.epub –segment –language en –format csv $ vivre parse book.epub –verbose –output analysis.json

vivre.cli.main(version: bool = <typer.models.OptionInfo object>) → None[source]¶

Vivre - A library for processing parallel texts.

This CLI provides two powerful commands for EPUB processing and text alignment:

[bold]parse[/bold] - Comprehensive EPUB analysis with metadata, structure, and optional segmentation
[bold]align[/bold] - Parallel text alignment using machine learning techniques

Examples

$ vivre parse book.epub –verbose $ vivre align english.epub french.epub en-fr –format csv $ vivre –help

Internal Functions¶

These functions are used internally but may be useful for advanced users.

vivre.api._create_aligned_corpus(source_chapters: List[Chapter], target_chapters: List[Chapter], pipeline: VivrePipeline, book_title: str, language_pair: str) → Dict[str, Any][source]¶: Create the aligned corpus structure.

vivre.api._format_as_text(corpus: Dict[str, Any]) → str[source]¶: Format corpus as plain text.

vivre.api._format_as_csv(corpus: Dict[str, Any]) → str[source]¶: Format corpus as CSV.

vivre.api._format_as_xml(corpus: Dict[str, Any]) → str[source]¶: Format corpus as XML.

vivre.api._parse_source_or_chapters(source: str | Path | Chapters, name: str) → Tuple[List[Chapter], str][source]¶

Parse source or target, whether it’s a file path or Chapters object.

Parameters:

source – File path or Chapters object
name – Name for error messages (“source” or “target”)

Returns:

Tuple of (chapters, book_title)

CLI Formatting Functions¶

vivre.cli._format_alignments_as_text(output_data: dict) → str[source]¶: Format alignments as plain text.

vivre.cli._format_alignments_as_csv(output_data: dict) → str[source]¶: Format alignments as CSV with enhanced metadata.

vivre.cli._format_alignments_as_xml(output_data: dict) → str[source]¶: Format alignments as XML.

vivre.cli._format_parse_as_text(output_data: dict) → str[source]¶: Format parse results as plain text.

vivre.cli._format_parse_as_csv(output_data: dict) → str[source]¶: Format parse results as CSV.

vivre.cli._format_parse_as_xml(output_data: dict) → str[source]¶: Format parse results as XML.

API Reference¶

Top-Level Functions¶

Core Classes¶

Chapter¶

AlignmentResult¶

Chapters¶

VivreParser¶

Segmenter¶

Aligner¶

VivrePipeline¶

Pipeline Functions¶

CLI Functions¶

Internal Functions¶

CLI Formatting Functions¶

Module Index¶

vivre

Navigation

Related Topics