API¶

`icu4py`¶

icu4py.icu_version: str¶: A string representing the ICU version, for example "78.2".

icu4py.icu_version_info: tuple[int, int, int, int]¶: A tuple of four integers representing the ICU version in the format (major, minor, patch, build), for example, (78, 2, 0, 0).

`icu4py.breakers`¶

This module wraps ICU’s boundary analysis functionality, providing classes for finding text boundaries around characters, words, lines, and sentences.

class icu4py.breakers.BaseBreaker(text: str, locale: str | Locale)¶

Base class for the following breaker subclasses, which cannot be instantiated directly. Wraps ICU’s BreakIterator class.

Parameters:

text – The text to analyze for boundaries.
locale – The locale to use, as either a string (an ICU style C locale) or a Locale object.

text¶

The text being analyzed.

Type:: str

locale¶

The locale being used for boundary analysis.

Type:: Locale

__iter__() → Iterator[str]¶

Iterate over text segments split by boundaries.

Returns:: An iterator of strings, each representing a segment of text between boundaries.

Example usage:

>>> from icu4py.breakers import WordBreaker
>>> breaker = WordBreaker("Hello World", "en_GB")
>>> list(breaker)
['Hello', ' ', 'World']

segments() → Iterator[tuple[int, int]]¶

Iterate over boundary positions as (start, end) tuples.

Returns:: An iterator of (start, end) tuples representing boundary positions.

Example usage:

>>> from icu4py.breakers import WordBreaker
>>> breaker = WordBreaker("Hello World", "en_GB")
>>> list(breaker.segments())
[(0, 5), (5, 6), (6, 11)]

class icu4py.breakers.CharacterBreaker(text: str, locale: str | Locale)¶

BaseBreaker subclass for iterating over character (grapheme cluster) boundaries, handling combining characters and emoji sequences. Wraps ICU’s character-break iterator.

Example usage:

>>> from icu4py.breakers import CharacterBreaker
>>> greeting = "👋🏽 hi"
>>> list(greeting)  # splits by codepoints, emoji and skin tone are separate
['👋', '🏽', ' ', 'h', 'i']
>>> breaker = CharacterBreaker(greeting, "en_GB")
>>> list(breaker)  # splits by grapheme clusters, keeping emoji and skin tone together
['👋🏽', ' ', 'h', 'i']

class icu4py.breakers.WordBreaker(text: str, locale: str | Locale)¶

BaseBreaker subclass for iterating over word boundaries, correctly handling punctuation, hyphenated words, and contractions. Wraps ICU’s word boundary iterator.

Example usage:

>>> from icu4py.breakers import WordBreaker
>>> exclamation = "A self-made rabbit."
>>> list(WordBreaker(exclamation, "en_GB"))
['A', ' ', 'self', '-', 'made', ' ', 'rabbit', '.']

class icu4py.breakers.LineBreaker(text: str, locale: str | Locale)¶

BaseBreaker subclass for iterating over line-break boundaries, which are incicate where text could be wrapped to the next line, correctly handling punctuation and hyphenated words. Wraps ICU’s line-break iterator.

Example usage:

>>> from icu4py.breakers import LineBreaker
>>> review = "It's quite thirst-quenching."
>>> list(LineBreaker(review, "en_GB"))
["It's ", 'quite ', 'thirst-', 'quenching.']

class icu4py.breakers.SentenceBreaker(text: str, locale: str | Locale)¶

BaseBreaker subclass for iterating over sentence boundaries, handling periods within numbers, abbreviations, and trailing punctuation marks. Wraps ICU’s sentence-break iterator.

Example usage:

>>> from icu4py.breakers import SentenceBreaker
>>> tagline = 'You asked "Why?". We answered "Why not?"'
>>> list(SentenceBreaker(tagline, "en_GB@ss=standard"))
['You asked "Why?". ', 'We answered "Why not?"']

(The ss=standard locale extension enables sentence break filters, to filter out false breaks, like the perioad after Dr..)

`icu4py.locale`¶

This module wraps ICU’s locale functionality.

class icu4py.locale.Locale(language: str, country: str | None = None, variant: str | None = None, extensions: dict[str, str] | None = None)¶

A wrapper around ICU’s Locale class.

Represents a specific geographical, political, or cultural region.

Parameters:: language – A valid ISO Language Code: one of the lower-case two-letter codes as defined by ISO-639, like "en". Find a full list of these codes on Wikipedia.

Alternatively, this parameter may be provided as an ICU style C locale string, such as "en_GB" or "de_DE@collation=phonebook". In this case, the other parameters should be left as None.

Parameters:

country – A valid ISO Country Code: one of the upper-case two-letter (A-2) codes as defined by ISO-3166, like GB". Find a full list of these codes on Wikipedia.
variant – A Variant: variant codes are vendor and browser-specific.
extensions – A dictionary of Unicode locale extensions, such as {"collation": "phonebook", "currency": "euro"} (optional).

Per ICU’s behaviour, the Locale constructor performs no validation of the provided locale data. Operations use a best-match approach for locales. However, if input data is completely invalid, the locale is marked as “bogus”, which can be checked with the bogus attribute.

Example usage:

>>> from icu4py.locale import Locale
>>> locale = Locale("en", "GB")
>>> locale.bogus
False
>>> locale.language
'en'
>>> locale.country
'GB'

bogus: bool¶: Whether the locale is bogus (definitely invalid). Returns True if the locale is bogus, False if it is valid.

language: str¶

The locale’s ISO Language Code, like "en" for English.

Note that ICU canonicalizes the language code. For instance, a Locale constructed with the three-letter code "eng" will return "en".

country: str¶

The locale’s ISO Country Code, like "GB" for the United Kingdom.

Returns an empty string if no country code was specified.

variant: str¶

The locale’s variant code. Variant codes are vendor and browser-specific, such as "POSIX".

Returns an empty string if no variant was specified. Note that ICU uppercases variant codes.

extensions: dict[str, str]¶

A dictionary of the locale’s keywords and values (extensions). For example, {"collation": "phonebook", "currency": "USD"}.

Returns an empty dictionary if no extensions were specified.

`icu4py.messageformat`¶

This module wraps ICU’s MessageFormat V1 functionality.

class icu4py.messageformat.MessageFormat(pattern: str, locale: str | Locale)¶

A wrapper around ICU’s version 1 MessageFormat class.

Construct an instance with a message pattern and locale, then call format() with a dictionary of values to format the message.

Parameters:

pattern – The message pattern string.
locale – The locale to use, as either a string (an ICU style C locale) or a Locale object.

pattern: str¶: The message pattern string.

locale: Locale¶: The locale used for formatting, as a Locale object.

format(values: dict[str, Any]) → str¶

Format the message with the given values.

Parameters:

values –

A dictionary of names to values to format the message with.

Currently supported value types are int, float, str, decimal.Decimal, datetime.date, and datetime.datetime.

Returns:

The formatted message string.

Return type:

str

Example usage:

>>> from icu4py.messageformat import MessageFormat
>>> pattern = "{count,plural,one {# file} other {# files}}"
>>> fmt = MessageFormat(pattern, "en_GB")
>>> fmt.format({"count": 1})
'1 file'
>>> fmt.format({"count": 5})
'5 files'

A more complex example:

>>> from icu4py.messageformat import MessageFormat
>>> pattern = (
...     "{num_guests,plural,offset:1 "
...     "=0 {{host} does not throw a party.}"
...     "=1 {{host} invites {guest} to the party.}"
...     "=2 {{host} invites {guest} and one other person to the party.}"
...     "other {{host} invites {guest} and # other people to the party.}}"
... )
>>> fmt = MessageFormat(pattern, "en_GB")
>>> fmt.format({"num_guests": 0, "host": "Alice", "guest": "Bob"})
'Alice does not throw a party.'
>>> fmt.format({"num_guests": 1, "host": "Alice", "guest": "Bob"})
'Alice invites Bob to the party.'
>>> fmt.format({"num_guests": 5, "host": "Alice", "guest": "Bob"})
'Alice invites Bob and 4 other people to the party.'

Formatting a datetime:

>>> import datetime as dt
>>> from icu4py.messageformat import MessageFormat
>>> fmt = MessageFormat("Year {when,date,::yyyy}, month {when,date,::MM}", "en_GB")
>>> fmt.format({"when": dt.datetime(1985, 10, 26, 1, 24)})
'Year 1985, month 10'

API¶

icu4py¶

icu4py.breakers¶

icu4py.locale¶

icu4py.messageformat¶

`icu4py`¶

`icu4py.breakers`¶

`icu4py.locale`¶

`icu4py.messageformat`¶