API

icu4py

icu4py.icu_version: str

A string representing the ICU version, for example "78.2".

icu4py.icu_version_info: tuple[int, int, int, int]

A tuple of four integers representing the ICU version in the format (major, minor, patch, build), for example, (78, 2, 0, 0).

icu4py.breakers

This module wraps ICU’s boundary analysis functionality, providing classes for finding text boundaries around characters, words, lines, and sentences.

class icu4py.breakers.BaseBreaker(text: str, locale: str | Locale)

Base class for the following breaker subclasses, which cannot be instantiated directly. Wraps ICU’s BreakIterator class.

Parameters:
  • text – The text to analyze for boundaries.

  • locale – The locale to use, as either a string (an ICU style C locale) or a Locale object.

text

The text being analyzed.

Type:

str

locale

The locale being used for boundary analysis.

Type:

Locale

__iter__() Iterator[str]

Iterate over text segments split by boundaries.

Returns:

An iterator of strings, each representing a segment of text between boundaries.

Example usage:

>>> from icu4py.breakers import WordBreaker
>>> breaker = WordBreaker("Hello World", "en_GB")
>>> list(breaker)
['Hello', ' ', 'World']
segments() Iterator[tuple[int, int]]

Iterate over boundary positions as (start, end) tuples.

Returns:

An iterator of (start, end) tuples representing boundary positions.

Example usage:

>>> from icu4py.breakers import WordBreaker
>>> breaker = WordBreaker("Hello World", "en_GB")
>>> list(breaker.segments())
[(0, 5), (5, 6), (6, 11)]
class icu4py.breakers.CharacterBreaker(text: str, locale: str | Locale)

BaseBreaker subclass for iterating over character (grapheme cluster) boundaries, handling combining characters and emoji sequences. Wraps ICU’s character-break iterator.

Example usage:

>>> from icu4py.breakers import CharacterBreaker
>>> greeting = "👋🏽 hi"
>>> list(greeting)  # splits by codepoints, emoji and skin tone are separate
['👋', '🏽', ' ', 'h', 'i']
>>> breaker = CharacterBreaker(greeting, "en_GB")
>>> list(breaker)  # splits by grapheme clusters, keeping emoji and skin tone together
['👋🏽', ' ', 'h', 'i']
class icu4py.breakers.WordBreaker(text: str, locale: str | Locale)

BaseBreaker subclass for iterating over word boundaries, correctly handling punctuation, hyphenated words, and contractions. Wraps ICU’s word boundary iterator.

Example usage:

>>> from icu4py.breakers import WordBreaker
>>> exclamation = "A self-made rabbit."
>>> list(WordBreaker(exclamation, "en_GB"))
['A', ' ', 'self', '-', 'made', ' ', 'rabbit', '.']
class icu4py.breakers.LineBreaker(text: str, locale: str | Locale)

BaseBreaker subclass for iterating over line-break boundaries, which are incicate where text could be wrapped to the next line, correctly handling punctuation and hyphenated words. Wraps ICU’s line-break iterator.

Example usage:

>>> from icu4py.breakers import LineBreaker
>>> review = "It's quite thirst-quenching."
>>> list(LineBreaker(review, "en_GB"))
["It's ", 'quite ', 'thirst-', 'quenching.']
class icu4py.breakers.SentenceBreaker(text: str, locale: str | Locale)

BaseBreaker subclass for iterating over sentence boundaries, handling periods within numbers, abbreviations, and trailing punctuation marks. Wraps ICU’s sentence-break iterator.

Example usage:

>>> from icu4py.breakers import SentenceBreaker
>>> tagline = 'You asked "Why?". We answered "Why not?"'
>>> list(SentenceBreaker(tagline, "en_GB@ss=standard"))
['You asked "Why?". ', 'We answered "Why not?"']

(The ss=standard locale extension enables sentence break filters, to filter out false breaks, like the perioad after Dr..)

icu4py.locale

This module wraps ICU’s locale functionality.

class icu4py.locale.Locale(language: str, country: str | None = None, variant: str | None = None, extensions: dict[str, str] | None = None)

A wrapper around ICU’s Locale class.

Represents a specific geographical, political, or cultural region.

Parameters:

language – A valid ISO Language Code: one of the lower-case two-letter codes as defined by ISO-639, like "en". Find a full list of these codes on Wikipedia.

Alternatively, this parameter may be provided as an ICU style C locale string, such as "en_GB" or "de_DE@collation=phonebook". In this case, the other parameters should be left as None.

Parameters:
  • country – A valid ISO Country Code: one of the upper-case two-letter (A-2) codes as defined by ISO-3166, like GB". Find a full list of these codes on Wikipedia.

  • variant – A Variant: variant codes are vendor and browser-specific.

  • extensions – A dictionary of Unicode locale extensions, such as {"collation": "phonebook", "currency": "euro"} (optional).

Per ICU’s behaviour, the Locale constructor performs no validation of the provided locale data. Operations use a best-match approach for locales. However, if input data is completely invalid, the locale is marked as “bogus”, which can be checked with the bogus attribute.

Example usage:

>>> from icu4py.locale import Locale
>>> locale = Locale("en", "GB")
>>> locale.bogus
False
>>> locale.language
'en'
>>> locale.country
'GB'
bogus: bool

Whether the locale is bogus (definitely invalid). Returns True if the locale is bogus, False if it is valid.

language: str

The locale’s ISO Language Code, like "en" for English.

Note that ICU canonicalizes the language code. For instance, a Locale constructed with the three-letter code "eng" will return "en".

country: str

The locale’s ISO Country Code, like "GB" for the United Kingdom.

Returns an empty string if no country code was specified.

variant: str

The locale’s variant code. Variant codes are vendor and browser-specific, such as "POSIX".

Returns an empty string if no variant was specified. Note that ICU uppercases variant codes.

extensions: dict[str, str]

A dictionary of the locale’s keywords and values (extensions). For example, {"collation": "phonebook", "currency": "USD"}.

Returns an empty dictionary if no extensions were specified.

icu4py.messageformat

This module wraps ICU’s MessageFormat V1 functionality.

class icu4py.messageformat.MessageFormat(pattern: str, locale: str | Locale)

A wrapper around ICU’s version 1 MessageFormat class.

Construct an instance with a message pattern and locale, then call format() with a dictionary of values to format the message.

Parameters:
  • pattern – The message pattern string.

  • locale – The locale to use, as either a string (an ICU style C locale) or a Locale object.

pattern: str

The message pattern string.

locale: Locale

The locale used for formatting, as a Locale object.

format(values: dict[str, Any]) str

Format the message with the given values.

Parameters:

values

A dictionary of names to values to format the message with.

Currently supported value types are int, float, str, decimal.Decimal, datetime.date, and datetime.datetime.

Returns:

The formatted message string.

Return type:

str

Example usage:

>>> from icu4py.messageformat import MessageFormat
>>> pattern = "{count,plural,one {# file} other {# files}}"
>>> fmt = MessageFormat(pattern, "en_GB")
>>> fmt.format({"count": 1})
'1 file'
>>> fmt.format({"count": 5})
'5 files'

A more complex example:

>>> from icu4py.messageformat import MessageFormat
>>> pattern = (
...     "{num_guests,plural,offset:1 "
...     "=0 {{host} does not throw a party.}"
...     "=1 {{host} invites {guest} to the party.}"
...     "=2 {{host} invites {guest} and one other person to the party.}"
...     "other {{host} invites {guest} and # other people to the party.}}"
... )
>>> fmt = MessageFormat(pattern, "en_GB")
>>> fmt.format({"num_guests": 0, "host": "Alice", "guest": "Bob"})
'Alice does not throw a party.'
>>> fmt.format({"num_guests": 1, "host": "Alice", "guest": "Bob"})
'Alice invites Bob to the party.'
>>> fmt.format({"num_guests": 5, "host": "Alice", "guest": "Bob"})
'Alice invites Bob and 4 other people to the party.'

Formatting a datetime:

>>> import datetime as dt
>>> from icu4py.messageformat import MessageFormat
>>> fmt = MessageFormat("Year {when,date,::yyyy}, month {when,date,::MM}", "en_GB")
>>> fmt.format({"when": dt.datetime(1985, 10, 26, 1, 24)})
'Year 1985, month 10'