=== API === ``icu4py`` ========== .. currentmodule:: icu4py .. data:: icu_version :type: str A string representing the ICU version, for example ``"78.2"``. .. data:: icu_version_info :type: tuple[int, int, int, int] A tuple of four integers representing the ICU version in the format ``(major, minor, patch, build)``, for example, ``(78, 2, 0, 0)``. ``icu4py.breakers`` =================== This module wraps ICU's `boundary analysis`__ functionality, providing classes for finding text boundaries around characters, words, lines, and sentences. __ https://unicode-org.github.io/icu/userguide/boundaryanalysis/ .. currentmodule:: icu4py.breakers .. class:: BaseBreaker(text: str, locale: str | Locale) Base class for the following breaker subclasses, which cannot be instantiated directly. Wraps ICU’s |BreakIterator class|__. .. |BreakIterator class| replace:: ``BreakIterator`` class __ https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1BreakIterator.html#details :param text: The text to analyze for boundaries. :param locale: The locale to use, as either a string (an ICU style C locale) or a :class:`~icu4py.locale.Locale` object. .. attribute:: text The text being analyzed. :type: str .. attribute:: locale The locale being used for boundary analysis. :type: Locale .. method:: __iter__() -> Iterator[str] Iterate over text segments split by boundaries. :return: An iterator of strings, each representing a segment of text between boundaries. Example usage: .. doctest:: >>> from icu4py.breakers import WordBreaker >>> breaker = WordBreaker("Hello World", "en_GB") >>> list(breaker) ['Hello', ' ', 'World'] .. method:: segments() -> Iterator[tuple[int, int]] Iterate over boundary positions as ``(start, end)`` tuples. :return: An iterator of ``(start, end)`` tuples representing boundary positions. Example usage: .. doctest:: >>> from icu4py.breakers import WordBreaker >>> breaker = WordBreaker("Hello World", "en_GB") >>> list(breaker.segments()) [(0, 5), (5, 6), (6, 11)] .. class:: CharacterBreaker(text: str, locale: str | Locale) :class:`BaseBreaker` subclass for iterating over character (grapheme cluster) boundaries, handling combining characters and emoji sequences. Wraps ICU's `character-break iterator`__. __ https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1BreakIterator.html#afffc1125b180a61857f698e147b1a668 Example usage: .. doctest:: >>> from icu4py.breakers import CharacterBreaker >>> greeting = "👋🏽 hi" >>> list(greeting) # splits by codepoints, emoji and skin tone are separate ['👋', '🏽', ' ', 'h', 'i'] >>> breaker = CharacterBreaker(greeting, "en_GB") >>> list(breaker) # splits by grapheme clusters, keeping emoji and skin tone together ['👋🏽', ' ', 'h', 'i'] .. class:: WordBreaker(text: str, locale: str | Locale) :class:`BaseBreaker` subclass for iterating over word boundaries, correctly handling punctuation, hyphenated words, and contractions. Wraps ICU's `word boundary iterator`__. __ https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1BreakIterator.html#a6aa1459cc086397bdb85ccd1bb3c5500 Example usage: .. doctest:: >>> from icu4py.breakers import WordBreaker >>> exclamation = "A self-made rabbit." >>> list(WordBreaker(exclamation, "en_GB")) ['A', ' ', 'self', '-', 'made', ' ', 'rabbit', '.'] .. class:: LineBreaker(text: str, locale: str | Locale) :class:`BaseBreaker` subclass for iterating over line-break boundaries, which are incicate where text could be wrapped to the next line, correctly handling punctuation and hyphenated words. Wraps ICU's `line-break iterator`__. __ https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1BreakIterator.html#aae588706df064825f1bccb2a9165169e Example usage: .. doctest:: >>> from icu4py.breakers import LineBreaker >>> review = "It's quite thirst-quenching." >>> list(LineBreaker(review, "en_GB")) ["It's ", 'quite ', 'thirst-', 'quenching.'] .. class:: SentenceBreaker(text: str, locale: str | Locale) :class:`BaseBreaker` subclass for iterating over sentence boundaries, handling periods within numbers, abbreviations, and trailing punctuation marks. Wraps ICU's `sentence-break iterator`__. __ https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1BreakIterator.html#ae161880c561882dad879112e15fde42b Example usage: .. doctest:: >>> from icu4py.breakers import SentenceBreaker >>> tagline = 'You asked "Why?". We answered "Why not?"' >>> list(SentenceBreaker(tagline, "en_GB@ss=standard")) ['You asked "Why?". ', 'We answered "Why not?"'] (The ``ss=standard`` locale extension enables `sentence break filters`__, to filter out false breaks, like the perioad after ``Dr.``.) __ https://unicode-org.github.io/icu/userguide/boundaryanalysis/#sentence-break-filters ``icu4py.locale`` ================= This module wraps ICU's `locale functionality`__. __ https://unicode-org.github.io/icu/userguide/locale/ .. currentmodule:: icu4py.locale .. class:: Locale(language: str, country: str | None = None, variant: str | None = None, extensions: dict[str, str] | None = None) A wrapper around ICU's |Locale class|__. .. |Locale class| replace:: ``Locale`` class __ https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1Locale.html#details Represents a specific geographical, political, or cultural region. :param language: A valid **ISO Language Code**: one of the lower-case two-letter codes as defined by ISO-639, like ``"en"``. Find a full list of these codes `on Wikipedia `__. Alternatively, this parameter may be provided as an **ICU style C locale string**, such as ``"en_GB"`` or ``"de_DE@collation=phonebook"``. In this case, the other parameters should be left as ``None``. :param country: A valid **ISO Country Code**: one of the upper-case two-letter (A-2) codes as defined by ISO-3166, like ``GB"``. Find a full list of these codes `on Wikipedia `__. :param variant: A **Variant**: variant codes are vendor and browser-specific. :param extensions: A dictionary of Unicode locale extensions, such as ``{"collation": "phonebook", "currency": "euro"}`` (optional). Per ICU’s behaviour, the ``Locale`` constructor performs no validation of the provided locale data. Operations use a best-match approach for locales. However, if input data is completely invalid, the locale is marked as “bogus”, which can be checked with the :attr:`bogus` attribute. Example usage: .. doctest:: >>> from icu4py.locale import Locale >>> locale = Locale("en", "GB") >>> locale.bogus False >>> locale.language 'en' >>> locale.country 'GB' .. attribute:: bogus :type: bool Whether the locale is bogus (definitely invalid). Returns ``True`` if the locale is bogus, ``False`` if it is valid. .. attribute:: language :type: str The locale's ISO Language Code, like ``"en"`` for English. Note that ICU canonicalizes the language code. For instance, a ``Locale`` constructed with the three-letter code ``"eng"`` will return ``"en"``. .. attribute:: country :type: str The locale's ISO Country Code, like ``"GB"`` for the United Kingdom. Returns an empty string if no country code was specified. .. attribute:: variant :type: str The locale's variant code. Variant codes are vendor and browser-specific, such as ``"POSIX"``. Returns an empty string if no variant was specified. Note that ICU uppercases variant codes. .. attribute:: extensions :type: dict[str, str] A dictionary of the locale's keywords and values (extensions). For example, ``{"collation": "phonebook", "currency": "USD"}``. Returns an empty dictionary if no extensions were specified. ``icu4py.messageformat`` ======================== This module wraps ICU’s `MessageFormat V1 functionality`__. __ https://unicode-org.github.io/icu/userguide/format_parse/messages/ .. currentmodule:: icu4py.messageformat .. class:: MessageFormat(pattern: str, locale: str | Locale) A wrapper around ICU’s version 1 |MessageFormat class|__. .. |MessageFormat class| replace:: ``MessageFormat`` class __ https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1MessageFormat.html#details Construct an instance with a message pattern and locale, then call :meth:`format` with a dictionary of values to format the message. :param pattern: The message pattern string. :param locale: The locale to use, as either a string (an ICU style C locale) or a :class:`~icu4py.locale.Locale` object. .. attribute:: pattern :type: str The message pattern string. .. attribute:: locale :type: Locale The locale used for formatting, as a :class:`~icu4py.locale.Locale` object. .. method:: format(values: dict[str, Any]) -> str Format the message with the given values. :param values: A dictionary of names to values to format the message with. Currently supported value types are ``int``, ``float``, ``str``, |Decimal|__, |date|__, and |datetime|__. .. |Decimal| replace:: ``decimal.Decimal`` __ https://docs.python.org/3/library/decimal.html#decimal.Decimal .. |date| replace:: ``datetime.date`` __ https://docs.python.org/3/library/datetime.html#datetime.date .. |datetime| replace:: ``datetime.datetime`` __ https://docs.python.org/3/library/datetime.html#datetime.datetime :return: The formatted message string. :rtype: str Example usage: .. doctest:: >>> from icu4py.messageformat import MessageFormat >>> pattern = "{count,plural,one {# file} other {# files}}" >>> fmt = MessageFormat(pattern, "en_GB") >>> fmt.format({"count": 1}) '1 file' >>> fmt.format({"count": 5}) '5 files' A more complex example: .. doctest:: >>> from icu4py.messageformat import MessageFormat >>> pattern = ( ... "{num_guests,plural,offset:1 " ... "=0 {{host} does not throw a party.}" ... "=1 {{host} invites {guest} to the party.}" ... "=2 {{host} invites {guest} and one other person to the party.}" ... "other {{host} invites {guest} and # other people to the party.}}" ... ) >>> fmt = MessageFormat(pattern, "en_GB") >>> fmt.format({"num_guests": 0, "host": "Alice", "guest": "Bob"}) 'Alice does not throw a party.' >>> fmt.format({"num_guests": 1, "host": "Alice", "guest": "Bob"}) 'Alice invites Bob to the party.' >>> fmt.format({"num_guests": 5, "host": "Alice", "guest": "Bob"}) 'Alice invites Bob and 4 other people to the party.' Formatting a ``datetime``: .. doctest:: >>> import datetime as dt >>> from icu4py.messageformat import MessageFormat >>> fmt = MessageFormat("Year {when,date,::yyyy}, month {when,date,::MM}", "en_GB") >>> fmt.format({"when": dt.datetime(1985, 10, 26, 1, 24)}) 'Year 1985, month 10'