Documentation

UpdateUnicode extends BackgroundTask
in package

This class contains code used to update SMF's Unicode data files.

Table of Contents

Constants

DATA_URL_IDNA  = 'https://www.unicode.org/Public/idna/latest'
DATA_URL_UCD  = 'https://unicode.org/Public/UCD/latest/ucd'
URLs where we can fetch the Unicode data files.
RECEIVE_NOTIFY_ALERT  = 0x1
RECEIVE_NOTIFY_EMAIL  = 0x2
Constants for notification types.

Properties

$temp_dir  : string
$ucd_version  : string
$unicodedir  : string
$_details  : array<string|int, mixed>
$char_data  : array<string|int, mixed>
$derived_normalization_props  : array<string|int, mixed>
$full_decomposition_maps  : array<string|int, mixed>
$funcs  : array<string|int, mixed>
$prefetch  : array<string|int, mixed>
$script_aliases  : array<string|int, mixed>
$script_stats  : array<string|int, mixed>
$time_limit  : int

Methods

__construct()  : mixed
The constructor.
execute()  : bool
This executes the task.
export_funcs_to_file()  : mixed
Updates Unicode data functions in their designated files.
getMinUserInfo()  : array<string|int, mixed>
Loads minimal info for the previously loaded user ids
build_func_array()  : mixed
Helper for get_function_code_and_regex(). Builds the function's data array.
build_idna()  : mixed
Builds maps and regex classes for IDNA purposes.
build_quick_check()  : mixed
Builds regular expressions for normalization quick check.
build_regex_indic()  : mixed
Builds regex classes for join control tests in utf8_sanitize_invisibles.
build_regex_joining_type()  : mixed
Builds regex classes for join control tests in utf8_sanitize_invisibles.
build_regex_properties()  : mixed
Builds regular expression classes for extended Unicode properties.
build_regex_variation_selectors()  : mixed
Builds regular expression classes for filtering variation selectors.
build_script_stats()  : mixed
Helper function for build_regex_joining_type and build_regex_indic.
deltree()  : mixed
Deletes a directory and its contents.
fetch_unicode_file()  : string
Fetches the contents of a Unicode data file.
finalize_decomposition_forms()  : mixed
Finalizes all the decomposition forms.
get_function_code_and_regex()  : array<string|int, mixed>
Builds complete code for the specified element in $this->funcs to be inserted into the relevant PHP file. Also builds a regex to check whether a copy of the the function is already present in the file.
lookup_ucd_version()  : mixed
Sets $this->ucd_version to latest version number of the UCD.
make_temp_dir()  : mixed
Makes a temporary directory to hold our working files, and sets $this->temp_dir to the path of the created directory.
process_casing_data()  : mixed
Processes SpecialCasing.txt and CaseFolding.txt in order to get finalized versions of all case conversion data.
process_derived_normalization_props()  : mixed
Processes DerivedNormalizationProps.txt in order to populate $this->derived_normalization_props.
process_main_unicode_data()  : mixed
Processes UnicodeData.txt in order to populate $this->char_data, $this->full_decomposition_maps, and the 'data' element of most elements of $this->funcs.
should_update()  : bool
Compares version of SMF's local Unicode data with the latest release.
smf_file_header()  : string
Gets basic boilerplate for the PHP files that will be created.

Constants

DATA_URL_IDNA

public mixed DATA_URL_IDNA = 'https://www.unicode.org/Public/idna/latest'

DATA_URL_UCD

URLs where we can fetch the Unicode data files.

public mixed DATA_URL_UCD = 'https://unicode.org/Public/UCD/latest/ucd'

RECEIVE_NOTIFY_EMAIL

Constants for notification types.

public mixed RECEIVE_NOTIFY_EMAIL = 0x2

Properties

$temp_dir

public string $temp_dir = ''

Path to temporary working directory.

$ucd_version

public string $ucd_version = ''

The latest official release of the Unicode Character Database.

$unicodedir

public string $unicodedir = ''

Convenince alias of Config::$sourcedir . '/Unicode'.

$_details

protected array<string|int, mixed> $_details

Holds the details for the task

$char_data

private array<string|int, mixed> $char_data = []

Assorted info about Unicode characters.

$derived_normalization_props

private array<string|int, mixed> $derived_normalization_props = []

Character properties used during normalization.

$full_decomposition_maps

private array<string|int, mixed> $full_decomposition_maps = []

Key-value pairs of character decompositions.

$funcs

private array<string|int, mixed> $funcs = [['file' => 'Metadata.php', 'regex' => '/if \\(!defined\\(\'SMF_UNICODE_VERSION\'\\)\\)(?:\\s*{)?\\n\\tdefine\\(\'SMF_UNICODE_VERSION\', \'\\d+(\\.\\d+)*\'\\);(?:\\n})?/', 'data' => [ // 0.0.0.0 will be replaced with correct value at runtime. "if (!defined('SMF_UNICODE_VERSION')) {\n\tdefine('SMF_UNICODE_VERSION', '0.0.0.0');\n}", ]], 'utf8_normalize_d_maps' => ['file' => 'DecompositionCanonical.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_normalize_d.'], 'return' => ['type' => 'array', 'desc' => 'Canonical Decomposition maps for Unicode normalization.'], 'data' => []], 'utf8_normalize_kd_maps' => ['file' => 'DecompositionCompatibility.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_normalize_kd.'], 'return' => ['type' => 'array', 'desc' => 'Compatibility Decomposition maps for Unicode normalization.'], 'data' => []], 'utf8_compose_maps' => ['file' => 'Composition.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_compose.'], 'return' => ['type' => 'array', 'desc' => 'Composition maps for Unicode normalization.'], 'data' => []], 'utf8_combining_classes' => ['file' => 'CombiningClasses.php', 'key_type' => 'hexchar', 'val_type' => 'int', 'desc' => ['Helper function for utf8_normalize_d.'], 'return' => ['type' => 'array', 'desc' => 'Combining Class data for Unicode normalization.'], 'data' => []], 'utf8_strtolower_simple_maps' => ['file' => 'CaseLower.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_strtolower.'], 'return' => ['type' => 'array', 'desc' => 'Uppercase to lowercase maps.'], 'data' => []], 'utf8_strtolower_maps' => ['file' => 'CaseLower.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_strtolower.'], 'return' => ['type' => 'array', 'desc' => 'Uppercase to lowercase maps.'], 'data' => []], 'utf8_strtoupper_simple_maps' => ['file' => 'CaseUpper.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_strtoupper.'], 'return' => ['type' => 'array', 'desc' => 'Lowercase to uppercase maps.'], 'data' => []], 'utf8_strtoupper_maps' => ['file' => 'CaseUpper.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_strtoupper.'], 'return' => ['type' => 'array', 'desc' => 'Lowercase to uppercase maps.'], 'data' => []], 'utf8_titlecase_simple_maps' => ['file' => 'CaseTitle.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_convert_case.'], 'return' => ['type' => 'array', 'desc' => 'Simple title case maps.'], 'data' => []], 'utf8_titlecase_maps' => ['file' => 'CaseTitle.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_convert_case.'], 'return' => ['type' => 'array', 'desc' => 'Full title case maps.'], 'data' => []], 'utf8_casefold_simple_maps' => ['file' => 'CaseFold.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_casefold.'], 'return' => ['type' => 'array', 'desc' => 'Casefolding maps.'], 'data' => []], 'utf8_casefold_maps' => ['file' => 'CaseFold.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_casefold.'], 'return' => ['type' => 'array', 'desc' => 'Casefolding maps.'], 'data' => []], 'utf8_default_ignorables' => ['file' => 'DefaultIgnorables.php', 'key_type' => 'int', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_normalize_kc_casefold.'], 'return' => ['type' => 'array', 'desc' => 'Characters with the \'Default_Ignorable_Code_Point\' property.'], 'data' => []], 'utf8_regex_properties' => ['file' => 'RegularExpressions.php', 'key_type' => 'string', 'val_type' => 'string', 'propfiles' => ['DerivedCoreProperties.txt', 'PropList.txt', 'emoji/emoji-data.txt', 'extracted/DerivedGeneralCategory.txt'], 'props' => ['Bidi_Control', 'Case_Ignorable', 'Cn', 'Default_Ignorable_Code_Point', 'Emoji', 'Emoji_Modifier', 'Ideographic', 'Join_Control', 'Regional_Indicator', 'Variation_Selector'], 'desc' => ['Helper function for utf8_sanitize_invisibles and utf8_convert_case.', '', 'Character class lists compiled from:', 'https://unicode.org/Public/UNIDATA/DerivedCoreProperties.txt', 'https://unicode.org/Public/UNIDATA/PropList.txt', 'https://unicode.org/Public/UNIDATA/emoji/emoji-data.txt', 'https://unicode.org/Public/UNIDATA/extracted/DerivedGeneralCategory.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for various Unicode properties.'], 'data' => []], 'utf8_regex_variation_selectors' => ['file' => 'RegularExpressions.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for utf8_sanitize_invisibles.', '', 'Character class lists compiled from:', 'https://unicode.org/Public/UNIDATA/StandardizedVariants.txt', 'https://unicode.org/Public/UNIDATA/emoji/emoji-variation-sequences.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for filtering variation selectors.'], 'data' => []], 'utf8_regex_joining_type' => ['file' => 'RegularExpressions.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for utf8_sanitize_invisibles.', '', 'Character class lists compiled from:', 'https://unicode.org/Public/UNIDATA/extracted/DerivedJoiningType.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for joining characters in certain scripts.'], 'data' => []], 'utf8_regex_indic' => ['file' => 'RegularExpressions.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for utf8_sanitize_invisibles.', '', 'Character class lists compiled from:', 'https://unicode.org/Public/UNIDATA/extracted/DerivedCombiningClass.txt', 'https://unicode.org/Public/UNIDATA/IndicSyllabicCategory.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for Indic scripts that use viramas.'], 'data' => []], 'utf8_regex_quick_check' => ['file' => 'QuickCheck.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for utf8_is_normalized.', '', 'Character class lists compiled from:', 'https://unicode.org/Public/UNIDATA/extracted/DerivedNormalizationProps.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for disallowed characters in normalization forms.'], 'data' => []], 'idna_maps' => ['file' => 'Idna.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for idn_to_* polyfills.'], 'return' => ['type' => 'array', 'desc' => 'Character maps for IDNA processing.'], 'data' => []], 'idna_maps_deviation' => ['file' => 'Idna.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for idn_to_* polyfills.'], 'return' => ['type' => 'array', 'desc' => '"Deviation" character maps for IDNA processing.'], 'data' => []], 'idna_maps_not_std3' => ['file' => 'Idna.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for idn_to_* polyfills.'], 'return' => ['type' => 'array', 'desc' => 'Non-STD3 character maps for IDNA processing.'], 'data' => []], 'idna_regex' => ['file' => 'Idna.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for idn_to_* polyfills.'], 'return' => ['type' => 'array', 'desc' => 'Regular expressions useful for IDNA processing.'], 'data' => []]]

Info about functions to build in SMF's Unicode data files.

$prefetch

private array<string|int, mixed> $prefetch = [self::DATA_URL_UCD => ['CaseFolding.txt', 'DerivedAge.txt', 'DerivedCoreProperties.txt', 'DerivedNormalizationProps.txt', 'IndicSyllabicCategory.txt', 'PropertyValueAliases.txt', 'PropList.txt', 'ScriptExtensions.txt', 'Scripts.txt', 'SpecialCasing.txt', 'StandardizedVariants.txt', 'UnicodeData.txt', 'emoji/emoji-data.txt', 'emoji/emoji-variation-sequences.txt', 'extracted/DerivedGeneralCategory.txt', 'extracted/DerivedJoiningType.txt'], self::DATA_URL_IDNA => ['IdnaMappingTable.txt']]

Files to fetch from unicode.org.

$script_aliases

private array<string|int, mixed> $script_aliases = []

Tracks associations between character scripts' short and long names.

$script_stats

private array<string|int, mixed> $script_stats = []

Statistical info about character scripts (e.g. Latin, Greek, Cyrillic, etc.)

$time_limit

private int $time_limit = 30

Used to ensure we exit long running tasks cleanly.

Methods

__construct()

The constructor.

public __construct(array<string|int, mixed> $details) : mixed
Parameters
$details : array<string|int, mixed>

The details for the task

execute()

This executes the task.

public execute() : bool
Return values
bool

Always returns true

export_funcs_to_file()

Updates Unicode data functions in their designated files.

public export_funcs_to_file() : mixed

getMinUserInfo()

Loads minimal info for the previously loaded user ids

public getMinUserInfo([array<string|int, mixed> $user_ids = [] ]) : array<string|int, mixed>
Parameters
$user_ids : array<string|int, mixed> = []
Tags
throws
Exception
Return values
array<string|int, mixed>

build_func_array()

Helper for get_function_code_and_regex(). Builds the function's data array.

private build_func_array(string &$func_code, array<string|int, mixed> $data, string $key_type, string $val_type) : mixed
Parameters
$func_code : string

The raw string that contains function code.

$data : array<string|int, mixed>

Data to format as an array.

$key_type : string

How to format the array keys.

$val_type : string

How to format the array values.

build_idna()

Builds maps and regex classes for IDNA purposes.

private build_idna() : mixed

build_quick_check()

Builds regular expressions for normalization quick check.

private build_quick_check() : mixed

build_regex_indic()

Builds regex classes for join control tests in utf8_sanitize_invisibles.

private build_regex_indic() : mixed

Specifically, for Indic scripts like Devanagari.

build_regex_joining_type()

Builds regex classes for join control tests in utf8_sanitize_invisibles.

private build_regex_joining_type() : mixed

Specifically, for cursive scripts like Arabic.

build_regex_properties()

Builds regular expression classes for extended Unicode properties.

private build_regex_properties() : mixed

build_regex_variation_selectors()

Builds regular expression classes for filtering variation selectors.

private build_regex_variation_selectors() : mixed

build_script_stats()

Helper function for build_regex_joining_type and build_regex_indic.

private build_script_stats() : mixed

deltree()

Deletes a directory and its contents.

private deltree(mixed $dir_path) : mixed
Parameters
$dir_path : mixed

fetch_unicode_file()

Fetches the contents of a Unicode data file.

private fetch_unicode_file(string $filename, string $data_url) : string

Caches a local copy for subsequent lookups.

Parameters
$filename : string

Name of a Unicode datafile, relative to $data_url.

$data_url : string

One of this class's DATA_URL_* constants.

Return values
string

Path to locally saved copy of the file.

finalize_decomposition_forms()

Finalizes all the decomposition forms.

private finalize_decomposition_forms() : mixed

This is necessary because some characters decompose to other characters that themselves decompose further.

get_function_code_and_regex()

Builds complete code for the specified element in $this->funcs to be inserted into the relevant PHP file. Also builds a regex to check whether a copy of the the function is already present in the file.

private get_function_code_and_regex(string $func_name) : array<string|int, mixed>
Parameters
$func_name : string

Key of an element in $this->funcs.

Return values
array<string|int, mixed>

PHP code and a regular expression.

lookup_ucd_version()

Sets $this->ucd_version to latest version number of the UCD.

private lookup_ucd_version() : mixed

make_temp_dir()

Makes a temporary directory to hold our working files, and sets $this->temp_dir to the path of the created directory.

private make_temp_dir() : mixed

process_casing_data()

Processes SpecialCasing.txt and CaseFolding.txt in order to get finalized versions of all case conversion data.

private process_casing_data() : mixed

process_derived_normalization_props()

Processes DerivedNormalizationProps.txt in order to populate $this->derived_normalization_props.

private process_derived_normalization_props() : mixed

process_main_unicode_data()

Processes UnicodeData.txt in order to populate $this->char_data, $this->full_decomposition_maps, and the 'data' element of most elements of $this->funcs.

private process_main_unicode_data() : mixed

should_update()

Compares version of SMF's local Unicode data with the latest release.

private should_update() : bool
Return values
bool

Whether SMF should update its local Unicode data or not.

smf_file_header()

Gets basic boilerplate for the PHP files that will be created.

private smf_file_header() : string
Return values
string

Standard SMF file header.


        
On this page

Search results