UpdateUnicode
extends BackgroundTask
in package
This class contains code used to update SMF's Unicode data files.
Table of Contents
Constants
- DATA_URL_IDNA = 'https://www.unicode.org/Public/idna/latest'
- DATA_URL_UCD = 'https://unicode.org/Public/UCD/latest/ucd'
- URLs where we can fetch the Unicode data files.
- RECEIVE_NOTIFY_ALERT = 0x1
- RECEIVE_NOTIFY_EMAIL = 0x2
- Constants for notification types.
Properties
- $temp_dir : string
- $ucd_version : string
- $unicodedir : string
- $_details : array<string|int, mixed>
- $char_data : array<string|int, mixed>
- $derived_normalization_props : array<string|int, mixed>
- $full_decomposition_maps : array<string|int, mixed>
- $funcs : array<string|int, mixed>
- $prefetch : array<string|int, mixed>
- $script_aliases : array<string|int, mixed>
- $script_stats : array<string|int, mixed>
- $time_limit : int
Methods
- __construct() : mixed
- The constructor.
- execute() : bool
- This executes the task.
- export_funcs_to_file() : mixed
- Updates Unicode data functions in their designated files.
- getMinUserInfo() : array<string|int, mixed>
- Loads minimal info for the previously loaded user ids
- build_func_array() : mixed
- Helper for get_function_code_and_regex(). Builds the function's data array.
- build_idna() : mixed
- Builds maps and regex classes for IDNA purposes.
- build_quick_check() : mixed
- Builds regular expressions for normalization quick check.
- build_regex_indic() : mixed
- Builds regex classes for join control tests in utf8_sanitize_invisibles.
- build_regex_joining_type() : mixed
- Builds regex classes for join control tests in utf8_sanitize_invisibles.
- build_regex_properties() : mixed
- Builds regular expression classes for extended Unicode properties.
- build_regex_variation_selectors() : mixed
- Builds regular expression classes for filtering variation selectors.
- build_script_stats() : mixed
- Helper function for build_regex_joining_type and build_regex_indic.
- deltree() : mixed
- Deletes a directory and its contents.
- fetch_unicode_file() : string
- Fetches the contents of a Unicode data file.
- finalize_decomposition_forms() : mixed
- Finalizes all the decomposition forms.
- get_function_code_and_regex() : array<string|int, mixed>
- Builds complete code for the specified element in $this->funcs to be inserted into the relevant PHP file. Also builds a regex to check whether a copy of the the function is already present in the file.
- lookup_ucd_version() : mixed
- Sets $this->ucd_version to latest version number of the UCD.
- make_temp_dir() : mixed
- Makes a temporary directory to hold our working files, and sets $this->temp_dir to the path of the created directory.
- process_casing_data() : mixed
- Processes SpecialCasing.txt and CaseFolding.txt in order to get finalized versions of all case conversion data.
- process_derived_normalization_props() : mixed
- Processes DerivedNormalizationProps.txt in order to populate $this->derived_normalization_props.
- process_main_unicode_data() : mixed
- Processes UnicodeData.txt in order to populate $this->char_data, $this->full_decomposition_maps, and the 'data' element of most elements of $this->funcs.
- should_update() : bool
- Compares version of SMF's local Unicode data with the latest release.
- smf_file_header() : string
- Gets basic boilerplate for the PHP files that will be created.
Constants
DATA_URL_IDNA
public
mixed
DATA_URL_IDNA
= 'https://www.unicode.org/Public/idna/latest'
DATA_URL_UCD
URLs where we can fetch the Unicode data files.
public
mixed
DATA_URL_UCD
= 'https://unicode.org/Public/UCD/latest/ucd'
RECEIVE_NOTIFY_ALERT
public
mixed
RECEIVE_NOTIFY_ALERT
= 0x1
RECEIVE_NOTIFY_EMAIL
Constants for notification types.
public
mixed
RECEIVE_NOTIFY_EMAIL
= 0x2
Properties
$temp_dir
public
string
$temp_dir
= ''
Path to temporary working directory.
$ucd_version
public
string
$ucd_version
= ''
The latest official release of the Unicode Character Database.
$unicodedir
public
string
$unicodedir
= ''
Convenince alias of Config::$sourcedir . '/Unicode'.
$_details
protected
array<string|int, mixed>
$_details
Holds the details for the task
$char_data
private
array<string|int, mixed>
$char_data
= []
Assorted info about Unicode characters.
$derived_normalization_props
private
array<string|int, mixed>
$derived_normalization_props
= []
Character properties used during normalization.
$full_decomposition_maps
private
array<string|int, mixed>
$full_decomposition_maps
= []
Key-value pairs of character decompositions.
$funcs
private
array<string|int, mixed>
$funcs
= [['file' => 'Metadata.php', 'regex' => '/if \\(!defined\\(\'SMF_UNICODE_VERSION\'\\)\\)(?:\\s*{)?\\n\\tdefine\\(\'SMF_UNICODE_VERSION\', \'\\d+(\\.\\d+)*\'\\);(?:\\n})?/', 'data' => [
// 0.0.0.0 will be replaced with correct value at runtime.
"if (!defined('SMF_UNICODE_VERSION')) {\n\tdefine('SMF_UNICODE_VERSION', '0.0.0.0');\n}",
]], 'utf8_normalize_d_maps' => ['file' => 'DecompositionCanonical.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_normalize_d.'], 'return' => ['type' => 'array', 'desc' => 'Canonical Decomposition maps for Unicode normalization.'], 'data' => []], 'utf8_normalize_kd_maps' => ['file' => 'DecompositionCompatibility.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_normalize_kd.'], 'return' => ['type' => 'array', 'desc' => 'Compatibility Decomposition maps for Unicode normalization.'], 'data' => []], 'utf8_compose_maps' => ['file' => 'Composition.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_compose.'], 'return' => ['type' => 'array', 'desc' => 'Composition maps for Unicode normalization.'], 'data' => []], 'utf8_combining_classes' => ['file' => 'CombiningClasses.php', 'key_type' => 'hexchar', 'val_type' => 'int', 'desc' => ['Helper function for utf8_normalize_d.'], 'return' => ['type' => 'array', 'desc' => 'Combining Class data for Unicode normalization.'], 'data' => []], 'utf8_strtolower_simple_maps' => ['file' => 'CaseLower.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_strtolower.'], 'return' => ['type' => 'array', 'desc' => 'Uppercase to lowercase maps.'], 'data' => []], 'utf8_strtolower_maps' => ['file' => 'CaseLower.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_strtolower.'], 'return' => ['type' => 'array', 'desc' => 'Uppercase to lowercase maps.'], 'data' => []], 'utf8_strtoupper_simple_maps' => ['file' => 'CaseUpper.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_strtoupper.'], 'return' => ['type' => 'array', 'desc' => 'Lowercase to uppercase maps.'], 'data' => []], 'utf8_strtoupper_maps' => ['file' => 'CaseUpper.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_strtoupper.'], 'return' => ['type' => 'array', 'desc' => 'Lowercase to uppercase maps.'], 'data' => []], 'utf8_titlecase_simple_maps' => ['file' => 'CaseTitle.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_convert_case.'], 'return' => ['type' => 'array', 'desc' => 'Simple title case maps.'], 'data' => []], 'utf8_titlecase_maps' => ['file' => 'CaseTitle.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_convert_case.'], 'return' => ['type' => 'array', 'desc' => 'Full title case maps.'], 'data' => []], 'utf8_casefold_simple_maps' => ['file' => 'CaseFold.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_casefold.'], 'return' => ['type' => 'array', 'desc' => 'Casefolding maps.'], 'data' => []], 'utf8_casefold_maps' => ['file' => 'CaseFold.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_casefold.'], 'return' => ['type' => 'array', 'desc' => 'Casefolding maps.'], 'data' => []], 'utf8_default_ignorables' => ['file' => 'DefaultIgnorables.php', 'key_type' => 'int', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_normalize_kc_casefold.'], 'return' => ['type' => 'array', 'desc' => 'Characters with the \'Default_Ignorable_Code_Point\' property.'], 'data' => []], 'utf8_regex_properties' => ['file' => 'RegularExpressions.php', 'key_type' => 'string', 'val_type' => 'string', 'propfiles' => ['DerivedCoreProperties.txt', 'PropList.txt', 'emoji/emoji-data.txt', 'extracted/DerivedGeneralCategory.txt'], 'props' => ['Bidi_Control', 'Case_Ignorable', 'Cn', 'Default_Ignorable_Code_Point', 'Emoji', 'Emoji_Modifier', 'Ideographic', 'Join_Control', 'Regional_Indicator', 'Variation_Selector'], 'desc' => ['Helper function for utf8_sanitize_invisibles and utf8_convert_case.', '', 'Character class lists compiled from:', 'https://unicode.org/Public/UNIDATA/DerivedCoreProperties.txt', 'https://unicode.org/Public/UNIDATA/PropList.txt', 'https://unicode.org/Public/UNIDATA/emoji/emoji-data.txt', 'https://unicode.org/Public/UNIDATA/extracted/DerivedGeneralCategory.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for various Unicode properties.'], 'data' => []], 'utf8_regex_variation_selectors' => ['file' => 'RegularExpressions.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for utf8_sanitize_invisibles.', '', 'Character class lists compiled from:', 'https://unicode.org/Public/UNIDATA/StandardizedVariants.txt', 'https://unicode.org/Public/UNIDATA/emoji/emoji-variation-sequences.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for filtering variation selectors.'], 'data' => []], 'utf8_regex_joining_type' => ['file' => 'RegularExpressions.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for utf8_sanitize_invisibles.', '', 'Character class lists compiled from:', 'https://unicode.org/Public/UNIDATA/extracted/DerivedJoiningType.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for joining characters in certain scripts.'], 'data' => []], 'utf8_regex_indic' => ['file' => 'RegularExpressions.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for utf8_sanitize_invisibles.', '', 'Character class lists compiled from:', 'https://unicode.org/Public/UNIDATA/extracted/DerivedCombiningClass.txt', 'https://unicode.org/Public/UNIDATA/IndicSyllabicCategory.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for Indic scripts that use viramas.'], 'data' => []], 'utf8_regex_quick_check' => ['file' => 'QuickCheck.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for utf8_is_normalized.', '', 'Character class lists compiled from:', 'https://unicode.org/Public/UNIDATA/extracted/DerivedNormalizationProps.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for disallowed characters in normalization forms.'], 'data' => []], 'idna_maps' => ['file' => 'Idna.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for idn_to_* polyfills.'], 'return' => ['type' => 'array', 'desc' => 'Character maps for IDNA processing.'], 'data' => []], 'idna_maps_deviation' => ['file' => 'Idna.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for idn_to_* polyfills.'], 'return' => ['type' => 'array', 'desc' => '"Deviation" character maps for IDNA processing.'], 'data' => []], 'idna_maps_not_std3' => ['file' => 'Idna.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for idn_to_* polyfills.'], 'return' => ['type' => 'array', 'desc' => 'Non-STD3 character maps for IDNA processing.'], 'data' => []], 'idna_regex' => ['file' => 'Idna.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for idn_to_* polyfills.'], 'return' => ['type' => 'array', 'desc' => 'Regular expressions useful for IDNA processing.'], 'data' => []]]
Info about functions to build in SMF's Unicode data files.
$prefetch
private
array<string|int, mixed>
$prefetch
= [self::DATA_URL_UCD => ['CaseFolding.txt', 'DerivedAge.txt', 'DerivedCoreProperties.txt', 'DerivedNormalizationProps.txt', 'IndicSyllabicCategory.txt', 'PropertyValueAliases.txt', 'PropList.txt', 'ScriptExtensions.txt', 'Scripts.txt', 'SpecialCasing.txt', 'StandardizedVariants.txt', 'UnicodeData.txt', 'emoji/emoji-data.txt', 'emoji/emoji-variation-sequences.txt', 'extracted/DerivedGeneralCategory.txt', 'extracted/DerivedJoiningType.txt'], self::DATA_URL_IDNA => ['IdnaMappingTable.txt']]
Files to fetch from unicode.org.
$script_aliases
private
array<string|int, mixed>
$script_aliases
= []
Tracks associations between character scripts' short and long names.
$script_stats
private
array<string|int, mixed>
$script_stats
= []
Statistical info about character scripts (e.g. Latin, Greek, Cyrillic, etc.)
$time_limit
private
int
$time_limit
= 30
Used to ensure we exit long running tasks cleanly.
Methods
__construct()
The constructor.
public
__construct(array<string|int, mixed> $details) : mixed
Parameters
- $details : array<string|int, mixed>
-
The details for the task
execute()
This executes the task.
public
execute() : bool
Return values
bool —Always returns true
export_funcs_to_file()
Updates Unicode data functions in their designated files.
public
export_funcs_to_file() : mixed
getMinUserInfo()
Loads minimal info for the previously loaded user ids
public
getMinUserInfo([array<string|int, mixed> $user_ids = [] ]) : array<string|int, mixed>
Parameters
- $user_ids : array<string|int, mixed> = []
Tags
Return values
array<string|int, mixed>build_func_array()
Helper for get_function_code_and_regex(). Builds the function's data array.
private
build_func_array(string &$func_code, array<string|int, mixed> $data, string $key_type, string $val_type) : mixed
Parameters
- $func_code : string
-
The raw string that contains function code.
- $data : array<string|int, mixed>
-
Data to format as an array.
- $key_type : string
-
How to format the array keys.
- $val_type : string
-
How to format the array values.
build_idna()
Builds maps and regex classes for IDNA purposes.
private
build_idna() : mixed
build_quick_check()
Builds regular expressions for normalization quick check.
private
build_quick_check() : mixed
build_regex_indic()
Builds regex classes for join control tests in utf8_sanitize_invisibles.
private
build_regex_indic() : mixed
Specifically, for Indic scripts like Devanagari.
build_regex_joining_type()
Builds regex classes for join control tests in utf8_sanitize_invisibles.
private
build_regex_joining_type() : mixed
Specifically, for cursive scripts like Arabic.
build_regex_properties()
Builds regular expression classes for extended Unicode properties.
private
build_regex_properties() : mixed
build_regex_variation_selectors()
Builds regular expression classes for filtering variation selectors.
private
build_regex_variation_selectors() : mixed
build_script_stats()
Helper function for build_regex_joining_type and build_regex_indic.
private
build_script_stats() : mixed
deltree()
Deletes a directory and its contents.
private
deltree(mixed $dir_path) : mixed
Parameters
- $dir_path : mixed
fetch_unicode_file()
Fetches the contents of a Unicode data file.
private
fetch_unicode_file(string $filename, string $data_url) : string
Caches a local copy for subsequent lookups.
Parameters
- $filename : string
-
Name of a Unicode datafile, relative to $data_url.
- $data_url : string
-
One of this class's DATA_URL_* constants.
Return values
string —Path to locally saved copy of the file.
finalize_decomposition_forms()
Finalizes all the decomposition forms.
private
finalize_decomposition_forms() : mixed
This is necessary because some characters decompose to other characters that themselves decompose further.
get_function_code_and_regex()
Builds complete code for the specified element in $this->funcs to be inserted into the relevant PHP file. Also builds a regex to check whether a copy of the the function is already present in the file.
private
get_function_code_and_regex(string $func_name) : array<string|int, mixed>
Parameters
- $func_name : string
-
Key of an element in $this->funcs.
Return values
array<string|int, mixed> —PHP code and a regular expression.
lookup_ucd_version()
Sets $this->ucd_version to latest version number of the UCD.
private
lookup_ucd_version() : mixed
make_temp_dir()
Makes a temporary directory to hold our working files, and sets $this->temp_dir to the path of the created directory.
private
make_temp_dir() : mixed
process_casing_data()
Processes SpecialCasing.txt and CaseFolding.txt in order to get finalized versions of all case conversion data.
private
process_casing_data() : mixed
process_derived_normalization_props()
Processes DerivedNormalizationProps.txt in order to populate $this->derived_normalization_props.
private
process_derived_normalization_props() : mixed
process_main_unicode_data()
Processes UnicodeData.txt in order to populate $this->char_data, $this->full_decomposition_maps, and the 'data' element of most elements of $this->funcs.
private
process_main_unicode_data() : mixed
should_update()
Compares version of SMF's local Unicode data with the latest release.
private
should_update() : bool
Return values
bool —Whether SMF should update its local Unicode data or not.
smf_file_header()
Gets basic boilerplate for the PHP files that will be created.
private
smf_file_header() : string
Return values
string —Standard SMF file header.