|
|||||||||||||||||||||||||||
Anders Sønderberg Mortensen <sslug@sslug> writes:
> Men det ville stadig være interessant, hvis der fandtes noget simplere,
> som ikke afhænger af andre biblioteker.
Det burde være ret simpelt at læse RFC2279 og implementerer noget
selv:
The table below summarizes the format of these different octet
types. The letter x indicates bits available for encoding bits of
the UCS-4 character value.
UCS-4 range (hex.) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx
Så et eller andet med:
sub validp {
@bytes = unpack "C*", $_[0];
while(@bytes) {
if ($bytes[0] < 128 ) { # 0xxxxxxx
splice @bytes, 0, 1;
} elsif ($bytes[0] < 224) { # 110xxxxx
return 0 if $#bytes == 1;
return 0 unless ($bytes[1] & 0x3F) == 0x80;
splice @bytes, 0, 2;
} elsif ($bytes[0] < 240) { # 1110xxxx
return 0 if $#bytes == 2;
return 0 unless ($bytes[1] & 0x3F) == 0x80;
return 0 unless ($bytes[2] & 0x3F) == 0x80;
splice @bytes, 0, 3;
} elsif ($bytes[0] < 248) { # 11110xxx
return 0 if $#bytes == 3;
return 0 unless ($bytes[1] & 0x3F) == 0x80;
return 0 unless ($bytes[2] & 0x3F) == 0x80;
return 0 unless ($bytes[3] & 0x3F) == 0x80;
splice @bytes, 0, 4;
} elsif ($bytes[0] < 252) { # 111110xx
return 0 if $#bytes == 4;
return 0 unless ($bytes[1] & 0x3F) == 0x80;
return 0 unless ($bytes[2] & 0x3F) == 0x80;
return 0 unless ($bytes[3] & 0x3F) == 0x80;
return 0 unless ($bytes[4] & 0x3F) == 0x80;
splice @bytes, 0, 5;
} elsif ($bytes[0] < 254) { # 1111110x
return 0 if $#bytes == 5;
return 0 unless ($bytes[1] & 0x3F) == 0x80;
return 0 unless ($bytes[2] & 0x3F) == 0x80;
return 0 unless ($bytes[3] & 0x3F) == 0x80;
return 0 unless ($bytes[4] & 0x3F) == 0x80;
return 0 unless ($bytes[5] & 0x3F) == 0x80;
splice @bytes, 0, 6;
} else { # 11111110 or 11111111 is invalid
return 0;
}
}
return 1;
}
Det kan sikkert gøres kortere, men ovenstående burde virke og
skitserer ideen til hvordan det generel gøres.
--
Peter Makholm | We constantly have to keep in mind why natural
sslug@sslug | languages are good at what they're good at. And to
http://hacking.dk | never forget that Perl is a human language first,
| and a computer language second
|
||||||||||||||
|
||||||||||||||