How to determine the encoding windows-1251 regular expression?

Hello, I recently had to find a variable in the wrong encoding and switched to UTF-8.

Found in Internet this code:
if (!preg_match('/^.{1}/us', $data)) {
 $data = iconv("windows-1251", "utf-8", $data); 
}


Question following: as the function preg_match is checking and what can go wrong with this sample (if at all)?
June 7th 19 at 14:58
4 answers
June 7th 19 at 15:00
Solution
Since PHP 4 there is a function mb_check_encoding which checks a string for consistency with the specified encoding.

<?php
if (mb_check_encoding($data, 'windows-1251')) {
 $data = iconv("windows-1251", "utf-8", $data);
}</property-->
better not to fence your bike you can use a ready solution. For example https://github.com/neitanod/forceutf8

<?php
foreach ($_POST as $key =--> $value) {
 $_POST[$key] = \ForceUTF8\Encoding::toUTF8($_POST[$key]);
}
- braeden_Schaden commented on June 7th 19 at 15:03
the above (in the beginning) You code can be used to correctly check the encoding and translation from windows-1251 to utf-8 ? because ForceUTF8 a bit to understand it :) - Bryana.Sporer commented on June 7th 19 at 15:06
example with mb_check_encoding will be triggered only if the encoded string to windows-1251. It is better to use ready solution for this: https://github.com/neitanod/forceutf8 - braeden_Schaden commented on June 7th 19 at 15:09
thanks - Bryana.Sporer commented on June 7th 19 at 15:12
June 7th 19 at 15:02
there is an important modifier "u" in the regular expression
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl: the pattern and the target string is treated as UTF-8 strings. Invalid target string causes the function preg_* nothing found, and a wrong pattern causes an error of level E_WARNING.
That is, you can safely use the above code in production to fix encodings in UTF-8 is the I need to $_GET ? There have been cases that in $_GET parameter passed is not UTF-8 encoding. - braeden_Schaden commented on June 7th 19 at 15:05
I , uh, know. can't vouch for the feasibility of this option. maybe it is working. test need :) - Bryana.Sporer commented on June 7th 19 at 15:08
June 7th 19 at 15:04
Very often, before something to invent, you need to make sure is this already part of the language. Ask a lot of questions for interviews how to perform one or the other task, June tries to come up with full code, but often it turns out that it has already implemented, and a built-in language.
It is important that validation is not triggered falsely, ready-made solution is mb_check_encoding instead of regexps? - braeden_Schaden commented on June 7th 19 at 15:07
so I'm not a programmer at all, but I see that is a Boolean function that will return either True or falls in case, and still protects from the "attack the wrong encoding", run a bunch of tests and then test them - Bryana.Sporer commented on June 7th 19 at 15:10
, actually I just started learning CSS - braeden_Schaden commented on June 7th 19 at 15:13
just call to do all the standard methods always try to the last, and not made up, if SPECA and the S. O. do not give then only it is possible - Bryana.Sporer commented on June 7th 19 at 15:16
June 7th 19 at 15:06
According to this here lawrence.ecorp.net/inet/samples/regexp-intro.php the flags mean "enable unicode support" and "to consider the entry as one line". Thus, the expression tests whether the first character of $data is a valid Unicode character, and if not, the code under the condition performs the conversion from Unicode to windows-1251. For single-byte Cyrillic encoding characters in the range 0xC0-0xFF if $data Cyrillic, in this case, the conversion of two consecutive bytes to Unicode will fail as UTF8 after 0xC0-0xF4 have to go the bytes from 0x80-0xBF that quite often is true. "Go correctly", if the text starts on the large letter and the second symbol is "e", in this case, the conversion of the first character in Unicode will be successful. If the incoming stream has come in KOI-8, the error detection encoding is provided, another thing that is not a client now reflects the text in the web query in KOI-8. So be careful with such a naive test.
mb_check_encoding($data, 'windows-1251')

.. will be the best alternative to the regular season? - braeden_Schaden commented on June 7th 19 at 15:09
For sure. At the same time will allow you to check on кои8. - Bryana.Sporer commented on June 7th 19 at 15:12

Find more questions by tags Character encodingRegular expressionsPHP