Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ftfy has gotten by for four years without dependencies on other Python libraries, but now we can spare ourselves some code and some maintenance burden by delegating certain tasks to other libraries that already solve them well. This version now depends on the html5lib and wcwidth libraries. Feature changes: The remove_control_chars fixer will now remove some non-ASCII control characters as well, such as deprecated Arabic control characters and byte-order marks. Bidirectional controls are still left as is. This should have no impact on well-formed text, while cleaning up many characters that the Unicode Consortium deems "not suitable for markup" (see Unicode Technical Report #20). The unescape_html fixer uses a more thorough list of HTML entities, which it imports from html5lib. ftfy.formatting now uses wcwidth to compute the width that a string will occupy in a text console. Heuristic changes: Updated the data file of Unicode character categories to Unicode 9, as used in Python 3.6.0. (No matter what version of Python you're on, ftfy uses the same data.) Pending deprecations: The remove_bom option will become deprecated in 5.0, because it has been superseded by remove_control_chars. ftfy 5.0 will remove the previously deprecated name fix_text_encoding. It was renamed to fix_encoding in 4.0. ftfy 5.0 will require Python 3.2 or later, as planned. Python 2 users, please specify ftfy < 5 in your dependencies if you haven't already. Version 4.2.0 (September 28, 2016) Heuristic changes: Math symbols next to currency symbols are no longer considered 'weird' by the heuristic. This fixes a false positive where text that involved the multiplication sign and British pounds or euros (as in '5×£35') could turn into Hebrew letters. A heuristic that used to be a bonus for certain punctuation now also gives a bonus to successfully decoding other common codepoints, such as the non-breaking space, the degree sign, and the byte order mark. In version 4.0, we tried to "future-proof" the categorization of emoji (as a kind of symbol) to include codepoints that would likely be assigned to emoji later. The future happened, and there are even more emoji than we expected. We have expanded the range to include those emoji, too. ftfy is still mostly based on information from Unicode 8 (as Python 3.5 is), but this expanded range should include the emoji from Unicode 9 and 10. Emoji are increasingly being modified by variation selectors and skin-tone modifiers. Those codepoints are now grouped with 'symbols' in ftfy, so they fit right in with emoji, instead of being considered 'marks' as their Unicode category would suggest. This enables fixing mojibake that involves iOS's new diverse emoji. An old heuristic that wasn't necessary anymore considered Latin text with high-numbered codepoints to be 'weird', but this is normal in languages such as Vietnamese and Azerbaijani. This does not seem to have caused any false positives, but it caused ftfy to be too reluctant to fix some cases of broken text in those languages. The heuristic has been changed, and all languages that use Latin letters should be on even footing now. Version 4.1.1 (April 13, 2016) Bug fix: in the command-line interface, the -e option had no effect on Python 3 when using standard input. Now, it correctly lets you specify a different encoding for standard input. Version 4.1.0 (February 25, 2016) Heuristic changes: ftfy can now deal with "lossy" mojibake. If your text has been run through a strict Windows-1252 decoder, such as the one in Python, it may contain the replacement character � (U+FFFD) where there were bytes that are unassigned in Windows-1252. Although ftfy won't recover the lost information, it can now detect this situation, replace the entire lossy character with �, and decode the rest of the characters. Previous versions would be unable to fix any string that contained U+FFFD. As an example, text in curly quotes that gets corrupted “ like this â€� now gets fixed to be “ like this �. Updated the data file of Unicode character categories to Unicode 8.0, as used in Python 3.5.0. (No matter what version of Python you're on, ftfy uses the same data.) Heuristics now count characters such as ~ and ^ as punctuation instead of wacky math symbols, improving the detection of mojibake in some edge cases. New features: A new module, ftfy.formatting, can be used to justify Unicode text in a monospaced terminal. It takes into account that each character can take up anywhere from 0 to 2 character cells. Internally, the utf-8-variants codec was simplified and optimized. Version 4.0.0 (April 10, 2015) Breaking changes: The default normalization form is now NFC, not NFKC. NFKC replaces a large number of characters with 'equivalent' characters, and some of these replacements are useful, but some are not desirable to do by default. The fix_text function has some new options that perform more targeted operations that are part of NFKC normalization, such as fix_character_width, without requiring hitting all your text with the huge mallet that is NFKC. If you were already using NFC normalization, or in general if you want to preserve the spacing of CJK text, you should be sure to set fix_character_width=False. The remove_unsafe_private_use parameter has been removed entirely, after two versions of deprecation. The function name fix_bad_encoding is also gone. New features: Fixers for strange new forms of mojibake, including particularly clear cases of mixed UTF-8 and Windows-1252. New heuristics, so that ftfy can fix more stuff, while maintaining approximately zero false positives. The command-line tool trusts you to know what encoding your input is in, and assumes UTF-8 by default. You can still tell it to guess with the -g option. The command-line tool can be configured with options, and can be used as a pipe. Recognizes characters that are new in Unicode 7.0, as well as emoji from Unicode 8.0+ that may already be in use on iOS. Deprecations: fix_text_encoding is being renamed again, for conciseness and consistency. It's now simply called fix_encoding. The name fix_text_encoding is available but emits a warning. Pending deprecations: Python 2.6 support is largely coincidental. Python 2.7 support is on notice. If you use Python 2, be sure to pin a version of ftfy less than 5.0 in your requirements.
- Loading branch information