Python | Character Encoding (original) (raw)

Last Updated : 29 May, 2021

Finding the text which is having nonstandard character encoding is a very common step to perform in text processing.
All the text would have been from utf-8 or ASCII encoding ideally but this might not be the case always. So, in such cases when the encoding is not known, such non-encoded text has to be detected and the be converted to a standard encoding. So, this step is important before processing the text further.
Charade Installation :
For performing the detection and conversion of encoding, charade - a Python library is required. This module can be simply installed using sudo easy_install charade or pip install charade.
Let's see the wrapper function around the charade module.
Code : encoding.detect(string), to detect the encoding

Python3 `

-- coding: utf-8 --

import charade def detect(s):

try:
    # check it in the charade list
    if isinstance(s, str):
        return charade.detect(s.encode())
    # detecting the string
      else:
        return charade.detect(s)

# in case of error
# encode with 'utf -8' encoding
except UnicodeDecodeError:
    return charade.detect(s.encode('utf-8'))

`

The detect functions will return 2 attributes :

Confidence : the probability of charade being correct. Encoding : which encoding it is.

Code : encoding.convert(string) to convert the encoding.

Python3 `

-- coding: utf-8 --

import charade

def convert(s):

# if in the charade instance
if isinstance(s, str):
    s = s.encode()

# retrieving the encoding information 
# from the detect() output
encode = detect(s)['encoding']

if encode == 'utf-8':
    return s.decode()
else:
    return s.decode(encoding) 

`

Code : Example

Python3 `

importing library

import encoding

d1 = encoding.detect('geek') print ("d1 is encoded as : ", d1)

d2 = encoding.detect('ascii') print ("d2 is encoded as : ", d2)

`

Output :

d1 is encoded as : (confidence': 0.505, 'encoding': 'utf-8') d2 is encoded as : ('confidence': 1.0, 'encoding': 'ascii')

detect() : It is a charade.detect() wrapper. It encodes the strings and handles the UnicodeDecodeError exceptions. It expects a bytes object so therefore the string is encoded before trying to detect the encoding.
convert() : It is a charade.convert() wrapper. It calls detect() first to get the encoding. Then, it returns a decoded string.