Advertisement
Help Keep Boards Alive. Support us by going ad free today. See here: https://subscriptions.boards.ie/.
If we do not hit our goal we will be forced to close the site.

Current status: https://keepboardsalive.com/

Annual subs are best for most impact. If you are still undecided on going Ad Free - you can also donate using the Paypal Donate option. All contribution helps. Thank you.
https://www.boards.ie/group/1878-subscribers-forum

Private Group for paid up members of Boards.ie. Join the club.

Encoding problem trying to web scrape in python using BeautifulSoup

  • 01-11-2016 02:01PM
    #1
    Registered Users, Registered Users 2 Posts: 120 ✭✭


    Hi,
    I have been trying to follow tutorials online using requests / urllib and BeautifulSoup, it throws an error at print(soup.prettify()) depending on the site. The code below works properly inside of pycharm but not in cmd

    #this is the code I use
    import requests
    from bs4 import BeautifulSoup

    r = requests.get('https://www.facebook.com/')

    soup = BeautifulSoup(r.content,'html5lib')#I have tried . encode here

    print(soup.prettify())

    I get this error:

    line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in position 27842: character maps to <undefined>

    I spent hours searching around online for a solution, I tried

    soup = BeautifulSoup(r.content,'html5lib').encode('utf-8')

    which returns another error AttributeError: 'bytes' object has no attribute 'prettify'.

    r is encoded to ISO-8859-1 but I tried r.encoding = 'utf-8' it throws the same error as above.


    If anyone could help me out it would be greatly appreciated


Comments

  • Registered Users, Registered Users 2 Posts: 13 rwsz365


    I presume you are on Windows? Windows console can't handle unicode, see http://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console. You can encode to ascii and ignore some characters if you like, `print(soup.prettify().encode('ascii', 'ignore'))`.

    Your other error when you call the encode method is because `soup` is a byte stream after you call the encode method, not an instance of BeautifulSoup so it does not have any method called prettify.


  • Registered Users, Registered Users 2 Posts: 1,275 ✭✭✭bpmurray


    U+2019 is the right quotation mark. Since this isn't normally available in the console (which is CP 850), you need
    print soup.prettify("cp850")
    


Advertisement