Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Scraping tweets from twitter.

Options
  • 21-09-2014 3:54pm
    #1
    Registered Users Posts: 225 ✭✭


    I am looking to scrape and collect tweets from twitter given a certain search term.

    I have looked around the web and found various modules for python etc. which claim to do this. However, most of them are outdated and frustratingly difficult to setup (at least for me anyway!) so I have had no luck so far.

    Has anyone any experience or input on this topic?

    I tried a python module named tweepy this morning but after downloading it from the github repository and running into a seemingly never-ending number of errors when trying to set it up from the cmd, I gave up about 20 minutes ago.

    I'm hoping there is an easier way of doing this, appreciate any help anyone can give me!


Comments

  • Moderators, Society & Culture Moderators Posts: 17,642 Mod ✭✭✭✭Graham


    Take a look at the twitter api, you can access tweets without adding in the complications of scraping.


  • Registered Users Posts: 401 ✭✭irishbuzz


    If you think your usage would be within rate limits you could simply use Twitter's API:

    https://dev.twitter.com/rest/reference/get/search/tweets

    Alternatively have a look at Scrapy or ScraperJS


  • Registered Users Posts: 124 ✭✭shanefitz360


    Use the Twitter API with the pypi.python.org/pypi/twitter package


  • Registered Users Posts: 225 ✭✭TheSetMiner


    thanks for the replies, I'll have a look into this today and hopefully have some luck!


  • Registered Users Posts: 225 ✭✭TheSetMiner


    Use the Twitter API with the pypi.python.org/pypi/twitter package

    I vistited this link and downloaded the file but I am a little unsure what the next step is. Sorry I am a bit new to directories and installing modules etc.

    There was a python wheel file and a .tar.gz file there. I downloaded both. The former could not be opened and the latter opened just like a regular zip file.

    My question is do I need to put this entire file (zipped or unzipped?) file into my python34 "lib" folder or is there another destination. And will that be installed then or do I need to run setup.py from the cmd like I read somewhere else.

    Sorry for the questions but the documentation wasn't that clear on setup procedure.

    I'd greatly appreciate any help!


  • Advertisement
  • Registered Users Posts: 159 ✭✭magooly


    twitter4j


  • Technology & Internet Moderators Posts: 28,799 Mod ✭✭✭✭oscarBravo


    You probably shouldn't be downloading files from PyPI directly, but using something like pip:
    pip install twitter
    


  • Registered Users Posts: 7,410 ✭✭✭jmcc


    Get the book "Mining The Social Web" by Matthew A. Russell. It is an O'Reilly book and is quite good on the subject. It does require some understanding and expertise with the Python language though.

    Regards...jmcc


  • Registered Users Posts: 225 ✭✭TheSetMiner


    Thanks for all the helpful comments.


    I managed to install twitter using pip, which was preinstalled with python 3.4. I had some success using the search api to get 15 random tweets(very messily encoded in JSON I think) about a given search query but no such luck with the streaming api as I keep getting an error with the stream.py and api.py files when I try to do that.
    Deciphering the tweets from the messy string of JSON is going to be a task for regex I think.

    And has anyone any idea on what might be going wrong with the TwitterStream class?


  • Technology & Internet Moderators Posts: 28,799 Mod ✭✭✭✭oscarBravo


    Deciphering the tweets from the messy string of JSON is going to be a task for regex I think.
    import json
    


  • Advertisement
  • Registered Users Posts: 225 ✭✭TheSetMiner


    update: I saw an answer on stack overflow recommending Twython so I said I'd give it a quick go and thanks to their great, up-to-date documentation I managed to get the streaming api working in less than 15 minutes! Amazingly satisfying to watch the data flowing in on the python shell at last! Now my next step will be to figure out how to store the tweets along with their respective time of post and country of origin.

    Anyone know what the streaming limit is exactly? I've had the shell going non stop for about 10 minutes now, I wonder if I am getting close to the limit?


  • Registered Users Posts: 159 ✭✭magooly


    http://www.eirwig.com uses twitter streaming API (Java + Spring MVC) and its 24/7.

    There is no streaming limit per se since as a developer connected to the streaming API you are only seeing approx 1% of random tweets globally.

    Re: tweet times and country the info is all there in the tweet object returned, you simply need to call the correct getters on the tweet.

    You will find all the info you need via the twitter streaming API docs, signup to dev.twitter.com and create your own app for the connection details.

    Its a great API and a very rewarding experience, you already know that.


Advertisement