This past weekend was partially spent researching data mining/scraping tools for my pilot project. As part of my literature review this summer, I am also testing out tools for my independent reading and research course to mimic one aspect of potential future data collection. Initially I thought I would examine various social media platforms, but I have decided to limit myself to Twitter, as I am most active there and the platform, its limitations, and uses are familiar to me.
The tools became known to me by chance through my regular Twitter feed – I did not seek out this information at all. I’ve already examined @Netlytic and @Gephi, but these were all new to me. Below is an abridged list of products that I’ve briefly reviewed.
|Data Miner||Looks at sequence of tweets with N<50 – can be saved to spreadsheet – this is a Chrome extension; it is free for 500 page scrapes per month – they also have free office hours for support, for cases where different pages have data arranged slightly differently. Any .html page should be usable. Video tutorials are available as well. FREE(mium).||Openness about privacy and the use of how data is accessed – listed on the first page. DataScraping is what they call it. Interesting piece of information: no website will be alerted that I have this on my computer.|
|Mind Manager||Concept mapping tool – to record and display overview – this is a paid subscription. Can be used for a free trial; paid subscription is $449.||This is an expensive tool – I may want to use the free trail when I am ready to play with its features, but not at the moment.|
|TAGs – Twitter Arching Google Spreadsheet||This will collect a corpus of tweets N>50 ; they can be saved to a spreadsheet. It is for Twitter use and web-based. It is essentially a Google Sheet template. Comprehensive FAQ section – makes me aware of the limitations right away. FREE.||I’ll have to see how big the data set is and try it out. Similar FAQs to Data Miner.|
|Audiense||Can export to PDF or PPT, as well as .csv files. Visualization of Twitter is possible. They have a resource library that is like a “product” blog – very slick. Looks useful for marketing companies and cost is ~$31/mo.||Don’t like that there is no privacy information/ethical information about what happens to data, where it’s stored, etc.|
Looks at FaceBook, Instagram, Blogs, Forums, Twitter. It is web-based and is called a social media monitoring tool – it is suggested for those who want to find out about online mentions and growing sales. $49/mo for tracking up to 5 key words a month.
|More of a sales tool than a research tool, although online mentions are something I will take a look at with the other tools.|
|Chorus Project||Request page includes space to summarize what Twitter research is for. This is a Windows desktop product. Need to register in order to use. Page developers are contactable through this link. Data Harvesting has a blog associated with it. Provides visual analytics as well. Specifically designed for academic research. Developed by web developers, social scientists, and programmers.Chorus can collect user-driven (network associated with a person’s tweets) or semantically driven(certain words/hashtags) data. They also have Chorus-TV (TweetVis) – a visual analytic tool for qual and quan data, as well as a cluster explorer. FREE.||I am comfortable with the addressing of ethics on the registration page, and the extra step of requesting access to the tool is reassuring.|
|Mozdeh||Can use for Twitter, YouTube Comments, from subreddits on Reddit, import academic citations, TripAdvisor comments – but importing from Facebook is no longer permitted. Can draw time series graphs and mine word associations – main reason for word, gender differences of specific words, sentience of content. Restricted to tweets that are two weeks old. Can look at a time series of a single user. FREE.||Not sure if I can make use of this, as it uses recent tweets. But am interested in the features.|
|WeboMetrick Analyst||Can get data from Mendeley, YouTube, Flickr, Twitter – can also do image extraction and searches. This is a Windows-based program, using Bing. It also does a link analysis and counts the number of URL citations or title mentions. Tutorials available. FREE.||This may be a good place to go next and play with once the pilot is completed. Not ruling it out, even though the page is not intuitive.|
VERDICT: Free and research-based is the direction I will take for my project. None of the commercial products interest my wallet; they are also not transparent about how data is collected, stored, and used. Mozdeh has some interesting features, but as I want to use older secondary data that has less connection to the present, I will put Mozdeh aside for now. I will use Data Miner, TAGs, or a Chorus Project, or a combination of all three, and attempt a SWOT analysis, before attempting any fancier tools.
Note about ethics:
This post initially had a preamble about ethics and conducting research online, but as this is not a description of a potential pilot study, my primary concern at the moment is how the tools operate. My literature summary about conducting research online will wait, again.