Data analysis tools were all the rage in 2013, so it’s no surprise that many people want to integrate some form of data-driven analytic solution into their business. To implement a data solution you want to include your own proprietary data but it’s often necessary to combine your data with something else to find some insights. Analysis projects usually start off with an untested ideas, so it’s always a good idea to try to keep costs down by looking for free data sets to combine with your proprietary data. Finding this data is not always easy, however, and this post will offer advice and suggestions from our team on where we’ve found luck in getting started with new analysis projects with clients.
Topic-Specific Data Sets for Fraud Detection, Disaster Response, and more
If you’re looking for topic specific data sets, non-governmental organizations and universities are some of the best places to find free data. Getting a topic specific data set usually doesn’t require much more than time, and these organizations have personnel resources and reasons to cultivate the data. It’s usually also made available to the public to advance a cause or promote research, so this is often the first place to look for new data sets.
The bottom line is that if a problem exists, there’s probably an organization collecting data on it. Half the trick is just to learn what that organization’s name is and the rest is a Google search away. The following list is by no means exhaustive, but it should give you an idea of where to look for some topics.
- Disasters / disaster response: http://reliefweb.int/
- Terrorism / force protection: http://thedata.harvard.edu/dvn/dv/start
- Fraud / money laundering: http://offshoreleaks.icij.org/
- Corporations: http://www.corporationwiki.com/
- Technology companies: http://www.crunchbase.com/
- General web data: http://commoncrawl.org/
- Geography / infrastructure: http://www.census.gov/geo/maps-data/data/tiger.html
Data Sets for Social Media Analysis
In many cases, you can get social media related to your account for free direct from the source. For instance, if all you want is raw Twitter data, the Twitter API provides a good way to get started. You’ll have to register for an API key ever since the shift to the API version 2.0, but it’s still free for most of the applications you’d be running. One of Coursera’s Data Science courses takes this approach to teach users how to get and analyze data sets using some simple sentiment scoring techniques.
If you want more than just the raw data, however, (or more than the API will allow you for that matter) it may be worth looking for a data partner who can enrich the data for you and who has access to the Twitter firehose. Datasift is one such partner for Twitter. While not really a true free data source, Datasift does offer a $10 dollar free trial account for their pay-as-you-go streaming service. If you want to get at a relatively small stream of social data, particularly live Twitter, this free account can be a great way to get started and prototyping. They support multiple delivery options for storage which makes it easy sync the data with whatever platform you’re using.
Integrating Paid Data with Free Data Helps Avoid Analysis Delays
While this post is about finding free data, it’s important to be prepared to pay for some data to get started with a new project. Often you can get decent amounts of sample data for relatively low prices. We recommend to our customers that they usually allocate a few hundred dollars for data purchases. This usually isn’t necessary, but by planning for it in advance, you can avoid stalling a project from going forward. A relatively small investment up front can keep your team from making assumptions that a particular kind of data may be available. This means even if your new project is exploratory, you can decide if it’s worth full investment faster. This means faster return on investment or a quicker pivot to a different solution. Either way is a win for your team and your business’ bottom line.