How can you collect data?

 

Generating Data as a By-Product of Operations

Much of the data collected by organizations is generated as a by-product of day-to-day operations. This fact brings with it the interesting implication that, because data is not created for a specific use, it is often most valuable in parts of the organization or even in industries different from their place of creation. -> share data, collaborate and connect! (Not only to utilise the value of data but also, in first instance, to understand where it is valuable.)


Web Scraping

Much of the public information on the Internet can be scraped, i.e. the text can be automatically “downloaded” from websites. (If you right click on a website and “inspect”, you can see its source code which you can then copy or build a software to copy from multiple pages.) Sophisticated websites tend to prevent web scraping and there is a debate around the legality of web scraping and usage of scraped data. Nonetheless, many organizations have scraped a high volume of data, which you can find on Common Crawl, for example:

Offering a Service

Many tech firms offer services for free, in part to collect data. This is a common way for startups, for example, to overcome the barrier to entry of not having customer data. We can think of Google offering the search engine to collect data on customers’ interests and preferences which are then sold to advertisers. Retailers might offer an online shopping service in order to, amongst other things, measure their customers purchasing behaviour which can then also be used also to improve the offline shopping experience.


Hiring Humans to Label Data

Due to the prevalence of unstructured data (e.g. images, text), the need to label data often arises in order to structure the data in a useful format. There are many websites, such as Amazon Mechanical Turk, Appen or freelancer, where one can easily hire humans to label data.


Buying Data from Provider

While a firm's own data is often its most valuable, combining data with other data sets often creates value. There are many data providers of various sorts in all industries.


Sharing Data

Firms can create data by sharing data privately with (alliance) partners or publicly with the crowd.

  • Public sharing may be particularly beneficial on certain platforms. A firm may create a competition on InnoCentive or Kaggle where sharing data may lead to innovative, crowd-sourced ideas. NASA is an example of an organization that has embraced open innovation and experimented with various formats. A recent scientific study (Lifshitz-Assaf, 2018) examined NASA's experiment in 2009 where they shared their strategic R&D challenges with the public. The research showed that the open model led to a scientific breakthrough at unprecedented speed using unusually limited resources; yet it also challenged the professional identity of existing R&D professionals. Transparency may generally reveal potential uncommon partners which can advance innovative efforts (Shipilov & Furr, 2018).

  • As mentioned above, since data is often generated as a by-product of operations, it is often mutually advantageous to organizations to privately share data with one another.


Using Data from Governments

Data is available from a range of institutions including governments, universities, think tanks and NGOs. Open Government Data is a philosophy that promotes transparency and accountability. Notable open data initiatives include those in Korea, France and the United States.


Retrieving Data from Cloud Providers

Many Cloud providers also provide public data repositories, such as Google Public Datasets or AWS Public Datasets.


Producing Data with Computers

Computer-generated data comes from simulations, such as games created or images changed by the computer. For example, Alpha Go 0 generated its own data by playing Go games against itself.

Where does your data come from?