« Back to blog

Data mining 101, Methods of collection

Now that we have defined the types of data commonly used in data analytics, we will move forward and look at some of the methods to collect the data.

Please excuse the confusing and somewhat loose usage of terms which happen quite often. I consider data analytics and data mining to be one in the same to a certain degree.

 

One of the most important things in data analytics is the empirical data, which can be defined as "data produced by an experiment or observation". 

Traditionally the most common means to collect empirical data is the survey. A primitive however effective survey could be a simple questioner or product sample. This method can be used for everything from behavior patterns, consumer acceptance, up to understanding how people feel about various topics. This is seen even today for example walking through a shopping center and you are asked to test something, the national census in respective countries, or even people stopping you on the street or door to door asking questions about the community. This is a very tedious method and eventually it must become electronic so that it is somewhat easier to evaluate the data gathered. This is impractical in my opinion, however it does cover that “rare breed” of Homo sapiens that do not have a computer or internet (Like my 94 year old grandma). Though this may seem unstructured in the beginning because it is, the actual data is then structured by being placed into electronic form and then into a database. - WOW, now isn’t that a waste of time and money between initial data input and analytical output.....

A more common approach today is the use of online surveys. Again covering just about any topic and aspect, it is a bit more particle based on the fact the data is saved into a structured database for use at any time. Just go to about any web site and a pop up will appear asking to answer a few questions. In the end however it is just a more modernized way of collecting data compared to the above example. - Ok, a bit better but still old or historical data. By the way -

 

Please take a minute to answer our short survey - 5 questions on Real Time Data Analytics

 

The survey is a very typical approach to data collection or production via observation. Another common way to gain data is experimentation, which can lead to enormous amounts of data. This is a very interesting source of data, however it is one that has the most problems when it comes to saving the data somewhere. Electronic collection has become a standard in science in todays world, which may seem that it would be easier, however the lack of money in many of the scientific programs cause data to not be saved for comparison or reduces the amount empirical data that can be gathered. This is simply a physical limitation, simply put: low budget = less hard drives = important empirical data may not be saved for analyzing. I would go so far as to claim that this is not only a problem found with scientists, but also in may large commercial organizations that use empirical data to improve their offerings, brand protection, and business decision.

Based on this information, you may see why I consider my third type of data described in my earlier blog post may become extremely critical.

As you may imagine based on the very basic information above, it becomes easier to understand why it is claimed that many professionals in the field of data analytics will be needed in the future. The need for data analytics is huge, the amount of professionals is quite small. This will of course drive their market value extremely high and only large companies will be able to afford such highly qualified individuals. Companies such as Nielsen or GFK have been focused on data analytics which have helped them grow into enormous companies. I have also seen reports that IBM, EMC and HP are very concerned about this market as well, knowing that is is estimated at $200 billion by 2015 and the compute power needed to do this type of research. Alone these 5 companies can control this market based on the large need by many companies for data analytics and combined with the very good analysts will stay in scientific research programs at their universities once complete with school. Imagine the costs of data mining in the future if 5 major companies own the market, scary isn't it! 

I do hope it is becoming clear, though indirect, how the third type of data I mentioned will become very important. I will discuss in the next post a bit more detailed on why I feel the 80% of the data that is not saved is more important than any other structured or unstructured data.

 

With a bit of luck we may even see a solution that can help every field interested in data analytics in the future, who knows?