« Back to blog

Data mining 101, Which data is worth chasing?

Going against what I learned many years ago when I set out for my degree in computer science, I would like to explain the three types of data that I find interesting when looking at data mining.

First there is the structured data which in my opinion has the least importance.  It is historical data, meaning that it happened in the past, and is placed in a very structured method somewhere.  You can compare this to a card catalog in a library. If you are looking for a book, then you start with the first letter of the authors name, pull out that drawer, then go through the cards that are listed in alphabetical order, find the card and the location of the book is listed there. You then can follow the “mapping system” of the book shelves in the library and locate the book.

 

Cards

 

The next and very interesting type of data is unstructured data, which is a very important data type and probably the larger part of saved data anywhere. This can be anything from historical data to real time data. It is not kept in a very organized compared to structured data, however there are many methods to index this data now days to assist us in finding what is needed. Comparing this one is simple if you know me. Using the book reference from above, if I receive a new book I order from Amazon I get it in the mail, open it and take it to my home office. I read through it and place it somewhere in the office. It could be on the book shelf, could be in a desk drawer, or even just leave it on my desk. Now I know that object is saved somewhere, but not in a very tidy method. I also know that it is somewhere in my room if I need it again, but as more books arrive, documents are printed out, and mail comes in it could become difficult to locate it, or quite slow.

Messy
Thanks to http://www.flickr.com/photos/addie_oh_addie/ for the office pic, under a CC license.

The third category that I will put out there is the data that never gets saved, small events in time that happen, but someone did not save it. I have seen studies that claim that 80% of data created fall into this category. To base this example on the above two, when you get a book either from the library or my home office, you don’t know how it got there.  Was it UPS, DHL, or even the guy next door accepted the package because I was not home. This is data, but it was not “saved”. This is also the real time data, that is happening now but will be forgotten in 5 minutes. Wouldn’t it be interesting to know how that book got there as it was happening? (ok, probably not but you get my point I hope)

Internet_map_1024

I feel that the third type of data that I have listed, the 80% that is not saved, is the most interesting and critical data of all for data mining or better said analytics. If I want to know what is going on now I just can not use structured data, obviously because it has already happened so it is not interesting. If using unstructured data I also have a problem of historical vs. real time, for many companies are claiming real time but just are not delivering. There is also a scalability problem with unstructured data, looking simultaneously at various data streams, and of course if it is not saved somewhere then that moment is gone.

In order to keep these posts short, I will continue in the next part about how the types of data are being used for data mining. It will be interesting to look at todays common methods, the short comings in the near future and what is being done to help conquer these short comings.

 

all other prictures used in my blog that are not directly listed as a CC license are used from WikiCommons under OpenSource license