I am a student of DataScience and these are the questions that i get asked very often. What is the difference between DataScience and BigData. What is all this fuss about? what is so special in it ? At least in Pakistan there seems to be no jobs in DataScience, then why spend time in learning all these algorithms and methods? In this post i’ll try to answer all such questions. So let’s start…
Data science is being called the coolest job of 21st century. So the curiosity about it is natural. Let us first look at some recent advances in technology:
- Skype is now offering real time voice translation. Which means one person is talking in one language the other is talking in another and there voices gets translated in real time.
- DeepArt, an algorithm that mimics the styles of some of history’s greatest painters has been developed by researchers in Germany. Given any picture this algorithm can generate its painting mimicking any famous painter. Below is an example of generating painting of the same image in different styles.
- Watson is a question answering computer system capable of answering questions posed in natural language, developed in IBM’s DeepQA project. In 2011 Watson competed on Jeopardy! against former winners Brad Rutter and Ken Jennings.Watson received the first place prize of $1 million.
- AlphaGo, Google’s AI that recently defeated human champion in the game of Go. Just to emphasize on the importance of this: go is a game considered to be more difficult then chess, since the total number of possible positions in this game are more then the total number of atoms in the universe.
- Cancer Detection : Samsung Medison, a global medical equipment company and an affiliate of Samsung Electronics has developed an algorithm is better than humans in extracting meaning from cancer pathology reports.
And there are a lot more of such awesome things that have transitioned from a science fiction movie to reality in just a few years. And that bring us to that conclusion that something cool is going on, that acted as a catalyst for these huge advancements. What that might be ? well one of them is what most people call BigData.
What is BigData ?
The simplest definition i could gather is such data that due to its velocity, variety and volume could not be stored, processed and analysed with conventional methods. That is a little objectionable but the serves our purpose. So let’s first talk about the conventional methods. Conventionally by the term data one means mostly textual data, stored in relational databases in nice formatted columns. That can be easily queried by our beloved SQL and can also be analysed easily.
When we talk about big data it has :
- Verity: It can be unstructured data, images, videos, markup, natural language sentences, graphs and pretty much whatever you can imagine.
- Volume: It is in huge amount. And by huge i really mean HUGE!
- Velocity: It comes with great speed for example in 60 seconds 259710 tweets are posted on twitter. That is 4500 tweets per second.
So it would be safe to say that we have been dealing with data since a long time, but what has changed now is its amount, speed and type.
To support that statement i urge you to talk a look at http://onesecond.designly.com/ and you would realize that by the time you are here there have been :
- 16833165 google searches made
- 24463824 youtube videos viewed
- 28853982 facebook likes
- 964999614 emails sent
- 18330710 tweets posted
- 562496 dropbox files uploaded
So that is the amount of data being generated every second and while it brings challenge in saving, processing and analyzing it also brings opportunities beyond imagination. I’ll talk about it in detail in later. But first take a look on what is data science and where does machine learning and other such fields fit in.
What is DataScience?
I may be oversimplifying things but DataScience simply means making sense of data. Now, that data can be in any form. Relational database, unstructured data, natural language content, images, videos, graphs, markup, logs, anything. DataScience is the study of such algorithms, tools and techniques that help making sense of data. And it is pretty much an umbrella term which include many sub fields like machine learning, computer vision, business intelligence and so on.
I’ll try to elaborate with an example. Consider a task in which youtube intend to provide their users effective searching experience. Till now the only way user can search is by the name or title of the video. But can the name or title completely reflect the content ? No. What if the video is from a news channel feed and in each frame there is a whole new story. So we need something more to add the meaning to it. Let’s first consider an approach to tag the video. And consider the user is not tagging while uploading so youtube has to do it.
Now what we can do is to hire a bunch of people who will watch the videos and tag them. Seems like the job is done right? But in every 30 minutes, 150 hours length of videos are uploaded on the youtube server. So we will require 1200 people working in three shifts with no breaks working continuously to match that pace. And what about the videos already on the server ? well i don’t think it is possible to tag all of them manually. So we need a computer algorithm that can look at a video and suggest tags for it automatically. That would be cool right ? Well it has been done and this blog post covers some of the cutting edge techniques for video and image tagging.
That brings us to our most important question, why is it so hyped now! These algorithms and techniques were in practice for a long time. Then why the have become so important now?
Why now ?
There are two major catalyst to theses advances in DataScience. The first one is BigData. Almost all of the algorithms used in DataScience are data hungry algorithms. The more data you feed them the more effective and accurate they get. Skype’s translation, google’s effective search results, Netflix movie suggestions, amazons effective delivery, IBS’s watson and google’s alphaGo, nothing would have been possible without this huge amount of data theses companies have.
The second catalyst is advancement in computing resources. Specially with the advancements in cloud model, a common person can have resources which once were only possible for big companies like google and amazon.
So Datacience is an umbrella term that includes maths, stats, machine learning and AI algorithms and many other techniques. And all of these are used to make ‘sense of data’.
The time is ripe for such sophisticated methods, due to the advancements in processing power and the exploding increase in amount of data. There are many cool courses on coursera and other online portals to get you started. And if i can help you in any way just ping me 🙂