Why DataScience? Why Now ?

I am a student of DataScience and these are the questions that i get asked very often. What is the difference between DataScience and BigData. What is all this fuss about? what is  so special in it ? At least in Pakistan there seems to be no jobs in DataScience, then why spend time in learning all these algorithms and methods? In this post i’ll try to answer all such questions. So let’s start…

Data science is being called the coolest job of 21st century.  So the curiosity about it is natural. Let us first look at some recent advances in technology:

  • Skype is now offering real time voice translation. Which means one person is talking in one language the other is talking in another and there voices gets translated in real time.
  • DeepArt, an algorithm that mimics the styles of some of history’s greatest painters has been developed by researchers in Germany. Given any picture this algorithm can generate its painting mimicking any famous painter. Below is an example of generating painting of the same image in different styles.

    ART1_1

    Original image in A, and all others are paintings generated by algorithm, mimicking the style of the pictures in small boxes.

  • Watson is a question answering computer system capable of answering questions posed in natural language, developed in IBM’s DeepQA project. In 2011 Watson competed on Jeopardy! against former winners Brad Rutter and Ken Jennings.Watson received the first place prize of $1 million.
  • AlphaGo, Google’s  AI that recently defeated human champion in the game of Go. Just to emphasize on the importance of this:  go is a game considered to be more difficult then chess, since the total number of possible positions in this game are more then the total number of atoms in the universe.
  • Cancer Detection : Samsung Medison, a global medical equipment company and an affiliate of Samsung Electronics has developed an algorithm is better than humans in extracting meaning from cancer pathology reports.

And there are a lot more of such awesome things that have transitioned from a science fiction movie to reality in just a few years. And that bring us to that conclusion that something cool is going on, that acted as a catalyst for these huge advancements. What that might be ? well one of them is what most people call BigData.

What is BigData ?

The simplest definition i could gather is such data that due to its velocity, variety and volume could not be stored, processed and analysed with conventional methods. That is a little objectionable but the serves our purpose. So let’s first talk about the conventional methods. Conventionally by the term data one means mostly textual data, stored in relational databases in nice formatted columns. That can be easily queried by our beloved SQL and can also be analysed easily.

When we talk about big data it has :

  •  Verity: It can be unstructured data, images, videos, markup, natural language sentences, graphs and pretty much whatever you can imagine.
  • Volume: It is in huge amount. And by huge i really mean HUGE!
  • Velocity: It comes with great speed for example in 60 seconds 259710 tweets are posted on twitter. That is 4500 tweets per second. 

So it would be safe to say that we have been dealing with data since a long time, but what has changed now is its amount, speed and type.

To support that statement i urge you to talk a look at http://onesecond.designly.com/ and you would realize that by the time you are here there have been :

  • 16833165 google searches made
  • 24463824 youtube videos viewed
  • 28853982 facebook likes
  • 964999614 emails sent
  • 18330710 tweets posted
  • 562496 dropbox files uploaded  

So that is the amount of data being generated every second and while it brings challenge in saving, processing and analyzing it also brings opportunities beyond imagination. I’ll talk about it in detail in later. But first take a look on what is data science and where does machine learning and other such fields fit in.

What is DataScience?

I may be oversimplifying things but DataScience simply means making sense of data. Now, that data can be in any form. Relational database, unstructured data, natural language content, images, videos, graphs, markup, logs, anything. DataScience is the study of such algorithms, tools and techniques that help making sense of data. And it is pretty much an umbrella term which include many sub fields like machine learning, computer vision, business intelligence and so on.

I’ll try to elaborate with an example. Consider a task in which youtube intend to provide their users effective searching experience. Till now the only way user can search is by the name or title of the video. But can the name or title completely reflect the content ? No. What if the video is from a news channel feed and in each frame there is a whole new story. So we need something more to add the meaning to it. Let’s first consider an approach to tag the video. And consider the user is not tagging while uploading so youtube has to do it.

Now what we can do is to hire a bunch of people who will watch the videos and tag them. Seems like the job is done right? But in every 30 minutes, 150 hours length of videos are uploaded on the youtube server. So we will require  1200 people working in three shifts with no breaks working continuously to match that pace. And what about the videos already on the server ? well i don’t think it is possible to tag all of them manually. So we need a computer algorithm that can look at a video and suggest tags for it automatically. That would be cool right ? Well it has been done and this blog post covers some of the cutting edge techniques for video and image tagging.

That brings us to our most important question, why is it so hyped now! These algorithms and techniques were in practice for a long time. Then why the have become so important now?

Why now ?

There are two major catalyst to theses advances in DataScience. The first one is BigData. Almost all of the algorithms used in DataScience are data hungry algorithms. The more data you feed them the more effective and accurate they get. Skype’s translation, google’s effective search results, Netflix movie suggestions, amazons effective delivery, IBS’s watson and google’s alphaGo, nothing would have been possible without this huge amount of data theses companies have.

The second catalyst is advancement in computing resources. Specially with the advancements in cloud model, a common person can have resources which once were only possible for big companies like google and amazon.

Conclusion

So Datacience is an umbrella term that includes maths, stats, machine learning and AI algorithms and many other techniques. And all of these are used to make ‘sense of data’.

The time is ripe for such sophisticated methods, due to the advancements in processing power and the exploding increase in amount of data. There are many cool courses on coursera and other online portals to get you started. And if i can help you in any way just ping me 🙂

Beginning DataScience

About

This is going to be a journey of me teaching a friend of mine the basics of Data science. But the twist is that he is not from a computer science background, and definitely he doesn’t code. So we’ll start from the basics of programming and try to reach where he can call himself a “Data scientist” 🙂

I’ll be posting all the resources and codes we are going through and i hope this post would be helpful to anyone who chooses the same path and hopefully if we succeed, then anyone can learn the basics of data science by following us.

Initial Plan  

We’ll start with basic programming constructs and will chose python as the initial language due to it’s similarity with pseudo-code and faster learning curve. Let’s get started!

Setup

For getting started with python, Anaconda is a blessing. It is a contains all the necessary python packages, libraries and a cool IDE. And the best thing is its free and it supports all platforms so go ahead, download and install  Anaconda for your OS.

Programming Basics

For a beginner, How to Think Like a Computer Scientist is a good book. We’ll start with first 10 chapters of the book with a fast pace and then for a much detailed understanding first 13 chapters of Think Python.

I have found another useful and very handy tutorial by J.R. Johansson. It covers most of the concepts we need right now. So after going through this complete notebook, if you think you know all the concepts discussed then we have achieved our first milestone. You can check your understanding by solving the following exercises on lists and strings.

 

 

One to one image matching – An introduction

Hi every one, this post would be the first part of a series in which i will be discussing, a generic object detection algorithm using python and some external libraries.

Digital image processing is one of the most important areas today, when facebook is detecting the faces of our friends automatically and google is allowing us to search using images, google glass and microsoft’s HoloLens have initiated a new era of augmented reality. BIG data science is blooming and one of its major strengths is to process non trivial data, including images and videos. Facebook, instagram, twitter have millions of images posted every day, what if one can extract all the information from these images? just think about it and you will understand why in today’s age every tech person must atleast have a basic understanding of how images are actually interpreted, stored and matched. So here i will first give an abstract idea of basic image matching and then in a later post  some python code to actually implement it.

One thing we surely know that images are stored in binary (1,0) where the sequence of bits depend upon the format (jpg, png, etc) and on whether it is an 8 bit or 16 bit image. But anyways it will be stored as sequences of 1’s and 0’s. So if we have to compare two exactly same images it would be a trivial task, just consider them two arrays, apply any naive algorithm and there we have it. we can easily find out  whether two images match or not.

lena_2           lena_2

But in reality it is extremely rare to match two exactly same images. Most of the time you would have to match two similar images.

lena_2           lena_2

Now that trivial technique will now work, because these images are not exactly same but are similar.

So this is where digital image processing comes in. Now the task is to find the similarity among the images, and if the similarity is greater than a certain limit we can say that the images are same. The steps to do that are as follows :

1. Find unique features of each image.
2. Match the features of first image to the second.
3. Calculate the number of features matched.

The word feature is the key here, if one can understand what a feature is, rest of the task would be a piece of cake. So a feature is a point in an image which can uniquely identify an image to another e.g. consider the region in white rectangle, can any point residing in this rectangle  be called a feature? No. Because any point selected would have same coloured point all around it, and there would be a lot of such point so it would not be a unique point.lena_2

In comparison to that if we select a point which has different coloured points around it, e.g. any point in yellow rectangle would have different coloured points around it, with different intensities and change in colour in a specific direction, there will be very less no of point which would exactly match it, and if we consider enough information of the points around it, it will not match to any other point but itself, so it would be a good feature.Once you find such a point you can store information like its colour density, density of point around it and the gradient angle etc in any data structure, and we have a feature!

It is very common to consider the point on edges as features, and by an edge i mean the point where colour intensity is changing rapidly. And hence most of the algorithms of feature detection are also called edge detection algorithms. There are a number of these algorithms available some of them are Harism Hessian, SIFT and FAST. All these does the same task with slight variations, and their implementations can be found in any programming language. Now by applying any of these algorithm we can have something like this, where red dots are the features found.
features           features

Now since all of these point are unique features, we can compare the features of these two images and count how many of them match, and if enough matches are found we can say the the images are similar.
So here is the basic idea to how two images can be matched, i hope i was successful in explaining, in case of any questions do contact!