Hi, welcome to the article on types of data.
In the previous article we learnt about:
- Define the terms labelled data and unlabeled data
- Explain supervised learning and unsupervised learning
Learning Objectives:
At the end of this topic you will be able to:
- Explain what data is and where it is available
- Explain what a variable is
- Explain the types of variables
- Explain what dependent and independent variables are
- Differentiate between dependent and independent variables
- Explain what structured and unstructured data is
- Differentiate between structured and unstructured data
What is data ?
As we learnt earlier "data is any information about something". This information can be in the form of number, text like character, words, sentences, paragraph etc, image, audio or video and so on. And this information is available everywhere. In fact we're living in an environment surrounded by data in different forms.
For example :- When you wake up in the morning at the sound of a alarm the specific time you wake up that is data. After waking up you check your mobile for notifications. Now again your mobile contains a lot of data such as number of calls received on missed, the number of SMS or WhatsApp messages received or sent all these including the contents of the messages is also data in the form of text, audio, pictures or video.
So our actions help in generating data which is registered by our brain. Which we recall when required. Thus our human brain stores this data and recollect it as when it required. That it is always stored somewhere and recalled whenever required.
For example :- When you want to call someone using your smartphone. You search the name in your phone book and make the call. Here your phone book store the contact information which is nothing but the data. The data here is in the form of name, contact number, email id, picture and so on. After your call ends your smartphone display is the number of times you are called that person on different dates and also shows the duration of each of those calls. The call log data is stored in the form of different attributes like contact name, date, time of the calls and duration of the calls in minutes or seconds. Each of these attributes such as contact name, duration of a call and so on. Represent a specific information which together construct the call log data. These attributes are called variables which are also known as data items as they constituted the data.
What is a variable ?
"A variable represents one specific characteristics of the data and tells a specific information about the data under consideration". In the above example contact name variable tells us the name of the person to whom the call was made. Similarly call duration variable tell the duration of the call. A variable can take any value like numbers of characters.
For example :- In the call log data the call duration in seconds will be a numeric variable and the person's name would be a character variable. It is called a variable because its value can change across different data points on over time. for example the call duration for different calls will be different. Similarly the date and time of each call will also be different. So you can see that the values of a variable may change as you move from one record or one row of data to the next.
Types of variables
There are two types of variables. One which can be measured are counted and is known as numeric or quantitative variable. The Other which represents a characteristic and cannot be measured are counted is known as categorical or qualitative variable.
Numeric variables are those which are quantifiable. That is they can be measured are counted and expressed in a numerical format by categorical variables are those which are like adjectives and express a feeling characteristic but they cannot be quantified are measured.
We can explain this with the example of a mobile phone. Is the price of the phone is mentioned as cheap it is a categorical variable. But it is not well defined and what you may call cheap could be expensive for someone else. However if you mention the exact price of the phone in Rupees say Rs.6000 then it is a numeric variable. Because it can be measured and expressed numerically.
Numerical Variables
The numeric variables can be subdivided into two categories of continuous and discrete. Continuous variables are those numeric variables which can take any value between a set of real numbers that is integers fractions decimals and so on. Their values will remain in that finite interval. For example : The height of a person essay 1.52 meters, the age of a person is 23 years or 25.5 years, the temperature is 36.9 degrees, the time and so on are all continuous variables.
Where as discrete variables are those which are countable and can only take whole numbers that is integers as value. For example : The number of contacts in your phone book, the number of calls made on a particular day are whole numbers and each of them are distinct for each scenario.
Categorical Variables
The categorical variable and not numeric in nature and consists of string or text values. For example : yes or no, good or bad, slow or fast, characteristics like colours, fat, slim etc or binary variables Zero OR one. These variables can be subdivided into ordinal and nominal variables.
Ordinal variables are those categorical variables which can be arranged in a logical order. that is they can be ranked in ascending or descending order based on their values. For example : The rating of a product when we express it in terms of poor, average, good. We know good is greater than average which is greater than poor. Similarly if the price of phone is expressed as cheap or expensive. We know that expensive is greater than cheap and so on.
Nominal variables are those categorical variables which cannot be logically rank based on the values. Color of your phone such as Black or White or silver, its brand such as Samsung, apple or Nokia etc.
We can summarize the learning so far in the diagram given. Data contains information which we get in the form of different variables. Variable can be numeric comprising numbers or categorical which comprise string variables. The numeric variables can further be subdivided into continuous variables comprising real numbers and discrete variables comprising whole numbers. The categorical variables can also be subdivided into ordinal variables comprising variables which can be arranged in a meaningful sequence and nominal variables comprising variables which cannot be arranged in any order.
Independent & Dependent Variables
Independent variables are the ones whose values do not depend on any other variables. Their values can be changed or manipulated easily. For example : The number of calls you make in a day on the amount you spend on any given day and so on could be examples of independent variables. As their values do not depend on variables in the data.
Dependent variables are the ones whose values depend on other independent variables and cannot be changed easily. They bare a relationship with different variables. Changing the value of the independent variables will affect the value of the dependent variable. For example : The battery percentage of your smartphone will depend on a lot of factors including the number of calls made, for the duration of the calls. This variable that is battery percentage is a dependent variable whose values would depend on the number of calls and the duration of calls which are independent variables.
We will deal with dependent and independent variables while we learn about the machine learning models later. The diagram given here shows the difference between dependent and independent variables.
Structured Data
Let us now discuss about structured data. What is structured data ? As the name suggests we need to have a structure for variables. Structure variables always have a format or structure when they are stored.
For example :- The phone book of your smartphone follows a structures so that, whenever you add new contact it gets displayed in the same format. Where it shows the first name and last name along with the contact number.
Another example :- would be your results will get displayed online in the same format for each of the students. It shows the subjects names along with the marks and the status pass or fail. Here the data is structured in a proper tabular format.
By tabular format we mean the variables and their values are displayed in different rows and columns. The First row contains the row header also called the variable name which tells us what information each column contains. The subsequent rows contain values of these variables for different data points. The Machines can easily read this kind of data and you can always carry on mathematical treatment with these data points for numeric variables.
Example of a structured data is given here :
It contains the marks obtained by different students in different subjects.
Unstructured Data
What is unstructured data ? As the term suggests the data which does not have a well defined structure and is not arranged in any tabular format is called unstructured data. This kind of data is easily readable for humans. But it is difficult for machines to read.
For example :- The messages you receive in your smartphone or your WhatsApp conversations. They do not follow any pattern or structure and don't have any tabular format. It can be very easily understood by the humans. But too difficult to comprehend for machines. A text extracted from your favorite novel or comments on your social media account and image and audio clip all these do not have structure of their own and hence examples of unstructured data.
But you must be thinking. If the machines are unable to read the images or audio data then how come the image recognition of Facebook is able to identify you once you upload any of your pictures and how is Siri Alexa Works By easily responding to your voice commands. As you are also aware that this technology is used machine learning for making smart decisions. The solution is very simple the Machines first convert the unstructured data into a structured format so that they can easily read and work with them.
For example :- It is considered the example discussed earlier. We have used the comment selected from social media about a particular model of a mobile phone which is not working properly. People have posted their complaints and negative reviews of the phone on Facebook. This is an example of unstructured data. So how do machines convert this data into a structured format. First of all machines extract a Keyword from each of the comments. The keyword is extracted based on the placement of the word in the comments and frequency of occurrence of that word. Next it creates a matrix based on those words and assign value to each of those words known as term frequency. Which is actually the value of frequency of occurrence in the comments. Once the matrix is created the machine can very easily understand and read the data and work with them.
Another example :- We have images of persons. Let us find out how the Machines convert these images into a structured and readable format. First of all the machine transforms and captures the images into a format where only eyes, nose, lips and jawline is visible. Next It Breaks the captured image into 128 pixels and creates an outline based on the image.
Next it assigns values and create sub metrics based on these pixel points. The matrix is unique for each person's image. Once we have the unique Matrix the machine very easily read the data as it is in a structured format and models those images.
Don't worry if you don't fully understand how unstructured data is converted into structured data. Knowing what kind of data can be called unstructured and knowing that machines always convert this kind of data into some structured data equivalent is sufficient at this stage of the article. That is in this article. In next article you will learn about Graphical and Analytical Representation of Data.
Attention readers! Don't stop learning now. Check out our articles to gain more knowledge.