Machine Learning : Graphical and Analytical Representation of Data

Machine Learning : Graphical and Analytical Representation of Data


Hi, welcome to the article on graphical and analytical representation of data.

In the previous article we learnt about:

  1. Explain what data is and where it is available
  2. Explain what a variable is
  3. Explain the types of variables
  4. Explain what dependent and independent variables are
  5. Differentiate between dependent and independent variables
  6. Explain what structured and unstructured data is
  7. Differentiate between structured and unstructured data

Learning Objectives:

In this article you will be able to:

  1. Explain what data analysis is explain the link between data analysis and ML
  2. Explain the types of data analysis
  3. Explain the advantages of representing data graphically
  4. List and explain major types of graph used for data analysis

Now we will discuss more about what exercise we do with the available data for bringing out inside it. Suppose you are watching a cricket match and a new Batman comes to the crease at the Strikers end. The captain of the fielding team in consultation with the bowler decide which field setting to take and which wall he must boll. So that the Batman gets out early. Let it slide down the point the captain and the bowler might have discussed:

  • The Batman's previous strength
  • The Batman's weakness
  • Does the batsman struggle with spring and place
  • Is he comfortable with spin etc.

If we talk technically the captain and the bowler consulted the historical data of the batsman or batting statistics, based on which pick up the required decision. Also the bowler, based on the treatment of the batsmen, on his previous delivery decides on which areas to boll, so that the runs can be minimized.

Machine Learning : Graphical and Analytical Representation of Data - Analyze data about bowling


Data Analysis

The process of studying the available data and drawing valuable insights on information from it with the help of any software is termed as data analysis. Data analysis is being used everyday and every way to enable businesses to take smart and accurate decision. When you get a semester results your professor suggest you the topic of subjects you must pay more attention to for getting better results. This again is an example of data analysis.

Machine Learning : Graphical and Analytical Representation of Data - Professor data analyze for good marks example

Your professor analyses your marks in each subject, identify your weakest subject and then highlights the major topics in those subjects which you must pay attention to for scoring better mark. In this case since the data was very small the human brain was able to analyze the data efficiently.

Now suppose the professor has to analyze each and every students grade for the university or for the city or town their living in. This becomes very difficult for the human brain to analyze since there is huge amount of data. So we use analytical tool to analyze the data and derive the desired result.

Some of the most commonly used analytical tools in the industry are MS Excel, Python and R, SAS, KNIME, SPSS, Matlab, Minitab, Tableau etc. Everyday new tools are being developed for better and accurate detection and minimal human intervention.

Machine Learning : Graphical and Analytical Representation of Data - Tools for data analyzing


What has Machine Learning got to do with Data Analysis?

We have understood what data analysis is. But what is machine learning got to do with data analysis. As your learnt earlier machine learning uses different algorithms for predicting the future outcome and data analysis is about analyzing the data and bring out valuable insights from it using analytical tools. So data analysis is a part of machine learning which analyses your data and brings out inside before applying any algorithms on it.

Machine Learning : Graphical and Analytical Representation of Data - Cricket match prediction example

For example : Let's consider the cricket match scenario. Based on the previous match performance of your opponents. You analyze and decide on which bowler to use for the selected batsman or which ball to ball at the batsman which will give you reward in the form of a wicket. But when you predict which team is going to win the cricket match based on historical data at that time you would use a machine learning algorithm for predicting the winner. In the previous example we also consider a certain factors like the pitch condition the weather at time for the match etc. This will help us spread the results for the match. It is all about identifying a trend or a relationship with either of the condition which affect a prediction.

Machine Learning : Graphical and Analytical Representation of Data - Cricket match factor example


Exploratory Data Analysis (EDA)

This type of data analysis is known as exploratory data analysis. The objective of EDA is to find out the patterns and relationships among the data variables. Exploratory data analysis or EDA is an integral and most important part of Data Analytics. It helps in finding on patterns and trends in the data.

Just like you get the summary of story line of your favorite Nobel at the back cover. EDA gives the summary of the dataset. The summary is always accompanied by different types of graphs and descriptive statistics which becomes easier for the audience to understand and interpret.

Machine Learning : Graphical and Analytical Representation of Data - EDA summary example

For example : Before the start of every cricket match the captain and the Cockroach discusses on the performance of the players of the opponent team. They study the opponent strengths, weaknesses, the number of times they won, last 5 innings ratio and so on. All the figures are just descriptive statistics of the opponent team performance and this is what EDA all about.

Before we start any modelling on a data set it is always necessary to know about the variable that we will be dealing with and hence it comes into picture. If we have EDA we know about the dataset and hence are modelling becomes easy. This can be summarized in the diagram given:

Machine Learning : Graphical and Analytical Representation of Data - EDA model example

Using machine learning we create models where we do data analysis using different algorithms. We also explore the data using different methods for the EDA. The four major types or we can say the four levels of Data Analytics are descriptive analysis, diagnostic analysis, predictive analysis and prescriptive analysis. Let us discuss in detail about each one of them.

Machine Learning : Graphical and Analytical Representation of Data - Levels of Data Analytics

Descriptive Analysis

In descriptive analysis we get to know what is happening based on the available data. It is the first level of information that you obtain from the data. Let us consider the cricket match which we have talked about earlier. Using the match data you will know which players didn't perform well. Some examples of descriptive analysis a daily or weekly sales summary for a marketing officer. Which can help in identifying the sales trend over the week or with sales representative at the maximum sales on a given day. Another example is a daily summary of transactions are available with bank that help them know the amount of money deposited in a day or the number of loan sanctioned. This analysis will give us the high-level pictures of the data and we can say that this is step one for a data analysis journey.

Diagnostic Analysis

Most of the data driven companies are never content with the information gathered in descriptive analysis. But they go deeper to know about the cause of the events. This is called Diagnostic analysis. This is the second step of a data analysis process. Let us take an example of the cricket match. So once we have won or loss the match. We deep dive into the data and investigate as to what was the main reason for the outcome of the match. Was it the toss or the pitch conditions or the players didn't perform well or some other reason. This type of analysis is time intensive as you deep dive into the issue to find out the reason for the occurrence. Companies use Diagnostic analytics to derive insights from the problems.

Predictive Analysis

Next step that is predictive analysis is the major step of a data analysis process. It takes into account the key findings from the descriptive and Diagnostic analysis and predict the future events. This type of data analysis using machine learning algorithms to predict the future events using different machine learning tool that we would cover in the subsequent articles. For example : For predicting the outcome of the next match based on the available data. We build a forecasting model using ml algorithms.

Prescriptive Analysis

Next step in data analysis is prescriptive analysis. It prescribed which actions must be taken to eliminate the problem which impact the future Trends of the events. This analysis takes into account all of the previous analysis and then the comments action items. For example : Once the cricket match and the losing side analyses the cause for such outcome. They also identify the things that they should avoid and think they should focus on improving the performance in the next matches.

Machine Learning : Graphical and Analytical Representation of Data - Data Analytic levels detail


Graphical Data Representation

Graphical or visual representation of the tabular or numeric data is one of the most popular and commonly used analytical techniques. Here the visual representation is done using graph or chat such as bars in a bar chart, lines in a line chart or slices in a pie chart.

Machine Learning : Graphical and Analytical Representation of Data - Charts and graphs example
For such visual representation we neither use any algorithms, nor any Complex models, nor any Complex mathematical formula is involved. It is one of the simplest technique for drawing insights on the data.

Machine Learning : Graphical and Analytical Representation of Data - Unrequired things for graphs and charts representation

This not only helps us to study the relationship between variables but also helps to identify the trend and patterns across the variables. Now let's talk about some of the major types of graph which are commonly used for data analysis.

Machine Learning : Graphical and Analytical Representation of Data - Major types of graphs


Line Graph

The graph shown here is a line graph. It represent graphically the runs scored for over in a cricket match. Here the x-axis of the horizontal axis represents the over numbers and the y-axis of the vertical Axis represents the runs scored in an over. The based on the run scored for over the data points are plotted based on X and Y values. Once the points are obtained the dots are joined using a line. Looking at the chart we can easily identify which was the most expensive over and which was the most economical over. Such inferences or conclusions can be drawn from this graph.

Machine Learning : Graphical and Analytical Representation of Data - Line graph


Bar Graph

Now instead of floating the dots we could have represented the chart using bars. Then the chart would have been called a bar chart. Here is an example of a bar chart.

Machine Learning : Graphical and Analytical Representation of Data - Bar chart


Histogram/Frequency Polygon

Now let's move on to a special type of bar chart called histogram of frequency polygon. A histogram is used to count the frequency of occurrence of an event. The only difference from a bar chart is that there are no gaps in between the bars. It is used to represent continuous variables only.

Machine Learning : Graphical and Analytical Representation of Data - Histogram/Frequency polygon chart
In our example of cricket match let's represent runs scored in an over on the x-axis and the number of overs in which those many runs were scored on the y-axis. The runs represented on the x-axis are categorized into intervals of 2 has x-axis values in a histogram are represented in intervals which are known as bins. So here the intervals would be 0-2, 2-4, 4-6 etc. The frequency of those many runs being scored in the overs is shown on the y-axis. In the chat here you can see that y-axis values for 0-2 runs interval is 4 that means there were 4 overs in which runs scored by between 0-2.

Machine Learning : Graphical and Analytical Representation of Data - Histogram/Frequency polygon chart cricket match example

Similarly you can see that y-axis value for interval value 6-8 is 16. That means there was 16 overs in which runs scored by between 6 to 8 and so on. In this graph we can see that most of the overs had 6-8 runs scored in them. There are a few low scoring overs and a few high scoring overs.

So there is a pattern of inclining, reaching a maximum frequency and then declining. If we draw smooth line based on the frequencies. We will get a bell-shaped figure as shown here.

Machine Learning : Graphical and Analytical Representation of Data - Bell shaped figure

Whenever a histogram is bell shaped that is slowly the frequencies increase at some point and after that it starts decreasing. We say that the data is normally distributed. We see normally Distributed Data at several places in nature.

For example : If you would plot a histogram with number of people in each height range. You might see that a few would be rather short and a few would be quite tall. But majority would fall in a specific range. Same is true for marks in exam. Some would score high and some would score low but majority would over around a specific range. For modelling we must always have normally distributed data. In case the data is not normally distributed we use mathematical treatment to make the data model ready. So that it follows a normal distribution.

Machine Learning : Graphical and Analytical Representation of Data - Normally distributed data example


Pie Chart

Now let's move on to Pie Charts. As the name suggests pie chart is displayed as Pie or a slice or a piece of a pizza. Each portion of the slice represents the share of the variable. It is one of the widely used graphical representation chart where we want to display the proportion of different values or attributes.

Machine Learning : Graphical and Analytical Representation of Data - Pie Chart

Let's consider the cricket match example. We want to find out which player is made most of the runs. Player one made most contribution of 55% followed by player 2, 3 and 4. The rest of the players contributed to 10% of the total scores.

Machine Learning : Graphical and Analytical Representation of Data - Pie Chart cricket match example


Scatter Plot

Now we will discuss scatter plots one of the most useful graphs in Data Analytics. This is a two-dimensional visualization with each point is represented by a dot and each point is obtained by the intersection of values of two variables are represented in x axis and y axis respectively. This graph is very useful in informing at the relationship between two variables. Sometimes we have a line drawn between the dots. And this line is the best fit line based on the dots on the points.

Machine Learning : Graphical and Analytical Representation of Data - Scatter plot

For example : Let us consider the match which we have talked about earlier. Now we have the information of both the innings and we are focusing on the dot balls per over. This is displayed using a scatter plot. From the graph we can infer that the initial overs fetch maximum dot bolls and as the innings progressed the dot bolls is reduced. In mid innings we see both the teams got some momentum and had minimal dot balls or scoring balls. If we draw a best fit line based on the points the line will first decrease and then it will start increasing post 10 overs.

Machine Learning : Graphical and Analytical Representation of Data - Scatter plot cricket match example


Box and Whisker Plot

Now look at boxplots. Box plot also known as box and whisker plot is a graphical representation of the statistical summary of numeric variables. This graph is represented in a rectangular format with adjoining discuss at both ends and a line bisecting the rectangular block. From the graph we can obtain the following :

  • Minimum value
  • First quartile or the 25th quartile
  • Median value or the 50th quartile
  • Third quartile or the 75th value
  • Interquartile range
  • Maximum value
  • Outliers
Machine Learning : Graphical and Analytical Representation of Data - Box and whisker plot

Now what are quartiles?

The quartiles divide the data into three equal parts. Namely 25th quartile or the first quartile, 5th quartile of the median, 75th quartile of the third quartile. The difference between first and third quartile is known as the interquartile range which is equal to the length of a rectangular block in the graph.

Suppose a data consists of 100 observations and we arrange data in ascending order. Then the first quartile will be represented by the 25th value of the data set. The second quartile will be the 50th value in the data set and it will also be the median value. And the third quartile will be the 75th of the data set.

Machine Learning : Graphical and Analytical Representation of Data - Box and whisker plot quartile explanation

Let us discuss the box plot with an example where we consider Sachin Tendulkar's first hundred one day international match scores. From the above graph we can figure out the maximum value is 115. It was the highest scorer for Tendulkar and his first hundred One Day International matches and zero was a minimum of the lowest score and is represented by the whiskers. Now if we arrange all the hundred scores in ascending order then the first quartile of the 25th record will be 8 which means that in 25 innings out of his first hundred Tendulkar scored 8 or less runs. The 75th third quartile will be at 52 runs which means that in 75 innings his score is 52 or less runs. Another way to read this could be that in 25 innings he scored more than 52 runs.

Machine Learning : Graphical and Analytical Representation of Data - Box and whisker plot sachin tandulkar cricket example

The length of the rectangular block will be of 44 runs which is the difference between 3rd and first quartile. From this rectangle or the interquartile range we figure out that maximum times runs scored by Tendulkar in his first hundred innings were between 8 and 52. The median score for Sachin Tendulkar or the 50th score is 24 runs.

Now suppose for 100 and first match the data obtained for Tendulkar is 500. Now 500 will be considered an outlier. Outlier is represented on the vertical axis of the box plot. As outlier represents the extreme magnitude of the data points. It is above the top whisker as shown here.


Time Series

Now we will discuss time series. This again is a representation of the line graph where x-axis contains time and the y-axis can be any numerical variable which changes with time. In this example we are considering the housing prices.

Machine Learning : Graphical and Analytical Representation of Data - Time Series


Points To Remember

Now so far we have learnt about various types of graph. Let us now look at some important points we need to take into consideration while using graph.

  • The type of graph used impacts your understanding of the data
  • We always must be careful how we are representing the data and the type of graph used

So that is in this article. In the next article you will learn about Introduction to Python.


Attention readers! Don't stop learning now. Check out our articles to gain more knowledge.

Post a Comment

If you have any doubt, let me know.

Previous Post Next Post