Whether you are just going to look at your data and draw conclusions visually, or you are going to write a computer program to manipulate the data, a data structure can help. A data structure is designed to organize data to suit a specific purpose. Typically data to be analysed with software like Excel is stored in a file format that is readable by Excel, and the data are organized in columns to fit into the columns of the spreadsheet. In what follows we will make a distinction between a “spreadsheet” and a “table” because a table in Excel has special properties.
When analyzing data, it is useful to decide on the data that needs to be processed, and then organize that data into a data structure. A well designed data structure can make the operations that need to be done on the data very efficient.
We can put just about anything into any cell and Excel won’t complain about it.
In order to do analysis on data, we need to have the same type of data in either the rows or columns. The data that we want to do operations on should be grouped together so we can perform meaningful mathematical operations. Many operations will be done on entire columns of data.
In cells B5, C5 and D5 we have counted the number of cells in those columns. We know that we have 3 items in column B, 3 in C and 3 in D. However, we could change the total in column C to be a sum, i.e., we are going to add the numbers in that column:
There is a limit to the number of operations a person can do based on the data types in a column. For example in D2 we have the number 1.33 while in D3 we have a structure containing 3 numbers and D4 has text of some sort – it doesn’t make sense to add them. In this case we can count the number of cells that have stuff in them but that is about all.
In column C we just have numbers so we can add them, find the average, find minimum and maximum values … we can do a variety of different types of calculations:
At this point we would need more information about the data before we can do any meaningful analysis.
This is just a simple spread-sheet but Excel does have a data structure called a table which allows us to do additional things with a data set easily. So let’s turn this small example into a table structure and explore the offerings.
This is a drop-down list which allows for both sorting and filtering options. Sorting on a column results in all rows being sorted appropriately. This simplifies sorting compared to when the data isn’t in a table format.
We can see that Excel recognizes that the data in the column is text, and offers to build Text Filter rules.
Excel recognizes the data in the column as numbers and offers Number Filters.
Arranging data in meaningful groups, checking the data, annotating the data and working with table structures allows for quick and simple data exploration.
Why do we care about tables in Excel? They are a nice example of a simple data structure that enables us to do some operations on data easily.
In the world of analytics:
- Users of the R language place tables into a “data frame” structure
- Users of the Python Pandas module use “pandas.DataFrame’s”.
- Both of these structures hold the contents of tables from Excel or databases.
Much of business data comes in tables and once we understand the structure and its limitations we can focus on the analysis in any of a number of tools to achieve the desired results.