What is Data?
Data refers to any recorded piece of information. For example, yesterday's recorded temperature is a form of data. The current day of the week is also a piece of information. Likewise, the text you're reading at this moment is data, as is the following image of three coffee beans!
These four examples cover the four forms of data we usually work with in data science: numeric, label, text and image. We work with these four forms of data in different ways and each of them requires different techniques to work with:
- Numeric: This form includes recorded or observed numerical values like 1, 2, 3, or 4.5. Typically stored in a tabular format, we'll look at some numeric data very shortly in a table below.
- Label: These often consist of short text, and are often a kind of classification. Examples include "Sunday", "Monday", "white bird", "black cat", "blue", "green", "setosa", or "versicolor".
- Text: By text here we usually mean long text, like the paragraphs on this page, which may be converted into a form of numerical representation for analysis. This conversion is done using an advanced machine learning technique known as Natural Language Processing.
- Image: Images, such as the displayed coffee beans, can also be converted into a numerical representation. This transformation is accomplished using another advanced machine learning technique known as Computer Vision.
Let's dive in and examine our first real-world dataset. This dataset, structured in a tabular format, comprises both numerical and categorical data. This dataset was collected by the biologist Ronald Fisher in 1936. This dataset is called the iris dataset and is popular in data science education because of its well-structured format and the clarity it provides in making inferences. We will maintain focus on this particular dataset as we proceed and explore further throughout this course.
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
7 | 3.2 | 4.7 | 1.4 | versicolor |
6.3 | 3.3 | 6 | 2.5 | virginica |
Ronald has meticulously recorded these five variables for each flower, and put the data into a tabular format, with columns representing the variables and rows containing the corresponding observations. In this instance, we have five distinct columns: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. And in this table we have 3 observations. We structure these columns, collectively placed side-by-side and often refer to them as variables, or alternatively, data items.
Taking a closer look at the table, the first observation reveals that the value for Sepal.Length is 5.1, and the value for Species is setosa.
Interestingly, these two values exhibit different characteristics. 5.1 is a numerical value, while setosa is a label.
Dataframes
We'll delve into the unfamiliar terms used in the following code snippet
at a later stage. For now, let's focus on how we can create our own data.
We can accomplish this using the data.frame
function in R,
which allows us to construct a dataframe. We'll store this dataframe
in a variable, which we'll name df. You are encouraged to
execute the following code and try it out for yourself!
Let's go through each line of code in the above example. In the
first line, a new dataframe object named df is created.
The dataframe is constructed using the
data.frame
function which comes pre-loaded in every
R session. The a column of the dataframe is populated with
the values 1 and 2, specified using the c
function
(another function that comes built in with R and the "c" stands for combine).
The b column of the dataframe is populated with the values
3 and 4, also specified using the c
function.
The second line prints out the contents of the dataframe
df. The print
function
is used to display the dataframe object. The output will show the
values in the a and b columns of the
dataframe, with each row representing a separate observation.