What is Data?

Data refers to any recorded piece of information. For example, yesterday's recorded temperature is a form of data. The current day of the week is also a piece of information. Likewise, the text you're reading at this moment is data, as is the following image of three coffee beans!

coffee-beans

These four examples cover the four forms of data we usually work with in data science: numeric, label, text and image. We work with these four forms of data in different ways and each of them requires different techniques to work with:

  • Numeric: This form includes recorded or observed numerical values like 1, 2, 3, or 4.5. Typically stored in a tabular format, we'll look at some numeric data very shortly in a table below.
  • Label: These often consist of short text, and are often a kind of classification. Examples include "Sunday", "Monday", "white bird", "black cat", "blue", "green", "setosa", or "versicolor".
  • Text: By text here we usually mean long text, like the paragraphs on this page, which may be converted into a form of numerical representation for analysis. This conversion is done using an advanced machine learning technique known as Natural Language Processing.
  • Image: Images, such as the displayed coffee beans, can also be converted into a numerical representation. This transformation is accomplished using another advanced machine learning technique known as Computer Vision.

Let's dive in and examine our first real-world dataset. This dataset, structured in a tabular format, comprises both numerical and categorical data. This dataset was collected by the biologist Ronald Fisher in 1936. This dataset is called the iris dataset and is popular in data science education because of its well-structured format and the clarity it provides in making inferences. We will maintain focus on this particular dataset as we proceed and explore further throughout this course.

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
7 3.2 4.7 1.4 versicolor
6.3 3.3 6 2.5 virginica

Ronald has meticulously recorded these five variables for each flower, and put the data into a tabular format, with columns representing the variables and rows containing the corresponding observations. In this instance, we have five distinct columns: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. And in this table we have 3 observations. We structure these columns, collectively placed side-by-side and often refer to them as variables, or alternatively, data items.

Taking a closer look at the table, the first observation reveals that the value for Sepal.Length is 5.1, and the value for Species is setosa.

Interestingly, these two values exhibit different characteristics. 5.1 is a numerical value, while setosa is a label.

Dataframes

We'll delve into the unfamiliar terms used in the following code snippet at a later stage. For now, let's focus on how we can create our own data. We can accomplish this using the data.frame function in R, which allows us to construct a dataframe. We'll store this dataframe in a variable, which we'll name df. You are encouraged to execute the following code and try it out for yourself!

Let's go through each line of code in the above example. In the first line, a new dataframe object named df is created. The dataframe is constructed using the data.frame function which comes pre-loaded in every R session. The a column of the dataframe is populated with the values 1 and 2, specified using the c function (another function that comes built in with R and the "c" stands for combine). The b column of the dataframe is populated with the values 3 and 4, also specified using the c function.

The second line prints out the contents of the dataframe df. The print function is used to display the dataframe object. The output will show the values in the a and b columns of the dataframe, with each row representing a separate observation.