What is Data?
Data refers to any recorded piece of information. For example, yesterday's recorded temperature is a form of data. The current day of the week is also a piece of information. Likewise, the text you're reading at this moment is data, as is the following image of three coffee beans!
These four examples cover the four forms of data we usually work with in data science: numeric, label, text and image. We work with these four forms of data in different ways and each of them requires different techniques to work with:
- Numeric: This form includes recorded or observed numerical values like 1, 2, 3, or 4.5. Typically stored in a tabular format, we'll look at some numeric data very shortly in a table below.
- Label: These often consist of short text, and are often a kind of classification. Examples include "Sunday", "Monday", "white bird", "black cat", "blue", "green", "setosa", or "versicolor".
- Text: By text here we usually mean long text, like the paragraphs on this page, which may be converted into a form of numerical representation for analysis. This conversion is done using an advanced machine learning technique known as Natural Language Processing.
- Image: Images, such as the displayed coffee beans, can also be converted into a numerical representation. This transformation is accomplished using another advanced machine learning technique known as Computer Vision.
Let's dive in and examine our first real-world dataset. This dataset, structured in a tabular format, comprises both numerical and categorical data. This dataset was collected by the biologist Ronald Fisher in 1936. This dataset is called the iris dataset and is popular in data science education because of its well-structured format and the clarity it provides in making inferences. We will maintain focus on this particular dataset as we proceed and explore further throughout this course.
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
7 | 3.2 | 4.7 | 1.4 | versicolor |
6.3 | 3.3 | 6 | 2.5 | virginica |
Ronald has meticulously recorded these five variables for each flower, and put the data into a tabular format, with columns representing the variables and rows containing the corresponding observations. In this instance, we have five distinct columns: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. And in this table we have 3 observations. We structure these columns, collectively placed side-by-side and often refer to them as variables, or alternatively, data items.
Taking a closer look at the table, the first observation reveals that the value for Sepal.Length is 5.1, and the value for Species is setosa.
Interestingly, these two values exhibit different characteristics. 5.1 is a numerical value, while setosa is a label.
Dataframes
We'll delve into the unfamiliar terms used in the following code snippet
at a later stage. For now, let's focus on how we can create our own data.
We can accomplish this using the pd.DataFrame
function in
the Pandas library in Python. We will get into packages in a future lesson,
however just know for now that it makes more functions available for us to use,
such as this dataframe creating function. We'll store this dataframe
in a variable, which we'll name df. You are encouraged to
execute the following code and try it out for yourself!
Let's go through each line of code. In the first line, a new dictionary object that we name data is created. Let's cover what a dictionary is briefly now, and again in a future lesson. A dictionary is a key-value mapping, which makes an association between the two items. So in our example, there are two keys that are named a and b. Each key has a list as its value, with some numbers, 1 and 2, and 3 and 4 respectively within each list.
The second line creates an object that is based on a DataFrame
class. This class is found within the Pandas package. When it
is instantiated, it can take a dictionary as an argument, setting the keys
as columns and values from the list as rows. It is then saved into the
variable df.
The third line prints out the contents of the dataframe df
using the print
function, which comes ready to use with
every Python installation. It will show the column
names in the dataframe, followed by rows of data, each row representing
a separate observation.