Pandas Fundamentals
Overview¶
Now that we have a basic idea how to visualize some data, we can turn to representing the data itself. In our previous episode, we used lists to store our temperature and the times of day. This worked fine for our very limited scope, but once we have to deal with larger data sets, this will quickly become unwieldy.
This is where pandas comes into play.
In this section we will introduce the fundamental data types pandas uses and look at some of their capabilities.
A Series in Time¶
For the beginning we will start with a new script series.py
and carry over the data we used in the previous episode.
from datetime import datetime, timedelta
start_time = datetime(year=2020, month=12, day=31, hour=0)
air_temperature = [
5.0, 5.6, 5.6, 5.6, 5.6, 6.1, 6.7, 7.2,
7.8, 8.3, 8.3, 8.9, 7.8, 7.2, 7.2, 7.2,
6.7, 6.7, 6.7, 6.7, 6.1, 5.6, 5.0, 5.0
]
hours = [
start_time + timedelta(hours=offset)
for offset in range(len(air_temperature))
]
To make use of the functionality that pandas offers we should turn these lists into a time series.
For this we import the Series
data type from the framework and insert the data into it.
We also add the hours as the index to the data, instead of keeping them in a separate variable.
from datetime import datetime, timedelta
from pandas import Series
start_time = datetime(year=2020, month=12, day=31, hour=0)
air_temperature = [
5.0, 5.6, 5.6, 5.6, 5.6, 6.1, 6.7, 7.2,
7.8, 8.3, 8.3, 8.9, 7.8, 7.2, 7.2, 7.2,
6.7, 6.7, 6.7, 6.7, 6.1, 5.6, 5.0, 5.0
]
hours = [
start_time + timedelta(hours=offset)
for offset in range(len(air_temperature))
]
measurements = Series(
name="Air Temperature [°C]",
data=air_temperature,
index=hours
)
print(measurements)
Output
2020-12-31 00:00:00 5.0
2020-12-31 01:00:00 5.6
2020-12-31 02:00:00 5.6
2020-12-31 03:00:00 5.6
2020-12-31 04:00:00 5.6
2020-12-31 05:00:00 6.1
2020-12-31 06:00:00 6.7
2020-12-31 07:00:00 7.2
2020-12-31 08:00:00 7.8
2020-12-31 09:00:00 8.3
2020-12-31 10:00:00 8.3
2020-12-31 11:00:00 8.9
2020-12-31 12:00:00 7.8
2020-12-31 13:00:00 7.2
2020-12-31 14:00:00 7.2
2020-12-31 15:00:00 7.2
2020-12-31 16:00:00 6.7
2020-12-31 17:00:00 6.7
2020-12-31 18:00:00 6.7
2020-12-31 19:00:00 6.7
2020-12-31 20:00:00 6.1
2020-12-31 21:00:00 5.6
2020-12-31 22:00:00 5.0
2020-12-31 23:00:00 5.0
Name: Air Temperature [°C], dtype: float64
Note that the Series
automatically picked up that the index is a date-time
and that the underlying data type for our values is a 64-bit floating point format.
The new data type on its own already offers some functionality that can be very useful for some quick data processing tasks.
We can for example do some quick statistics or transform the values from Celsius to Fahrenheit in one sweep.
Example 1: Quick Statistics¶
from datetime import datetime, timedelta
from pandas import Series
start_time = datetime(year=2020, month=12, day=31, hour=0)
air_temperature = [
5.0, 5.6, 5.6, 5.6, 5.6, 6.1, 6.7, 7.2,
7.8, 8.3, 8.3, 8.9, 7.8, 7.2, 7.2, 7.2,
6.7, 6.7, 6.7, 6.7, 6.1, 5.6, 5.0, 5.0
]
hours = [
start_time + timedelta(hours=offset)
for offset in range(len(air_temperature))
]
measurements = Series(
name="Air Temperature [°C]",
data=air_temperature,
index=hours
)
# Example 1: Quick statistics
print("=== Statistics ===")
print(measurements.describe())
Output
Example 2: Converting all data at once¶
from datetime import datetime, timedelta
from pandas import Series
start_time = datetime(year=2020, month=12, day=31, hour=0)
air_temperature = [
5.0, 5.6, 5.6, 5.6, 5.6, 6.1, 6.7, 7.2,
7.8, 8.3, 8.3, 8.9, 7.8, 7.2, 7.2, 7.2,
6.7, 6.7, 6.7, 6.7, 6.1, 5.6, 5.0, 5.0
]
hours = [
start_time + timedelta(hours=offset)
for offset in range(len(air_temperature))
]
measurements = Series(
name="Air Temperature [°C]",
data=air_temperature,
index=hours
)
# Example 2: Modifying the whole Series at once
measurements_farenheit = measurements * 1.8 + 32
measurements_farenheit.name = "Air Temperature [°F]"
print("=== Measurements in °F ===")
print(measurements_farenheit)
Output
=== Measurements in °F ===
2020-12-31 00:00:00 41.00
2020-12-31 01:00:00 42.08
2020-12-31 02:00:00 42.08
2020-12-31 03:00:00 42.08
2020-12-31 04:00:00 42.08
2020-12-31 05:00:00 42.98
2020-12-31 06:00:00 44.06
2020-12-31 07:00:00 44.96
2020-12-31 08:00:00 46.04
2020-12-31 09:00:00 46.94
2020-12-31 10:00:00 46.94
2020-12-31 11:00:00 48.02
2020-12-31 12:00:00 46.04
2020-12-31 13:00:00 44.96
2020-12-31 14:00:00 44.96
2020-12-31 15:00:00 44.96
2020-12-31 16:00:00 44.06
2020-12-31 17:00:00 44.06
2020-12-31 18:00:00 44.06
2020-12-31 19:00:00 44.06
2020-12-31 20:00:00 42.98
2020-12-31 21:00:00 42.08
2020-12-31 22:00:00 41.00
2020-12-31 23:00:00 41.00
Name: Air Temperature [°F], dtype: float64
Accessing Data¶
From Pythons’ lists, tuples, set or dictionaries you are likely familiar with the indexing operator […]
.
We can use the same operator to access elements or slices of our series.
Not all indexes are numbers
If not specified otherwise, pandas will automatically index the data with 0-based concurrent integer values.
In our case we instead used datetime
-instances to index the data.
This makes much sense considering that we usually would like to ask:
“What was the air temperature (in this place) at a given date and time?”
Consequently, when accessing the data, we have to provide datetime
-objects as the index.
Following the conventions laid out in the previous lesson, we consider the time from 7:00 to 17:00 as the daytime. Let us print only the section for the daytime, as well as the temperature at noon.
from datetime import datetime, timedelta
from pandas import Series
start_time = datetime(year=2020, month=12, day=31, hour=0)
air_temperature = [
5.0, 5.6, 5.6, 5.6, 5.6, 6.1, 6.7, 7.2,
7.8, 8.3, 8.3, 8.9, 7.8, 7.2, 7.2, 7.2,
6.7, 6.7, 6.7, 6.7, 6.1, 5.6, 5.0, 5.0
]
hours = [
start_time + timedelta(hours=offset)
for offset in range(len(air_temperature))
]
measurements = Series(
name="Air Temperature [°C]",
data=air_temperature,
index=hours
)
day_start = datetime(2020, 12, 31, hour=7)
noon = datetime(2020, 12, 31, hour=12)
day_end = datetime(2020, 12, 31, hour=17)
print("Temperature at noon:", measurements[noon])
print("Daytime temperatures:")
print(measurements[day_start:day_end])
Output
Temperature at noon: 7.8
Daytime temperatures:
2020-12-31 07:00:00 7.2
2020-12-31 08:00:00 7.8
2020-12-31 09:00:00 8.3
2020-12-31 10:00:00 8.3
2020-12-31 11:00:00 8.9
2020-12-31 12:00:00 7.8
2020-12-31 13:00:00 7.2
2020-12-31 14:00:00 7.2
2020-12-31 15:00:00 7.2
2020-12-31 16:00:00 6.7
2020-12-31 17:00:00 6.7
Name: Air Temperature [°C], dtype: float64
You may note that when accessing a singular index you receive the direct value for the given entry. If you provide a slice however, the result will be a series again, reduced to the section you specified.
Data Frames¶
Usually, we do not only have a singular data value for each point in time, but are measuring or observing multiple quantities.
Let’s create a new script data_frame.py
and set up a second series of measurements besides our original one to work with.
We add the dew-point temperature (in °C) for the same time frame as well.
With multiple measurements over the same time, combining the data into a table-like structure can be a sensible choice.
For this purpose, pandas offers the DataFrame
.
from datetime import datetime, timedelta
from pandas import DataFrame
start_time = datetime(year=2020, month=12, day=31, hour=0)
air_temperature = [
5.0, 5.6, 5.6, 5.6, 5.6, 6.1, 6.7, 7.2,
7.8, 8.3, 8.3, 8.9, 7.8, 7.2, 7.2, 7.2,
6.7, 6.7, 6.7, 6.7, 6.1, 5.6, 5.0, 5.0
]
dew_point = [
-4.4, -3.3, -2.2, -2.8, -2.2, -1.7, -1.1, 0.0,
1.1, 3.3, 5.6, 6.1, 6.7, 6.7, 5.6, 4.4,
5.0, 3.3, 0.6, 1.1, -1.1, -2.2, -2.8, -2.8
]
hours = [
start_time + timedelta(hours=offset)
for offset in range(len(air_temperature))
]
LABEL_TEMPERATURE = "Air Temperature [°C]"
LABEL_DEW_POINT = "Dew Point [°C]"
measurements = DataFrame(
data={
LABEL_TEMPERATURE: air_temperature,
LABEL_DEW_POINT: dew_point
},
index=hours
)
print(measurements)
Output
Air temperature [°C] Dew Point [°C]
2020-12-31 00:00:00 5.0 -4.4
2020-12-31 01:00:00 5.6 -3.3
2020-12-31 02:00:00 5.6 -2.2
2020-12-31 03:00:00 5.6 -2.8
2020-12-31 04:00:00 5.6 -2.2
2020-12-31 05:00:00 6.1 -1.7
2020-12-31 06:00:00 6.7 -1.1
2020-12-31 07:00:00 7.2 0.0
2020-12-31 08:00:00 7.8 1.1
2020-12-31 09:00:00 8.3 3.3
2020-12-31 10:00:00 8.3 5.6
2020-12-31 11:00:00 8.9 6.1
2020-12-31 12:00:00 7.8 6.7
2020-12-31 13:00:00 7.2 6.7
2020-12-31 14:00:00 7.2 5.6
2020-12-31 15:00:00 7.2 4.4
2020-12-31 16:00:00 6.7 5.0
2020-12-31 17:00:00 6.7 3.3
2020-12-31 18:00:00 6.7 0.6
2020-12-31 19:00:00 6.7 1.1
2020-12-31 20:00:00 6.1 -1.1
2020-12-31 21:00:00 5.6 -2.2
2020-12-31 22:00:00 5.0 -2.8
2020-12-31 23:00:00 5.0 -2.8
In the above example we provide the data as a dictionary, associating labels with the actual data points. These labels end up as the column names in the data frame. We assign the actual texts for these labels to constants so we have one central point where we can re-define or change them if necessary. Also, development tools that provide auto-completion can pick up on the definition and offer it as a completion option down the line, reducing the need for typing and chance for hard to spot typing errors.
Basically all features available for the Series
type also work on DataFrame
s.
Accessing data¶
Accessing the data is slightly different, since besides the index we also need to specify the column range to be accessed. This results in multiple ways of handling the data access.
The regular index-operator […]
expects a column label and will return the respective column as a series.
from datetime import datetime, timedelta
from pandas import DataFrame
start_time = datetime(year=2020, month=12, day=31, hour=0)
air_temperature = [
5.0, 5.6, 5.6, 5.6, 5.6, 6.1, 6.7, 7.2,
7.8, 8.3, 8.3, 8.9, 7.8, 7.2, 7.2, 7.2,
6.7, 6.7, 6.7, 6.7, 6.1, 5.6, 5.0, 5.0
]
dew_point = [
-4.4, -3.3, -2.2, -2.8, -2.2, -1.7, -1.1, 0.0,
1.1, 3.3, 5.6, 6.1, 6.7, 6.7, 5.6, 4.4,
5.0, 3.3, 0.6, 1.1, -1.1, -2.2, -2.8, -2.8
]
hours = [
start_time + timedelta(hours=offset)
for offset in range(len(air_temperature))
]
LABEL_TEMPERATURE = "Air Temperature [°C]"
LABEL_DEW_POINT = "Dew Point [°C]"
measurements = DataFrame(
data={
LABEL_TEMPERATURE: air_temperature,
LABEL_DEW_POINT: dew_point
},
index=hours
)
print(measurements[LABEL_DEW_POINT])
Output
2020-12-31 00:00:00 -4.4
2020-12-31 01:00:00 -3.3
2020-12-31 02:00:00 -2.2
2020-12-31 03:00:00 -2.8
2020-12-31 04:00:00 -2.2
2020-12-31 05:00:00 -1.7
2020-12-31 06:00:00 -1.1
2020-12-31 07:00:00 0.0
2020-12-31 08:00:00 1.1
2020-12-31 09:00:00 3.3
2020-12-31 10:00:00 5.6
2020-12-31 11:00:00 6.1
2020-12-31 12:00:00 6.7
2020-12-31 13:00:00 6.7
2020-12-31 14:00:00 5.6
2020-12-31 15:00:00 4.4
2020-12-31 16:00:00 5.0
2020-12-31 17:00:00 3.3
2020-12-31 18:00:00 0.6
2020-12-31 19:00:00 1.1
2020-12-31 20:00:00 -1.1
2020-12-31 21:00:00 -2.2
2020-12-31 22:00:00 -2.8
2020-12-31 23:00:00 -2.8
Name: Dew Point [°C], dtype: float64
Accessing multiple columns at once
To access a selection of columns, we pass in a list of column names in the desired order, which is often done in larger data sets.
Here the […]
to denote a list are nested inside the […]
to denote an index, which might lead to the wrongful impression
that the resulting [[…]]
is a special thing on its own. It is not.
For clarity, some people prefer to write it as [ […] ]
instead.
The property loc
gives label-based access to the elements of a data frame.
It follows the pattern dataframe_name.loc[index_slice, column_slice]
.
For example:
from datetime import datetime, timedelta
from pandas import DataFrame
start_time = datetime(year=2020, month=12, day=31, hour=0)
air_temperature = [
5.0, 5.6, 5.6, 5.6, 5.6, 6.1, 6.7, 7.2,
7.8, 8.3, 8.3, 8.9, 7.8, 7.2, 7.2, 7.2,
6.7, 6.7, 6.7, 6.7, 6.1, 5.6, 5.0, 5.0
]
dew_point = [
-4.4, -3.3, -2.2, -2.8, -2.2, -1.7, -1.1, 0.0,
1.1, 3.3, 5.6, 6.1, 6.7, 6.7, 5.6, 4.4,
5.0, 3.3, 0.6, 1.1, -1.1, -2.2, -2.8, -2.8
]
hours = [
start_time + timedelta(hours=offset)
for offset in range(len(air_temperature))
]
LABEL_TEMPERATURE = "Air Temperature [°C]"
LABEL_DEW_POINT = "Dew Point [°C]"
measurements = DataFrame(
data={
LABEL_TEMPERATURE: air_temperature,
LABEL_DEW_POINT: dew_point
},
index=hours
)
day_start = datetime(2020, 12, 31, 7)
day_end = datetime(2020, 12, 31, 17)
day_temperature = measurements.loc[day_start:day_end, LABEL_TEMPERATURE]
Output
2020-12-31 07:00:00 7.2
2020-12-31 08:00:00 7.8
2020-12-31 09:00:00 8.3
2020-12-31 10:00:00 8.3
2020-12-31 11:00:00 8.9
2020-12-31 12:00:00 7.8
2020-12-31 13:00:00 7.2
2020-12-31 14:00:00 7.2
2020-12-31 15:00:00 7.2
2020-12-31 16:00:00 6.7
2020-12-31 17:00:00 6.7
Name: Air Temperature [°C], dtype: float64
The iloc
attribute works similar to loc
, except that it takes integer-based indexes instead of index/column labels:
from datetime import datetime, timedelta
from pandas import DataFrame
start_time = datetime(year=2020, month=12, day=31, hour=0)
air_temperature = [
5.0, 5.6, 5.6, 5.6, 5.6, 6.1, 6.7, 7.2,
7.8, 8.3, 8.3, 8.9, 7.8, 7.2, 7.2, 7.2,
6.7, 6.7, 6.7, 6.7, 6.1, 5.6, 5.0, 5.0
]
dew_point = [
-4.4, -3.3, -2.2, -2.8, -2.2, -1.7, -1.1, 0.0,
1.1, 3.3, 5.6, 6.1, 6.7, 6.7, 5.6, 4.4,
5.0, 3.3, 0.6, 1.1, -1.1, -2.2, -2.8, -2.8
]
hours = [
start_time + timedelta(hours=offset)
for offset in range(len(air_temperature))
]
LABEL_TEMPERATURE = "Air Temperature [°C]"
LABEL_DEW_POINT = "Dew Point [°C]"
measurements = DataFrame(
data={
LABEL_TEMPERATURE: air_temperature,
LABEL_DEW_POINT: dew_point
},
index=hours
)
day_start_index = 7
day_end_index = 18
day_temperature = measurements.iloc[day_start_index:day_end_index, 0]
Output
2020-12-31 07:00:00 7.2
2020-12-31 08:00:00 7.8
2020-12-31 09:00:00 8.3
2020-12-31 10:00:00 8.3
2020-12-31 11:00:00 8.9
2020-12-31 12:00:00 7.8
2020-12-31 13:00:00 7.2
2020-12-31 14:00:00 7.2
2020-12-31 15:00:00 7.2
2020-12-31 16:00:00 6.7
2020-12-31 17:00:00 6.7
Name: Air Temperature [°C], dtype: float64
Wrap up of Fundamentals¶
At this point we have seen the basic working parts of pandas and matplotlib. Of course there is a lot more to learn which we will pick up along the way when we work on a real-life example. Before we continue, it is a good idea to revisit what we learned so far and try our hands at some exercises.
Key points¶
- The pandas-framework offers data types to represent data in a structured way:
- The
Series
is a named sequence of data - The
DataFrame
is a table-like structure
- The