Loading Data

Data Files¶

In most cases the information to be processed is not written directly into the code. Instead, the data is often placed inside a file or database and loaded at the beginning of the program. You do not have to write the whole code for reading the file yourself, pandas already offers various functions to deal with many file formats.

The documentation provides an overview of the supported file types.

Reading the Data¶

We will start by reading a small file to illustrate the basic principle. You can download the example file water_levels.csv, so you do not have to search for suitable data on your own.

About the Data

In this example we will be taking a look at the water level of the Elbe river, measured in the city of Dresden, Germany. The data is presented a file containing comma seperated values (CSV), with the time and date as the first column and the water level in centimeters as the second column.

Note that the file itself does not contain any column headers.

Disclaimer: Data Source and Preprocessing

The data was taken from the water level information system of the German federal government. The original data was presented in a block-wise linear fashion. Some additional information has been cut and the data has been arranged in a column-wise fashion together with the corresponding ISO-8601 timestamps.

To read the data, we will employ pandas’ read_csv(…)-function. In the following code section we load the example file.

water_levels.py

from pandas import read_csv
from pathlib import Path

# Path of the data file current working directory
# Adapt this depending on where you saved the file
data_file = Path("water_levels.csv")

# Optional: Print file path so we can check that if the file isn't found
print("Try to load file:", data_file.absolute())

water_level_data = read_csv(data_file, header=None)

print(water_level_data)

Output

Try to load file: […]/water_levels.csv
                     0    1
0     2024-05-04 00:15  100
1     2024-05-04 00:30  100
2     2024-05-04 00:45  100
3     2024-05-04 01:00  100
4     2024-05-04 01:15  100
...                ...  ...
2875  2024-06-03 23:00  267
2876  2024-06-03 23:15  267
2877  2024-06-03 23:30  267
2878  2024-06-03 23:45  266
2879  2024-06-03 24:00  266

[2880 rows x 2 columns]

When inspecting the output we notice that we got a data frame with 2880 rows and 2 columns. Each of them has been labeled with a numeric index. It would be sensible to label the columns with words or phrases instead.

water_levels.py

from pandas import read_csv
from pathlib import Path

# Path of the data file current working directory
# Adapt this depending on where you saved the file
data_file = Path("water_levels.csv")

# Optional: Print file path so we can check that if the file isn't found
print("Try to load file:", data_file.absolute())

water_level_data = read_csv(data_file, header=None)

# Add the column headers
# Use constants so they can be referred to and adapted easily.
LABEL_DATETIME = "Date & Time"
LABEL_WATER_LEVEL = "Water level [cm]"
water_level_data.columns = [LABEL_DATETIME, LABEL_WATER_LEVEL]

print(water_level_data)

Output

Try to load file: […]/water_levels.csv
           Date & Time  Water level [cm]
0     2024-05-04 00:15               100
1     2024-05-04 00:30               100
2     2024-05-04 00:45               100
3     2024-05-04 01:00               100
4     2024-05-04 01:15               100
...                ...               ...
2875  2024-06-03 23:00               267
2876  2024-06-03 23:15               267
2877  2024-06-03 23:30               267
2878  2024-06-03 23:45               266
2879  2024-06-03 24:00               266

[2880 rows x 2 columns]

Now that looks quite a bit better. As a side effect, if we later decide to plot the data, the axis labels are already there.

If we check the types of the elements in the Date & Time column, we notice that there is a small issue.

>>> # Apply the build-in type(…) function to each element 
>>> # and produce a summary of the results via Series.value_counts(…)
>>> water_level_data[LABEL_DATETIME].apply(type).value_counts()
Date & Time
<class 'str'>    2880
Name: count, dtype: int64

All of our entries in this colum are strings! While this might be acceptable for some quick printing, it will be quite a headace once we try to do some actual calculation. Instead we would prefer to have the column entries as timestamp objects.

Luckily, pandas has you covered:

water_levels.py

from pandas import read_csv
from pathlib import Path

# Path of the data file current working directory
# Adapt this depending on where you saved the file
data_file = Path("water_levels.csv")

# Optional: Print file path so we can check that if the file isn't found
print("Try to load file:", data_file.absolute())

water_level_data = read_csv(
    data_file,       # What to read
    header=None,     # File contains no headers
    parse_dates=[0]  # Parse the first column as date
)

# Add the column headers
# Use constants so they can be referred to and adapted easily.
LABEL_DATETIME = "Date & Time"
LABEL_WATER_LEVEL = "Water level [cm]"
water_level_data.columns = [LABEL_DATETIME, LABEL_WATER_LEVEL]

print(water_level_data)

This time, pandas will automatically infer the data type for the selected column(s):

>>> water_level_data[LABEL_DATETIME].apply(type).value_counts()
Date & Time
<class 'pandas._libs.tslibs.timestamps.Timestamp'>    2880
Name: count, dtype: int64

Good to know

Since the columns have no labels when they are loaded, pandas numbers them from left to right, starting with 0. The Timestamp data type is almost identical to Pythons native datetime and compatible with it in many cases. Further details can be found in the Timestamp documentation

Finally, we have to contend with the fact that our row indexes currently are numbers. You do not usually do this when investigating the dataset. Rather you would want to use the timestamps themselves as an index, so we can query for

What was the water level at 2024-0-05 12:30?

Instead of

What is the water level for the row with the same index as the timestamp 2024-05-05 12:30?

We can turn our column into an index.

water_levels.py

from pandas import read_csv
from pathlib import Path

# Path of the data file current working directory
# Adapt this depending on where you saved the file
data_file = Path("water_levels.csv")

# Optional: Print file path so we can check that if the file isn't found
print("Try to load file:", data_file.absolute())

water_level_data = read_csv(
    data_file,       # What to read
    header=None,     # File contains no headers
    parse_dates=[0]  # Parse the first column as date
)

# Add the column headers
# Use constants so they can be referred to and adapted easily.
LABEL_DATETIME = "Date & Time"
LABEL_WATER_LEVEL = "Water level [cm]"
water_level_data.columns = [LABEL_DATETIME, LABEL_WATER_LEVEL]
water_level_data.set_index(LABEL_DATETIME)

print(water_level_data)

Output

Try to load file: […]/water_levels.csv
                     Water level [cm]
Date & Time                          
2024-05-04 00:15:00               100
2024-05-04 00:30:00               100
2024-05-04 00:45:00               100
2024-05-04 01:00:00               100
2024-05-04 01:15:00               100
...                               ...
2024-06-03 23:00:00               267
2024-06-03 23:15:00               267
2024-06-03 23:30:00               267
2024-06-03 23:45:00               266
2024-06-04 00:00:00               266

[2880 rows x 1 columns]

Note how the column label has shifted to the left, because now it is an index label.

Here is an example how we now can query for the water level (using the column name LABEL_WATER_LEVEL) at a certain time:

>>> from pandas import Timestamp
>>> when = Timestamp("2024-06-03 23:00")
>>> water_level_data[LABEL_WATER_LEVEL][when]
267

Querying in pandas

There are many methods to query data inside a data frame. We will see a few other options along the way.

Key points¶

pandas can load many data formats natively
The data loading can be fine-tunined with parameters to save work
Having a proper index is advisable