A Real Data Set

A wild Dataset appears!¶

Before we continue

From here on we will create our first real data analysis project. It is highly recommended that you put it into its own folder weather_analysis and use this folder as your working directory.

In the following we want to take a look at a real data set. Our example is taken from the NOAA’s weather data archive which hosts weather recordings from all over the world, covering over 100 years.

The particular data set we are interested in from New York, Central Park (station number 725053 94728 if you want to look up more details in the station list ) in 2020.

Download the example data set

To separate data and code, lets place the file in a subfolder called data.

Current project structure

📁 weather_analysis
└── 📁 data
    └─ 725053-94728-2020.gz

The data comes as a gz-compressed file. You do not need to unpack it to work with it, pandas can handle this file type just fine. If you were to do that and look inside, here is a glimpse of what you would see:

2020 01 01 01    67    22 10028   200    36 -9999     0 -9999
2020 01 01 02    61    17 10035   280    62 -9999     5 -9999
2020 01 01 03    61     6 10033   280    36 -9999    -1 -9999
2020 01 01 04    56   -11 10034   270    36 -9999    -1 -9999
2020 01 01 05    50   -22 10035   270    46 -9999     0 -9999
…

Each data set should be accompanied by a proper description, outlining how the values inside are supposed to be understood. In our case the data is in a format called ISD Lite. The NOAA website holds a documentation of the data format as a PDF.

Upon closer inspection of the document it becomes clear that we have 12 columns in our data set, each separated by one or more spaces.

With this information in mind, we can do our first attempt at reading the data set. We laid some groundwork for this in the previous episode, which we can now re-use. For our analysis we start by creating a new Python script weather_analysis/__main__.py.

Current project structure

📁 weather_analysis
├── __main__.py
└── 📁 data
    └─ 725053-94728-2020.gz

__main__.py

from pandas import read_csv
from pathlib import Path

# Path of the data file current working directory
# Adapt this depending on where you saved the file
data_file = Path("data/725053-94728-2020.gz")

# Loading the data
# No header in data file, use one or more whitespaces as column separator
weather_data = read_csv(data_file, header=None, sep=r"\s+")

print(weather_data)

Output

        0   1   2   3   4   5      6     7     8     9   10    11
0     2020   1   1   1  67  22  10028   200    36 -9999   0 -9999
1     2020   1   1   2  61  17  10035   280    62 -9999   5 -9999
2     2020   1   1   3  61   6  10033   280    36 -9999  -1 -9999
3     2020   1   1   4  56 -11  10034   270    36 -9999  -1 -9999
4     2020   1   1   5  50 -22  10035   270    46 -9999   0 -9999
...    ...  ..  ..  ..  ..  ..    ...   ...   ...   ...  ..   ...
8776  2020  12  31  19  67  11  10202 -9999 -9999 -9999  -1 -9999
8777  2020  12  31  20  61 -11  10213 -9999    21 -9999   0 -9999
8778  2020  12  31  21  56 -22  10230   300    36 -9999  -1 -9999
8779  2020  12  31  22  50 -28  10239 -9999 -9999 -9999   0 -9999
8780  2020  12  31  23  50 -28  10247   290    36 -9999   0 -9999

[8781 rows x 12 columns]

While there is still a lot of fine-tuning to do, the columns get identified correctly.

To get the data to load we do not have to care about the compression of the file, pandas can deal with that on its own.

Specifying the column seprator is a bit tricky however. Since the amount of spaces between the columns depends on many digits the respective columns contain, we can not directly give a fixed sequence as the separator in this case. But the sep-parameter also accepts a thing called Regular Expressions (or regex for short), which allows us to express “one or more whitespaces” in a way the computer understands. Since those regular expressions often use backspaces \ (which signal the start of a special symbol in normal Python strings), we put the r in front of the strings quotes to indicate that Python should process the string as-is without paying attention to the backslash.

Regular Expressions

Regular Expressions are sequences of characters, that can be used to describe the structure of a text. They can be a very handy tool if you have to investigate text a lot, but are also known to quickly become quite complicated. A good way to experiment around with them is the Regex101 website.

Labels and Indexes¶

Let’s start by doing what we learned before and properly label the columns. However since we have a lot of labels this time, it would be sensible to move them in a separate file to not clutter the main code with a bunch of definitions.

For this purpose we create label.py and put the label definitions in there. An added benefit is that we can import the label module as a whole and then can use the required prefix label. … to add some context to each of the names.

Current project structure

📁 weather_analysis
├── __main__.py
├── label.py
└── 📁 data
    └─ 725053-94728-2020.gz

Let’s take a look at the label definitions first.

label.py

# This module contains all the label names
# We want to use throughout our program

YEAR = "Year"
MONTH = "Month"
DAY = "Day"
HOUR = "Hour"
AIR_TEMP = "Air Temperature [°C]"
DEW_POINT = "Dew Point [°C]"
PRESSURE = "Sea Level Pressure [HPa]"
WIND_DIRECTION = "Wind Direction [°]"
WIND_SPEED = "Wind Speed [m/s]"
SKY_CONDITION = "Sky Condition & Coverage Code"
PRECIPITATION_1H = "Precipitation (1h) [mm]"
PRECIPITATION_6H = "Precipitation (6h) [mm]"

# Sort the columns into lists depending on their later use
DATE_TIME_COLUMS = [YEAR, MONTH, DAY, HOUR]
MEASUREMENT_COLUMNS = [
    AIR_TEMP,
    DEW_POINT,
    PRESSURE,
    WIND_DIRECTION,
    WIND_SPEED,
    SKY_CONDITION,
    PRECIPITATION_1H,
    PRECIPITATION_6H
]

We sort out the date in time columns since we later want to unify them into one column and use that as the index. There is quite a bit of duplication involved in assigning the labels and later sorting them into the repective lists. If you happen to know the “Walrus-operator” :=, you can use it here to make the code a bit less redundant.

Use := like an extremely potent Spice

When used expertly, it can make for a really nice addition. Otherwise it ruins the experience for everyone. If in doubt, get a second opinion or leave it out!

label.py

# This module contains all the label names
# We want to use throughout our program

DATE_TIME_COLUMS = [
    YEAR := "Year",
    MONTH := "Month",
    DAY := "Day",
    HOUR := "Hour"
]

MEASUREMENT_COLUMNS = [
    AIR_TEMP := "Air Temperature [°C]",
    DEW_POINT := "Dew Point [°C]",
    PRESSURE := "Sea Level Pressure [HPa]",
    WIND_DIRECTION := "Wind Direction [°]",
    WIND_SPEED := "Wind Speed [m/s]",
    SKY_CONDITION := "Sky Condition & Coverage Code",
    PRECIPITATION_1H := "Precipitation (1h) [mm]",
    PRECIPITATION_6H := "Precipitation (6h) [mm]"
]

The nice effect is that we now have a constant to refer to for each column as well as for each column group. This only works because can group columns that are next to each other. If that is not the case in another dataset, we either have to re-order the columns first or stick with the initial approach.

Now we can use our definitions to actually set the column labels.

__main__.py

from pandas import read_csv
from pathlib import Path

# Importing the whole module is fine, since we need all the label
# Also the required module name prefix adds helpful context
import label

# Path of the data file current working directory
# Adapt this depending on where you saved the file
data_file = Path("data/725053-94728-2020.gz")

# Loading the data
# No header in data file, use one or more whitespaces as column separator
weather_data = read_csv(data_file, header=None, sep=r"\s+")

# Set the proper column labels
# IMPORTANT: this only works so nicely
# because the columns are provided in the correct order!
weather_data.columns = label.DATE_TIME_COLUMS + label.MEASUREMENT_COLUMNS

print(weather_data)

Creating the Index¶

Again, we would like the date and time to be the index for our data. Unfortunately, this time the information is strewn across multiple columns, which we need to turn into a singular date-time column first.

Pandas has a to_datetime(…)-function which can do the job for us. This function takes a section of the data frame with all the relevant columns and converts them into a column that we can later turn into an index. First we add a label for this index-to-be column.

label.py

# This module contains all the label names
# We want to use throughout our program

DATE_TIME_COLUMS = [
    YEAR := "Year",
    MONTH := "Month",
    DAY := "Day",
    HOUR := "Hour"
]

MEASUREMENT_COLUMNS = [
    AIR_TEMP := "Air Temperature [°C]",
    DEW_POINT := "Dew Point [°C]",
    PRESSURE := "Sea Level Pressure [HPa]",
    WIND_DIRECTION := "Wind Direction [°]",
    WIND_SPEED := "Wind Speed [m/s]",
    SKY_CONDITION := "Sky Condition & Coverage Code",
    PRECIPITATION_1H := "Precipitation (1h) [mm]",
    PRECIPITATION_6H := "Precipitation (6h) [mm]"
]

# The column we create to use as an index
INDEX_COLUMN = "Date & Time"

And then we can use that label to actually create the column and set it as an index. Further, we delete the now obsolete columns for the year, month, day and hour.

__main__.py

from pandas import read_csv, to_datetime
from pathlib import Path

# Importing the whole module is fine, since we need all the label
# Also the required module name prefix adds helpful context
import label

# Path of the data file current working directory
# Adapt this depending on where you saved the file
data_file = Path("data/725053-94728-2020.gz")

# Loading the data
# No header in data file, use one or more whitespaces as column separator
weather_data = read_csv(data_file, header=None, sep=r"\s+")

# Set the proper column labels
# IMPORTANT: this only works so nicely
# because the columns are provided in the correct order!
weather_data.columns = label.DATE_TIME_COLUMNS + label.MEASUREMENT_COLUMNS

# Combine the date and time columns into one, correctly parsing date and time
weather_data[label.INDEX_COLUMN] = to_datetime(
    weather_data[label.DATE_TIME_COLUMNS]
)

weather_data = weather_data.set_index(label.INDEX_COLUMN)

# We don't need the separate date and time columns anymore
weather_data = weather_data.drop(label.DATE_TIME_COLUMNS, axis="columns")

print(weather_data)

Output

                     Air Temperature [°C]  ...  Precipitation (6h) [mm]
Date & Time                                ...                         
2020-01-01 01:00:00                    67  ...                    -9999
2020-01-01 02:00:00                    61  ...                    -9999
2020-01-01 03:00:00                    61  ...                    -9999
2020-01-01 04:00:00                    56  ...                    -9999
2020-01-01 05:00:00                    50  ...                    -9999
...                                   ...  ...                      ...
2020-12-31 19:00:00                    67  ...                    -9999
2020-12-31 20:00:00                    61  ...                    -9999
2020-12-31 21:00:00                    56  ...                    -9999
2020-12-31 22:00:00                    50  ...                    -9999
2020-12-31 23:00:00                    50  ...                    -9999

[8781 rows x 8 columns]

Two things in this code stand out and might warrant further explanation.

Operations that modify a data frame do not actually affect the data frame in question instantly. To make these changes stick, the result has to be assigned to the same variable (if you want to replace the original) or a new variable (if you want to keep the changed and the original version).
Some operations can be eiter applied on the data frame as a whole or along a certain axis (i.e. row wise or column_wise). This is what the axis parameter influences. Possible values for this parameter are 0, 1, "rows" or "columns". Using the string values here can aid with legibility.

Understanding the Axis Direction

It is easy to misunderstand the axis directions. The parameter does not specify on which axis to execute an opertation, but rather along which axis to do so.

Consider this example:

>>> from pandas import DataFrame
>>> my_data = DataFrame(data={
    "column 0": [0, 1, 2, 3],
    "column 1": [10, 11, 12, 13]
})
>>> # Calculate the mean per column by default
>>> my_data.mean()
column 0     1.5
column 1    11.5
dtype: float64

>>> # Calculate the mean along the rows (i.e. for each column)
>>> my_data.mean(axis="rows")
column 0     1.5
column 1    11.5
dtype: float64

>>> # Calculate the mean along the columns (i.e. for each row)
>>> my_data.mean(axis="columns")
0    5.0
1    6.0
2    7.0
3    8.0
dtype: float64

It took us a moment to get here, but we managed to properly load and label a data set that previous was just a compressed bunch of numbers.

Our next step will be to clean the data a bit so we can use it for our actual calculations.

Key Points¶

Real-life data usually only comes as “just a chunk of numbers”
- An accompanying document outlines how to interpret these
Pandas can deal with most compression formats on its own
Regular Expressions can be helpful when dealing with non-fixed separators