A Real Data Set
A wild Dataset appears!¶
Before we continue
From here on we will create our first real data analysis project.
It is highly recommended that you put it into its own folder weather_analysis
and use this folder as your working directory.
In the following we want to take a look at a real data set. Our example is taken from the NOAA’s weather data archive which hosts weather recordings from all over the world, covering over 100 years.
The particular data set we are interested in from New York, Central Park
(station number 725053 94728
if you want to look up more details in the station list )
in 2020.
To separate data and code, lets place the file in a subfolder called data
.
The data comes as a gz
-compressed file.
You do not need to unpack it to work with it, pandas can handle this file type just fine.
If you were to do that and look inside, here is a glimpse of what you would see:
2020 01 01 01 67 22 10028 200 36 -9999 0 -9999
2020 01 01 02 61 17 10035 280 62 -9999 5 -9999
2020 01 01 03 61 6 10033 280 36 -9999 -1 -9999
2020 01 01 04 56 -11 10034 270 36 -9999 -1 -9999
2020 01 01 05 50 -22 10035 270 46 -9999 0 -9999
…
Each data set should be accompanied by a proper description, outlining how the values inside are supposed to be understood. In our case the data is in a format called ISD Lite. The NOAA website holds a documentation of the data format as a PDF.
Upon closer inspection of the document it becomes clear that we have 12 columns in our data set, each separated by one or more spaces.
With this information in mind, we can do our first attempt at reading the data set.
We laid some groundwork for this in the previous episode, which we can now re-use.
For our analysis we start by creating a new Python script weather_analysis/__main__.py
.
from pandas import read_csv
from pathlib import Path
# Path of the data file current working directory
# Adapt this depending on where you saved the file
data_file = Path("data/725053-94728-2020.gz")
# Loading the data
# No header in data file, use one or more whitespaces as column separator
weather_data = read_csv(data_file, header=None, sep=r"\s+")
print(weather_data)
Output
0 1 2 3 4 5 6 7 8 9 10 11
0 2020 1 1 1 67 22 10028 200 36 -9999 0 -9999
1 2020 1 1 2 61 17 10035 280 62 -9999 5 -9999
2 2020 1 1 3 61 6 10033 280 36 -9999 -1 -9999
3 2020 1 1 4 56 -11 10034 270 36 -9999 -1 -9999
4 2020 1 1 5 50 -22 10035 270 46 -9999 0 -9999
... ... .. .. .. .. .. ... ... ... ... .. ...
8776 2020 12 31 19 67 11 10202 -9999 -9999 -9999 -1 -9999
8777 2020 12 31 20 61 -11 10213 -9999 21 -9999 0 -9999
8778 2020 12 31 21 56 -22 10230 300 36 -9999 -1 -9999
8779 2020 12 31 22 50 -28 10239 -9999 -9999 -9999 0 -9999
8780 2020 12 31 23 50 -28 10247 290 36 -9999 0 -9999
[8781 rows x 12 columns]
While there is still a lot of fine-tuning to do, the columns get identified correctly.
To get the data to load we do not have to care about the compression of the file, pandas can deal with that on its own.
Specifying the column seprator is a bit tricky however.
Since the amount of spaces between the columns depends on many digits the respective columns contain,
we can not directly give a fixed sequence as the separator in this case.
But the sep
-parameter also accepts a thing called Regular Expressions (or regex for short),
which allows us to express “one or more whitespaces” in a way the computer understands.
Since those regular expressions often use backspaces \
(which signal the start of a special symbol in normal Python strings),
we put the r
in front of the strings quotes to indicate that Python should process the string as-is
without paying attention to the backslash.
Regular Expressions
Regular Expressions are sequences of characters, that can be used to describe the structure of a text. They can be a very handy tool if you have to investigate text a lot, but are also known to quickly become quite complicated. A good way to experiment around with them is the Regex101 website.
Labels and Indexes¶
Let’s start by doing what we learned before and properly label the columns. However since we have a lot of labels this time, it would be sensible to move them in a separate file to not clutter the main code with a bunch of definitions.
For this purpose we create label.py
and put the label definitions in there.
An added benefit is that we can import the label
module as a whole and then can use the required prefix label. …
to add some context to each of the names.
Current project structure
Let’s take a look at the label definitions first.
# This module contains all the label names
# We want to use throughout our program
YEAR = "Year"
MONTH = "Month"
DAY = "Day"
HOUR = "Hour"
AIR_TEMP = "Air Temperature [°C]"
DEW_POINT = "Dew Point [°C]"
PRESSURE = "Sea Level Pressure [HPa]"
WIND_DIRECTION = "Wind Direction [°]"
WIND_SPEED = "Wind Speed [m/s]"
SKY_CONDITION = "Sky Condition & Coverage Code"
PRECIPITATION_1H = "Precipitation (1h) [mm]"
PRECIPITATION_6H = "Precipitation (6h) [mm]"
# Sort the columns into lists depending on their later use
DATE_TIME_COLUMS = [YEAR, MONTH, DAY, HOUR]
MEASUREMENT_COLUMNS = [
AIR_TEMP,
DEW_POINT,
PRESSURE,
WIND_DIRECTION,
WIND_SPEED,
SKY_CONDITION,
PRECIPITATION_1H,
PRECIPITATION_6H
]
We sort out the date in time columns since we later want to unify them into one column and use that as the index.
There is quite a bit of duplication involved in assigning the labels and later sorting them into the repective lists.
If you happen to know the “Walrus-operator” :=
, you can use it here to make the code a bit less redundant.
Use :=
like an extremely potent Spice
When used expertly, it can make for a really nice addition. Otherwise it ruins the experience for everyone. If in doubt, get a second opinion or leave it out!
# This module contains all the label names
# We want to use throughout our program
DATE_TIME_COLUMS = [
YEAR := "Year",
MONTH := "Month",
DAY := "Day",
HOUR := "Hour"
]
MEASUREMENT_COLUMNS = [
AIR_TEMP := "Air Temperature [°C]",
DEW_POINT := "Dew Point [°C]",
PRESSURE := "Sea Level Pressure [HPa]",
WIND_DIRECTION := "Wind Direction [°]",
WIND_SPEED := "Wind Speed [m/s]",
SKY_CONDITION := "Sky Condition & Coverage Code",
PRECIPITATION_1H := "Precipitation (1h) [mm]",
PRECIPITATION_6H := "Precipitation (6h) [mm]"
]
The nice effect is that we now have a constant to refer to for each column as well as for each column group. This only works because can group columns that are next to each other. If that is not the case in another dataset, we either have to re-order the columns first or stick with the initial approach.
Now we can use our definitions to actually set the column labels.
from pandas import read_csv
from pathlib import Path
# Importing the whole module is fine, since we need all the label
# Also the required module name prefix adds helpful context
import label
# Path of the data file current working directory
# Adapt this depending on where you saved the file
data_file = Path("data/725053-94728-2020.gz")
# Loading the data
# No header in data file, use one or more whitespaces as column separator
weather_data = read_csv(data_file, header=None, sep=r"\s+")
# Set the proper column labels
# IMPORTANT: this only works so nicely
# because the columns are provided in the correct order!
weather_data.columns = label.DATE_TIME_COLUMS + label.MEASUREMENT_COLUMNS
print(weather_data)
Creating the Index¶
Again, we would like the date and time to be the index for our data. Unfortunately, this time the information is strewn across multiple columns, which we need to turn into a singular date-time column first.
Pandas has a to_datetime(…)
-function which can do the job for us.
This function takes a section of the data frame with all the relevant columns
and converts them into a column that we can later turn into an index.
First we add a label for this index-to-be column.
# This module contains all the label names
# We want to use throughout our program
DATE_TIME_COLUMS = [
YEAR := "Year",
MONTH := "Month",
DAY := "Day",
HOUR := "Hour"
]
MEASUREMENT_COLUMNS = [
AIR_TEMP := "Air Temperature [°C]",
DEW_POINT := "Dew Point [°C]",
PRESSURE := "Sea Level Pressure [HPa]",
WIND_DIRECTION := "Wind Direction [°]",
WIND_SPEED := "Wind Speed [m/s]",
SKY_CONDITION := "Sky Condition & Coverage Code",
PRECIPITATION_1H := "Precipitation (1h) [mm]",
PRECIPITATION_6H := "Precipitation (6h) [mm]"
]
# The column we create to use as an index
INDEX_COLUMN = "Date & Time"
And then we can use that label to actually create the column and set it as an index. Further, we delete the now obsolete columns for the year, month, day and hour.
from pandas import read_csv, to_datetime
from pathlib import Path
# Importing the whole module is fine, since we need all the label
# Also the required module name prefix adds helpful context
import label
# Path of the data file current working directory
# Adapt this depending on where you saved the file
data_file = Path("data/725053-94728-2020.gz")
# Loading the data
# No header in data file, use one or more whitespaces as column separator
weather_data = read_csv(data_file, header=None, sep=r"\s+")
# Set the proper column labels
# IMPORTANT: this only works so nicely
# because the columns are provided in the correct order!
weather_data.columns = label.DATE_TIME_COLUMNS + label.MEASUREMENT_COLUMNS
# Combine the date and time columns into one, correctly parsing date and time
weather_data[label.INDEX_COLUMN] = to_datetime(
weather_data[label.DATE_TIME_COLUMNS]
)
weather_data = weather_data.set_index(label.INDEX_COLUMN)
# We don't need the separate date and time columns anymore
weather_data = weather_data.drop(label.DATE_TIME_COLUMNS, axis="columns")
print(weather_data)
Output
Air Temperature [°C] ... Precipitation (6h) [mm]
Date & Time ...
2020-01-01 01:00:00 67 ... -9999
2020-01-01 02:00:00 61 ... -9999
2020-01-01 03:00:00 61 ... -9999
2020-01-01 04:00:00 56 ... -9999
2020-01-01 05:00:00 50 ... -9999
... ... ... ...
2020-12-31 19:00:00 67 ... -9999
2020-12-31 20:00:00 61 ... -9999
2020-12-31 21:00:00 56 ... -9999
2020-12-31 22:00:00 50 ... -9999
2020-12-31 23:00:00 50 ... -9999
[8781 rows x 8 columns]
Two things in this code stand out and might warrant further explanation.
- Operations that modify a data frame do not actually affect the data frame in question instantly. To make these changes stick, the result has to be assigned to the same variable (if you want to replace the original) or a new variable (if you want to keep the changed and the original version).
- Some operations can be eiter applied on the data frame as a whole or along a certain axis (i.e. row wise or column_wise).
This is what the
axis
parameter influences. Possible values for this parameter are0
,1
,"rows"
or"columns"
. Using the string values here can aid with legibility.
Understanding the Axis Direction
It is easy to misunderstand the axis directions. The parameter does not specify on which axis to execute an opertation, but rather along which axis to do so.
Consider this example:
>>> from pandas import DataFrame
>>> my_data = DataFrame(data={
"column 0": [0, 1, 2, 3],
"column 1": [10, 11, 12, 13]
})
>>> # Calculate the mean per column by default
>>> my_data.mean()
column 0 1.5
column 1 11.5
dtype: float64
>>> # Calculate the mean along the rows (i.e. for each column)
>>> my_data.mean(axis="rows")
column 0 1.5
column 1 11.5
dtype: float64
>>> # Calculate the mean along the columns (i.e. for each row)
>>> my_data.mean(axis="columns")
0 5.0
1 6.0
2 7.0
3 8.0
dtype: float64
It took us a moment to get here, but we managed to properly load and label a data set that previous was just a compressed bunch of numbers.
Our next step will be to clean the data a bit so we can use it for our actual calculations.
Key Points¶
- Real-life data usually only comes as “just a chunk of numbers”
- An accompanying document outlines how to interpret these
- Pandas can deal with most compression formats on its own
- Regular Expressions can be helpful when dealing with non-fixed separators