Skip to content

Cleaning Data

Now we have a data set loaded and labelled. However the data is not ready to use yet. Here is a quick To-Do-list of what still needs to be done to get the data analysis ready:

  1. All missing measurements are represented by the value -9999. This should be represented by something more fitting, so the numeric placeholder value does not throw off our statistics.
  2. Many columns are scaled by a factor to turn the measurements into integer values. For example an air temperature of 24.6 °C is represented by the value 246. We should correct the values back to their original measuremnents, so they are easier to interpret down the line.
  3. The wind direction uses 0 for calm winds and conversely 360 degrees to represent North. We can derive a new column indicating calm winds and then replace 0 with an indicator for no data, so we can use 0 to indicate North instead as it is common practise in aviation and marine navigation.

Let`s tackle these tasks one-by-one.

Dealing with Missing Measurements

For any missing data we need a placeholder that can express “there is no data here”. Usually, you would use the Python value None for that. When dealing with numeric data there is another option that we can refer to.

The standard for the floating point data type (IEEE 754, c.f. Wikipedia) defines multiple values to represent if something is not a number, NaN for short.

Peculiarities of NaN

  1. The NaN value is offered by multiple Python libraries. The most common approach is to use
  2. Technically, NaN is a floating point number, which can be rather confusing. However that allows NaN to be used as values in calculations without crashing the program. This added robustness can be quite useful. The result of computing anything with NaN will be NaN in turn.
  3. By definition NaN is not equal to anything not even itself. It also can not be compared to anything, the result will always be False. So you can not check for NaN using any comparison operators.

Here are some examples to demonstrate these points:

>>> from math import nan
>>> type(nan)
<class 'float'>
>>> from math import nan
>>> nan == nan
False
>>> nan > 0
False
>>> nan < 0
False
>>> nan == 0
False
>>> from math import nan
>>> nan + 3
nan
>>> 0 / nan
nan
>>> nan - nan
nan
>>> 0 * nan
nan

For our use case it would be a good idea to represent the missing values as not a number. This is also pandas’ default way of dealing with missing data and the robustness in calcualtions is a great benefit when processing whole rows, columns or data frames at once.

Let us improve our program accordingly.

__main__.py
from pandas import read_csv, to_datetime
from pathlib import Path
from math import nan

# Importing the whole module is fine, since we need all the label
# Also the required module name prefix adds helpful context
import label

# Path of the data file current working directory
# Adapt this depending on where you saved the file
data_file = Path("data/725053-94728-2020.gz")

# Loading the data
# No header in data file, use one or more whitespaces as column separator
weather_data = read_csv(data_file, header=None, sep=r"\s+")

# Set the proper column labels
# IMPORTANT: this only works so nicely
# because the columns are provided in the correct order!
weather_data.columns = label.DATE_TIME_COLUMNS + label.MEASUREMENT_COLUMNS

# Combine the date and time columns into one, correctly parsing date and time
weather_data[label.INDEX_COLUMN] = to_datetime(
    weather_data[label.DATE_TIME_COLUMNS]
)

weather_data = weather_data.set_index(label.INDEX_COLUMN)

# We don't need the separate date and time columns anymore
weather_data = weather_data.drop(label.DATE_TIME_COLUMNS, axis="columns")

# Replace the missing value indicator -9999 by nan
weather_data = weather_data.replace({-9999: nan})

print(weather_data)
Output
                     Air Temperature [°C]  ...  Precipitation (6h) [mm]
Date & Time                                ...                         
2020-01-01 01:00:00                  67.0  ...                      NaN
2020-01-01 02:00:00                  61.0  ...                      NaN
2020-01-01 03:00:00                  61.0  ...                      NaN
2020-01-01 04:00:00                  56.0  ...                      NaN
2020-01-01 05:00:00                  50.0  ...                      NaN
...                                   ...  ...                      ...
2020-12-31 19:00:00                  67.0  ...                      NaN
2020-12-31 20:00:00                  61.0  ...                      NaN
2020-12-31 21:00:00                  56.0  ...                      NaN
2020-12-31 22:00:00                  50.0  ...                      NaN
2020-12-31 23:00:00                  50.0  ...                      NaN

[8781 rows x 8 columns]

You may observe that in the temperature column the values now appear as 67.0 instead of 67 as before. The introduction of nan into a column silently converts the column from integer to floating point.

Rescaling Columns

Some columns have been scaled by a factor of 10 to allow for more accurate measurements to still be presented as integer numbers. While this can make sense for working with older or embedded systems, for working with the data, we probably like to use a decimal fraction representation of our values. Note that this will convert the integer values to floating point with all consequential limitations.

The columns that should be rescaled (according to the ISD Lite documentation) by a factor of 1/10 are

  • Air temperature
  • Dew point temperature
  • Sea level pressure
  • Wind speed
  • Precipitation over 1 hour
  • Precipitation over 6 hours

To keep our main program a bit more focussed, we can add the list of scaled columns in the label.py file.

label.py
# This module contains all the label names
# We want to use throughout our program

DATE_TIME_COLUMNS = [
    YEAR := "Year",
    MONTH := "Month",
    DAY := "Day",
    HOUR := "Hour"
]

MEASUREMENT_COLUMNS = [
    AIR_TEMP := "Air Temperature [°C]",
    DEW_POINT := "Dew Point [°C]",
    PRESSURE := "Sea Level Pressure [HPa]",
    WIND_DIRECTION := "Wind Direction [°]",
    WIND_SPEED := "Wind Speed [m/s]",
    SKY_CONDITION := "Sky Condition & Coverage Code",
    PRECIPITATION_1H := "Precipitation (1h) [mm]",
    PRECIPITATION_6H := "Precipitation (6h) [mm]"
]

# The column we create to use as an index
INDEX_COLUMN = "Date & Time"

# The columns that have been scaled by a factor of 10
# in the ISD Lite format
SCALED_COLUMNS = [
    AIR_TEMP, DEW_POINT, PRESSURE, WIND_SPEED,
    PRECIPITATION_1H, PRECIPITATION_6H
]

All that is left for us to do now is to reverse the scaling by dividing the scaled columns by 10.

__main__.py
from pandas import read_csv, to_datetime
from pathlib import Path
from math import nan

# Importing the whole module is fine, since we need all the label
# Also the required module name prefix adds helpful context
import label

# Path of the data file current working directory
# Adapt this depending on where you saved the file
data_file = Path("data/725053-94728-2020.gz")

# Loading the data
# No header in data file, use one or more whitespaces as column separator
weather_data = read_csv(data_file, header=None, sep=r"\s+")

# Set the proper column labels
# IMPORTANT: this only works so nicely
# because the columns are provided in the correct order!
weather_data.columns = label.DATE_TIME_COLUMNS + label.MEASUREMENT_COLUMNS

# Combine the date and time columns into one, correctly parsing date and time
weather_data[label.INDEX_COLUMN] = to_datetime(
    weather_data[label.DATE_TIME_COLUMNS]
)

weather_data = weather_data.set_index(label.INDEX_COLUMN)

# We don't need the separate date and time columns anymore
weather_data = weather_data.drop(label.DATE_TIME_COLUMNS, axis="columns")

# Replace the missing value indicator -9999 by nan
weather_data = weather_data.replace({-9999: nan})

# Undo the column scaling
weather_data[label.SCALED_COLUMNS] = weather_data[label.SCALED_COLUMNS] / 10

print(weather_data)

We now have some very nice data that we can work with, so it is time to present it nicely.

There are further improvements that can be done. They will be the topic of the exercises.

Key Points

  • NaN kan be used to represent missing data
    • There are some caveats to keep in mind
  • Data frames have functions to replace data
  • Arithmetic operations can also be applied to whole rows, columns or data frames