Skip to content

Modifying Dataframes

Incomplete Data

We intend to also note down the cleaning habits of our cat. For this purpose we have created a new series of measurements.

cleaning = Series(
    data={"Monday": 2, "Friday": 1, "Saturday": 3},
    index=days_of_week,
    name="Cleaning"
)
print(cleaning)
Output
Monday       2.0
Tuesday      NaN
Wednesday    NaN
Thursday     NaN
Friday       1.0
Saturday     3.0
Sunday       NaN
Name: Cleaning, dtype: float64

Note that not all weekdays have a value associated with it. Incomplete data is a common problem in real-world measurements. Pandas tends to represent “no data” as NaN which can be a pitfall.

Calculating with DataFrames

Our vetinary friend wants to help us and requests we send them the temparatures we measured. Since they live in the US, they would prefer to have the measurements in Farenheidt:

print(measurements["Temperature"] * 1.8 + 32)
Output
Monday       51.62
Tuesday      46.76
Wednesday    45.68
Thursday     46.04
Friday       48.92
Saturday     51.98
Sunday       54.32
Name: Temperature, dtype: object

Adding another column to a Dataframe

To extend our dataframe, we can use

measurements.join(cleaning)
print(measurements)

This seems not to have worked as we expected! The reason is that many dataframe manipulations return a copy with the result instead of manipulating the original dataframe. We can assign the result to our original dataframe (or a new variable)

measurements = measurements.join(cleaning)
print(measurements)
Output
           Sneezes  Temperature  Humidity  Cleaning
Monday          32         10.9      62.5       2.0
Tuesday         41          8.2      76.3       NaN
Wednesday       56          7.6      82.4       NaN
Thursday        62          7.8      98.2       NaN
Friday          30          9.4      77.4       1.0
Saturday        22         11.1      58.9       3.0
Sunday          17         12.4      41.2       NaN

Side Note: Advanced filtering

Dataframes offer additional methods to generate filter masks.

missing_data = measurements.isnull()
print(missing_data)
Output
           Sneezes  Temperature  Humidity  Cleaning
Monday       False        False     False     False
Tuesday      False        False     False      True
Wednesday    False        False     False      True
Thursday     False        False     False      True
Friday       False        False     False     False
Saturday     False        False     False     False
Sunday       False        False     False      True

We can reduce this regarding rows or columns using the any()-method:

print(missing_data.any(axis="columns"))
print()  # Empty line as separator
print(missing_data.any(axis="index"))
Output
Monday       False
Tuesday       True
Wednesday     True
Thursday      True
Friday       False
Saturday     False
Sunday        True
dtype: bool

Sneezes        False
Temperature    False
Humidity       False
Cleaning        True
dtype: bool

The axis-parameter dictates that a summary of the columns is to be made, not along the columns (and similar for indexes).

These can be used to combine into handy filters like:

complete_rows = ~missing_data.any(axis="columns")
print(measurements[complete_rows])
Output
          Sneezes  Temperature  Humidity  Cleaning
Monday         32         10.9      62.5       2.0
Friday         30          9.4      77.4       1.0
Saturday       22         11.1      58.9       3.0

Changing singular data

We now learn from our roommate, that out cat was observed cleaning itself only one time on Tuesday. Let’s update our table:

measurements.loc["Tuesday", "Cleaning"] = 1
print(measurements)
Output
           Sneezes  Temperature  Humidity  Cleaning
Monday          32         10.9      62.5       2.0
Tuesday         41          8.2      76.3       1.0
Wednesday       56          7.6      82.4       NaN
Thursday        62          7.8      98.2       NaN
Friday          30          9.4      77.4       1.0
Saturday        22         11.1      58.9       3.0
Sunday          17         12.4      41.2       NaN

Replacing Multiple values

The replace-method of dataframes is a very powerful tool. For example, we want to replace the NaN values by more appropriate None

from math import nan  # to get the constant that encodes NaN
measurements.replace(to_replace={nan: None}, inplace=True)
print(measurements)
Output
           Sneezes Temperature Humidity Cleaning
Monday          32        10.9     62.5        2
Tuesday         41         8.2     76.3        1
Wednesday       56         7.6     82.4     None
Thursday        62         7.8     98.2     None
Friday          30         9.4     77.4        1
Saturday        22        11.1     58.9        3
Sunday          17        12.4     41.2     None

We set inplace=True here to modify the dataframe directly.

Key Points

  • Direct Calculations can be done on dataframes or series and apply the effect of the calculation to each cell
  • Singular data can be changed by directly writing to the location