Modifying Dataframes¶

Incomplete Data¶

We intend to also note down the cleaning habits of our cat. For this purpose we have created a new series of measurements.

cleaning = Series(
    data={"Monday": 2, "Friday": 1, "Saturday": 3},
    index=days_of_week,
    name="Cleaning"
)
print(cleaning)

Output

Monday       2.0
Tuesday      NaN
Wednesday    NaN
Thursday     NaN
Friday       1.0
Saturday     3.0
Sunday       NaN
Name: Cleaning, dtype: float64

Note that not all weekdays have a value associated with it. Incomplete data is a common problem in real-world measurements. Pandas tends to represent “no data” as NaN which can be a pitfall.

Calculating with DataFrames¶

Our vetinary friend wants to help us and requests we send them the temparatures we measured. Since they live in the US, they would prefer to have the measurements in Farenheidt:

print(measurements["Temperature"] * 1.8 + 32)

Output

Monday       51.62
Tuesday      46.76
Wednesday    45.68
Thursday     46.04
Friday       48.92
Saturday     51.98
Sunday       54.32
Name: Temperature, dtype: object

Adding another column to a Dataframe¶

To extend our dataframe, we can use

measurements.join(cleaning)
print(measurements)

This seems not to have worked as we expected! The reason is that many dataframe manipulations return a copy with the result instead of manipulating the original dataframe. We can assign the result to our original dataframe (or a new variable)

measurements = measurements.join(cleaning)
print(measurements)

Output

           Sneezes  Temperature  Humidity  Cleaning
Monday          32         10.9      62.5       2.0
Tuesday         41          8.2      76.3       NaN
Wednesday       56          7.6      82.4       NaN
Thursday        62          7.8      98.2       NaN
Friday          30          9.4      77.4       1.0
Saturday        22         11.1      58.9       3.0
Sunday          17         12.4      41.2       NaN

Side Note: Advanced filtering¶

Dataframes offer additional methods to generate filter masks.

missing_data = measurements.isnull()
print(missing_data)

Output

           Sneezes  Temperature  Humidity  Cleaning
Monday       False        False     False     False
Tuesday      False        False     False      True
Wednesday    False        False     False      True
Thursday     False        False     False      True
Friday       False        False     False     False
Saturday     False        False     False     False
Sunday       False        False     False      True

We can reduce this regarding rows or columns using the any()-method:

print(missing_data.any(axis="columns"))
print()  # Empty line as separator
print(missing_data.any(axis="index"))

Output

Monday       False
Tuesday       True
Wednesday     True
Thursday      True
Friday       False
Saturday     False
Sunday        True
dtype: bool

Sneezes        False
Temperature    False
Humidity       False
Cleaning        True
dtype: bool

The axis-parameter dictates that a summary of the columns is to be made, not along the columns (and similar for indexes).

These can be used to combine into handy filters like:

complete_rows = ~missing_data.any(axis="columns")
print(measurements[complete_rows])

Output

          Sneezes  Temperature  Humidity  Cleaning
Monday         32         10.9      62.5       2.0
Friday         30          9.4      77.4       1.0
Saturday       22         11.1      58.9       3.0

Changing singular data¶

We now learn from our roommate, that out cat was observed cleaning itself only one time on Tuesday. Let’s update our table:

measurements.loc["Tuesday", "Cleaning"] = 1
print(measurements)

Output

           Sneezes  Temperature  Humidity  Cleaning
Monday          32         10.9      62.5       2.0
Tuesday         41          8.2      76.3       1.0
Wednesday       56          7.6      82.4       NaN
Thursday        62          7.8      98.2       NaN
Friday          30          9.4      77.4       1.0
Saturday        22         11.1      58.9       3.0
Sunday          17         12.4      41.2       NaN

Replacing Multiple values¶

The replace-method of dataframes is a very powerful tool. For example, we want to replace the NaN values by more appropriate None

from math import nan  # to get the constant that encodes NaN
measurements.replace(to_replace={nan: None}, inplace=True)
print(measurements)

Output

           Sneezes Temperature Humidity Cleaning
Monday          32        10.9     62.5        2
Tuesday         41         8.2     76.3        1
Wednesday       56         7.6     82.4     None
Thursday        62         7.8     98.2     None
Friday          30         9.4     77.4        1
Saturday        22        11.1     58.9        3
Sunday          17        12.4     41.2     None

We set inplace=True here to modify the dataframe directly.

Key Points

Direct Calculations can be done on dataframes or series and apply the effect of the calculation to each cell
Singular data can be changed by directly writing to the location