Skip to content

Task 3: Cleaning the data

The data as loaded is not yet ready for our work. For technical reasons, the data representation has a few peculiarities:

  • According to the data documentation, the value -9999 indicates missing data.
  • Some data columns have been scaled by a factor.
  • A wind direction of 0 means that it is undetermined, North is designated by 360.

Tasks

  1. Replace the value -9999 with something more appropriate, for example the constant nan from the math library.
  2. Replace the measurements were no wind direction is given in a similar fashion (0nan).
  3. Now the value 0 is free to represent North as usual. So replace the direction 3600. This will come in handy in a later task.
  4. A precipitation value of -1 indicates that only trace amounts were detected. Introduce a new column for the 1-hour and 6-hour measurements that specifically indicates whether a trace amount was measured. Afterwards, replace the numeric value -10 to avoid mistakes when doing statistics on the measurements.
  5. Check for columns that have no useful data at all and remove them if convenient.
  6. Re-scale the columns so they all use a factor of 1 (and can be read and interpreted more easily by humans)
  7. Check if there are entries missing for some dates/hours. Consider first how many hours the given year should have (Account for the additional day of leap years if applicable.) How many rows are missing in your data set? (If your data set has a significant number of rows missing, consider choosing another one.) For this you may find the pandas.date_range()-function useful.
  8. Add suitable placeholders for those missing rows, so the averaging works as expected.
Hints for Solving the Task

If you are seriously stuck, you can take a look at the solution hints.