Skip to content

Task 2: Loading the Data

To load the data you can use the pandas.read_csv() function. (read_csv() documentation)

Hints

  • In these data sets the seperator for the data fields is not a comma, but multiple whitespaces. You can use the regular expression "\s+" to express this in python.
  • Note the parameter parse_dates of the read_csv()-function which can come in extremely handy.
  • Note that the data set as provided has no header.

As noted previously, the downloaded data is compressed in a gz-archive. You could decompress it before working with it (especially useful if you want to inspect the data beforehand with a plain text editor or other tool/programs), the read_csv()-function itself however can handle a such an archive just fine.

Tasks

  1. Consider first what the loaded data should look like
  2. Load the data set using the read_csv()-function from pandas. combine the year, month, day and hour columns into one single column for the timestamp.
  3. Set the timestamp to be the index of your dataframe
  4. Display the loaded data, compare the result with your expectations
  5. Do a plausability check:
  6. Check the number of rows and columns
  7. Check if the data inside the rows is displayed correctly (i.e. no columns got joined or torn apart), especially the date column
  8. Assign a proper header based on the information from the data documentation
Hints for Solving the Task

If you are seriously stuck, you can take a look at the solution hints.