Exercise 6

Pandas Quest

Today we will be using pandas to analyze a really groundbreaking dataset. The Hipparcos space photometry mission, launched in 1989 by the European Space Agency, stands as one of history's most successful space endeavors. It meticulously charted the positions of over a million stars, with a hundred thousand benefiting from extremely high-precision measurements and the rest recorded with good, albeit lower, precision. Additional information about Hipparcos is available here: https://www.cosmos.esa.int/web/hipparcos

Today, you will be diving into the publicly released dataset of this ESA mission! (And you don't need to know anything about astronomy.)

Please fork and then clone the exercise repository from the GitHub repository: https://github.com/ds-grundlagen/pandas-quest

  1. Download and unzip the dataset with one of:
    • wget -O hipparcos-star-catalog.zip https://www.kaggle.com/api/v1/datasets/download/konivat/hipparcos-star-catalog
    • curl -L -o hipparcos-star-catalog.zip https://www.kaggle.com/api/v1/datasets/download/konivat/hipparcos-star-catalog
  2. Open the dataset with pandas and look at the first few rows. Make a list of all the columns. How many rows are there?
  3. The positions of the stars in the sky were measured in units of degrees called Right Ascension and Declination, the columns being RAdeg and DEdeg . Plot the position of all the stars in the dataset, with RA on the X-axis and Declination on the Y-axis. Adjust the visual attributes of your plot so it's easy to see the overall distribution.
  4. A way of measuring distance is astronomy is via a phenomenon called "Parallax". This column in the DataFrame is Plx and has the unit of milliarcseconds (mas). Use describe() to inspect this column, do you see anything strange?
    • Turns out there are many rows without measurements (values of np.nan) and some other unphysical ones
    • Create a new DataFrame which only includes rows where Plx is not na and also greater than 0.1.
      • Hint: use boolean indexing with the Series method notna() to help
    • Add a new column to this filtered df called d_kpc which is $\frac{1}{Plx}$ . This represents the distance to each star in kiloparsecs.
    • Add a new column called d_ly which will represent distance in lightyears. Use the conversion factor $1 kpc = 3261.564 ly$
    • Plot the distribution of distances in lightyears using a seaborn histplot
    • Try also making the same plot with log_scale=True. Which plot better shows the distance distribution?
    • What is the closest star to earth in the database and how far away is it in lightyears? Does this closely match the distance and coordinates you find on Google for Earth's closest star?
    • For the rest of the exercise, please continue working with this filtered dataset.
  5. Stars are often classified by their "spectral type", which determines much of the physics that dictate how they behave and evolve. The column for this is SpType . In Astronomy 101, one learns that the main spectral types are O, B, A, F, G, K, M.
    • Examine the SpType column, how many different spectral types are there?
    • It turns out, these classes get really specific and there are tons and tons of subclasses. But in general, the first letter of the class comes from one of the 7 above.
    • Let's now create a new column that derives the simple, single lettered spectral class:
      • Write a function that takes the first letter of a string, and if it matches one of the above 7 letters, return that letter. Otherwise, return the exact string "Other".
      • Create a new column called sp_class with the .apply() method and populate with your above function.
      • Use .value_counts() to look at the distribution of this new column.
      • Make a countplot using seaborn to visualize this distribution. The order of the categories in the plot should be ["O", "B", "A", "F", "G", "K", "M", "Other"]
  6. Make a new DataFrame based off this one which contains only the columns HIP , d_ly and sp_class
    • Save this 3-columned DataFrame to a CSV file without the index included
    • From the root directory of your project, check your work by running the command uv run ./check-hipparcos.py ./path/to/your/final-results.csv

Submit the answer: