Sign in
Log inSign up

Pandas for data stuff: Code Challenge

Alex Wilson's photo
Alex Wilson
·Jul 14, 2017

Python Friday #4

After evaluating my planned projects, I found that most of them involve generating, collecting, and analyzing data. To prepare, I did a bunch of searching around the internet and found this website called Data Quest.

Data Quest is Fantastic!

I’m only just finishing my first course, but it has been a great experience so far. It’s what got me started messing around with pandas 🐼 And it’s what made me realize I need to have a deep understanding of how to use pandas to do some of the things I plan to accomplish in the near future.

Getting Started

I haven’t created anything yet that’s generating or collecting any data. So I did some Googling in an attempt to find some free data sets. I stumbled upon this Project Data sets, and found a bunch of .csv files to mess around with.

Scenario

In this challenge, I’ll be using the Cars.csv file from the project data sets website I mentioned earlier.

I know from the file description on the website that this file has data for over 400 cars.

The first 3 things that come to mind for me with cars are, MPG, price, and car features. I’m don’t particularly care about how quickly a car can reach 80MPH, just so long as it can reach it.

Unfortunately, the data set doesn’t have anything in it that I look at other than MPG. That’s OK, there is still plenty to be done!

Tasks

  1. Read the .csv file directly into a pandas data frame. This will make all future steps much easier.
  2. Find what the name of each column is.
  3. Find the car with the highest MPG.
  4. Find the car with the lowest MPG.
  5. Find the average MPG of all cars.
  6. Find the most common MPG.

Rules

  1. Use pandas to the max! Meaning, use only pandas features to achieve the above items.
  2. That’s it!

Let’s see the code

Here is my solution code. AS USUAL I am not claiming that my code is THE way to do these code challenges. It’s just one way to get the results I was seeking.

Some Gotchas

1) The data set provided by that website, while convenient, was not comma separated. It was semi-colon separated. In this regard, pandas is a truly baller utility! It can innately allow you to specify the separating value.

pandas.read_csv('your_csv_path.csv', sep=';')

Before specifying that separator, it was printing out all kinds of weird stuff, but not throwing an error in the terminal.

2) I was trying to do math on strings! When you are trying to break data out of the MPG column, you need to be sure you create a series with numeric values and not strings.

# Change the column type IN the dataframe.
CARS['MPG'] = CARS['MPG'].iloc[1:].astype(float)

I started at row one because the first row kept returning a value from row zero that said, ‘DOUBLE’. I immediately gave up on that and just re-assigned from row one.

3) There were a bunch of zero values in the data set. To clean the data I created a filter to remove those zero values from the data set.

# Create a zero filter
NON_ZERO_FILTER = CARS['MPG'] != 0
# Filtered MPG
MPG = CARS[NON_ZERO_FILTER]

Next Time

Jupyter notebooks and a deep dive into the pandas documentation!

Take-away

This code challenge took me about an hour. Those little gotchas are time consuming, but I feel like they make me stronger.

  • Pandas only feels a tiny bit alien. The data frame is so powerful. I think having that amount of power with .csv in the terminal is just foreign to me.
  • I really like that project data sets website, if you know of any other good data sets that are between 500 and 5000 rows, let me know!
  • AMC Ambassador Brougham: I have never even heard of this vehicle before, but apparently it only gets about 9 MPG.

Using previous challenges

I covered Unit tests in the last challenge. There won’t really be anything to run unit tests on in today’s challenge. But you could write functional tests to check types when utilizing pandas if you feel like it. Again, not really needed though, because pandas pretty prints to the console.

Next week, I will attempt to set up a Jupyter notebook, and do a deep dive into the pandas documentation.

Thanks for reading! 💚