Pandas Basics
Pandas DataFrames
Pandas is a high-level data manipulation tool developed by Wes McKinney. It is built on the Numpy package and its key data structure is called the DataFrame. DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables.
There are several ways to create a DataFrame. One way way is to use a dictionary. For example:
eyJsYW5ndWFnZSI6InB5dGhvbiIsInNhbXBsZSI6ImRpY3QgPSB7XCJjb3VudHJ5XCI6IFtcIkJyYXppbFwiLCBcIlJ1c3NpYVwiLCBcIkluZGlhXCIsIFwiQ2hpbmFcIiwgXCJTb3V0aCBBZnJpY2FcIl0sXG4gICAgICAgXCJjYXBpdGFsXCI6IFtcIkJyYXNpbGlhXCIsIFwiTW9zY293XCIsIFwiTmV3IERlaGxpXCIsIFwiQmVpamluZ1wiLCBcIlByZXRvcmlhXCJdLFxuICAgICAgIFwiYXJlYVwiOiBbOC41MTYsIDE3LjEwLCAzLjI4NiwgOS41OTcsIDEuMjIxXSxcbiAgICAgICBcInBvcHVsYXRpb25cIjogWzIwMC40LCAxNDMuNSwgMTI1MiwgMTM1NywgNTIuOThdIH1cblxuaW1wb3J0IHBhbmRhcyBhcyBwZFxuYnJpY3MgPSBwZC5EYXRhRnJhbWUoZGljdClcbnByaW50KGJyaWNzKSJ9
As you can see with the new brics
DataFrame, Pandas has assigned a key for each country as the numerical values 0 through 4. If you would like to have different index values, say, the two letter country code, you can do that easily as well.
eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZGljdCA9IHtcImNvdW50cnlcIjogW1wiQnJhemlsXCIsIFwiUnVzc2lhXCIsIFwiSW5kaWFcIiwgXCJDaGluYVwiLCBcIlNvdXRoIEFmcmljYVwiXSxcbiAgICAgICBcImNhcGl0YWxcIjogW1wiQnJhc2lsaWFcIiwgXCJNb3Njb3dcIiwgXCJOZXcgRGVobGlcIiwgXCJCZWlqaW5nXCIsIFwiUHJldG9yaWFcIl0sXG4gICAgICAgXCJhcmVhXCI6IFs4LjUxNiwgMTcuMTAsIDMuMjg2LCA5LjU5NywgMS4yMjFdLFxuICAgICAgIFwicG9wdWxhdGlvblwiOiBbMjAwLjQsIDE0My41LCAxMjUyLCAxMzU3LCA1Mi45OF0gfVxuaW1wb3J0IHBhbmRhcyBhcyBwZFxuYnJpY3MgPSBwZC5EYXRhRnJhbWUoZGljdCkiLCJzYW1wbGUiOiIjIFNldCB0aGUgaW5kZXggZm9yIGJyaWNzXG5icmljcy5pbmRleCA9IFtcIkJSXCIsIFwiUlVcIiwgXCJJTlwiLCBcIkNIXCIsIFwiU0FcIl1cblxuIyBQcmludCBvdXQgYnJpY3Mgd2l0aCBuZXcgaW5kZXggdmFsdWVzXG5wcmludChicmljcykiLCJzb2x1dGlvbiI6ImJyaWNzLmluZGV4ID0gW1wiQlJcIiwgXCJSVVwiLCBcIklOXCIsIFwiQ0hcIiwgXCJTQVwiXVxucHJpbnQoYnJpY3MpIiwic2N0Ijoic3VjY2Vzc19tc2coXCJHcmVhdCBqb2IhXCIpIn0=
Another way to create a DataFrame is by importing a csv file using Pandas. Now, the csv cars.csv
is stored and can be imported using pd.read_csv
:
eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZiA9IG9wZW4oJ2NhcnMuY3N2JywgXCJ3XCIpXG5mLndyaXRlKFwiXCJcIixjYXJzX3Blcl9jYXAsY291bnRyeSxkcml2ZXNfcmlnaHRcblVTLDgwOSxVbml0ZWQgU3RhdGVzLFRydWVcbkFVUyw3MzEsQXVzdHJhbGlhLEZhbHNlXG5KQVAsNTg4LEphcGFuLEZhbHNlXG5JTiwxOCxJbmRpYSxGYWxzZVxuUlUsMjAwLFJ1c3NpYSxUcnVlXG5NT1IsNzAsTW9yb2NjbyxUcnVlXG5FRyw0NSxFZ3lwdCxUcnVlXCJcIlwiKVxuZi5jbG9zZSgpIiwic2FtcGxlIjoiIyBJbXBvcnQgcGFuZGFzIGFzIHBkXG5pbXBvcnQgcGFuZGFzIGFzIHBkXG5cbiMgSW1wb3J0IHRoZSBjYXJzLmNzdiBkYXRhOiBjYXJzXG5jYXJzID0gcGQucmVhZF9jc3YoJ2NhcnMuY3N2JylcblxuIyBQcmludCBvdXQgY2Fyc1xucHJpbnQoY2FycykiLCJzb2x1dGlvbiI6IiMgSW1wb3J0IHBhbmRhcyBhcyBwZFxuaW1wb3J0IHBhbmRhcyBhcyBwZFxuXG4jIEltcG9ydCB0aGUgY2Fycy5jc3YgZGF0YTogY2Fyc1xuY2FycyA9IHBkLnJlYWRfY3N2KCdjYXJzLmNzdicpXG5cbiMgUHJpbnQgb3V0IGNhcnNcbnByaW50KGNhcnMpIiwic2N0Ijoic3VjY2Vzc19tc2coXCJHcmVhdCBqb2IhXCIpIn0=
Indexing DataFrames
There are several ways to index a Pandas DataFrame. One of the easiest ways to do this is by using square bracket notation.
In the example below, you can use square brackets to select one column of the cars
DataFrame. You can either use a single bracket or a double bracket. The single bracket will output a Pandas Series, while a double bracket will output a Pandas DataFrame.
eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZiA9IG9wZW4oJ2NhcnMuY3N2JywgXCJ3XCIpXG5mLndyaXRlKFwiXCJcIixjYXJzX3Blcl9jYXAsY291bnRyeSxkcml2ZXNfcmlnaHRcblVTLDgwOSxVbml0ZWQgU3RhdGVzLFRydWVcbkFVUyw3MzEsQXVzdHJhbGlhLEZhbHNlXG5KQVAsNTg4LEphcGFuLEZhbHNlXG5JTiwxOCxJbmRpYSxGYWxzZVxuUlUsMjAwLFJ1c3NpYSxUcnVlXG5NT1IsNzAsTW9yb2NjbyxUcnVlXG5FRyw0NSxFZ3lwdCxUcnVlXCJcIlwiKVxuZi5jbG9zZSgpIiwic2FtcGxlIjoiIyBJbXBvcnQgcGFuZGFzIGFuZCBjYXJzLmNzdlxuaW1wb3J0IHBhbmRhcyBhcyBwZFxuY2FycyA9IHBkLnJlYWRfY3N2KCdjYXJzLmNzdicsIGluZGV4X2NvbCA9IDApXG5cbiMgUHJpbnQgb3V0IGNvdW50cnkgY29sdW1uIGFzIFBhbmRhcyBTZXJpZXNcbnByaW50KGNhcnNbJ2NhcnNfcGVyX2NhcCddKVxuXG4jIFByaW50IG91dCBjb3VudHJ5IGNvbHVtbiBhcyBQYW5kYXMgRGF0YUZyYW1lXG5wcmludChjYXJzW1snY2Fyc19wZXJfY2FwJ11dKVxuXG4jIFByaW50IG91dCBEYXRhRnJhbWUgd2l0aCBjb3VudHJ5IGFuZCBkcml2ZXNfcmlnaHQgY29sdW1uc1xucHJpbnQoY2Fyc1tbJ2NhcnNfcGVyX2NhcCcsICdjb3VudHJ5J11dKSIsInNvbHV0aW9uIjoiIyBJbXBvcnQgcGFuZGFzIGFuZCBjYXJzLmNzdlxuaW1wb3J0IHBhbmRhcyBhcyBwZFxuY2FycyA9IHBkLnJlYWRfY3N2KCdjYXJzLmNzdicsIGluZGV4X2NvbCA9IDApXG5cbiMgUHJpbnQgb3V0IGNvdW50cnkgY29sdW1uIGFzIFBhbmRhcyBTZXJpZXNcbnByaW50KGNhcnNbJ2NhcnNfcGVyX2NhcCddKVxuXG4jIFByaW50IG91dCBjb3VudHJ5IGNvbHVtbiBhcyBQYW5kYXMgRGF0YUZyYW1lXG5wcmludChjYXJzW1snY2Fyc19wZXJfY2FwJ11dKVxuXG4jIFByaW50IG91dCBEYXRhRnJhbWUgd2l0aCBjb3VudHJ5IGFuZCBkcml2ZXNfcmlnaHQgY29sdW1uc1xucHJpbnQoY2Fyc1tbJ2NhcnNfcGVyX2NhcCcsICdjb3VudHJ5J11dKSIsInNjdCI6InN1Y2Nlc3NfbXNnKFwiR3JlYXQgam9iIVwiKSJ9
Square brackets can also be used to access observations (rows) from a DataFrame. For example:
eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZiA9IG9wZW4oJ2NhcnMuY3N2JywgXCJ3XCIpXG5mLndyaXRlKFwiXCJcIixjYXJzX3Blcl9jYXAsY291bnRyeSxkcml2ZXNfcmlnaHRcblVTLDgwOSxVbml0ZWQgU3RhdGVzLFRydWVcbkFVUyw3MzEsQXVzdHJhbGlhLEZhbHNlXG5KQVAsNTg4LEphcGFuLEZhbHNlXG5JTiwxOCxJbmRpYSxGYWxzZVxuUlUsMjAwLFJ1c3NpYSxUcnVlXG5NT1IsNzAsTW9yb2NjbyxUcnVlXG5FRyw0NSxFZ3lwdCxUcnVlXCJcIlwiKVxuZi5jbG9zZSgpIiwic2FtcGxlIjoiIyBJbXBvcnQgY2FycyBkYXRhXG5pbXBvcnQgcGFuZGFzIGFzIHBkXG5jYXJzID0gcGQucmVhZF9jc3YoJ2NhcnMuY3N2JywgaW5kZXhfY29sID0gMClcblxuIyBQcmludCBvdXQgZmlyc3QgNCBvYnNlcnZhdGlvbnNcbnByaW50KGNhcnNbMDo0XSlcblxuIyBQcmludCBvdXQgZmlmdGggYW5kIHNpeHRoIG9ic2VydmF0aW9uXG5wcmludChjYXJzWzQ6Nl0pIiwic29sdXRpb24iOiIjIEltcG9ydCBjYXJzIGRhdGFcbmltcG9ydCBwYW5kYXMgYXMgcGRcbmNhcnMgPSBwZC5yZWFkX2NzdignY2Fycy5jc3YnLCBpbmRleF9jb2wgPSAwKVxuXG4jIFByaW50IG91dCBmaXJzdCA0IG9ic2VydmF0aW9uc1xucHJpbnQoY2Fyc1swOjRdKVxuXG4jIFByaW50IG91dCBmaWZ0aCBhbmQgc2l4dGggb2JzZXJ2YXRpb25cbnByaW50KGNhcnNbNDo2XSkiLCJzY3QiOiJzdWNjZXNzX21zZyhcIkdyZWF0IGpvYiFcIikifQ==
You can also use loc
and iloc
to perform just about any data selection operation. loc
is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc
is integer index based, so you have to specify rows and columns by their integer index like you did in the previous exercise.
eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZiA9IG9wZW4oJ2NhcnMuY3N2JywgXCJ3XCIpXG5mLndyaXRlKFwiXCJcIixjYXJzX3Blcl9jYXAsY291bnRyeSxkcml2ZXNfcmlnaHRcblVTLDgwOSxVbml0ZWQgU3RhdGVzLFRydWVcbkFVUyw3MzEsQXVzdHJhbGlhLEZhbHNlXG5KQVAsNTg4LEphcGFuLEZhbHNlXG5JTiwxOCxJbmRpYSxGYWxzZVxuUlUsMjAwLFJ1c3NpYSxUcnVlXG5NT1IsNzAsTW9yb2NjbyxUcnVlXG5FRyw0NSxFZ3lwdCxUcnVlXCJcIlwiKVxuZi5jbG9zZSgpIiwic2FtcGxlIjoiIyBJbXBvcnQgY2FycyBkYXRhXG5pbXBvcnQgcGFuZGFzIGFzIHBkXG5jYXJzID0gcGQucmVhZF9jc3YoJ2NhcnMuY3N2JywgaW5kZXhfY29sID0gMClcblxuIyBQcmludCBvdXQgb2JzZXJ2YXRpb24gZm9yIEphcGFuXG5wcmludChjYXJzLmlsb2NbMl0pXG5cbiMgUHJpbnQgb3V0IG9ic2VydmF0aW9ucyBmb3IgQXVzdHJhbGlhIGFuZCBFZ3lwdFxucHJpbnQoY2Fycy5sb2NbWydBVVMnLCAnRUcnXV0pIiwic29sdXRpb24iOiIjIEltcG9ydCBjYXJzIGRhdGFcbmltcG9ydCBwYW5kYXMgYXMgcGRcbmNhcnMgPSBwZC5yZWFkX2NzdignY2Fycy5jc3YnLCBpbmRleF9jb2wgPSAwKVxuXG4jIFByaW50IG91dCBvYnNlcnZhdGlvbiBmb3IgSmFwYW5cbnByaW50KGNhcnMuaWxvY1syXSlcblxuIyBQcmludCBvdXQgb2JzZXJ2YXRpb25zIGZvciBBdXN0cmFsaWEgYW5kIEVneXB0XG5wcmludChjYXJzLmxvY1tbJ0FVUycsICdFRyddXSkiLCJzY3QiOiJzdWNjZXNzX21zZyhcIkdyZWF0IGpvYiFcIikifQ==