Python for Data Analysis: Pandas

Basic

Hyerang Raina Kim
3 min readAug 12, 2020

Tool review

Anaconda?

  • Python + Libraries + I/O Program

Jupyter Notebook

  • Python Debugging Tool

Pandas Data Structure

columns = [‘apples’, ‘oranges’]

index = [0,1,2,3]

Pandas Basic Commands

  • Importing data: read_excel(), read_csv()
  • Selection: df.loc(), df.iloc()
  • Changing index/column: columns/index, reset_index()

1. Importing pandas

import pandas as pd

2. Importing data

pd.read_excel('<filepath> + <filename>.xlsx')
pd.read_csv('<filepath> + <filename>.csv')

Filepath can be written either in absolute path and relative path.

  • For relative path, “./” -> means current jupyter notebook file location

Example:

fpath = './data/exam.xlsx'
raw = pd.read_excel(fpath)
#shows the upper part of the data up to 5 items
raw.head()
#shows the bottom part of the data up to 5 items
raw.tail()
#shows the first item of the data
raw.head(1)
#shows the amount and type of the index/column data
raw.info()
#shows the basic statistics for each column
raw.describe()
#set the certain column to be index
raw.set_index('index')
#assign the index when importing data
pd.read_excel('filename', index_col = <column_info>)
fpath = './data/exam.xlsx'
raw = pd.read_excel(fpath, index_col = 'index')

3. Selection

1) Single value selection

  • df.iloc[row,column]: select by index
  • df.loc[row,column]: select by name
raw.iloc[1,1] #3
raw.loc[1,'oranges'] #3

2) Multiple value selection

  • df.loc[index_name, column_name]
  • df.iloc[index_number, column_number]

Put list type data into Index/Column info (slicing)

raw.loc[1, ['apples','oranges']]#list slicing
raw.loc[1, 'apples':]

3) Column Selection
To select the entire single column,

  • data.<column_name>
  • data[‘<column_name>’]
  • data.loc[:, <column_name>]

4) Conditional Data Selection

df[condition] → Only True value data will be printed

  • condition: List or series that has a type of True/False
cond = raw['orange'] < 10
print(cond)
# Then it prints the True/False data for each column

5) Multiple conditions

cond = [True, False, True]
raw[cond]
# Then it prints the columns that satisfy the conditions listed
cond = (raw['apple'] >= 5) & (raw['orange'] >= 3)
raw[cond]

4. Adding Data

  • Add the column data
  • df[‘column_name’] = data
  • data → single value (same value for all), group (list, pandas series)

5. Merging the data table

# importing the data (index_col = 'index')fpath = './data/exam.xlsx'
A = pd.read_excel(fpath, index_col = 'index')
# importing the second data (index_col = 'index')fpath2 = './data/exam_extra.xlsx'
B = pd.read_excel(fpath2, index_col = 'index')
pd.merge(A, B, how = "left", left_on = "<column_name>" or left_index = True, right_on = "<column_name>" or right_index = True)

👉 how

  • left: data from the left table base
  • right: data from the right table base
  • inner: data from both
  • outer: entire data

6. Saving the merged data

# Merged datatotal = pd.merge(A, B, how = "left", left_index = True, right_index = True)# Saving the data in exam_total.xlsxfpath = './data/exam_total.xlsx'
total.to_excel(fpath)

Pandas data frame always has its index value. We use index = False if we don’t want its index value to be saved.

fpath = './data/exam_total_withoutindex.xlsx'total.to_excel(fpath, index = False)

Cleanup & Aggregation

import pandas as pd# importing datafpath = './data/babyNamesUS.csv'
raw = pd.read_csv(fpath)
# see the first five rowsraw.head()# see the data structureraw.info()

Aggregation

df.pivot_table(index = '<column_name>', columns = '<column_name>', values = '<column_name>', aggfunc = 'sum')

✔️ aggfunc options: sum, count, mean …

How to fill in the empty space in the data table?

  • Common values (ex. 0)
  • Temporary values (ex. mean, max, min, or any values around)
  • exempting the emptied data
name_df = name_df.fillna(0)

Sorting

name_df.sort_values(by = '<column_name>', ascending = False)

Check the data types by each column

df['<column>'].unique()
df['<column>'].value_counts()
# example by typesraw['StateCode'].unique()>>> array(['AK','AL','AR','AZ','CA','CO], dtype = object)# example by countsraw['StateCode'].value_counts()>>> CA 361128
AL 128556
AZ 108599
CO 101403
...

Sign up to discover human stories that deepen your understanding of the world.

--

--

No responses yet

Write a response