Python Basic, Cheat Sheets & ETL

Python Basic foundation understanding – How to get help

This is for beginner of python learner
Python is an object oriented programming
Keyword = Objects
Objects have properties 
We can apply methods (@actions) on objects
Basic objects
– data types
        – string = “a”, “b”, “text”
        – integer = 1,2,3
        – float = 1.0,2.0,3.0
        – Boolean True, False
– data containers 
   – variables  e.g.  file = ‘test.csv’, a =1, b =2.0
   – lists e.g. list_name = [‘test.csv’, 1, 2.0]
   – dictionary e.g. dict ={ key: value} => dict = { ‘apple’:1, ‘banana’: 2 , ‘coconut’: 3}
String methods
e.g. .strip(), upper, lower()
Integer/float methods
Math operations 
List methods 
e.g. append(), remove()
Dictionary methods 
 e.g. keys()
How to get help
1. Find out the data types  => type(object)
2. Find out the built-in methods => dir(object)

3. Get documentation of built in methods => help (object.method)

Personally, I would recommend the beginner to familiarize themselves with below 4 frequently used python packages for data science:

1. Numpy – http://www.numpy.org/
Cheat Sheet –  from DataCamp. One of the best place to learn data science skill, try it out at https://www.datacamp.com/

2. Scipy – https://www.scipy.org/

3. Matplotlib – https://matplotlib.org/
Cheat Sheet –  from DataCamp. One of the best place to learn data science skill, try it out at https://www.datacamp.com/

4. Pandas – https://pandas.pydata.org/
Cheat Sheet 1 –  from DataCamp. One of the best place to learn data science skill, try it out at https://www.datacamp.com/

Cheat Sheet 2 –  from https://pandas.pydata.org/

Additional data containers in Pandas:

  • Series
    • series = pd.Series ([1,-3,5,-7],index = [‘a’,’b’,’c’,’d’])
  • DataFrame
    • dataframe = pd.DataFrame({‘column_header1’:[‘column1_data1′,’column1_data2′,’column1_data3′],’column_header2’:[‘column2_data1′,’column2_data2′,’column2_data3′],’column_header3’:[‘column3_data1′,’column3_data2′,’column3_data3’] }, columns = [‘column_header1′,’column_header2′,’column_header3’])  
Pandas I/O
  • CSV
    • pd.csv_read(‘file.csv’, header = none, nrows =5)
    • pd.to_csv(‘myDataFrame.csv’)
  • Excel
    • pd.read_excel(‘file.xlsx’)
    • pd.to_excel(‘dir/myDataFrame.xlsx’, sheet_name =’sheet1′)
Exploring Data Sets
.shape
.head()
.tail()
.sample()
.ndim
.columns
.iloc([:,:])
.describe
.ix
.value_counts

.hist()

Each of the packages already has its own perfectly written tutorial. Make good use of the type(), dir() and help() and practice with real data, this is how I get started in python.

Python – ETL

I have recently done a small project in a financial institution. The Oracle SQL query has became so complex and took very long query execution time due to the schema design and other restriction of the Oracle database configuration. Therefore, I used the python Pandas to do the ETL instead.

Below is just a simplified version of the original Python script for the project, but the thought process should be similar in general:

import cx_Oracle
import pandas as pd
from pandas import DataFrame
con = cx_Oracle.connect(‘user/pass@host”)
cur = con.cursor()
rows = cur.execute(“select * from table”)
result = rows.fetchall()
result[10]
df = DataFrame(result, columns = [‘a’, ‘b’, ‘c’])
/# alternatively use below pandas method to import sql query result directly into dataframe #/

sql1 = “SELECT * from table ”  [Note: if the sql query is very long, use “”” to wrap the sql to avoid syntax error.]
df1 = pd.read_sql(sql1, con)
sql2 = “””SELECT * from table “””
df2 = pd.read_sql(sql2, con)
sql3 = “””SELECT * from table “””
df3 = pd.read_sql(sql3, con)
sql4 = “””SELECT * from table “””
df4 = pd.read_sql(sql4, con)
sql5 = “””SELECT * from table “””
df5 = pd.read_sql(sql5, con)
wip2 = pd.merge(txn_df, price_df, how= ‘outer’, left_on =[‘stock_ID’], right_on =[‘stock_ID’])
wip2[‘PERIOD’] = wip2.DATA_MONTH.dt.to_period(‘M’) – wip2.TRADE_MONTH.dt.to_period(‘M’)
wip2 = wip2[(wip2.PERIOD =-18)]
pd.set_option(‘display.max_columns’, None)
wip3 = pd.merge(wip2, df3, how =’left’, left_on = [‘Month’,’CODE’,’ID’], right_on =[‘MONTH’,’code’,’ID’])
wip4 = pd.merge(wip3, df4, how =’left’, left_on = [‘Month’,’code’], right_on =[‘MONTH’,’code’])
wip5 = pd.merge(wip4, df5, how =’left’, left_on = [‘MONTH2′,’Code’], right_on =[‘MONTH’,’code’])
wip5 = wip5.drop(wip5.columns[38], axis=1)
wip5 = wip5.rename(columns={‘old_column_name’: ‘new_column_name’})
df = df.rename(columns={“””TRUNC(DATA_DATE,’MM’)”””: ‘DATA_DATE’})
wip5 = pd.merge(wip4, df5, how =’left’, left_on = [‘Date’,’CODE’], right_on =[‘DATE’,’code’])
wip5[‘BUY_BASE’] = wip5.BUY_AMT / wip5.BUY_QTN
wip5[‘SELL_BASE’] = wip5.SELL_AMT / wip5.SELL_QTN
conditions = [
    (wip5[‘TRANSACTION_TYPES’] == ‘BUY’) & (wip5[‘TRADED_QUANTITY’] == wip5[‘HOLDING_QUANTITY’]),
    (wip5[‘TRANSACTION_TYPES’] == ‘SELL’) & (wip5[‘TRADED_QUANTITY’] == wip5[‘HOLDING_QUANTITY’]),
    (wip5[‘TRANSACTION_TYPES’] == ‘BUY’) & (wip5[‘TRADED_QUANTITY’] wip5[‘HOLDING_QUANTITY’]),
    (wip5[‘TRANSACTION_TYPES’] == ‘SELL’) & (wip5[‘TRADED_QUANTITY’] wip5[‘HOLDING_QUANTITY’])]
choices = [‘X_BUY’, ‘X_SELL’, ‘X_x’, ‘X_t”]
wip5[‘ACTION’] = np.select(conditions, choices)
wip5[‘BUY_RETURN’] = (((wip5.PRICE * wip5.MONTHLY_EXCHANGE_RATE) – (wip5.BUY_BASE*wip5.BASE_EXCHANGE_RATE))/ (wip5.BUY_BASE*wip5.BASE_EXCHANGE_RATE))
wip5[‘SELL_RETURN’] = (((wip5.PRICE * wip5.MONTHLY_EXCHANGE_RATE) – (wip5.SELL_BASE*wip5.BASE_EXCHANGE_RATE))/ (wip5.SELL_BASE*wip5.BASE_EXCHANGE_RATE))
wip5.loc[:,’BUY_BASE’:’SELL_RETURN’]
exchange_df.EXCHANGE_MONTH.value_counts().sort_index()
wip5.groupby([‘PORTFOLIO_CODE’,’ACTION’,’TRADE_MONTH’,’DATA_MONTH’,’PERIOD’])[‘BUY_RETURN’,’SELL_RETURN’].mean()
FINAL=wip5.groupby([‘PORTFOLIO_CODE’,’ACTION’,’TRADE_MONTH’,’DATA_MONTH’,’PERIOD’])[‘BUY_RETURN’,’SELL_RETURN’].mean()
FINAL.to_csv(‘result.csv’, sep=’\t’)
DATA = wip5.loc[wip5[‘code’] == ‘xxxxx’]
DATA.to_csv(‘xxxxx_DATA.csv’,sep=’\t’)

Related Posts

2 thoughts on “Python Basic, Cheat Sheets & ETL

Comments are closed.