Old Post: Data Analysis and ETL Transformation using Oracle SQL

This is another old post I wrote quite a while ago, just some of the oracle sql scripts I used for data analysis, ETL transformation technical test etc.

Basic usage pattern of Oracle SQL:
Select Distinct i, j, Count(*) as cnt From table a
Left/Right/Inner Jointable b 
Where  a.x = b.y
Group by i, j
Order by i Asc/Desc
Sub Query pattern of Oracle SQL:
a)
Select Distinct subq.i, subq.j, subq.max_date From
(Select i, j, Max(date) as max_date  Fromtable a
Where  end_date  = ‘01-jan-9999’
Group by i, j
Order by i Asc/Desc) subq
b)
Select Distinct subq.i, subq.j From
(Select i, j, Max(date) Over (Partition by i) From table a
Where  end_date  = ‘01-jan-9999’) subq
c)
Select Distinctsubq.i, subq.j From
(Select i, j,  rank() Over (Partition by i Order by date Asc) rnk From table a
Where  end_date  = ‘01-jan-9999’ ) subq
Where subq.rnk =1
Data Exploration/Data Manipulation in SQL:
1. categorical variable data exploration
SELECT categorical_variable, COUNT(*) as cnt
FROM table
GROUP BY categorical_variable
ORDER BY 1
2. categorical variable + numerical variable data exploration
SELECT categorical_variable,
SUM(CASE WHEN 0 <= numerical_variable AND numerical_variable < x
THEN 1 ELSE 0 END) as cnt_x,
SUM(CASE WHEN x <= numerical_variable AND numerical_variable < y
THEN 1 ELSE 0 END) as cnt_y,
SUM(CASE WHEN y <= numerical_variable AND numerical_variable < z
THEN 1 ELSE 0 END) as cnt_z,
SUM(CASE WHEN numerical_variable >= z THEN 1 ELSE 0 END) as cnt_za,
COUNT(*) as cnt, SUM(numerical_variable) as sum_num
FROM orders
GROUP BY categorical_variable
ORDER BY 1
(Note: x,y,z = number)
3.  Special case – Assign credit card types based on the pre-fix of the credit card number
SELECT (CASE WHEN LEFT(credit_card_no, 2) IN (‘51’, ‘52’, ‘53’, ‘54’, ‘55’)
THEN ‘MASTERCARD’
WHEN LEFT(credit_card_no, 1) IN (‘4’) THEN ‘VISA’
WHEN LEFT(credit_card_no, 2) IN (‘34’, ‘37’) THEN ‘AMERICAN EXPRESS’
WHEN LEFT(credit_card_no, 3) IN (‘300’, ‘301’, ‘302’, ‘303’, ‘304’,
‘305’) OR
LEFT(credit_card_no, 2) IN (‘36’, ‘38’, ‘55’)
THEN ‘DINERS CLUB’
WHEN LEFT(credit_card_no, 4) IN (‘6011’) THEN ‘DISCOVER’
WHEN LEFT(credit_card_no, 4) IN (‘2014’, ‘2149’) THEN ‘ENROUTE’
WHEN LEFT(credit_card_no, 2) IN (‘35’) OR
LEFT(credit_card_no, 4) IN (‘2131’, ‘1800’)
THEN ‘JCB’
WHEN LEFT(credit_card_no, 3) IN (‘560’, ‘561’) THEN ‘DEBIT’
ELSE ‘OTHER’ END) as Credit_Card_Types_Description
4.  Histogram
SELECT categorical variable, SUM(numerical variable) as numerical variable, SUM(pop) as pop
FROM ((SELECT o.categorical variable, COUNT(*) as numerical variable, 0 as pop
FROM orders o
GROUP BY o.categorical variable)
UNION ALL
(SELECT categorical variable, 0 as numerical variable, SUM(pop) as pop
FROM zipcensus
GROUP BY categorical variable)) summary
GROUP BY categorical variable
ORDER BY 2 DESC
          Fixed value Histogram
SELECT (CASE WHEN cnt >= 100 THEN categorical variable ELSE ‘OTHER’ END) as categorical variable,
SUM(cnt) as cnt
FROM (SELECT o.categorical variable, COUNT(*) as cnt
FROM orders o
GROUP BY o.categorical variable
) a
GROUP BY (CASE WHEN cnt >= 100 THEN categorical variable ELSE ‘OTHER’ END)

ORDER BY 2 desc
Minimum and Maximum Values
SELECT MIN(

), MAX(

)
FROM
SELECT SUM(CASE WHEN

= minv THEN 1 ELSE 0 END) as freqminval,
SUM(CASE WHEN

= maxv THEN 1 ELSE 0 END) as freqmaxval
FROM t CROSS JOIN
(SELECT MIN(

) as minv, MAX(

) as maxv
FROM ) vals
The Most Common Value (Mode)
Calculating Mode Using Standard SQL
SELECT

, COUNT(*) as freq
FROM
GROUP BY

ORDER BY 2
SELECT

, COUNT(*) as freq
FROM
GROUP BY

HAVING COUNT(*) = (SELECT MAX(freq)
FROM (SELECT

, COUNT(*) as freq
FROM GROUP BY

) b)
SELECT MIN(

) as minmode
FROM (SELECT

, COUNT(*) as freq
FROM
GROUP BY

HAVING COUNT(*) = (SELECT MAX(freq)
FROM (SELECT

, COUNT(*) as freq
FROM GROUP BY

) b)
) a
What Characters Are in a String?
SELECT SUBSTRING(category variable, 1, 1) as onechar,
ASCII(SUBSTRING(category variable, 1, 1)) as asciival,
COUNT(*) as numerical variable
FROM table name
GROUP BY SUBSTRING(category variable, 1, 1)
ORDER BY 1
Oracle Analytical Functions
To understand Oracle SQL Analytical Functions,  in SQL categorical variablements, 4 different parts are to be looked into:
Syntax Patterns:
Analytical Functions([arguments]) over ([query_partition_clause] [order_by_clause [windowing_clause]])
a) Analytical Functions (Required)
Categories of Analytical Functions:
1. Ranking (rank, dense_rank, cume_dist, percent_rank, ntile, row_number)
2. Windowing Aggregate (sum, avg, max, min, count, stddev, variance, first_value, last_value)
3. Reporting Aggregate
4. LAG/LEAD
5. FIRST/LAST
6. Inverse percentile (percentile_cont, percentile_disc)
7. Hypothetical Rank and distribution (rank, dense_rank, percent_rank, cume_dist)
8. Linear Regression (regr_count, regr_avgy, regt_avgx, regr_slope, regr_intercept, regr_R2, regrr_sxx, regr_syy, regtr_sxy)
9. Other Statistics
10. Width_bucket
b) Partitioning Clause (Optional)
c) Order by Clause (Optional)
d) Windowing Clause (Optional)
examples:
select empno, deptno, sal, avg(sal) OVER (PARTITION BY deptno ) AS avg_dept_sal from emp;
select empno, deptno, sal, FIRST_VALUE(sal IGNORE NULLS) OVER (PARTITION BY deptno ORDER BY sal ASC NULLS LAST) AS first_val_in_dept from emp;
default order by clause:
order by  xxx asc NULL LAST
order by xxx desc NULL FIRST
windowing clause is an extension of the order by clause
2 basic form of windowing clause
1. range between start_point and end_point
2. row between start_point and end_point
Possible values for ‘start_point’ and ‘end_point’ are:
1. Unbounded preceding
2. unbounded following
3. current row
4. value_expr preceding
5. value_expr following
Default windowing clause:
Range between unbounded preceding and current row
Normal usage should be unbounded preceding and unbounded following
e.g. getting running average
select empno, deptno, sal, avg(sal) OVER (PARTITION BY deptno ORDER BY sal ) AS running_average from emp;
e.g. getting department average
select empno, deptno, sal, avg(sal) OVER (PARTITION BY deptno ) AS avg_dept_sal from emp;
select empno, deptno, sal, avg(sal) OVER (PARTITION BY deptno ORDER BY sal  unbounded preceding and unbounded following) AS running_average from emp;
e..g getting overall average
select empno, deptno, sal, avg(sal) OVER () AS running_average from emp;
using windowing clause to get lag and lead effects:
GETTING LAG – Previous record
select empno, deptno, sal, FIRST_VALUE(sal) OVER (ORDER BY sal ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS previous_sal from emp;
select empno, deptno, sal, LAG(sal, 1, 0) OVER (ORDER BY sal) as previous_sal from emp;
GETTING LEAD – following record
select empno, deptno, sal, LAST_VALUE(sal) OVER (ORDER BY sal ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) AS next_sal from emp;
select empno, deptno, sal, LEAD(sal, 1) OVER (ORDER BY sal) as next_sal from emp;
  
How to remove duplicated records with Advanced Analytical Functions
With v_transformation_1  as
(Select case when a.col1 is not null
then a.col1||a.colz         
else b.col1||a.colz          
end as primary_key,
a.col2,
 a.col3,
(case when a.col1 is not null
                Then (row_num() over (partition by a.col1 order by a.col4 asc, a.col5 desc))
                Else (row_num() over (partition by b.col1 order by a.col4 asc, a.col5 desc))
                End) as prio_1
From table_name_1 a
Left outer join table_name_2 b
on a.col_common = b.col_common and b.col_filter = conditions
where a.col_filter = conditions)
select primary_key, col2, col3
from v_transformation_1
where (primary_key, prio_1)
in (select primary_key, min(prio_1) from v_transformation_1 group by primary_key)

Related Posts

3 thoughts on “Old Post: Data Analysis and ETL Transformation using Oracle SQL

Comments are closed.