Pig Case Study – Complain Raised by different Software Companies

Pig Case Study – Complain Raised by different Software Companies

Pig Case Study :

Problem Statement: To build a script which produces a report listing each Company & State and number of complaints raised by them.

We have a data set of complaints (incidents or trouble tickets) of a software system raised by different companies.

The details of the fields (i.e. data dictionary) of each record is in below format:

Date received = dtr:chararray,

Product = prd:chararray,

Sub-product =  sprd:chararray,

Issue = iss:chararray,

Sub-issue = siss:chararray,

Company public response = pubrsp:chararray,

Company = cmpny:chararray,

State = stt:chararray,

ZIP Code = zip:chararray,

TAG = tag:chararray,

Consumer consent provided = consent:chararray,

Submitted via = via:chararray,

Date sent to company = dts:chararray,

Company response to consumer = conrsp:chararray,

Timely response = tmly:chararray,

Consumer disputed = dsput:chararray,

Complaint ID = cmpid:int

Sample data-set:

4/1/2016|Debt collection|”Other (i.e. phone, health club, etc.)”|Disclosure verification of debt|Not given enough info to verify debt||”Convergent Resources, Inc.”|IL|60035|||Web|4/1/2016|Closed with explanation|Yes|No|1859891

4/1/2016|Consumer Loan|Vehicle loan|Managing the loan or lease|||Ally Financial Inc.|TX|779XX|||Web|4/1/2016|In progress|Yes|No|1861161

4/1/2016|Bank account or service|Savings account|”Account opening, closing, or management”|||Ally Financial Inc.|TN|381XX|||Web|4/1/2016|In progress|Yes|No|1861175

I am working on single distributed node cluster and data is located at /home/rahul/data and name of file is ccd-200-pid.txt

Differences
Since, I am working on Pig distributed mode, I will make sure Hadoop Cluster is UP.  I am going to copy data to HDFS location. I am creating pigdata directory at HDFS location and copying data to this location.

Command to create directory:  hadoop fs –mkdir pigdata

Command to copy data from local directory to HDFS location:  Hadoop fs –copyFromLocal <source> <destination>

Hadoop fs –copyFromLocal ccd-200-pipe.txt /user/rahul/pigdata/

Differences

Command to list file at HDFS location:  Hadoop fs –ls <hdfs directory location>

Hadoop fs –ls /user/rahul/pigdata
Differences

Now, I will open pig terminal.
Differences

My data is located in HDFS location at /user/rahul/pigdata

In Pig, we have alias concept. It means we have to use new alias at every step. We need to process every time with the alias only.

I am storing data of file name ccd-200-pipe.txt into alias inputdata. We can have any name of the alias.  I am executing below codes as per below data schema.

inputdata = LOAD ‘/user/rahul/pigdata/ccd-200-pipe.txt’ USING PigStorage(‘|’) AS (dtr:chararray, prd:chararray, sprd:chararray, iss:chararray, siss:chararray, pubrsp:chararray, cmpny:chararray, stt:chararray, zip:chararray, tag:chararray, consent:chararray, via:chararray, dts:chararray, conrsp:chararray, tmly:chararray, dsput:chararray, cmpid:int);
Differences

Pig offers you dump and store comamnd to trigger your executions. I am going to check successful execution of load command using dump command.
Differences

Below is output on screen and we can see that data is loaded successfully  into alias inputdata
Differences

Now I am using Filter Command to filter null complain id into the dataset.

Command:  fltd1 = FILTER inputdata  BY cmpid IS NOT NULL;
Differences

Below is output on screen and we can see that data is loaded successfully  into alias fltd1;
DifferencesDifferences

Now, I am using SUBSTRING function to filter out data and Month and assigning data type. I am assigning date and month into new column.

trd1 = FOREACH fltd1 GENERATE cmpid, prd, cmpny, stt, SUBSTRING(dtr,5,9), SUBSTRING(dtr,0,1);
Differences

We can use describe function to check schema of alias and datatype.
Differences

I am assigning data type to date and month for the new column.

trd1 = FOREACH fltd1 GENERATE cmpid, prd, cmpny, stt, SUBSTRING(dtr,5,9)as year, SUBSTRING(dtr,0,1) as month;
Differences

I am grouping alias trd1 based on company and state.

grd1 = GROUP trd1 BY (cmpny, stt);
Differences
Differences

Now, I am counting alias trd1 based on grouping.

aggrd1 = FOREACH grd1 GENERATE group, COUNT(trd1);
Differences

Dump aggrd1;
Differences

Now, I am counting the total number of complains and getting output in descending order.

srtd1 = ORDER aggrd1 BY $1 DESC;
Differences

I am getting below output by using command Dump srtd1;
Differences

I got the expected output i.e. Number of tickets raised by different company along with the state. I am storing the final output into HDFS for BI processing.

STORE srtd1 INTO ‘pigoutput’ ;

Now I can check using HDFS command – new directory being created at HDFS location.

hadoop fs –ls pigoutput

We can see success file and part-r-00000 files created inside pigoutput directory. Actual content is present on filename part-r-00000

Differences