Document

● which programming language you used in your project?

● which ETL tool you worked on?

● whats your role in your project

● what exactly your project based on in business perspective?

● on which tools you worked on?

● how did you load your data into spark?

● what type of files are there?

● how to perform sql query operations on spark ?

● explain architecture of your data pipeline?

● Spark optimization.

● Repartition and coalesce

● What is task in EMR

● Tell me workflow for " I want data from DB then transformation and then stores on HDFS"

● How many nodes you are using?

● Group by Key and reduceByKey difference?

● Suppose we have two table A and B. In Table B having also A table data so I want only B table data which is not present in A?

● Which schedular you have used?

● What is security group?

● In Hive we used Staging and final table so which types of table?

● I have a table consist of three column i.e . Salary, dept.id, name/I want dept.id wise 3rd min salary?

● What is query?

● how to read/import data from oracle in your project?

● Which tool used for schedule

● Which Operators you have used in Airflow

● how to set dependency in 2 tasks

● how to set dependency in 2 DAGS

● how to check log in Airflow

● How you have ran job for previous day

● How many Jobs are running in Production

● Which version control tool you have used ? Git

● ETL tools in project ?

● How you handled incremental Data in project ?

● How to check issue in prod and monitoring team help ?

● Data Skewness in spark and how to handle it?

● what type of report generated?

● What was KPI of Project ?

● what was project Goal?

● What is catalyst optimizer?

● difference optimization technique in spark.

● difference transformation in project ?

● where you were keeping your project data?

● how to deploy code ?

● Which AWS services you have used ?

● difference between Athena and redshift ?

● Can I truncate external Table ?

● from AWS s3 which storage classes are there ?

● Max data you have processed ? 120-150TB

● how many columns in one table ? 121 columns

● how many records you have seen in table ? around 40-50 million

● Version of each module ?

● max time to take any job?

● avg time to take any job?

● clustered and non-clustered index?

● How we can remove duplicate records from table ?

● In project which technique you use Project flow

● how to import data in hbase?

● why you are using hbase? Why you are not importing data direct into Hive?

● how to generate report?

● which type of data you perform (batch or real time)

● Which technology you used in project?

● what is project duration?

● which table you are using in hive?

● how to manage your spark job?

● what was your project all about?

● what was your part in the whole project?

● What difficulties you faced while doing your part of the project?

● What technologies you used and why you used them only?

● How long did it take to complete your part of project

● what tools did you use on the projects

● who is assigning task to you

● what was the project mode

● How would you describe a project plan?

● What you checked in unit testing ?

● How to check logs

● How to decide no of core and executor memory?

● If we get 10gb delta data and we have made allocation for 1gb or 2gb.How to handle such situation

● how connection is made in redshift

● How you did the deployment?

● What you checked in unit testing ?

● Why You used on premise cluster in development and on which platform ?

● After writing the code in development what you did ?

● How to decide no of core and executor memory?

● Why we use hbase in our project ?

● why we use hive not Athena or redshift ?

● In which phase your project is ? Is our job responsibilities changes according to project phase ?

● How exactly the report generation according to client requirement work in project ?

● On Which Part of ur project you use Airflow ?

● how u do reconsilation in your project ?

● how u achieve Parallelizm in Project ?

● how to find how many resources/executor are available on cluster and how we allocate no. of executor ?

● which distribution of bigdata ur company use? if on-premises then how the install Hadoop on it ( like cloud era or mapper) ??

● what was project mode? is it support or development

● who do testing in your project? is there testing team or test yourself?

● what is the major blocker you faced in your project?

● what types of report you generated?

● what is spark? why do you use spark only?

● why do you use H-base? why not other tools?

● why DAG is created?

● how you use DAG to solve any issue?

● What tech used to convert external table into internal table & explain with example ?

● where you used external & internal table in this project?

● what is s3 replication?

● explain clusterd and non clusterd index?

● decorator and its type?

● versioning in s3?

● cluster manager in spark?

● what is file system ali?

● daily delta file get which location during project?

● actual which type of clients requirement comes and how implement logic on requirement?

● why we used clusterd index?

● Who will give the requirement in your project?

● Who will put the code into the production environment?

● In which domain your project was?

● What was your team size?

● Have you providing a support for your project?

● What type of support you are providing?

● What was your project duration?

● What is role of Data Engineer?

● What is difference between customer and subscriber?

● What was your project goal? What was main intention to analyze the data

● Which tool you have used in your project?

● What was your project cluster configuration?

● Which type of data processing you have used in your project

● What is the main reason to load data into L1 layer in your project?

● In your project you are processing historic data on daily basis?

● How you are capturing Delta data?

● Who will send this Delta data file?

● Why you are using S3 bucket as target in your project?

● Why staging is required?

● Why you are keeping daily data into the staging layer?

● Why you have got one file in S3 bucket

● How you debug the failed job?

● Which command used for storing the log in file

● How to identify any error in log?

● How to check any job is running or failed?

● How to terminate any job?

● How to list of job which are running

● Where you can checking log if any job got failed?

● What is the port no. of Application Manager or Spark History Server?

● What are the different issue you have faced while implimenting this project & how to resolve it?

● What are the optimization technique you have used in your project?

● How you have done validation/reconsilation in your project?

● How to log in phoenix terminal?

● Which command use for import data from RDS to Hbase?

● How to run .hql script from hadoop?

● Which command use for import data from Hbase to Hive?

● Which command use for import data from Hive_staging to Hive_final?

● Difference betweeen OLAP & OLTP?

● Why we are getting data source from RDBMS.?

● why we are not getting data source from hive hbase etc?

● Who will prepare requirement document?

● Who will prepare Scopping document?

PROJECT

INTERVIEW QUESTIONS