● which programming language you used in your project?
● which ETL tool you worked on?
● whats your role in your project
● what exactly your project based on in business perspective?
● on which tools you worked on?
● how did you load your data into spark?
● what type of files are there?
● how to perform sql query operations on spark ?
● explain architecture of your data pipeline?
● Spark optimization.
● Repartition and coalesce
● What is task in EMR
● Tell me workflow for " I want data from DB then transformation and then stores on HDFS"
● How many nodes you are using?
● Group by Key and reduceByKey difference?
● Suppose we have two table A and B. In Table B having also A table data so I want only B table data which is not present in A?
● Which schedular you have used?
● What is security group?
● In Hive we used Staging and final table so which types of table?
● I have a table consist of three column i.e . Salary, dept.id, name/I want dept.id wise 3rd min salary?
● What is query?
● how to read/import data from oracle in your project?
● Which tool used for schedule
● Which Operators you have used in Airflow
● how to set dependency in 2 tasks
● how to set dependency in 2 DAGS
● how to check log in Airflow
● How you have ran job for previous day
● How many Jobs are running in Production
● Which version control tool you have used ? Git
● ETL tools in project ?
● How you handled incremental Data in project ?
● How to check issue in prod and monitoring team help ?
● Data Skewness in spark and how to handle it?
● what type of report generated?
● What was KPI of Project ?
● what was project Goal?
● What is catalyst optimizer?
● difference optimization technique in spark.
● difference transformation in project ?
● where you were keeping your project data?
● how to deploy code ?
● Which AWS services you have used ?
● difference between Athena and redshift ?
● Can I truncate external Table ?
● from AWS s3 which storage classes are there ?
● Max data you have processed ? 120-150TB
● how many columns in one table ? 121 columns
● how many records you have seen in table ? around 40-50 million
● Version of each module ?
● max time to take any job?
● avg time to take any job?
● clustered and non-clustered index?
● How we can remove duplicate records from table ?
● In project which technique you use Project flow
● how to import data in hbase?
● why you are using hbase? Why you are not importing data direct into Hive?
● how to generate report?
● which type of data you perform (batch or real time)
● Which technology you used in project?
● what is project duration?
● which table you are using in hive?
● how to manage your spark job?
● what was your project all about?
● what was your part in the whole project?
● What difficulties you faced while doing your part of the project?
● What technologies you used and why you used them only?
● How long did it take to complete your part of project
● what tools did you use on the projects
● who is assigning task to you
● what was the project mode
● How would you describe a project plan?
● What you checked in unit testing ?
● How to check logs
● How to decide no of core and executor memory?
● If we get 10gb delta data and we have made allocation for 1gb or 2gb.How to handle such situation
● how connection is made in redshift
● How you did the deployment?
● What you checked in unit testing ?
● Why You used on premise cluster in development and on which platform ?
● After writing the code in development what you did ?
● How to decide no of core and executor memory?
● Why we use hbase in our project ?
● why we use hive not Athena or redshift ?
● In which phase your project is ? Is our job responsibilities changes according to project phase ?
● How exactly the report generation according to client requirement work in project ?
● On Which Part of ur project you use Airflow ?
● how u do reconsilation in your project ?
● how u achieve Parallelizm in Project ?
● how to find how many resources/executor are available on cluster and how we allocate no. of executor ?
● which distribution of bigdata ur company use? if on-premises then how the install Hadoop on it ( like cloud era or mapper) ??
● what was project mode? is it support or development
● who do testing in your project? is there testing team or test yourself?
● what is the major blocker you faced in your project?
● what types of report you generated?
● what is spark? why do you use spark only?
● why do you use H-base? why not other tools?
● why DAG is created?
● how you use DAG to solve any issue?
● What tech used to convert external table into internal table & explain with example ?
● where you used external & internal table in this project?
● what is s3 replication?
● explain clusterd and non clusterd index?
● decorator and its type?
● versioning in s3?
● cluster manager in spark?
● what is file system ali?
● daily delta file get which location during project?
● actual which type of clients requirement comes and how implement logic on requirement?
● why we used clusterd index?
● Who will give the requirement in your project?
● Who will put the code into the production environment?
● In which domain your project was?
● What was your team size?
● Have you providing a support for your project?
● What type of support you are providing?
● What was your project duration?
● What is role of Data Engineer?
● What is difference between customer and subscriber?
● What was your project goal? What was main intention to analyze the data
● Which tool you have used in your project?
● What was your project cluster configuration?
● Which type of data processing you have used in your project
● What is the main reason to load data into L1 layer in your project?
● In your project you are processing historic data on daily basis?
● How you are capturing Delta data?
● Who will send this Delta data file?
● Why you are using S3 bucket as target in your project?
● Why staging is required?
● Why you are keeping daily data into the staging layer?
● Why you have got one file in S3 bucket
● How you debug the failed job?
● Which command used for storing the log in file
● How to identify any error in log?
● How to check any job is running or failed?
● How to terminate any job?
● How to list of job which are running
● Where you can checking log if any job got failed?
● What is the port no. of Application Manager or Spark History Server?
● What are the different issue you have faced while implimenting this project & how to resolve it?
● What are the optimization technique you have used in your project?
● How you have done validation/reconsilation in your project?
● How to log in phoenix terminal?
● Which command use for import data from RDS to Hbase?
● How to run .hql script from hadoop?
● Which command use for import data from Hbase to Hive?
● Which command use for import data from Hive_staging to Hive_final?
● Difference betweeen OLAP & OLTP?
● Why we are getting data source from RDBMS.?
● why we are not getting data source from hive hbase etc?
● Who will prepare requirement document?
● Who will prepare Scopping document?