COMPANY

INTERVIEW QUESTIONS

Cognizant :

● Draw architecture of your project

● Source of data from client, csv or sql table

● Password file in Sqoop

● Compress and import in Sqoop

● What's incremental append

● Where does last value on incremental append save, can we view the last value

● DDL functions of hive

● Partition & Bucketing

● Why manage table in our project 1

● Types of hive tables and where u used this table in your project.

● Sqoop import and incremental Append and Last modified in where and when used this in your project

● What types of file u get in your project

● Draw and Explain Your Project Architecture If the Given data Looks like 12,34,56 23,56,86 45,67,56,87 56,77,66

● Only One Row having One Extra Column how do you Process it

● What are the hive Process you are doing in Your Project Explain

● What are the Requirements for Hive table Creation and How You are Creating in Your Project

● How to automate that Hive Process and requirements for the automation

● Explain the requirements in Details

● What type Of tables you are using in your project

● If You are Using Partition and Bucketing in your Project

● What is your Default partition using in your project

● Is it Possible to Delete and Update in Hive Table. Have You used in Your Project

● What are the important requirements For Spark Submit

● What are The Process you are Working, Explain.

● What is Your Cluster size and Data Handling Size

● How to Transfer data from One Cluster to Another cluster

● What are the Optimization techniques used in Hive

● Explain Bucketing and Partitioning In which scenarios you use both in your Project

● Write the Query to Create Bucketing table

● Clustered By Distributed By Difference

● What is mean by Broadcast Variable

● Explain Data Transferring from SQL to Hadoop

● Write a Spark Scala Code to Create Tempview from a Text file

Wipro :

● Tell me about your project

● How you will merge small files in Hadoop

● Why you are going for big data

● Map reduce explain with example

● Difference between avro and parquet

● How you will insert values to the hive table from one table using case condition

● Syntax of map side join and why we go for map side join

● What is the max size of map side join small table

● On sale directory if you create internal and external table, if u delete external table can u see

● Internal table details

● Explain Hadoop 2.0 architecture with example

● Split brain scenario

Maverick :

● Project details

● Sqoop code incremental append

● What is hive metastore , where it is saved in prod cluster?

● Hive external tables use

● What do you mean by a data lake?

● How to utilize hive buckets in spark?

Legato systems :

● Two Input files: .csv and .parquet with policy and insured details.

● You have to join based on policy number and insert the data based on join and some business

● logic to hive table

● Using data frame..DSL or sparksql this was supposed to be done

HCL Technologies:

● How the folder structure is constructed in ur project in Eclipse IDE

● How do u access the global variable in your spark code

● What r the starting lines u write in spark code

● Explode function in Spark

● What to do when a sqoop job fails after sometime

● How to restart a failed sqoop job

● What to do if 1 mapper is failing out of 2 mappers everytime

● How to join 2 files in hive without loading it in hive tables

● How do u pass different parameter values in sqoop job

● What is case class in spark

● How to process nested json or nested array in hive and spark

● How to select few columns from an rdd and not converting it to a df

● If the data type is not correct, then How to enforce the data type on a column after the df is created

● How to read parquet file in spark

● Left fold..right fold difference in Scala

● Map..flat map difference

● What is currying function in Scala

● Closure functions and what are their benefits.

● write a hive query to join 3 tables, Where the second largest table should go to memory With optimization

FIS Global :

● Hadoop copy

● Sqoop incremental append

● Using Sqoop import particular table content of data

● External table create

● Load data to external table

● Process in your company

● What are elements presents in your data

DATA PEACE :

● explain yarn architecture

● describe all the phases in Map Reduce

● does hive support OLTP operations?

● how does hive work internally?

● difference between Hadoop 1.0 and Hadoop 2.0

● what is hive Metastore?

● who submits the job name node or data node

● who executes the job name node or data node

● hive supports only OLAP then how can it support insert commands

● does hive also work on Write Once Read Many?

GSPANN :

● bucketing, static and dynamic partitioning in hive

● how to use explode in hive

● about trim function in hive

● hive performance tuning

● purpose of pagefilter in hbase

● WAL in hbase

● how to use joins in hbase(I don’t think hbase support joins, we need to depend on mapreduce?)

● can you run hbase and mapreduce on same cluster, if yes how

DST Worldwide Services :

● tell me about yourself & your project

● Difference between hdfs and hive

● what different kind of tables we can create in hive

● how to recursively delete the directory in hdfs

● what are data warehousing concepts

● what is type 2 dimensions

● what is a data warehouse

● Difference between database and data warehouse

● how to load the data through spark where few records are updates and few are new ?

● Hive and spark optimizations

● Can we write a stored procedure in hive ?

● Can we create a view in hive

● How much cost and processing time did u save by optimizations

● How big the data u were processing

● can we define foreign and primary key in hive tables

● how can we do the data quality check in spark and hive

● how to replace null values with some other value or discard the rows with null values in spark

● how to force datatype checks on column level in Scala

Michelin Types :

● Project architecture

● How to process a log file in spark and store

● Spark architecture

● what are Serde properties

● Hive context in spark

● Oozie workflow

● Partitions and buckets in hive

● Hbase architecture and working

Global Edge Software, Bangalore :

● tell me about urself & your project

● explain hadoop architecture

● diff between hadoop 1 and 2

● when a file is stored in hdfs, can we modify that file ?

● can multiple clients write the same file at the same time ?

● if a file is being written and another client wants to read the same file, is it possible?

● if i have 10 nodes and a job is running on these nodes. Now till 8 nodes the job has finished and then name node goes down, then what will happen.

● what is speculative execution

● what is Spark

● what id RDD

● what are the properties of RDD

● explain about kafka-spark poc

● explain nifi tool

● what is hive

● what is the use of external tables in hive

● If i have millions of records in a csv file and i want to load it in hive table, how to do that ?

● what is Oozie ?

● Can kafka start without zookeeper ?

Capgemini :

● Higher order function? and its advantages?

● Why you used Scala for Spark ?

● How do you do packaging (managing packages)

● How do you add dependencies for your project/application?

● Why do you need dependencies?

● pom.xml <- will things work if we rename pom.xml

● what are minimum mandatory imports that are required for scala application

● What is Kafka and why its used but not other Message queue?

● Can we use analytic functions on RDD? (i don't exactly know what he meant)

● What is vectorization and how does it improves performance?

T-Systems :

● tell me about urself & your project

● What is short term and long term goal and where u want to see urself in like years

● In Spark, what is a lower level and higher level abstractions

● What is a dataframe and a dataset and what is the diff b/w them

● Lets say u have 5 tb of data & 16 gb of RAM, and u want to provide top 20 customers then what will be your approach for this ?

● one scenario based question on hive joins (it was kind of a very tricky question, it actually didn't require join but required case condition)

● When we submit a job in the spark cluster, what happens at the back end ?

● Can we import data directly to hive table through Sqoop

● how to give password and other details through file in Sqoop import

● what is the architecture of Flume

● Which kind of channel in flume is most reliable

● what is SCD and how many types of SCDs are there

● what is OLAP and OLTP database schemas

● what is a surrogate key in database

some other companies :

● what are the components u have used so far

● which version of spark and hive

● what is the distribution u r using

● difference between spark 1.6 and spark 2.0

● If used s3, then what is the default no. of buckets in s3

● How to read a nested JSON data in spark

● How to convert rdd to data frame

● what is struct type

● is it possible to define array field in struct

● what is the datatype of array field in struct

● is it possible to convert a df into rdd

● what is output of df.rdd

● which ide to develop spark code

● how u deploy the code to production

● How to use the same spark jar for different variables inputs

impetus :

● How you manage duplicate data in hive

● Difference between input split and block

● How to change replication factor in hdfs

● what is default block size and how to change it

● if job is running and we increase block size what will happen with job

● How to recover cluster if all name node stand by name node fail

● what is fs image

● How many resource manager we have in cluster'

● Difference internal and external table

● Difference partitioning and bucketing

● if you have one table in hive write a query to add new column

● How to use user defined function in hive

● is it possible to create bucket without partition

● Syntax of bucketing

● What type of data we store in hive

● is it possible to create table in hive with data like .log file

● which is bidefault file format in spark

● Difference between ORC and Parquet

● Different type of join use in spark

● Optimization technique use in spark

● Explain sort merge and broadcast join

● Difference between map and flat map

● Difference between map and map by partition

Cognizant :

● Project flow

● difference between partitioning & Buckating

● how to put data on s3 ?

● what is narrow and wide transformation

● what is case statement in scala ?

● difference between dataframe and data set

● which file format have you used in your project ?

● what is singletone object in scala ?

● How multiple task get created in spark submit job ?

● How customer is reading the final report ?

● Explain Spark Architecture

● what is SerDe in Hive ?

Others :

● Total no. of years of relevant hands-on experience in Big Data Analytics?

● Total no. of years of relevant hands-on experience in Spark using Scala?

● How many projects deployed in production using Spark?

● Total no. of years of relevant Spark streaming experience using Scala?

● How many projects deployed in production using Spark Streaming?

● Rate yourself on Spark Dataframe API from 1 to 5, 5 being highest.

● How many projects deployed in production using Spark Dataframe API?

● Max. size of data processed on daily basis in GB/TB/PB?

● How many productionalized projects processed data in TB?

● Total no. of years of Hive partitioning, bucketing, vectorization?

● How many projects deployed in production using Hive?

● Rate urself on Hive from 1 to 5, 5 being highest.

● Have you used Appworks/Oozie/AirFlow/Control-M scheduler in production?

● How many pipelines were deployed using the scheduler in production?

● Total no. of years of experience on Azure platform/Cloud?

● How many projects were deployed in production on Azure Data Lake platform

● Any experience with AWS or GCP Big data technologies?

● How many projects were deployed in production on AWS or GCP platform?

● Have you worked on CI/CD or Git Pipeline?

● What hadoop distribution was used on premises?

● Any Hadoop or cloud certifications?

● Any exposure to manufacturing domain?

● How much experience with SQL?

● Any experience with performance optimization in SQL?

● Any experience with performance optimization techniques in Spark?

● What Big Data technologies are used in current project?

Accenture :

● What is storage services in AWS

● What is object in S3

● How we access data from S3 to EC2

● How to put data in S3

● Max Data files limit in S3.

● Can we delete bucket in s3.

● How we recover data from EBS while EC2 failed.

● What is difference between EBS and S3.

● Various Databases in AWS.

● Use of DynamoDB and RedShift and RDS

● can we create same user in different region machines.

● What is use of EC2.

● What is security group.

● Roles of Security Group.

● What is VPC pair.

● How can we transfer data with in two regions Machine.

● Type of EC2 instances.

● What is AIM.

● Can we share AMI

● Role of VPC.

● can we change the region of the EC2.

● what can we use for text to speech conversion.

● what we use for huge data transfer.

● can we run multiple websites on single machine.

● Bigdata with AWS.