● Draw architecture of your project
● Source of data from client, csv or sql table
● Password file in Sqoop
● Compress and import in Sqoop
● What's incremental append
● Where does last value on incremental append save, can we view the last value
● DDL functions of hive
● Partition & Bucketing
● Why manage table in our project 1
● Types of hive tables and where u used this table in your project.
● Sqoop import and incremental Append and Last modified in where and when used this in your project
● What types of file u get in your project
● Draw and Explain Your Project Architecture If the Given data Looks like 12,34,56 23,56,86 45,67,56,87 56,77,66
● Only One Row having One Extra Column how do you Process it
● What are the hive Process you are doing in Your Project Explain
● What are the Requirements for Hive table Creation and How You are Creating in Your Project
● How to automate that Hive Process and requirements for the automation
● Explain the requirements in Details
● What type Of tables you are using in your project
● If You are Using Partition and Bucketing in your Project
● What is your Default partition using in your project
● Is it Possible to Delete and Update in Hive Table. Have You used in Your Project
● What are the important requirements For Spark Submit
● What are The Process you are Working, Explain.
● What is Your Cluster size and Data Handling Size
● How to Transfer data from One Cluster to Another cluster
● What are the Optimization techniques used in Hive
● Explain Bucketing and Partitioning In which scenarios you use both in your Project
● Write the Query to Create Bucketing table
● Clustered By Distributed By Difference
● What is mean by Broadcast Variable
● Explain Data Transferring from SQL to Hadoop
● Write a Spark Scala Code to Create Tempview from a Text file
● Tell me about your project
● How you will merge small files in Hadoop
● Why you are going for big data
● Map reduce explain with example
● Difference between avro and parquet
● How you will insert values to the hive table from one table using case condition
● Syntax of map side join and why we go for map side join
● What is the max size of map side join small table
● On sale directory if you create internal and external table, if u delete external table can u see
● Internal table details
● Explain Hadoop 2.0 architecture with example
● Split brain scenario
● Project details
● Sqoop code incremental append
● What is hive metastore , where it is saved in prod cluster?
● Hive external tables use
● What do you mean by a data lake?
● How to utilize hive buckets in spark?
● Two Input files: .csv and .parquet with policy and insured details.
● You have to join based on policy number and insert the data based on join and some business
● logic to hive table
● Using data frame..DSL or sparksql this was supposed to be done
● How the folder structure is constructed in ur project in Eclipse IDE
● How do u access the global variable in your spark code
● What r the starting lines u write in spark code
● Explode function in Spark
● What to do when a sqoop job fails after sometime
● How to restart a failed sqoop job
● What to do if 1 mapper is failing out of 2 mappers everytime
● How to join 2 files in hive without loading it in hive tables
● How do u pass different parameter values in sqoop job
● What is case class in spark
● How to process nested json or nested array in hive and spark
● How to select few columns from an rdd and not converting it to a df
● If the data type is not correct, then How to enforce the data type on a column after the df is created
● How to read parquet file in spark
● Left fold..right fold difference in Scala
● Map..flat map difference
● What is currying function in Scala
● Closure functions and what are their benefits.
● write a hive query to join 3 tables, Where the second largest table should go to memory With optimization
● Hadoop copy
● Sqoop incremental append
● Using Sqoop import particular table content of data
● External table create
● Load data to external table
● Process in your company
● What are elements presents in your data
● explain yarn architecture
● describe all the phases in Map Reduce
● does hive support OLTP operations?
● how does hive work internally?
● difference between Hadoop 1.0 and Hadoop 2.0
● what is hive Metastore?
● who submits the job name node or data node
● who executes the job name node or data node
● hive supports only OLAP then how can it support insert commands
● does hive also work on Write Once Read Many?
● bucketing, static and dynamic partitioning in hive
● how to use explode in hive
● about trim function in hive
● hive performance tuning
● purpose of pagefilter in hbase
● WAL in hbase
● how to use joins in hbase(I don’t think hbase support joins, we need to depend on mapreduce?)
● can you run hbase and mapreduce on same cluster, if yes how
● tell me about yourself & your project
● Difference between hdfs and hive
● what different kind of tables we can create in hive
● how to recursively delete the directory in hdfs
● what are data warehousing concepts
● what is type 2 dimensions
● what is a data warehouse
● Difference between database and data warehouse
● how to load the data through spark where few records are updates and few are new ?
● Hive and spark optimizations
● Can we write a stored procedure in hive ?
● Can we create a view in hive
● How much cost and processing time did u save by optimizations
● How big the data u were processing
● can we define foreign and primary key in hive tables
● how can we do the data quality check in spark and hive
● how to replace null values with some other value or discard the rows with null values in spark
● how to force datatype checks on column level in Scala
● Project architecture
● How to process a log file in spark and store
● Spark architecture
● what are Serde properties
● Hive context in spark
● Oozie workflow
● Partitions and buckets in hive
● Hbase architecture and working
● tell me about urself & your project
● explain hadoop architecture
● diff between hadoop 1 and 2
● when a file is stored in hdfs, can we modify that file ?
● can multiple clients write the same file at the same time ?
● if a file is being written and another client wants to read the same file, is it possible?
● if i have 10 nodes and a job is running on these nodes. Now till 8 nodes the job has finished and then name node goes down, then what will happen.
● what is speculative execution
● what is Spark
● what id RDD
● what are the properties of RDD
● explain about kafka-spark poc
● explain nifi tool
● what is hive
● what is the use of external tables in hive
● If i have millions of records in a csv file and i want to load it in hive table, how to do that ?
● what is Oozie ?
● Can kafka start without zookeeper ?
● Higher order function? and its advantages?
● Why you used Scala for Spark ?
● How do you do packaging (managing packages)
● How do you add dependencies for your project/application?
● Why do you need dependencies?
● pom.xml <- will things work if we rename pom.xml
● what are minimum mandatory imports that are required for scala application
● What is Kafka and why its used but not other Message queue?
● Can we use analytic functions on RDD? (i don't exactly know what he meant)
● What is vectorization and how does it improves performance?
● tell me about urself & your project
● What is short term and long term goal and where u want to see urself in like years
● In Spark, what is a lower level and higher level abstractions
● What is a dataframe and a dataset and what is the diff b/w them
● Lets say u have 5 tb of data & 16 gb of RAM, and u want to provide top 20 customers then what will be your approach for this ?
● one scenario based question on hive joins (it was kind of a very tricky question, it actually didn't require join but required case condition)
● When we submit a job in the spark cluster, what happens at the back end ?
● Can we import data directly to hive table through Sqoop
● how to give password and other details through file in Sqoop import
● what is the architecture of Flume
● Which kind of channel in flume is most reliable
● what is SCD and how many types of SCDs are there
● what is OLAP and OLTP database schemas
● what is a surrogate key in database
● what are the components u have used so far
● which version of spark and hive
● what is the distribution u r using
● difference between spark 1.6 and spark 2.0
● If used s3, then what is the default no. of buckets in s3
● How to read a nested JSON data in spark
● How to convert rdd to data frame
● what is struct type
● is it possible to define array field in struct
● what is the datatype of array field in struct
● is it possible to convert a df into rdd
● what is output of df.rdd
● which ide to develop spark code
● how u deploy the code to production
● How to use the same spark jar for different variables inputs
● How you manage duplicate data in hive
● Difference between input split and block
● How to change replication factor in hdfs
● what is default block size and how to change it
● if job is running and we increase block size what will happen with job
● How to recover cluster if all name node stand by name node fail
● what is fs image
● How many resource manager we have in cluster'
● Difference internal and external table
● Difference partitioning and bucketing
● if you have one table in hive write a query to add new column
● How to use user defined function in hive
● is it possible to create bucket without partition
● Syntax of bucketing
● What type of data we store in hive
● is it possible to create table in hive with data like .log file
● which is bidefault file format in spark
● Difference between ORC and Parquet
● Different type of join use in spark
● Optimization technique use in spark
● Explain sort merge and broadcast join
● Difference between map and flat map
● Difference between map and map by partition
● Project flow
● difference between partitioning & Buckating
● how to put data on s3 ?
● what is narrow and wide transformation
● what is case statement in scala ?
● difference between dataframe and data set
● which file format have you used in your project ?
● what is singletone object in scala ?
● How multiple task get created in spark submit job ?
● How customer is reading the final report ?
● Explain Spark Architecture
● what is SerDe in Hive ?
● Total no. of years of relevant hands-on experience in Big Data Analytics?
● Total no. of years of relevant hands-on experience in Spark using Scala?
● How many projects deployed in production using Spark?
● Total no. of years of relevant Spark streaming experience using Scala?
● How many projects deployed in production using Spark Streaming?
● Rate yourself on Spark Dataframe API from 1 to 5, 5 being highest.
● How many projects deployed in production using Spark Dataframe API?
● Max. size of data processed on daily basis in GB/TB/PB?
● How many productionalized projects processed data in TB?
● Total no. of years of Hive partitioning, bucketing, vectorization?
● How many projects deployed in production using Hive?
● Rate urself on Hive from 1 to 5, 5 being highest.
● Have you used Appworks/Oozie/AirFlow/Control-M scheduler in production?
● How many pipelines were deployed using the scheduler in production?
● Total no. of years of experience on Azure platform/Cloud?
● How many projects were deployed in production on Azure Data Lake platform
● Any experience with AWS or GCP Big data technologies?
● How many projects were deployed in production on AWS or GCP platform?
● Have you worked on CI/CD or Git Pipeline?
● What hadoop distribution was used on premises?
● Any Hadoop or cloud certifications?
● Any exposure to manufacturing domain?
● How much experience with SQL?
● Any experience with performance optimization in SQL?
● Any experience with performance optimization techniques in Spark?
● What Big Data technologies are used in current project?
● What is storage services in AWS
● What is object in S3
● How we access data from S3 to EC2
● How to put data in S3
● Max Data files limit in S3.
● Can we delete bucket in s3.
● How we recover data from EBS while EC2 failed.
● What is difference between EBS and S3.
● Various Databases in AWS.
● Use of DynamoDB and RedShift and RDS
● can we create same user in different region machines.
● What is use of EC2.
● What is security group.
● Roles of Security Group.
● What is VPC pair.
● How can we transfer data with in two regions Machine.
● Type of EC2 instances.
● What is AIM.
● Can we share AMI
● Role of VPC.
● can we change the region of the EC2.
● what can we use for text to speech conversion.
● what we use for huge data transfer.
● can we run multiple websites on single machine.
● Bigdata with AWS.