Document

● What is the role of Namenode, datanode, CheckPoint node?

● Challenges with bigdata and how Hadoop solve it?

● Role of job Tracker, Task Tracker?

● Explain Rack Awareness concept?

● Explain High Availability

● What is HDFS Federation?

● What are the different steps of MapReduce job?

● Can you tell me any 5 commands of HDFS?

● What are the steps of MapReduce Job?

● How to list all the files from Hadoop?

● What is the default replication factor and block size in Hadoop?

● Can we stored small size files on HDFS?

● Difference between MR and Spark?

● What is combiner ?

● Difference between HDFS DFS and Hadoop fs?

● Why we not recommend small size files on HDFS?

● Why HDFS block size should not be too small or too large ?

● How we run MapReduce job on cluster?

● Explain mapper and reducer phase of MR?

● Explain READ and Write Operation with HDFS?

● What is Default replication factor and how will you change it at file level?

● Why do we need replication factor > 1 in production Hadoop cluster?

● How will you combine the 4 part-r files of a mapreduce job?

● What are the Compression techniques in HDFS and which is the best one and why?

● How will you view the compressed files via HDFS command?

● What is Secondary Name Node and its Functionalities? why do we need it?

● What is Backup node and how is it different from Secondary name node?

● What is FSimage and edit logs and how they are related?

● what is default block size in HDFS? and why is it so large?

● How will you copy a large file of 50GB into HDFS in parallel

● what is Balancing in HDFS?

● What is expunge in HDFS ?

● What is the default uri for HDFS WEB UI? Can we create files via HDFS WEB UI?

● How can we check existence of non zero length file in HDFS commands

● What is the role of IOUtils in HDFS API and how is it useful?

● Can we archive files in HDFS? If yes, how can we do that?

● What is safe mode in Hadoop and what are the restrictions during safe mode?

● What is rack awareness in hadoop?

● Can we come out of safe mode manually, if yes how?

● Why block size in hadoop is maintained as very big compared to traditional block size?

● What are Sequence files and how are they different from text files?

● What is the limitation of Sequence files?

● What are Avro files ?

● Can an avro file created in Java in machine 1 can be read on machine with Ruby API?

● Where does the schema of an Avro file is store if the file is transferred from one host to another?

● How do we handle small files in HDFS?

● What is delegation token in Hadoop and why is it important?

● What is 'fsck' in Hadoop?

● Can we append data records to an existing file in HDFS?

● Can we get count of files in a directory on HDFS via command line?

● How do we achieve security on Hadoop cluster?

● Can we create multiple files in HDFS with different block sizes?

● What is the importance of dfs.namenode.name.dir in Hadoop?

● What is the need for fsck in hadoop?

● Does HDFS block boundaries be between records or across the records?

● What is Speculative execution?

● What is Distributed Cache?

● WorkFlow of MapReduce job?

● How will you globally sort the output of mapreduce job?

● Difference between map side and reducer side Join?

● What is Map reduce chaining?

● How will You pass parameters to mapper or reducer?

● How will you create custom key and value type’s?

● Sorting based on any column other than Key?

● How will you create custom input formats?

● How will you process huge number of small files in MR job?

● Can we run Reducer without Mapper?

● Whether mapper and reducer tasks run in parallel? If no, why see some times as (map 80%,reduce 10%)?

● How will you setup a custom counter to detect bad records in the input?

● How will you schedule mapreduce Jobs?

● what is combiner?Tell me one scenario where it is not suitable?

● How will you submit mapreduce job through command line?

● How will you kill a running mapreduce job?

● For a failed mapreduce job how will trace for the root cause

● What will you do if a mapreduce job failed with Java heap space error message?

● How many map tasks & reduce tasks will run on each datanode by default

● What is the minimum RAM capacity needed for this datanode?

● What is difference between Mapreduce and YARN?

● What is Tez framework?

● What is the difference between Tez and Mapreduce ?

● What is input split, input format and record reader in Mapreduce programming?

● Does Mapreduce support processing of Avro files ? If yes, what is the main classes of the API?

● How will you process a dataset in JSON format in mapreduce job?

● Can we create multi level directory structure (year/month/date) in Mapreduce based on the input data?

● What is the relation between TextOutputFormat and KeyValueTextInputFormat?

● What is Lazy Output Format in Mapreduce and why do we need it?

● How do we prevent file splitting in Mapreduce ?

● What is the difference between Writable and WritableComparable interfaces? And what is sufficient for value type in MR job?

● What is the Role of Application Master in running Mapreduce job through YARN?

● What is Uber task ?

● What are IdentityMapper & IdentityReducer classes?

● How do we create jar file with .class files in a directory through command line?

● What is the default port for YARN Web UI?

● How can we distribute our application’s jars to all of the nodes in the YARN cluster that need it?

● How do We include native libraries in YARN jobs?

● What is the default scheduler inside YARN framework for starting tasks?

● How do we handle record bounderies in Text files or Sequence files in Mapreduce Inputsplits?

● In Mapreduce, InputSplit’s RecordReader will start and end at a record boundary. In SequenceFiles, every 2k bytes has a 20 bytes sync mark between the records. These sync marks allow the RecordReader to seek to the start of the InputSplit, which contains a file, offset and length and find the first sync mark after the start of the split.

● The RecordReader continues processing records until it reaches the first sync mark after the end of the split. Text files are handled similarly, using newlines instead of sync marks.

● Some times mapreduce jobs will fail if we submit the same jobs from a different user? What is the cause and how do we fix these?

● How to change the default location of mapreduce job’s intermediate data ?

● If a map task is failed once during mapreduce job execution will job fail immediately?

● What are the 4 V’s

● Which one is most important?

● What File Formats can you use in Hadoop

● What is the difference between a name and a data node

● What is HDFS

● What is the purpose of YARN

● What is a data lake?

● What is a data warehouse

● Are there data lake warehouses?

● Two Datalakes within single warehouse?

● What is a data mart?

● what is a slow changing dimension (types)

● What is a surrogate key and why use them?

BIGDATA

INTERVIEW QUESTIONS

EXTRA QUESTIONS