● What is the role of Namenode, datanode, CheckPoint node?
● Challenges with bigdata and how Hadoop solve it?
● Role of job Tracker, Task Tracker?
● Explain Rack Awareness concept?
● Explain High Availability
● What is HDFS Federation?
● What are the different steps of MapReduce job?
● Can you tell me any 5 commands of HDFS?
● What are the steps of MapReduce Job?
● How to list all the files from Hadoop?
● What is the default replication factor and block size in Hadoop?
● Can we stored small size files on HDFS?
● Difference between MR and Spark?
● What is combiner ?
● Difference between HDFS DFS and Hadoop fs?
● Why we not recommend small size files on HDFS?
● Why HDFS block size should not be too small or too large ?
● How we run MapReduce job on cluster?
● Explain mapper and reducer phase of MR?
● Explain READ and Write Operation with HDFS?
● What is Default replication factor and how will you change it at file level?
● Why do we need replication factor > 1 in production Hadoop cluster?
● How will you combine the 4 part-r files of a mapreduce job?
● What are the Compression techniques in HDFS and which is the best one and why?
● How will you view the compressed files via HDFS command?
● What is Secondary Name Node and its Functionalities? why do we need it?
● What is Backup node and how is it different from Secondary name node?
● What is FSimage and edit logs and how they are related?
● what is default block size in HDFS? and why is it so large?
● How will you copy a large file of 50GB into HDFS in parallel
● what is Balancing in HDFS?
● What is expunge in HDFS ?
● What is the default uri for HDFS WEB UI? Can we create files via HDFS WEB UI?
● How can we check existence of non zero length file in HDFS commands
● What is the role of IOUtils in HDFS API and how is it useful?
● Can we archive files in HDFS? If yes, how can we do that?
● What is safe mode in Hadoop and what are the restrictions during safe mode?
● What is rack awareness in hadoop?
● Can we come out of safe mode manually, if yes how?
● Why block size in hadoop is maintained as very big compared to traditional block size?
● What are Sequence files and how are they different from text files?
● What is the limitation of Sequence files?
● What are Avro files ?
● Can an avro file created in Java in machine 1 can be read on machine with Ruby API?
● Where does the schema of an Avro file is store if the file is transferred from one host to another?
● How do we handle small files in HDFS?
● What is delegation token in Hadoop and why is it important?
● What is 'fsck' in Hadoop?
● Can we append data records to an existing file in HDFS?
● Can we get count of files in a directory on HDFS via command line?
● How do we achieve security on Hadoop cluster?
● Can we create multiple files in HDFS with different block sizes?
● What is the importance of dfs.namenode.name.dir in Hadoop?
● What is the need for fsck in hadoop?
● Does HDFS block boundaries be between records or across the records?
● What is Speculative execution?
● What is Distributed Cache?
● WorkFlow of MapReduce job?
● How will you globally sort the output of mapreduce job?
● Difference between map side and reducer side Join?
● What is Map reduce chaining?
● How will You pass parameters to mapper or reducer?
● How will you create custom key and value type’s?
● Sorting based on any column other than Key?
● How will you create custom input formats?
● How will you process huge number of small files in MR job?
● Can we run Reducer without Mapper?
● Whether mapper and reducer tasks run in parallel? If no, why see some times as (map 80%,reduce 10%)?
● How will you setup a custom counter to detect bad records in the input?
● How will you schedule mapreduce Jobs?
● what is combiner?Tell me one scenario where it is not suitable?
● How will you submit mapreduce job through command line?
● How will you kill a running mapreduce job?
● For a failed mapreduce job how will trace for the root cause
● What will you do if a mapreduce job failed with Java heap space error message?
● How many map tasks & reduce tasks will run on each datanode by default
● What is the minimum RAM capacity needed for this datanode?
● What is difference between Mapreduce and YARN?
● What is Tez framework?
● What is the difference between Tez and Mapreduce ?
● What is input split, input format and record reader in Mapreduce programming?
● Does Mapreduce support processing of Avro files ? If yes, what is the main classes of the API?
● How will you process a dataset in JSON format in mapreduce job?
● Can we create multi level directory structure (year/month/date) in Mapreduce based on the input data?
● What is the relation between TextOutputFormat and KeyValueTextInputFormat?
● What is Lazy Output Format in Mapreduce and why do we need it?
● How do we prevent file splitting in Mapreduce ?
● What is the difference between Writable and WritableComparable interfaces? And what is sufficient for value type in MR job?
● What is the Role of Application Master in running Mapreduce job through YARN?
● What is Uber task ?
● What are IdentityMapper & IdentityReducer classes?
● How do we create jar file with .class files in a directory through command line?
● What is the default port for YARN Web UI?
● How can we distribute our application’s jars to all of the nodes in the YARN cluster that need it?
● How do We include native libraries in YARN jobs?
● What is the default scheduler inside YARN framework for starting tasks?
● How do we handle record bounderies in Text files or Sequence files in Mapreduce Inputsplits?
● In Mapreduce, InputSplit’s RecordReader will start and end at a record boundary. In SequenceFiles, every 2k bytes has a 20 bytes sync mark between the records. These sync marks allow the RecordReader to seek to the start of the InputSplit, which contains a file, offset and length and find the first sync mark after the start of the split.
● The RecordReader continues processing records until it reaches the first sync mark after the end of the split. Text files are handled similarly, using newlines instead of sync marks.
● Some times mapreduce jobs will fail if we submit the same jobs from a different user? What is the cause and how do we fix these?
● How to change the default location of mapreduce job’s intermediate data ?
● If a map task is failed once during mapreduce job execution will job fail immediately?
● What are the 4 V’s
● Which one is most important?
● What File Formats can you use in Hadoop
● What is the difference between a name and a data node
● What is HDFS
● What is the purpose of YARN
● What is a data lake?
● What is a data warehouse
● Are there data lake warehouses?
● Two Datalakes within single warehouse?
● What is a data mart?
● what is a slow changing dimension (types)
● What is a surrogate key and why use them?