Document

● Why Spark, even Hadoop exists?

● Why both Spark and Hadoop needed?

● How can you use Machine Learning library ͞SciKit library͟ which is written in Python, with Spark engine?

● Why Spark is good at low-latency iterative workloads e.g. Graphs and Machine Learning?

● Which all kind of data processing supported by Spark?

● How do you define SparkContext?

● How can you define SparkConfigure?

● Which all are the, ways to configure Spark Properties and order them least important to the most important.

● What is the Default level of parallelism in Spark?

● Is it possible to have multiple SparkContext in single JVM?

● Can RDD be shared between SparkContexts?

● In Spark-Shell, which all contexts are available by default?

● Give few examples , how RDD can be created using SparkContext

● How would you broadcast, collection of values over the Spark executors?

● What is the advantage of broadcasting values across Spark Cluster?

● Can we broadcast an RDD?

● How can we distribute JARs to workers?

● How can you stop SparkContext and what is the impact if stopped?

● Which scheduler is used by SparkContext by default?

● How would you the amount of memory to allocate to each executor?

● How do you define RDD?

● What is Lazy evaluated RDD mean?

● How would you control the number of partitions of a RDD?

● What are the possible operations on RDD

● How RDD helps parallel job processing?

● What is the transformation?

● How do you define actions?

● How can you create an RDD for a text file?

● What is Preferred Locations

● What is a RDD Lineage Graph

● Please tell me , how execution starts and end on RDD or Spark Job

● Give example of transformations that do trigger jobs

● How many type of transformations exist?

● What is Narrow Transformations?

● What is wide Transformations?

● Data is spread in all the nodes of cluster, how spark tries to process this data?

● How would you hint, minimum number of partitions while transformation ?

● Which limits the maximum size of a partition?

● When Spark works with file.txt.gz, how many partitions can be created?

● What is coalesce transformation?

● What is the difference between cache() and persist() method of RDD

● You have RDD storage level defined as MEMORY_ONLY_2 , what does _2 means ?

● What is Shuffling?

● Does shuffling change the number of partitions?

● What is the difference between groupByKey and use reduceByKey ?

● When you call join operation on two pair RDDs e.g. (K, V) and (K, W), what is the result?

● What is checkpointing?

● distributed (HDFS) or local file system. RDD checkpointing that saves the actual intermediate

● RDD data to a reliable distributed file system.

● You mark an RDD for checkpointing by calling RDD.checkpoint() . The RDD will be saved to a file

● inside the checkpoint directory and all references to its parent RDDs will be removed. This

● function has to be called before any job has been executed on this RDD.

● What do you mean by Dependencies in RDD lineage graph?

● Which script will you use Spark Application, using spark-shell ?

● Define Spark architecture

● What is the purpose of Driver in Spark Architecture?

● Can you define the purpose of master in Spark architecture?

● What are the workers?

● Please explain, how worker’s work, when a new Job submitted to them?

● Please define executors in detail?

● What is DAG Schedular and how it performs?

● What is stage, with regards to Spark Job execution?

● What is Task, with regards to Spark Job execution?

● What is Speculative Execution of a tasks?

● Which all cluster manager can be used with Spark?

● What is a BlockManager?

● What is Data locality / placement?

● What is master URL in local mode?

● Define components of YARN?

● What is a Broadcast Variable?

● How can you define Spark Accumulators?

● What all are the data sources Spark can process?

● What is Apache Parquet format?

● What is Apache Spark Streaming?

● What happens when we submit a spark job

● Broadcast & accumulator

● Repartition & Coalesce

● Cache & Persist

● Wide transformation & Narrow Transformation

● What is normalization (In Database)

● What is Vectorization

● How Decide Driver Memory,Executer memory,Executer core,number of Executer

● Difference bet Group By,Reduce By

● How to give your Own schema of file.

● How to combine two Data frame

● How to add new column in DF

● what is rdd? 3 method to create.

● create rdd from 1 to 10 and find even number

● difference between rdd,df,and ds

● difference between map n flatmap

● what is transformation n action

● input string=AAAABBCC
output=4A2B3C

● shared variable in spark with explain

● spark-submit command--parameter which we used in this command

● type of transformation

● types of deployment mode in spark-submit

● lineage graph,DAG

● what happen when we submit spark-submit command

● reducebykey and combinebykey

● remove first two lines from file using spark

● 1st table having 10 record and 2nd having 5 record which are a same in 1st table find diff record which are present in 1st table(solve using dataframe)

● what is RDD,DF,DS what is diff between RDD ,Df AND DS out of which API is faster and why?

● what is difference between map,flatmap and foreach ?

● how to add column to Dataframe ?

● how to give own column name to Dataframe?

● how to load big file (csv,parquet)in to spark for analysis?

● how to combine two dataframe?

● how to remove duplicates form dataframe?

● what is difference between distinct and dropduplicate ?

● what is main difference between var and val and where we used var and val interms of hadoop?

● who to decide number of partitions in spark?

● difference between coalesce and repartition ?

● what is DAG and lineage?

● what is difference between cache and persist how its work and for which scenario you used ?

● what is Shared variable ? when to use broadcase and accumulator variable?

● what is transformation ? what are the type of transformations? what is action?

● what is lazy evaluations and how it give us advantage?

● what is the syntax of spark submit command?what is difference between client and cluster mode?

● what is difference between static allocation and dynamic allocation?

● how we can set executor ,core in dynamic allocation what are the properties for it?

● how to decide executor ,core for huge data?

● what happened when we submit the spark job?what is catalyst optimizer ?

● what are the optimization technique is spark?

● how to process nested JSON?

● what is skewed data?how to avoid it?what are the techniques to can handle skewed data?

● what happened when we are loading 500mb file on HDFS and out of which only 200mb is load into hdfs block and at the same time.if user trying to access that file will it fetch or got any error?

● what problems you faced while development? what is your role ?are you involved in all flow?

● difference between mapreduce and spark ?why spark is faster?

● what kind of configuration you did for spark?

● how to give conditions in DF?

● how to handle bad record or null record in spark?--what is failfast?what is dropmalformed?

● write syntax for DF ?how to convert list,array in to DF?

● How to select few columns from an RDD and not converting it to a Data Frame

● How much memory will the spark pro

● static and dynamic partition?

● why we store metastore in RDBMS instead of hdfs?

● if 1 dataset is big and 1 is small how to join it?

● what is data skewness? salting technique?

● can we perform any transformation like join/count in oracle tables using spark?

● What is the difference between Spark and Hadoop?

● What are the differences between functional and imperative languages, and why is functional programming important?

● What is a resilient distributed dataset (RDD), explain showing diagrams?

● Explain transformations and actions (in the context of RDDs)

● What are the Spark use cases?

● Why do we need transformations? What is lazy evaluation and why is it useful?

● What is ParallelCollectionRDD?

● Explain how ReduceByKey and GroupByKey work?

● What is the common workflow of a Spark program?

● Explain Spark environment for driver. Ref

● What are the transformations and actions that you have used in Spark?

● How can you minimize data transfers when working with Spark?

● What is a lineage graph?

● Describe the major libraries that constitute the Spark Ecosystem

● What are the different file formats that can be used in SparkSql?

● What are Pair RDDs?

● What is the difference between persist() and cache()

● What are the various levels of persistence in Apache Spark?

● Which Storage Level to choose?

● Explain advantages and drawbacks of RDD

● Explain why dataset is preferred over RDDs?

● How to share data from Spark RDD between two applications?

● Does Apache Spark provide check pointing?

● Explain the internal working of caching?

● What is the function of Block manager?

● Why does Spark SQL consider the support of indexes unimportant?

● How to convert existing UDTFs in Hive to Scala functions and use them from Spark SQL?

● Why use dataframes and datasets when we have RDD?

● What is a Catalyst and how does it work?

● What are the top challenges developers faces while writing Spark applications?

● Explain the difference in implementation between DataFrames and DataSet?

● How is memory handled in Datasets?

● What are the limitations of dataset?

● What are the contentions with memory?

● Show Command to run Spark in YARN client mode?

● Show Command to run Spark in YARN cluster mode?

● What is Standalone and YARN mode?

● Explain client mode and cluster mode in Spark?

● Which cluster managers are supported by Spark?

● What is Executor memory?

● What is DStream and what is the difference between batch and Dstream in Spark streaming?

● How does Spark Streaming work?

● Difference between map() and flatMap()?

● What is reduce() action, Is there any difference between reduce() and reduceByKey()?

● What is the disadvantage of reduce() action and how can we overcome this limitation?

● What are Accumulators and when are accumulators truly reliable?

● What is Broadcast Variables and what advantage do they provide?

● What is piping? Demonstrate with an example of a data pipeline.

● What is a driver?

● What does a Spark Engine do?

● What are the steps that occur when you run a Spark application on a cluster?

● What is a schema RDD/DataFrame?

● What are Row objects?

● How does Spark achieve fault tolerance?

● What parameter is set if cores need to be defined across executors?

● Name few Spark Master system properties?

● Define Partitions in reference to Spark implementation?

● Differences between how Spark and MapReduce manage cluster resources under YARN.

● What is Graphics and what is PageRank?

● What does MLlib do?

● What is a Parquet file?

● Why is Parquet used for Spark SQL?

● What is schema evolution and what is its disadvantage, explain schema merging in

● reference to parquet file?

● Name the different types of Cluster Managers in Spark

● How many ways we can create RDDs, show example?

● How do you flatten rows in Spark? Explain with example.

● What is Hive on Spark?

● Explain Spark Streaming Architecture?

● What are the types of Transformations on DStreams?

● What is Receiver in Spark Streaming, and can you build custom receivers?

● Explain the process of Live streaming storing DStream data to database?

● How is Spark streaming fault tolerant?

● Explain transform() method used in dStream?

● What file systems does Spark support?

● How is data security achieved in Spark?

● Explain Kerberos security?

● Name the various types of distributing that Spark supports?

● Show some example queries using the Scala DataFrame API.

● What are the conditions where Spark driver can parallelize dataSets as RDDs?

● Can repartition() operation decrease the number of partitions?

● What is the drawback of repartition() and coalesce() operations?

● In a join operation for example val joinVal = rddA.join(rddB) will it generate partition?

● Consider the following code in Spark, what is the final value in fVal variable?Scala pattern matching - Show various ways code can be written?

● What is the return result when a query is executed using Spark SQL or HIVE? Hint: RDD or dataframe/dataset?

● If we want to display just the schema of a dataframe/dataset what method is called?

● Show various implementations for the following query in Spark?

● What are the most important factors you want to consider when you start machine learning project?

● As a data scientist, which algorithm would you suggest if legal aspects and ease of explanation to non technique people are the main criteria?

● For the supervised learning algorithm, what percentage of data is split between training and test dataset

● Compare performance of Avro and parquet file formats and their usage (in the context of Spark)

● Spark Master Exposes a set of REST API's to submit and monitor applications. Which data format is used for these web services?

● When you should not use Spark?

● Can you use Spark to access and analyze data stored in Cassandra databases?

● With which mathematical properties can you achieve parallelism?

● What are various types of Partitioning in Apache Spark?

● How to set partitioning for data in Apache Spark?

● How did you handled logging in spark jobs

● What is Spark graphics

● How did you handled error handling in spark jobs

● Best way to write DF to Database(jdbc) - connection pool

● Get No of records processed of an input file (input file is very large - 50million)

● Can we create dataset with row datatype

● Can we create dataframe with person datatype

● How to Provide security to my spark job

● what is RDD,DF,DS what is diff between RDD ,Df AND DS out of which API is faster and why?

● what is diff between map,flatmap?

● how to add column to Dataframe ?

● how to give own column name to Dataframe?

● how to load big file (csv,parquet)in to spark for analysis?

● how to combine two dataframe?

● how to remove duplicates form dataframe?

● what is diff between distinct and dropduplicate ?

● who to decide number of partitions in spark?

● difference between coalesce and repartition ?

● what is DAG and lineage?

● what is diff between cache and persist ?

● what is shared variable ?when to use broadcast and accumulator variable?

● what is transformation ?what are the type of transformations?what is action?

● what is lazy evaluations and how it give us advantage?

● what is the syntax of spark submit command?what is diff between client and cluster mode?

● what is diff between static allocation and dynamic allocation?

● how we can set executor ,core in dynamic allocation what are the properties for it?

● how to decide executor ,core for huge data?

● what happened when we submit the spark job?what is catalyst optimizer ?

● what are the optimization technique is spark?

● how to process nested JSON?

● what happened when we are loading 500mb file on HDFS and out of which only 200mb is load into hdfs block and at the same time

● if user trying to access that file will it fetch or got any error?

● what problems you faced while development? what is your role ?are you involved in all flow?

● difference between mapreduce and spark ?why spark is faster?

● what kind of configuration you did for spark?

● how to give conditions in DF?

● how to handle bad record or null record in spark?

● Spark architecture

● What is 'etl'

● What is yarn

● What is spark.

● cost based model in spark?

● Write a Pyspark program for importing data from RDS to Hbase?

● what is broadcast variable, Accumulator?

● when we use df,rdd,ds?

● what Transformation and action you used?

● what is spark streaming? when it is used?

SPARK

INTERVIEW QUESTIONS