# Data Science Interview Questions

**Category:**Data Science

**Posted:**Mar 15, 2019

**By:**Alvera Anto

**1. What is Data Science?**

Data Science is a mixture of Statistics, technical skills, and business vision which is used to analyze the existing data and predict the future trend.

**2. What is the difference between Big data, Data Science, and Data Analytics?**

**Big Data:**

Huge volumes of data-structured, unstructured, and semi-structured.

It requires a basic knowledge of statistics and mathematics.

**Data Science:**

It deals with slicing and dicing the data.

It requires in-depth knowledge of statistics and mathematics

**Data Analytics:**

Contributing operational insights into complex business scenarios.

It requires adequate knowledge of statistics and mathematics.

**3. What is the difference between Supervised and Unsupervised learning?**

**Supervised Learning:**

- In supervised learning input data is labeled.
- It uses a training dataset.
- It is used for prediction.
- It enables classification and regression.

**Unsupervised Learning:**

- In unsupervised learning Input data are unlabeled.
- It uses the input data set.
- It is used for analysis.
- It enables Classification, Density Estimation, & Dimension Reduction

**4. What are the important skills to have in Python about data analysis?**

The following are some of the important skills to possess which will come in handy when performing data analysis using Python.

- Good understanding of the built-in data types, especially lists, dictionaries, tuples, and sets.
- Mastery of N-dimensional NumPy Arrays.
- Mastery of Pandas data frames.
- Ability to perform element-wise vector and matrix operations on NumPy arrays.
- Knowing that you should use the Anaconda distribution and the conda package manager.
- Familiarity with Scikit-learn.
- Ability to write efficient list comprehensions instead of traditional for loops.
- Ability to write small, clean functions (important for any developer), preferably pure functions that don’t alter objects.
- Knowing how to profile the performance of a Python script and how to optimize bottlenecks.

**5. What is Selection Bias?**

Selection bias is the term used to describe the situation where analysis has been conducted among a subset of the data (a sample) with the goal of concluding the population, but the resulting conclusions will likely be wrong (biased), because the subgroup differs from the population in some important way. Selection bias is usually introduced as an error with the sampling and having a selection for analysis that is not properly randomized.

**6. Which language is more suitable for text analytics? R or Python? **

Python is the most suitable language due to the following reasons:

- Because it has Pandas library which has easy-to-use data structures and high-performance data analysis tools.
- R is suitable for machine learning than just text analysis.
- Python performs faster for all types of text analytics.

**7. Compare Python, SAS, and R languages.**

**R:** R is an Open Source tool and hence used kindly by academia and the research community. It is a robust tool for statistical computation, graphical representation, and reporting. Due to its open-source nature, it is always being updated with the latest features, and then readily available to everybody.

**Python:** Python is a powerful open-source programming language that is easy to learn, works well with most other tools and technologies. Python has many libraries and community-created modules making it very robust. It has functions for statistical operation, model building, and more.

**SAS:** SAS is one of the most widely used analytics tools used by some of the biggest companies on earth. It has some of the best statistical functions, graphical user interface, but can come with a price tag and hence smaller enterprises cannot readily adopt it.

**8. What are the benefits of the R language?**

The R programming language comprises a set of software that is used for graphical representation, statistical computing, data manipulation, and calculation. Some of the benefits of R programming environments are:

It is an extensive collection of tools for data analysis.

- Data analysis technique for graphical representation
- It is a highly developed simple and effective programming language
- It supports machine learning applications
- It acts as a connecting link between various software, tool, and datasets.
- It provides a robust package ecosystem for different needs.
- It is useful in solving the data-oriented problem
- It creates high-quality reproducible analysis, which is flexible and powerful.

**9. How do Data Scientists use statistics?**

With the help of Statistics, Data Scientists can look into the data for patterns, hidden insights and convert Big Data into Big insights. It helps to get a better idea of what the customers are expecting. Using statistics, Data Scientists can learn about consumer behavior, interest, engagement, retaining, and finally alteration all through the power of statistics. It helps them to create powerful data models to validate certain implications and predictions. All this can be converted into a powerful business proposition by giving users what they want precisely when they want it.

**Want to know More about Data Science? Click here**

**10. What is Normal Distribution?**

Data is usually distributed in different ways with a bias to the left or to the right, or it can all be jumbled up. However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell-shaped curve. The random variables are distributed in the form of the asymmetrical bell-shaped curve.

Some of the properties of Normal Distribution are:

- Unimodal -one mode
- Symmetrical -left and right halves are mirror images
- Bell-shaped -maximum height (mode) at the mean
- Mean, Mode, and Median are all located in the center
- Asymptotic

**11. How does data cleaning plays an important role in Data Analysis?**

Data cleaning plays an important role in data analysis in the following ways:

- Data comes from multiple sources, it helps them to transform into a format that will be helpful for data analysts or data scientists.
- Data cleaning deals with the process of detecting and correcting data records, thus it helps to enhance the accuracy of the model in machine learning.
- It might take up to 80% of the time for just cleaning data, making it a critical part of the analysis task.
- It is a cumbersome process because as the number of data sources increases, the time taken to clean the data increases exponentially due to the number of sources and the volume of data generated by these sources.

**12. What is the difference between univariate, bivariate, and multivariate analysis?**

**Univariate Analysis: A **univariate analysis will have one variable and due to this there are no relationships, causes. The major aspect of the univariate analysis is to summarize the data and find the patterns within it to make actionable decisions.

**Bivariate Analysis: It **deals with the relationship between two sets of data. These sets of paired data come from related sources or samples.

**Multivariate Analysis:** It deals with the study of more than two variables to comprehend the effect of variables on the responses.

**13. What is Logistics regression?**

Logistics regression is a statistical technique or a model which is used to examine a dataset and predict the binary outcome. The result should be in binary i.e. either zero or one or yes or no. Random Forest is an essential method that is applied to do classification, regression and other tasks on data.

**14. Explain the difference between long and wide format data?**

In the **wide-format**, a subject’s repeated response will be in a single row, and each response is in a separate column, whereas in the **long**** **format, each row is a one-time point per subject. You can recognize data in wide format by the fact that columns generally represent groups.

**15.** **What is the objective of A/B Testing?**

It is a statistical hypothesis testing for a randomized experiment with two variables A and B.

The objective of A/B Testing is to recognize any changes to the web page to maximize or increase the outcome of interest. A/B testing is a fantastic method for figuring out the best online promotional and marketing strategies for your business. It can be used to test everything from website copy to sales emails to search ads.

**16. What is Machine Learning?**

Machine learning is to explore the study and creation of algorithms that can learn from and make predictions on data. It is closely related to computational statistics. It is used to plan complex models and algorithms that offer themselves to a prediction which in commercial use is known as predictive analysis.

**18. Explain the different features of a Machine Learning process.**

**Domain Knowledge:**This is the first step where we understand how to extract the various features from the data and learn more about the data that we are dealing with. It has got more to do with the type of domain that we are dealing with and familiarizing the system to learn more about it.**Feature Selection: In this step, we are selecting from the set of features we have.**Sometimes it happens that there are a lot of features and we have to make an intelligent decision regarding the type of feature that we want to select to go ahead with our machine learning endeavor.**Algorithm:**This is an important step since the algorithms chosen will have a major impact on the entire process of Machine Learning. You can choose between linear and non-linear algorithms. Few algorithms are Support Vector Machines, Decision Trees, Naïve Bayes, K-Means Clustering, etc.**Training:**This is the most important part of the machine learning technique. The training is done based on the data that we have and provides more real-world experiences. With each consequent training step, the machine gets better and smarter and able to take improved decisions.**Evaluation:**In this step, we evaluate the decisions taken by the machine in order to decide whether it is up to the mark or not. Various metrics are involved in this process, and we have to closed deploy each of these to decide on the efficacy of the whole machine learning endeavor.**Optimization:**This process involves improving the performance of the machine learning process using various optimization techniques. Optimization of machine learning is one of the most vital components wherein the performance of the algorithm is vastly improved. The best part of optimization techniques is that machine learning is not just a consumer of optimization techniques, but it also provides new ideas for optimization too.**Testing:**Here various tests are carried out and some of these are unseen sets of test cases. The data is partitioned into a test and training set. There are various testing techniques like cross-validation in order to deal with multiple situations.

**19. How Machine Learning is deployed in Real-time scenarios?**

Here are some of the scenarios in which machine learning finds applications in the real world:

- E-commerce: Understanding the customer churn, deploying targeted advertising, remarketing
- Search engine: Ranking pages depending on the personal preferences of the searcher
- Finance: Evaluating investment opportunities & risks, detecting fraudulent transactions
- Medicare: Designing drugs depending on the patient’s history and needs
- Robotics: Machine learning for handling situations that are out of the ordinary
- Social media: Understanding relationships and recommending connections
- Extraction of information: framing questions for getting answers from databases over the web

**20. What is a Linear Regression?**

It is the most commonly used method for predictive analytics. The Linear Regression method is used to describe the relationship between a dependent variable and an independent variable. The main job in Linear Regression is the method of fitting a single line within a scatter plot. Linear Regression involves the following three methods:

- Determining and analyzing the correlation and direction of the data
- Deploying the estimation of the model
- Ensuring the usefulness and validity of the model

It is extensively used in scenarios where the cause-effect model comes into play. For example, you want to know the effect of a certain action in order to determine the various outcomes and extent of the effect the cause has in determining the final outcome.

**21. What is a recommender system?**

A recommender system is widely used in multiple fields like movie recommendations, music preferences, social tags, research articles, search queries, and so on. The recommender systems work as per collaborative and content-based filtering or by organizing a personality-based approach. This type of system works based on a person’s past behavior in order to build a model for the future. This will predict future product buying, movie viewing, or book reading by people.

**22. What is K-means clustering?**

K-means clustering is the basic unsupervised learning algorithm. This method is used to classifying data using a certain set of clusters called K clusters. It is organized for grouping data to find the similarity in the data.

It includes defining the K centers, one each in a cluster. The clusters are defined into K groups with K being predefined. The K points are selected at random as cluster centers. The objects are assigned to their nearest cluster center. The objects within a cluster are as closely related to one another as possible and differ as much as possible from the objects in other clusters. K-means clustering works very well for large sets of data.

23. Wh**at is Machine Learning best suited for?**

Machine Learning is good at replacing labor-intensive decision-making systems that are predicated on hand-coded decision rules or manual analysis. Machine learning is suitable for six types of analysis, they are:

1. Classification (predicting the class/group membership of items)

2. Regression (predicting real-valued attributes)

3. Clustering (finding natural groupings in data)

4. Multi-label classification (tagging items with labels)

5. Recommendation engines (connecting users to items)

**Learn Data Science from Industry Experts**

**24. What is the most important machine learning techniques?**

In **Associative rule learning** computers are presented with a large set of observations, all being made up of multiple variables. The task is then to learn relations between variables such as A & B 3C (if A and B happen, then C will also happen).

In **Clustering** computers learn how to partition observations in various subsets so that each partition will be made up of similar observations according to some well-defined metric. Algorithms like K-Means and DBSCAN belong also to this class.

In **Density estimation,** computers learn how to find statistical values that describe data. Algorithms like Expectation-Maximization belong also to this class.

25. **What is the difference between overfitting and underfitting?**

In statistics and machine learning, one of the most common tasks is to fit a *model* to a set of training data, so as to be able to make reliable predictions on general untrained data.

In overfitting, a statistical model describes random error or noise in its place of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfitting has poor predictive performance, as it overreacts to minor fluctuations in the training data.

*Underfitting* occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Under-fitting would occur, for example, when fitting a linear model to non-linear data. Such a model too would have poor predictive performance.

**26. What is cross-validation?**

Cross-validation is a model validation technique that is used for evaluating the results of statistical analysis will simplify to an independent data set. It is mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice. The objective of cross-validation is to term a data set to test the model in the training phase to limit problems like over-fitting and gain insight on how the model will generalize to an independent data set.