Data Science Interview Q’s — I

Rupak (Bob) Roy - II
12 min readApr 10, 2022

--

A walkthrough with/from the essentials of data science interviews.

Hi hey there, thanks for the continuous support for my previous articles. Today we will go through the commonly asked essential questions by the interviewers to understand the root level knowledge of DS rather than going for fancy advanced questions.

“If your foundation Pillar is strong you can build anything”

1. What are the feature selection methods used to select the right variables?

There are 2 main methods for feature selection, i.e. filter and wrapper methods.

Filter Methods:- Linear Discrimination Analysis, Anova, Chi-square

Wrapper Methods: — Forward Selection, Backward Selection, Recursive Feature Elimination

Intrinsic Methods: Trees.

2. You are given a dataset consisting of variables with more than 30% missing values. How will you deal with them?

Answer: We can just simply remove the rows with missing data values. it is the quickest way but comes with the cost of throwing valuable information.

for smaller dataset, we can substitute missing values with mean, median, model and even with models like linear models, xgboost to predict the missing values.

3. What are dimensionality reduction and its benefits?

Refers to the process of converting a dataset with vast dimensions into data with fewer dimensions(fields) to convey similar information concisely.

This reduction helps in compressing data and reducing storage space. it also reduces computation time as fewer dimensions lead to less computing. it removes redundant features, for example, there’s no point in strong a value in different units(meters and inches)

4. For the given points, how will you calculate the euclidean distance in python?

plot1 = [1,3]

plot2 = [2,5]

The Euclidean distance can be calculated as follows:

euclidean_distance = sqrt((plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2)

5. How can you select K for K-means?

We use the elbow method to select K for K-means clustering. The idea of the elbow method is to run K-means clustering on the data set where ‘K’ is the number of clusters.

Within the sum of squares(WSS), it is defined as the sum of the squared distance between each member of the cluster and its centroid.

6. What is the significance of the p-value?

When p is smaller than or equals 0.05. indicates strong evidence against the null hypothesis, so we reject the null hypothesis(H0) and accept the alternative hypothesis (H1)

If p-value typically > 0.05. indicates weak evidence against the null hypothesis, therefore we accept the null hypothesis and reject the alternative hypothesis.

7. You are given a dataset on cancer detection. You have to build a classification model and achieved an accuracy of 96%. Why shouldn't you be happy with your model's performance? What can you do about it?

Cancer detection results in imbalanced data. In an imbalanced dataset, accuracy should not be based on a measure of performance. It is important to focus on the remaining 4 %, which represents the patients who were wrongly diagnosed. Early diagnosis is crucial when it comes to cancer detection and can greatly improve a patient’s prognosis.

Hence, to evaluate model performance we should use Sensitivity (True Positive Rate) and SPecificity (True Negative Rate). F measure to determine the class-wise performance of the classifier.

8. You have run the association rules algorithm on your dataset, and the two rules {banana, apple} => {grape} and {apple,orange}=> {grape} have been found to be relevant. What else must be true?

Choose the right answer.

  1. {banana, apple,grape, orange} must be a frequent itemset.
  2. {banna,apple} => {orange} must be a relevant rule
  3. {grape} => {banna, apple} must be a relevant rule
  4. {grape,apple} must be a frequent itemset

The answer is A {grape, apple} must be a frequent itemset

9. Your organization has a website where visitors randomly receive one of two coupons. It is also possible that visitors to the website will not receive a coupon. You have been asked to determine if offering a coupon to website visitors has any impact on their purchase decisions. Which analysis method should you use?

  1. One-way ANOVA
  2. K-means Clustering
  3. Association Rules
  4. Students t-Test

Answer: A: One way ANOVA

10. What are the feature vectors?

A feature vector is an n-dimensional vector of numerical features that represent an object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics(called features) of an object in a mathematical way that’s easy to analyze.

11. What is root cause analysis?

Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root cause of faults or problems, A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from recurring.

12. Do gradient descent methods always converge to similar points?

They do not, because in some cases, they reach a local minima or a local optima point, You would not reach the global optima point. This is governed by the data and the starting condition.

13. What is the goal of A/B Testing?

This is statistical hypothesis testing for randomized experiments with two variables, A and B. The objective of A/B testing is to detect any changes to a web page to maximise or increase the outcome of a strategy.

14. What are the drawbacks of the linear model?

  • the assumption of linearity of the errors.
  • it can’t be used for count outcomes or binary outcomes.
  • there are overfitting problems that it can’t solve

15. What is the law of large numbers?

It is a theorem that describes the result of performing the same experiment very frequently. This theorem forms the basis of frequency-style thinking. It states that the sample mean, sample variance, and sample standard deviation converge to what they are trying to estimate.

16. What are the confounding variables?

These are extraneous variables in a statistical model that correlates directly or inversely with both the dependent and the independent variable. The estimate fails to account for the confounding factor.

17. What is star schema?

It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields, these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes, star schemas involve several layers of summarization to recover information faster.

18. What are eigenvalue and eigenvector?

Eignvalues are the directions along which a particular linear transformation acts by flipping compressing or stretching.

Eignvectors are for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix.

19. Why is resampling done?

Resampling is done in any of these cases.

  • Estimating the accuracy of sample statistics by using subsets of accessible data, or drawing randomly with replacements from a set of data points.
  • Substituting labels on data points when performing significance tests.
  • Validating models by using random subsets(bootstrapping, cross-validation)

20. What is selection bias?

Selection bias is a problematic situation in which error is introduced due to a non-random population sample.

21. What are the types of biases that can occur during sampling?

  1. Selection bias 2. Undercoverage bias 3. Survivorship bias

22. What is survivorship bias?

Survivorship bias is the logical error of focusing on aspects that support surviving a process and casually overlooking those that did not because of their lack of prominence. This can lead to wrong conclusions in numerous ways.

23. If you are having 4GB RAM in your machine and you want to train your model on a 10GB dataset. How would you go about this problem? Have you ever faced this kind of problem in your machine learning/data science experience so far?

For Neural Networks: Batch size with Numbpy array will work.

For Traditional ML’s: currently, SVM for non-linear and SGDClassifier for linear with the help of Gradient Descent supports incremental learning. All we need to do is call the model.partial_fit() function and pass batch size data.

24. What is selection bias?

Selection bias is the bias introduced by the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved. thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. it is sometimes referred to as the selection effect. The phase “selection bias” most often refers to the distortion of statistical analysis, resulting from the method of collecting samples. if the selection bias is not taken into account, then some conclusions of the study may not be accurate.

25. Explain regularization and why it is useful.

Regularization is the process of adding a tuning parameter to the model to induce smoothness in order to prevent overfitting. This is most often done by adding a constant multiple to an existing weight vector. This constant is often the L1(Lasso) or L2(Ridge). The model predictions should then minimize the loss function calculated on the regularized training set

26. What is TF/IDF vectorization?

tf-idf is short for term frequency-inverse document frequency. is a numerical statistic that is intended to reflect how important a word is to a document is collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The rf-idf value increases proportionally to the number of times a word appears in the document. but is offset by the frequency of the word in the corpus, which helps to adjust for the factor that some words appear more frequently in general.

27. What is Box Cox Transformation?

A Box Cox transformation is a way to transform non-normal dependent variables into normal shape.

28. Is it possible to capture the correlation between continuous and categorical variable?

Yes, we can use the analysis of covariance technique to capture the association between continuous and categorical variables.

29. Treating a categorical variable as a continuous variable would result in a better predictive model?

Yes, the categorical value should be considered as a continuous variable only when the variable is ordinal in nature. So it is a better predictive model.

30. What is Power Analysis?

The power analysis is an integral part of the experimental design, it helps you to determine the sample size requires to find out the effect of a given size from a cause with a specific level of assurance. it also allows you to deploy a particular probability in a sample size constraint.

31. What are different ranking algorithms?

Traditional ML algorithms solve a prediction problem (classification or regression) on a single instance at a time. E.g. if you are doing spam detection on email, you will look at all the features associated with that email and classify it as spam or not. The aim of traditional ML is to come up with a class (spam or no-spam) or a single numerical score for that isntance.

Ranking algorithms like LTR solves a ranking problem on a list of items. The aim of LTR is to come up with optimal ordering of those items As such LTR doesn't care much about the exact score that each item gets. but cares more about the relative ordering among all the items. RankNet, LambdaRank, and LambdaMART are all LRT algorithms developed by Chris Burges and his colleagues at Microsoft Research.

  1. RankNet — the cost function of RankNet aims to minimize the number of inversions in ranking. RankNet optimizes the cost function using Stochastic Gradient Descent.
  2. LambadaRank — Burgess et. al found that during RankNet training procedure, you don't need the costs only need the gradients of the cost with respect to the model score. You can think of these gradients as little arrows attached to each document in the ranked list, indicating the direction we’d like those documents to move. Further, they found that scaling the gradients by the change in NDCG found by swapping each pair of documents gave good results. The core idea of LambaRank is to use this new cost function for training a RankNet. On experimental dataset, shows both speed and accuracy improvements over the original RankNet.
  3. LambdaMart — LambdaMART combines LambaRank and MART(Multiple Additive Regression Trees). While MART uses gradient boosted decision trees for prediction tasks, LambdaMART uses gradient boosted decision trees using a cost function derived from LambdaRank for solving a ranking task. On the experimental dataset, LambaMART has shown better results than LambaRank and the original RankNet

32. Assumptions of Linear Regression.

  1. The relationship between X and y must be linear
  2. the features must be independent of each other.
  3. Homoscedasticity — the variation between the output must be constant for different input data.
  4. The distribution of Y long X should be the Normal Distribution.

33. What are the Type1 and Type 2 errors? in which scenarios the Type 1 and Type 2 errors are significant?

Rejection of True Null Hypothesis is known as a Type 1 error. in Simple words, False Positive Rate are also known as Type 1 error.

Not rejecting the False Null Hypothesis is known as a Type 2 error. False Negatives are known as a Type 2 error.

Type 1 Error is significant where the importance of being negative becomes significant. For example — If a man is not suffering from a particular disease marked as positive for that infection. The medications given to him might damage his organs. While Type 2 Error is significant in cases where the importance of being positive becomes important. For example — The alarm has to be raised in case of burglary in a bank. But a system identifies it as a False case that won’t raise the alarm on time resulting in a heavy loss.

34. What are the conditions for overfitting and underfitting?

In overfitting the model performs well for the training data, but for any new data it fails to provide output. For Underfitting the model is very simple and not able to identify the correct relationship. Following are the bias and variance conditions.

Overfitting — Low bias and High Variance results in overfitted model. Decision tree is more prone to Overfitting.

Underfitting — High bias and Low Variance. Such model doesn’t perform well on test data also. For example — Linear Regression is more prone to Underfitting

35. What do you mean by Normalisation? Difference between Normalisation and Standardization?

Normalization is a process of bringing the features in a simple range, so that model can perform well and does not get inclined towards any particular feature. For example — If we have a dataset with multiple features and one feature is the Age data which is in the range 18–60, Another feature is the salary feature ranging from 20000–2000000. In such a case, the values have a very much different in them. Age ranges in two digits integer while salary is in range significantly higher than the age. So to bring the features incomparable range we need Normalisation.

Both Normalisation and Standardization are methods of Features Conversion. However, the methods are different in terms of conversions. The data after Normalisation scales in the range of 0–1. While in the case of Standardization the data is scaled such that it means comes out to be 0

36. What do you mean by Regularisation? What are L1 and L2 Regularisation?

Regulation is a method to improve your model which is Overfitted by introducing extra terms in the loss function. This helps in making the model perform better for unseen data.

There are two types of Regularisation :

L1 Regularisation — In L1 we add lambda times the absolute weight terms to the loss function. In this, the feature weights are penalized on the basis of absolute value. L2 Regularisation — In L2 we add lambda times the squared weight terms to the loss function. In this, the feature weights are penalized on the basis of squared values.

37. Explain Naive Bayes Classifier and the principle on which it works?

Naive Bayes Classifier algorithm is a probabilistic model. This model works on the Bayes Theorem principle. The accuracy of Naive Bayes can be increased significantly by combining it with other kernel functions for making a perfect Classifier.

Bayes Theorem — This is a theorem that explains conditional probability. If we need to identify the probability of occurrence of Event A provided that Event B has already occurred such cases are known as Conditional Probability.

38. Explain DBSCAN Clustering technique and in what terms DBSCAN is better than K- Means Clustering?

DBSCAN( Density-Based) clustering technique is an unsupervised approach that splits the vectors into different groups based on the minimum distance and number of points lying in that range. In DBSCAN Clustering we have two significant parameters –

Epsilon — The minimum radius or distance between the two data points to tag them in the same cluster.Min — Sample Points — The number of the minimum samples which should fall under that range to be identified as one cluster.

DBSCAN Clustering technique has few advantages over other clustering algorithms –

1. In DBSCAN we do not need to provide a fixed number of clusters. There can be as many clusters formed on the basis of the data points distribution. While in k nearest neighbor we need to provide the number of clusters we need to split our data into.

2. In DBSCAN we also get a noise cluster identified which helps us in identifying the outliers. This sometimes also acts as a significant term to tune the hyperparameters of a model accordingly.

The article is already too long to continue so,

Next, we will walk through Linear Regression and clustering questionnaire in the next article Part II, that will surprise you!

Thanks again, for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy

Some of my alternative internet presences Facebook, Instagram, Udemy, Blogger, Issuu, Slideshare, Scribd and more.

Also available on Quora @ https://www.quora.com/profile/Rupak-Bob-Roy

Let me know if you need anything. Talk Soon.

--

--

Rupak (Bob) Roy - II
Rupak (Bob) Roy - II

Written by Rupak (Bob) Roy - II

Things i write about frequently on Medium: Data Science, Machine Learning, Deep Learning, NLP and many other random topics of interest. ~ Let’s stay connected!

No responses yet